New strategies for freezing, advancing relfrozenxid early

Started by Peter Geogheganover 3 years ago140 messages

pg@bowt.ie

over 3 years ago

4 attachment(s)

Attached patch series is a completely overhauled version of earlier
work on freezing. Related work from the Postgres 15 cycle became
commits 0b018fab, f3c15cbe, and 44fa8488.

Recap
=====

The main high level goal of this work is to avoid painful, disruptive
antiwraparound autovacuums (and other aggressive VACUUMs) that do way
too much "catch up" freezing, all at once, causing significant
disruption to production workloads. The patches teach VACUUM to care
about how far behind it is on freezing for each table -- the number of
unfrozen all-visible pages that have accumulated so far is directly
and explicitly kept under control over time. Unfrozen pages can be
seen as debt. There isn't necessarily anything wrong with getting into
debt (getting into debt to a small degree is all but inevitable), but
debt can be dangerous when it isn't managed carefully. Accumulating
large amounts of debt doesn't always end badly, but it does seem to
reliably create the *risk* that things will end badly.

Right now, a standard append-only table could easily do *all* freezing
in aggressive/antiwraparound VACUUM, without any earlier
non-aggressive VACUUM operations triggered by
autovacuum_vacuum_insert_threshold doing any freezing at all (unless
the user goes out of their way to tune vacuum_freeze_min_age). There
is currently no natural limit on the number of unfrozen all-visible
pages that can accumulate -- unless you count age(relfrozenxid), the
triggering condition for antiwraparound autovacuum. But relfrozenxid
age predicts almost nothing about how much freezing is required (or
will be required later on). The overall result is that it oftens takes
far too long for freezing to finally happen, even when the table
receives plenty of autovacuums (they all could freeze something, but
in practice just don't freeze anything). It's very hard to avoid that
through tuning, because what we really care about is something pretty
closely related to (if not exactly) the number of unfrozen heap pages
in the system. XID age is fundamentally "the wrong unit" here -- the
physical cost of freezing is the most important thing, by far.

In short, the goal of the patch series/project is to make autovacuum
scheduling much more predictable over time. Especially with very large
append-only tables. The patches improve the performance stability of
VACUUM by managing costs holistically, over time. What happens in one
single VACUUM operation is much less important than the behavior of
successive VACUUM operations over time.

What's new: freezing/skipping strategies
========================================

This newly overhauled version introduces the concept of
per-VACUUM-operation strategies, which we decide on once per VACUUM,
at the very start. There are 2 choices to be made at this point (right
after we acquire OldestXmin and similar cutoffs):

1) Do we scan all-visible pages, or do we skip instead? (Added by
second patch, involves a trade-off between eagerness and laziness.)
2) How should we freeze -- eagerly or lazily? (Added by third patch)

The strategy-based approach can be thought of as something that blurs
the distinction between aggressive and non-aggressive VACUUM, giving
VACUUM more freedom to do either more or less work, based on known
costs and benefits. This doesn't completely supersede
aggressive/antiwraparound VACUUMs, but should make them much rarer
with larger tables, where controlling freeze debt actually matters.
There is a need to keep laziness and eagerness in balance here. We try
to get the benefit of lazy behaviors/strategies, but will still course
correct when it doesn't work out.

A new GUC/reloption called vacuum_freeze_strategy_threshold is added
to control freezing strategy (also influences our choice of skipping
strategy). It defaults to 4GB, so tables smaller than that cutoff
(which are usually the majority of all tables) will continue to freeze
in much the same way as today by default. Our current lazy approach to
freezing makes sense there, and should be preserved for its own sake.

Compatibility
=============

Structuring the new freezing behavior as an explicit user-configurable
strategy is also useful as a bridge between the old and new freezing
behaviors. It makes it fairly easy to get the old/current behavior
where that's preferred -- which, I must admit, is something that
wasn't well thought through last time around. The
vacuum_freeze_strategy_threshold GUC is effectively (though not
explicitly) a compatibility option. Users that want something close to
the old/current behavior can use the GUC or reloption to more or less
opt-out of the new freezing behavior, and can do so selectively. The
GUC should be easy for users to understand, too -- it's just a table
size cutoff.

Skipping pages using a snapshot of the visibility map
=====================================================

We now take a copy of the visibility map at the point that VACUUM
begins, and work off of that when skipping, instead of working off of
the mutable/authoritative VM -- this is a visibility map snapshot.
This new infrastructure helps us to decide on a skipping strategy.
Every non-aggressive VACUUM operation now has a choice to make: Which
skipping strategy should it use? (This was introduced as
item/question #1 a moment ago.)

The decision on skipping strategy is a decision about our priorities
for this table, at this time: Is it more important to advance
relfrozenxid early (be eager), or to skip all-visible pages instead
(be lazy)? If it's the former, then we must scan every single page
that isn't all-frozen according to the VM snapshot (including every
all-visible page). If it's the latter, we'll scan exactly 0
all-visible pages. Either way, once a decision has been made, we don't
leave much to chance -- we commit. ISTM that this is the only approach
that really makes sense. Fundamentally, we advance relfrozenxid a
table at a time, and at most once per VACUUM operation. And for larger
tables it's just impossible as a practical matter to have frequent
VACUUM operations. We ought to be *somewhat* biased in the direction
of advancing relfrozenxid by *some* amount during each VACUUM, even
when relfrozenxid isn't all that old right now.

A strategy (whether for skipping or for freezing) is a big, up-front
decision -- and there are certain kinds of risks that naturally
accompany that approach. The information driving the decision had
better be fairly reliable! By using a VM snapshot, we can choose our
skipping strategy based on precise information about how many *extra*
pages we will have to scan if we go with eager scanning/relfrozenxid
advancement. Concurrent activity cannot change what we scan and what
we skip, either -- everything is locked in from the start. That seems
important to me. It justifies trying to advance relfrozenxid early,
just because the added cost of scanning any all-visible pages happens
to be low.

This is quite a big shift for VACUUM, at least in some ways. The patch
adds a DETAIL to the "starting vacuuming" INFO message shown by VACUUM
VERBOSE. The VERBOSE output is already supposed to work as a
rudimentary progress indicator (at least when it is run at the
database level), so it now shows the final scanned_pages up-front,
before the physical scan of the heap even begins:

regression=# vacuum verbose tenk1;
INFO: vacuuming "regression.public.tenk1"
DETAIL: total table size is 486 pages, 3 pages (0.62% of total) must be scanned
INFO: finished vacuuming "regression.public.tenk1": index scans: 0
pages: 0 removed, 486 remain, 3 scanned (0.62% of total)
*** SNIP ***
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM

I included this VERBOSE tweak in the second patch because it became
natural with VM snapshots, and not because it felt particularly
compelling -- scanned_pages just works like this now (an assertion
verifies that our initial scanned_pages is always an exact match to
what happened during the physical scan, in fact).

There are many things that VM snapshots might also enable that aren't
particularly related to freeze debt. VM snapshotting has the potential
to enable more flexible behavior by VACUUM. I'm thinking of things
like suspend-and-resume for VACUUM/autovacuum, or even autovacuum
scheduling that coordinates autovacuum workers before and during
processing by vacuumlazy.c. Locking-in scanned_pages up-front avoids
the main downside that comes with throttling VACUUM right now: the
fact that simply taking our time during VACUUM will tend to increase
the number of concurrently modified pages that we end up scanning.
These pages are bound to mostly just contain "recently dead" tuples
that the ongoing VACUUM can't do much about anyway -- we could dirty a
lot more heap pages as a result, for little to no benefit.

New patch to avoid allocating MultiXacts
========================================

The fourth and final patch is also new. It corrects an undesirable
consequence of the work done by the earlier patches: it makes VACUUM
avoid allocating new MultiXactIds (unless it's fundamentally
impossible, like in a VACUUM FREEZE). With just the first 3 patches
applied, VACUUM will naively process xmax using a cutoff XID that
comes from OldestXmin (and not FreezeLimit, which is how it works on
HEAD). But with the fourth patch in place VACUUM applies an XID cutoff
of either OldestXmin or FreezeLimit selectively, based on the costs
and benefits for any given xmax.

Just like in lazy_scan_noprune, the low level xmax-freezing code can
pick and choose as it goes, within certain reasonable constraints. We
must accept an older final relfrozenxid/relminmxid value for the rel's
authoritative pg_class tuple as a consequence of avoiding xmax
processing, of course, but that shouldn't matter at all (it's
definitely better than the alternative).

Reducing the WAL space overhead of freezing
===========================================

Not included in this new v1 are other patches that control the
overhead of added freezing -- my focus since joining AWS has been to
get these more strategic patches in shape, and telling the right story
about what I'm trying to do here. I'm going to say a little on the
patches that I have in the pipeline here, though. Getting the
low-level/mechanical overhead of freezing under control will probably
require a few complementary techniques, not just high-level strategies
(though the strategy stuff is the most important piece).

The really interesting omitted-in-v1 patch adds deduplication of
xl_heap_freeze_page WAL records. This reduces the space overhead of
WAL records used to freeze by ~5x in most cases. It works in the
obvious way: we just store the 12 byte freeze plans that appear in
each xl_heap_freeze_page record only once, and then store an array of
item offset numbers for each entry (rather than naively storing a full
12 bytes per tuple frozen per page-level WAL record). This means that
we only need an "extra" ~2 bytes of WAL space per "extra" tuple frozen
(2 bytes for an OffsetNumber) once we decide to freeze something on
the same page. The *marginal* cost can be much lower than it is today,
which makes page-based batching of freezing much more compelling IMV.

Thoughts?
--
Peter Geoghegan

Attachments:

v1-0001-Add-page-level-freezing-to-VACUUM.patchapplication/octet-stream; name=v1-0001-Add-page-level-freezing-to-VACUUM.patchDownload

From 3ccc7220a18c028e535d1e7617b8997a17e586e4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v1 1/4] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

This approach decouples the question of _how_ VACUUM could/will freeze a
given heap page (which of its XIDs are eligible to be frozen) from the
question of whether it actually makes sense to do so right now.

Just adding page-level freezing does not change all that much on its
own: VACUUM will still typically freeze very lazily, since we're only
forcing freezing of all of a page's eligible tuples when we decide to
freeze at least one (on the basis of XID age and FreezeLimit).  For now
VACUUM still freezes everything almost as lazily as it always has.
Later work will teach VACUUM to apply an alternative eager freezing
strategy that triggers page-level freezing earlier, based on additional
criteria.
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |  37 +++++-
 src/backend/access/heap/heapam.c     | 171 ++++++++++++++++-----------
 src/backend/access/heap/vacuumlazy.c | 152 ++++++++++++++----------
 4 files changed, 230 insertions(+), 134 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index abf62d9df..c201f8ae6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,8 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
+									TransactionId limit_xid, MultiXactId limit_multi,
 									TransactionId *relfrozenxid_out,
 									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 1705e736b..40556271d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -330,6 +330,38 @@ typedef struct xl_heap_freeze_tuple
 	uint8		frzflags;
 } xl_heap_freeze_tuple;
 
+/*
+ * State used by VACUUM to track what the oldest extant XID/MXID will become
+ * when determing whether and how to freeze a page's heap tuples via calls to
+ * heap_prepare_freeze_tuple.
+ *
+ * The relfrozenxid_out and relminmxid_out fields are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
+ * relminmxid_out as needed to avoid unsafe values in rel's authoritative
+ * pg_class tuple.
+ *
+ * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
+ * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ */
+typedef struct page_frozenxid_tracker
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze;
+
+	/* Values used when page is to be frozen based on freeze plans */
+	TransactionId relfrozenxid_out;
+	MultiXactId relminmxid_out;
+
+	/* Used by caller for '!freeze' pages */
+	TransactionId relfrozenxid_nofreeze_out;
+	MultiXactId relminmxid_nofreeze_out;
+
+} page_frozenxid_tracker;
+
 /*
  * This is what we need to know about a block being frozen during vacuum
  *
@@ -409,10 +441,11 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId relminmxid,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
+									  TransactionId limit_xid,
+									  MultiXactId limit_multi,
 									  xl_heap_freeze_tuple *frz,
 									  bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  page_frozenxid_tracker *xtrack);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index aab8d6fa4..d6aea370f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6431,26 +6431,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
- * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
+ * Caller must initialize xtrack fields for page as a whole before calling
+ * here with first tuple for the page.  See page_frozenxid_tracker comments.
+ *
+ * Caller must set frz->offset itself if heap_execute_freeze_tuple is called.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6463,34 +6452,46 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  TransactionId limit_xid, MultiXactId limit_multi,
 						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  page_frozenxid_tracker *xtrack)
 {
 	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin,
+				freeze_xmax;
 	TransactionId xid;
 
+	/*
+	 * limit_xid *must* be <= cutoff_xid, to ensure that any XID older than it
+	 * can neither be running nor seen as running by any open transaction.
+	 * This ensures that we only freeze XIDs that are safe to freeze -- those
+	 * that are already unambiguously visible to everybody.
+	 *
+	 * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
+	 * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)
+	 */
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
+
 	frz->frzflags = 0;
 	frz->t_infomask2 = tuple->t_infomask2;
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen iff our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+	{
+		freeze_xmin = false;
+		xmin_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmin */
+	}
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6499,8 +6500,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoff_xid);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
@@ -6514,8 +6515,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 
@@ -6526,7 +6527,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.  It needs
+	 * to return true for any tuple that we would force to be frozen here.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6534,7 +6536,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
+		TransactionId mxid_oldest_xid_out = xtrack->relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
@@ -6553,8 +6555,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
+			if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6582,10 +6584,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6613,10 +6615,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, xtrack->relminmxid_out))
+				xtrack->relminmxid_out = xid;
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 		}
 		else
 		{
@@ -6656,8 +6658,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			freeze_xmax = false;
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
@@ -6673,6 +6675,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+		Assert(changed);
+	}
 	if (freeze_xmax)
 	{
 		Assert(!xmax_already_frozen);
@@ -6703,11 +6710,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we ignore the cutoff_xid and just always perform the
 		 * freeze operation.  The oldest release in which such a value can
 		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
+		 * was removed in PostgreSQL 9.0.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6721,18 +6724,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			else
 				frz->frzflags |= XLH_FREEZE_XVAC;
 
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
+			/* Set XMIN_COMMITTED defensively */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
+
+			/*
+			 * Force freezing any page with an xvac to keep things simple.
+			 * This allows totally_frozen tracking to ignore xvac.
+			 */
 			changed = true;
+			xtrack->freeze = true;
 		}
 	}
 
-	*totally_frozen = (xmin_frozen &&
+	/*
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen (provided caller executes freeze plan for the page)
+	 */
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
+
+	/*
+	 * Force vacuumlazy.c to freeze page when avoiding it would violate the
+	 * rule that XIDs < limit_xid (and MXIDs < limit_multi) must never remain
+	 */
+	if (!xtrack->freeze && !(xmin_already_frozen && xmax_already_frozen))
+		xtrack->freeze =
+			heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+									&xtrack->relfrozenxid_nofreeze_out,
+									&xtrack->relminmxid_nofreeze_out);
+
 	return changed;
 }
 
@@ -6785,14 +6806,20 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	page_frozenxid_tracker dummy;
+
+	dummy.freeze = true;
+	dummy.relfrozenxid_out = cutoff_xid;
+	dummy.relminmxid_out = cutoff_multi;
+	dummy.relfrozenxid_nofreeze_out = cutoff_xid;
+	dummy.relminmxid_nofreeze_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
+										  cutoff_xid, cutoff_multi,
 										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+										  &dummy);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7218,17 +7245,23 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * heap_tuple_would_freeze
  *
  * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/XMID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * force freezing of any of the XID/XMID fields from the tuple, given the same
+ * limits.  We must also deal with dead tuples here, since (xmin, xmax, xvac)
+ * fields could be processed by pruning away the whole tuple instead of
+ * freezing.
+ *
+ * Note: VACUUM refers to limit_xid and limit_multi as "FreezeLimit" and
+ * "MultiXactCutoff" respectively.  These should not be confused with the
+ * absolute cutoffs for freezing.  We just determine whether caller's tuple
+ * and limits trigger heap_prepare_freeze_tuple to force freezing.
  *
  * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
  * like the heap_prepare_freeze_tuple arguments that they're based on.  We
  * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
+heap_tuple_would_freeze(HeapTupleHeader tuple,
+						TransactionId limit_xid, MultiXactId limit_multi,
 						TransactionId *relfrozenxid_out,
 						MultiXactId *relminmxid_out)
 {
@@ -7242,7 +7275,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	{
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 
@@ -7259,7 +7292,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		/* xmax is a non-permanent XID */
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
@@ -7282,7 +7315,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 		if (MultiXactIdPrecedes(multi, *relminmxid_out))
 			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
+		if (MultiXactIdPrecedes(multi, limit_multi))
 			would_freeze = true;
 
 		/* need to check whether any member of the mxact is old */
@@ -7295,7 +7328,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			Assert(TransactionIdIsNormal(xid));
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(xid, limit_xid))
 				would_freeze = true;
 		}
 		if (nmembers > 0)
@@ -7309,7 +7342,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		{
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			would_freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b802ed247..75cb31e75 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,8 +169,9 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
+	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
@@ -507,6 +508,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
@@ -1554,8 +1556,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	page_frozenxid_tracker xtrack;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1571,8 +1573,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	xtrack.freeze = false;
+	xtrack.relfrozenxid_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_out = vacrel->NewRelminMxid;
+	xtrack.relfrozenxid_nofreeze_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_nofreeze_out = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1625,27 +1630,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1777,10 +1778,12 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen], &tuple_totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &xtrack))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1801,9 +1804,33 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (xtrack.freeze || nfrozen == 0)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (might be zero eligible tuples).
+		 */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_out;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* Not freezing this page, so use alternative cutoffs */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_nofreeze_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_nofreeze_out;
+
+		/* Might still set page all-visible, but never all-frozen */
+		nfrozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1811,7 +1838,7 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
-		Assert(prunestate->hastup);
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
@@ -1841,7 +1868,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(rel, buf, vacrel->NewRelfrozenXid,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
@@ -1850,6 +1877,41 @@ retry:
 	}
 
 	/*
+	 * Now save details of the LP_DEAD items from the page in vacrel
+	 */
+	if (lpdead_items > 0)
+	{
+		VacDeadItems *dead_items = vacrel->dead_items;
+		ItemPointerData tmp;
+
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_items->items[dead_items->num_items++] = tmp;
+		}
+
+		Assert(dead_items->num_items <= dead_items->max_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->has_lpdead_items = true;
+		prunestate->all_visible = false;
+	}
+
+	/* Finally, add page-local counts to whole-VACUUM counts */
+	vacrel->tuples_deleted += tuples_deleted;
+	vacrel->lpdead_items += lpdead_items;
+	vacrel->live_tuples += live_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
+
+	/*
+	 * We're done, but assert that some postconditions hold before returning.
+	 *
 	 * The second pass over the heap can also set visibility map bits, using
 	 * the same approach.  This is important when the table frequently has a
 	 * few old LP_DEAD items on each page by the time we get to it (typically
@@ -1873,7 +1935,7 @@ retry:
 			Assert(false);
 
 		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1885,38 +1947,6 @@ retry:
 			   cutoff == prunestate->visibility_cutoff_xid);
 	}
 #endif
-
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
-	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
-		vacrel->lpdead_item_pages++;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
-	}
-
-	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += tuples_deleted;
-	vacrel->lpdead_items += lpdead_items;
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
 
 /*
-- 
2.34.1

v1-0004-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/octet-stream; name=v1-0004-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload

From 720bf46318947e0b05e64022126f71c50d5b4071 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 31 Jul 2022 13:53:19 -0700
Subject: [PATCH v1 4/4] Avoid allocating MultiXacts during VACUUM.

Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
teach it the difference between OldestXmin and FreezeLimit.  Use this
high-level context to intelligently avoid allocating new MultiXactIds
during VACUUM operations.  We should always prefer to avoid allocating
new MultiXacts during VACUUM on general principle.  VACUUM is the only
mechanism that can claw back MultixactId space, so allowing VACUUM to
consume MultiXactId space (for any reason) adds to the risk that the
system will trigger the multiStopLimit wraparound protection mechanism.

FreezeMultiXactId() is now eager when it's cheap to process xmax, and
lazy when it's expensive/risky to process xmax (because an expensive
second pass that might result in allocating a new Multi is required).
We make a similar trade-off to the one made by vacuumlazy.c when a
cleanup lock isn't available on some heap page.  We can usually put off
freezing (for the time being) when it's inconvenient to proceed.  The
only downside to this approach is that it necessitates pushing back the
final relfrozenxid/relminmxid value that can be set in pg_class.
---
 src/backend/access/heap/heapam.c | 49 +++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 699a5acae..e18000d81 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6111,11 +6111,21 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
  *
  * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ *
+ * Allocating new MultiXactIds during VACUUM is something that we always
+ * prefer to avoid.  Caller passes us down all the context required to do this
+ * on their behalf: cutoff_xid, cutoff_multi, limit_xid, and limit_multi.
+ *
+ * We must never leave behind XIDs/MXIDs from before the "limits" given.
+ * There is lots of leeway around XIDs/MXIDs >= caller's "limits", though.
+ * When it's possible to inexpensively process xmax right away, we're eager
+ * about it.  Otherwise we're lazy about it -- next time it might be cheap.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
+				  TransactionId limit_xid, MultiXactId limit_multi,
 				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
@@ -6208,13 +6218,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below the limit for Xids, so we
 	 * need to walk the whole members array to figure out what to do, if
 	 * anything.
+	 *
+	 * We use limit_xid for this (VACUUM's FreezeLimit), rather than using
+	 * cutoff_xid (VACUUM's OldestXmin).  We greatly prefer to avoid a second
+	 * pass over the Multi that results in allocating a new replacement Multi.
 	 */
-
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6225,12 +6238,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP optimization */
 	for (i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		if (TransactionIdPrecedes(members[i].xid, limit_xid))
 		{
 			need_replace = true;
 			break;
@@ -6239,11 +6251,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			temp_xid_out = members[i].xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* The Multi itself must be >= limit_multi, too */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, limit_multi);
+
 	if (!need_replace)
 	{
 		/*
@@ -6260,6 +6271,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_would_freeze will indicate that the tuple must be frozen.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
@@ -6359,7 +6373,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				/*
 				 * Running locker cannot possibly be older than the cutoff.
 				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * cutoff_xid is VACUUM's OldestXmin, which is also the
 				 * initial value used for top-level relfrozenxid_out tracking
 				 * state.  A running locker cannot be older than VACUUM's
 				 * OldestXmin, either, so we don't need a temp_xid_out step.
@@ -6529,7 +6543,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given limit for Xids.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
@@ -6547,6 +6561,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
 									cutoff_xid, cutoff_multi,
+									limit_xid, limit_multi,
 									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
@@ -6587,12 +6602,18 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * MultiXactId, to carry forward two or more original member XIDs.
 			 * Might have to ratchet back relfrozenxid_out here, though never
 			 * relminmxid_out.
+			 *
+			 * We only do this when we have no choice; heap_tuple_would_freeze
+			 * will definitely force the page to be frozen below.
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
 			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
 												 xtrack->relfrozenxid_out));
+			Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+										   &xtrack->relfrozenxid_nofreeze_out,
+										   &xtrack->relminmxid_nofreeze_out));
 			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
-- 
2.34.1

v1-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/octet-stream; name=v1-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload

From 5a42f1ac5ed231e6a314ba79611ddb3e92992436 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 15:13:27 -0700
Subject: [PATCH v1 3/4] Add eager freezing strategy to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  Use of the eager strategy (an
alternative to the classic lazy freezing strategy) is controlled by a
new GUC, vacuum_freeze_strategy_threshold (and an associated
autovacuum_* reloption).  Tables whose rel_pages are >= the cutoff will
have VACUUM use the eager freezing strategy.  Otherwise we use the lazy
freezing strategy, which is the classic approach (actually, we always
use eager freezing in aggressive VACUUMs, though they are expected to be
much rarer now).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).

If and when a smaller table (a table that uses lazy freezing at first)
grows past the table size threshold, the next VACUUM against the table
shouldn't have to do too much extra freezing to catch up when we perform
eager freezing for the first time (the table still won't be very large).
Once VACUUM has caught up, the amount of work required in each VACUUM
operation should be roughly proportionate to the number of new pages, at
least with a pure append-only table.

In summary, we try to get the benefit of the lazy freezing strategy,
without ever allowing VACUUM to fall uncomfortably far behind.  In
particular, we avoid accumulating an excessive number of unfrozen
all-visible pages in any one table.  This approach is often enough to
keep relfrozenxid recent, but we still have aggressive/antiwraparound
autovacuums for tables where it doesn't work out that way.

Note that freezing strategy is distinct from (though related to) the
strategy for skipping pages with the visibility map.  In practice tables
that use eager freezing always eagerly scan all-visible pages (they
prioritize advancing relfrozenxid), partly because we expect few or no
all-visible pages there (at least during the second or subsequent VACUUM
that uses eager freezing).  When VACUUM uses the classic/lazy freezing
strategy, VACUUM will also scan pages eagerly (i.e. it will scan any
all-visible pages and only skip all-frozen pages) when the added cost is
relatively low.
---
 src/include/access/heapam_xlog.h              |  8 +-
 src/include/commands/vacuum.h                 |  4 +
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++
 src/backend/access/heap/heapam.c              |  8 +-
 src/backend/access/heap/vacuumlazy.c          | 80 ++++++++++++++++---
 src/backend/commands/vacuum.c                 |  3 +
 src/backend/postmaster/autovacuum.c           | 11 +++
 src/backend/utils/misc/guc.c                  | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 15 ++++
 doc/src/sgml/maintenance.sgml                 |  6 +-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++
 13 files changed, 155 insertions(+), 18 deletions(-)

diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 40556271d..9ea1db505 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -345,7 +345,11 @@ typedef struct xl_heap_freeze_tuple
  * pg_class tuple.
  *
  * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
- * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ * relminmxid_nofreeze_out must also be maintained.  If vacuumlazy.c caller
+ * opts to not execute freeze plans produced by heap_prepare_freeze_tuple for
+ * its own reasons, then new relfrozenxid and relminmxid values must reflect
+ * that that choice was made.  (This is only safe when 'freeze' is still unset
+ * after the final last heap_prepare_freeze_tuple call for the page.)
  */
 typedef struct page_frozenxid_tracker
 {
@@ -356,7 +360,7 @@ typedef struct page_frozenxid_tracker
 	TransactionId relfrozenxid_out;
 	MultiXactId relminmxid_out;
 
-	/* Used by caller for '!freeze' pages */
+	/* Used by caller that opts not to freeze a '!freeze' page */
 	TransactionId relfrozenxid_nofreeze_out;
 	MultiXactId relminmxid_nofreeze_out;
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index f38e1148f..cb86bfa5f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -214,6 +214,9 @@ typedef enum VacOptValue
 typedef struct VacuumParams
 {
 	bits32		options;		/* bitmask of VACOPT_* */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	int			freeze_min_age; /* min freeze age, -1 to use default */
 	int			freeze_table_age;	/* age at which to scan whole table */
 	int			multixact_freeze_min_age;	/* min multixact freeze age, -1 to
@@ -252,6 +255,7 @@ typedef struct VacDeadItems
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 7dc401cf0..c6d8265cf 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 609329bb2..f4e2109e7 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d6aea370f..699a5acae 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6429,7 +6429,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * WAL-log what we would need to do, and return true.  Return false if nothing
  * is to be changed.  In addition, set *totally_frozen to true if the tuple
  * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * more freezing will eventually be required (assuming page is to be frozen).
+ *
+ * Although this interface is primarily tuple-based, caller decides on whether
+ * or not to freeze the page as a whole.  We'll often help caller to prepare a
+ * complete "freeze plan" that it ultimately discards.  However, our caller
+ * doesn't always get to choose; it must freeze when xtrack.freeze is set
+ * here.  This ensures that any XIDs < limit_xid are never left behind.
  *
  * Caller must initialize xtrack fields for page as a whole before calling
  * here with first tuple for the page.  See page_frozenxid_tracker comments.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2fd703668..990f2eebb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -110,7 +110,7 @@
 
 /*
  * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages
+ * all-visible pages when using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
@@ -150,6 +150,8 @@ typedef struct LVRelState
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
+	/* Proactively freeze all tuples on pages about to be set all-visible? */
+	bool		allvis_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -252,6 +254,7 @@ typedef struct LVSavedErrInfo
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
 									  BlockNumber all_frozen);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
@@ -327,6 +330,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				eager_threshold,
 				all_visible,
 				all_frozen,
 				scanned_pages,
@@ -366,6 +370,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
 	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
 	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 *
+	 * Also determine our cutoff for applying the eager/all-visible freezing
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
+	 * even during non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -374,6 +382,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->multixact_freeze_table_age,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
+	eager_threshold = params->freeze_strategy_threshold < 0 ?
+		vacuum_freeze_strategy_threshold :
+		params->freeze_strategy_threshold;
 
 	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -523,10 +534,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->NewRelminMxid = OldestMxact;
 
 	/*
-	 * Use visibility map snapshot to determine whether we'll skip all-visible
-	 * pages using vmsnap in lazy_scan_heap
+	 * Use visibility map snapshot to determine freezing strategy, and whether
+	 * we'll skip all-visible pages using vmsnap in lazy_scan_heap
 	 */
-	scanned_pages = lazy_scan_strategy(vacrel,
+	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
 									   all_visible, all_frozen);
 	if (verbose)
 	{
@@ -1305,17 +1316,28 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine skipping strategy.
+ *	lazy_scan_strategy() -- Determine freezing/skipping strategy.
  *
- * Determines if the ongoing VACUUM operation should skip all-visible pages
- * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * Our traditional/lazy freezing strategy is useful when putting off the work
+ * of freezing totally avoids work that turns out to have been unnecessary.
+ * On the other hand we eagerly freeze pages when that strategy spreads out
+ * the burden of freezing over time.  Performance stability is important; no
+ * one VACUUM operation should need to freeze disproportionately many pages.
+ * Antiwraparound VACUUMs of append-only tables should generally be avoided.
+ *
+ * Also determines if the ongoing VACUUM operation should skip all-visible
+ * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
+ * important that relfrozenxid advance in affected tables, which are larger.
+ * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
+ * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
+ * depending on the extra cost - we might need to scan only a few extra pages.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_strategy(LVRelState *vacrel,
-				   BlockNumber all_visible,
-				   BlockNumber all_frozen)
+lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
+				   BlockNumber all_visible, BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1348,21 +1370,48 @@ lazy_scan_strategy(LVRelState *vacrel,
 		scanned_pages_skipallfrozen++;
 
 	/*
-	 * Okay, now we have all the information we need to decide on a strategy
+	 * Okay, now we have all the information we need to decide on a strategy.
+	 *
+	 * We use the all-visible/eager freezing strategy when a threshold
+	 * controlled by the freeze_strategy_threshold GUC/reloption is crossed.
+	 * VACUUM won't accumulate any unfrozen all-visible pages over time in
+	 * tables above the threshold.  The system won't fall behind on freezing.
 	 */
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
 		return rel_pages;
 	}
 	else if (vacrel->aggressive)
+	{
+		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
+	}
+	else if (rel_pages >= eager_threshold)
+	{
+		/*
+		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
+		 * GUC-based threshold for eager freezing.
+		 *
+		 * We always scan all-visible pages when the treshold is crossed, so
+		 * that relfrozenxid can be advanced.  There will typically be few or
+		 * no all-visible pages (only all-frozen) in the table anyway, at
+		 * least after the first VACUUM that exceeds the threshold.
+		 */
+		vacrel->allvis_freeze_strategy = true;
+		vacrel->skipallvis = false;
+	}
 	else
 	{
 		BlockNumber nextra,
 					nextra_threshold;
 
+		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
+		vacrel->allvis_freeze_strategy = false;
+
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1880,8 +1929,15 @@ retry:
 	 *
 	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
+	 *
+	 * When ongoing VACUUM opted to use the all-visible freezing strategy we
+	 * freeze any page that will become all-visible, making it all-frozen
+	 * instead. (Actually, there are edge-cases where this might not result in
+	 * marking the page all-frozen in the visibility map, but that should have
+	 * only a negligible impact.)
 	 */
-	if (xtrack.freeze || nfrozen == 0)
+	if (xtrack.freeze || nfrozen == 0 ||
+		(vacrel->allvis_freeze_strategy && prunestate->all_visible))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3670d1f18..6eaa3521d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -63,6 +63,7 @@
 /*
  * GUC parameters
  */
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
@@ -250,6 +251,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	 */
 	if (params.options & VACOPT_FREEZE)
 	{
+		params.freeze_strategy_threshold = -1;
 		params.freeze_min_age = 0;
 		params.freeze_table_age = 0;
 		params.multixact_freeze_min_age = 0;
@@ -257,6 +259,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	}
 	else
 	{
+		params.freeze_strategy_threshold = -1;
 		params.freeze_min_age = -1;
 		params.freeze_table_age = -1;
 		params.multixact_freeze_min_age = -1;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index b3b1afba8..110b3ffe7 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -146,6 +146,7 @@ static TransactionId recentXid;
 static MultiXactId recentMulti;
 
 /* Default freeze ages to use for autovacuum (varies by database) */
+static int	default_freeze_strategy_threshold;
 static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
@@ -2002,6 +2003,7 @@ do_autovacuum(void)
 
 	if (dbForm->datistemplate || !dbForm->datallowconn)
 	{
+		default_freeze_strategy_threshold = 0;
 		default_freeze_min_age = 0;
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
@@ -2009,6 +2011,7 @@ do_autovacuum(void)
 	}
 	else
 	{
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 		default_freeze_min_age = vacuum_freeze_min_age;
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
@@ -2793,6 +2796,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 	/* OK, it needs something done */
 	if (doanalyze || dovacuum)
 	{
+		int			freeze_strategy_threshold;
 		int			freeze_min_age;
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
@@ -2828,6 +2832,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			: Log_autovacuum_min_duration;
 
 		/* these do not have autovacuum-specific settings */
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		freeze_min_age = (avopts && avopts->freeze_min_age >= 0)
 			? avopts->freeze_min_age
 			: default_freeze_min_age;
@@ -2864,6 +2873,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.truncate = VACOPTVALUE_UNSPECIFIED;
 		/* As of now, we don't support parallel vacuum for autovacuum */
 		tab->at_params.nworkers = -1;
+
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.freeze_min_age = freeze_min_age;
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9fbbfb1be..9b9179868 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2696,6 +2696,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_freeze_min_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Minimum age at which VACUUM should freeze a table row."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90bec0502..e701e464e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -694,6 +694,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a5cd4e44c..ba3e012a0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9147,6 +9147,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..554b3a75d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -588,9 +588,9 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
+    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
+    transactions older than this cutoff XID are guaranteed to have been frozen.
+    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c14b2010d..7e684d187 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1680,6 +1680,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.34.1

v1-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/octet-stream; name=v1-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload

From d0cf595e7f587b0b8991156c3e08aadc32b81755 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v1 2/4] Teach VACUUM to use visibility map snapshot.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip.  The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").  It also
gives VACUUM reliable information about how many pages will be scanned,
before its physical heap scan even begins.  That makes it easier to
model the costs that VACUUM incurs using a top-down, up-front approach.

Non-aggressive VACUUMs now make an up-front choice about VM skipping
strategy: they decide whether to prioritize early advancement of
relfrozenxid (eager behavior) over avoiding work by skipping all-visible
pages (lazy behavior).  Nothing about the details of how lazy_scan_prune
freezes changes just yet, though a later commit will add the concept of
freezing strategies.

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.  VACUUM will now either scan
every all-visible page, or none at all.  This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.  It was possible that a single all-visible page
located in a range of all-frozen blocks would render it unsafe to
advance relfrozenxid later on; lazy_scan_skip just didn't have enough
high-level context about the table as a whole.  Now our policy around
skipping all-visible pages is exactly the same condition as whether or
not it's safe to advance relfrozenxid later on; nothing is left to
chance.

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work).  For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().
---
 src/include/access/visibilitymap.h      |   7 +
 src/backend/access/heap/vacuumlazy.c    | 323 +++++++++++++++---------
 src/backend/access/heap/visibilitymap.c | 162 ++++++++++++
 3 files changed, 367 insertions(+), 125 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..49f197bb3 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,9 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +38,10 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+									  BlockNumber *all_visible, BlockNumber *all_frozen);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
+extern uint8 visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 75cb31e75..2fd703668 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,10 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * Threshold that controls whether non-aggressive VACUUMs will skip any
+ * all-visible pages
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -146,8 +146,10 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
+	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	bool		skipallvis;
+	/* Skip (don't scan) all-frozen pages? */
+	bool		skipallfrozen;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -171,13 +173,14 @@ typedef struct LVRelState
 	TransactionId OldestXmin;
 	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
+	/* Snapshot of visibility map, taken just after OldestXmin acquired */
+	vmsnapshot *vmsnap;
 	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -248,10 +251,12 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber all_visible,
+									  BlockNumber all_frozen);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel,
 								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+								  bool *next_unskippable_allvis);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -314,7 +319,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	bool		verbose,
 				instrument,
 				aggressive,
-				skipwithvm,
+				skipallfrozen,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
@@ -322,6 +327,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				all_visible,
+				all_frozen,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -367,7 +375,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
 
-	skipwithvm = true;
+	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
 		/*
@@ -375,7 +383,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * visibility map (even those set all-frozen)
 		 */
 		aggressive = true;
-		skipwithvm = false;
+		skipallfrozen = false;
 	}
 
 	/*
@@ -400,20 +408,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
@@ -440,7 +434,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
 	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
+	/* Set skipallvis/skipallfrozen provisionally (before lazy_scan_strategy) */
+	vacrel->skipallvis = (!aggressive && skipallfrozen);
+	vacrel->skipallfrozen = skipallfrozen;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -505,11 +501,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
 	 * XIDs must at least be considered for freezing (though not necessarily
 	 * frozen) during its scan.
+	 *
+	 * Also acquire a read-only snapshot of the visibility map at this point.
+	 * We can work off of the snapshot when deciding which heap pages are safe
+	 * to skip.  This approach allows VACUUM to avoid scanning pages whose VM
+	 * bit gets unset concurrently, which is important with large tables that
+	 * take a long time to VACUUM.
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
+	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
+										&all_visible, &all_frozen);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
 	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
@@ -517,7 +521,35 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
-	vacrel->skippedallvis = false;
+
+	/*
+	 * Use visibility map snapshot to determine whether we'll skip all-visible
+	 * pages using vmsnap in lazy_scan_heap
+	 */
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   all_visible, all_frozen);
+	if (verbose)
+	{
+		Assert(!IsAutoVacuumWorkerProcess());
+		if (aggressive)
+			ereport(INFO,
+					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
+							get_database_name(MyDatabaseId),
+							vacrel->relnamespace, vacrel->relname),
+					 errdetail_internal("total table size is %u pages, %u pages (%.2f%% of total) must be scanned",
+										orig_rel_pages, scanned_pages,
+										orig_rel_pages == 0 ? 100.0 :
+										100.0 * scanned_pages / orig_rel_pages)));
+		else
+			ereport(INFO,
+					(errmsg("vacuuming \"%s.%s.%s\"",
+							get_database_name(MyDatabaseId),
+							vacrel->relnamespace, vacrel->relname),
+					 errdetail_internal("total table size is %u pages, %u pages (%.2f%% of total) must be scanned",
+										orig_rel_pages, scanned_pages,
+										orig_rel_pages == 0 ? 100.0 :
+										100.0 * scanned_pages / orig_rel_pages)));
+	}
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -534,6 +566,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -580,12 +613,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
 		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -630,6 +662,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -853,8 +888,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
+	bool		next_unskippable_allvis;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -868,43 +902,33 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	/* Set up an initial range of skippable blocks using VM snapshot */
+	next_unskippable_block = lazy_scan_skip(vacrel, 0,
+											&next_unskippable_allvis);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
-
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
+		if (blkno < next_unskippable_block)
 		{
 			/* Last page always scanned (may need to set nonempty_pages) */
 			Assert(blkno < rel_pages - 1);
 
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
+			/* Skip (don't scan) this page */
+			continue;
 		}
 
+		/*
+		 * Can't skip this page safely.  Must scan the page.  But determine
+		 * the next skippable range after the page first.
+		 */
+		all_visible_according_to_vmsnap = next_unskippable_allvis;
+		next_unskippable_block = lazy_scan_skip(vacrel, blkno + 1,
+												&next_unskippable_allvis);
+
 		vacrel->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
@@ -1113,10 +1137,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Handle setting visibility map bit based on information from our VM
+		 * snapshot, and from prunestate
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1145,11 +1169,11 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * the page-level bit is clear.  However, lazy_scan_skip works off of
+		 * a snapshot of the VM that might be quite old by now.  Recheck with
+		 * a buffer lock held before concluding that the VM is corrupt.
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1188,7 +1212,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1281,7 +1305,97 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *	lazy_scan_strategy() -- Determine skipping strategy.
+ *
+ * Determines if the ongoing VACUUM operation should skip all-visible pages
+ * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ *
+ * Returns final scanned_pages for the VACUUM operation.
+ */
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel,
+				   BlockNumber all_visible,
+				   BlockNumber all_frozen)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				scanned_pages_skipallvis,
+				scanned_pages_skipallfrozen;
+	uint8		mapbits;
+
+	Assert(vacrel->scanned_pages == 0);
+	Assert(rel_pages >= all_visible && all_visible >= all_frozen);
+
+	/*
+	 * First figure out the final scanned_pages for each of the skipping
+	 * policies that lazy_scan_skip might end up using: skipallvis (skip both
+	 * all-frozen and all-visible) and skipallfrozen (just skip all-frozen).
+	 */
+	scanned_pages_skipallvis = rel_pages - all_visible;
+	scanned_pages_skipallfrozen = rel_pages - all_frozen;
+
+	/*
+	 * Even if the last page is skippable, it will still get scanned because
+	 * of lazy_scan_skip's "try to set nonempty_pages for last page" rule.
+	 * Reconcile that rule with what vmsnap says about the last page now.
+	 *
+	 * When vmsnap thinks that we will be skipping the last page (we won't),
+	 * increment scanned_pages to compensate.  Otherwise change nothing.
+	 */
+	mapbits = visibilitymap_snap_status(vacrel->vmsnap, rel_pages - 1);
+	if (mapbits & VISIBILITYMAP_ALL_VISIBLE)
+		scanned_pages_skipallvis++;
+	if (mapbits & VISIBILITYMAP_ALL_FROZEN)
+		scanned_pages_skipallfrozen++;
+
+	/*
+	 * Okay, now we have all the information we need to decide on a strategy
+	 */
+	if (!vacrel->skipallfrozen)
+	{
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		return rel_pages;
+	}
+	else if (vacrel->aggressive)
+		Assert(!vacrel->skipallvis);
+	else
+	{
+		BlockNumber nextra,
+					nextra_threshold;
+
+		/*
+		 * Decide on whether or not we'll skip all-visible pages.
+		 *
+		 * In general, VACUUM doesn't necessarily have to freeze anything to
+		 * be able to advance relfrozenxid and/or relminmxid by a significant
+		 * number of XIDs/MXIDs.  The oldest tuples might turn out to have
+		 * been deleted since VACUUM last ran, or this VACUUM might find that
+		 * there simply are no MultiXacts that even need to be considered.
+		 *
+		 * It's hard to predict whether this VACUUM operation will work out
+		 * that way, so be lazy (just skip) unless the added cost is very low.
+		 * We opt for a skipallfrozen-only VACUUM when the number of extra
+		 * pages (extra scanned pages that are all-visible but not all-frozen)
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 */
+		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
+		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+		nextra_threshold = Max(32, nextra_threshold);
+
+		vacrel->skipallvis = nextra >= nextra_threshold;
+	}
+
+	/* Return the appropriate variant of scanned_pages */
+	if (vacrel->skipallvis)
+		return scanned_pages_skipallvis;
+	if (vacrel->skipallfrozen)
+		return scanned_pages_skipallfrozen;
+
+	return rel_pages;			/* keep compiler quiet */
+}
+
+/*
+ *	lazy_scan_skip() -- set up the next range of skippable blocks.
  *
  * lazy_scan_heap() calls here every time it needs to set up a new range of
  * blocks to skip via the visibility map.  Caller passes the next block in
@@ -1289,34 +1403,25 @@ lazy_scan_heap(LVRelState *vacrel)
  * no skippable blocks we just return caller's next_block.  The all-visible
  * status of the returned block is set in *next_unskippable_allvis for caller,
  * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * can be for rel's last page, and when DISABLE_PAGE_SKIPPING is used.
  *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
+ * This function operates on a snapshot of the visibility map that was taken
+ * just after OldestXmin was acquired.  VACUUM only needs to scan all pages
+ * whose tuples might contain XIDs < OldestXmin (or MXIDs < OldestMxact),
+ * which excludes pages treated as all-frozen here (pages >= rel_pages, too).
  */
 static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
+lazy_scan_skip(LVRelState *vacrel, BlockNumber next_block,
+			   bool *next_unskippable_allvis)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
+				next_unskippable_block = next_block;
 
 	*next_unskippable_allvis = true;
 	while (next_unskippable_block < rel_pages)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
+		uint8		mapbits = visibilitymap_snap_status(vacrel->vmsnap,
+														next_unskippable_block);
 
 		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
@@ -1332,55 +1437,23 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 		 * lock on rel to attempt a truncation that fails anyway, just because
 		 * there are tuples on the last page (it is likely that there will be
 		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
 		 */
 		if (next_unskippable_block == rel_pages - 1)
 			break;
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
+		Assert(vacrel->skipallfrozen || !vacrel->skipallvis);
+		if (!vacrel->skipallfrozen)
 			break;
 
 		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 * Don't skip all-visible pages when lazy_scan_strategy determined
+		 * that it was more important for this VACUUM to advance relfrozenxid
 		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
+		if (!vacrel->skipallvis && (mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+			break;
 
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
 		next_unskippable_block++;
-		nskippable_blocks++;
-	}
-
-	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
-	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
-	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
-	else
-	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
 	}
 
 	return next_unskippable_block;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index ed72eb7b6..6848576fd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,9 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap - get read-only snapshot of visibility map
+ *		visibilitymap_snap_release - release previously acquired snapshot
+ *		visibilitymap_snap_status - get status of bits from vm snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +55,9 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset (only tuples deleted before its
+ * OldestXmin cutoff are considered dead).
  *
  * LOCKING
  *
@@ -124,6 +130,22 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Snapshot of visibility map at a point in time.
+ *
+ * TODO This is currently always just a palloc()'d buffer -- give more thought
+ * to resource management (at a minimum add spilling to temp file).
+ */
+struct vmsnapshot
+{
+	/* Snapshot may contain zero or more visibility map pages */
+	BlockNumber nvmpages;
+
+	/* Copy of VM pages from the time that visibilitymap_snap() was called */
+	char		vmpages[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
@@ -368,6 +390,146 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap - get read-only snapshot of visibility map
+ *
+ * Initializes caller's snapshot, allocating memory in caller's memory context.
+ * Caller can use visibilitymap_snapshot_status to get the status of individual
+ * heap pages at the point that we were called.
+ *
+ * Used by VACUUM to determine which pages it must scan up front.  This avoids
+ * useless scans of concurrently unset heap pages.  VACUUM prefers to leave
+ * them to be scanned during the next VACUUM operation.
+ *
+ * rel_pages is the current size of the heap relation.
+ */
+vmsnapshot *
+visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+				   BlockNumber *all_visible, BlockNumber *all_frozen)
+{
+	BlockNumber nvmpages,
+				mapBlockLast;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap %s %d", RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	*all_visible = 0;
+	*all_frozen = 0;
+	mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages);
+	nvmpages = mapBlockLast + 1;
+	vmsnap = palloc(offsetof(vmsnapshot, vmpages) + BLCKSZ * nvmpages);
+	vmsnap->nvmpages = nvmpages;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Page		localvmpage;
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		localvmpage = vmsnap->vmpages + mapBlock * BLCKSZ;
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(localvmpage, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/*
+		 * Visibility map page copied to local buffer for caller's snapshot.
+		 * Caller requires an exact count of all-visible and all-frozen blocks
+		 * in the heap relation.  Handle that now.
+		 *
+		 * Must "truncate" our local copy of the VM to avoid incorrectly
+		 * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+		 * this by clearing irrelevant bits on the last VM page copied.
+		 */
+		map = PageGetContents(localvmpage);
+		if (mapBlock == mapBlockLast)
+		{
+			/* byte and bit for first heap page not to be scanned by VACUUM */
+			uint32		truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			uint8		truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
+
+			if (truncByte != 0 || truncOffset != 0)
+			{
+				/* Clear any bits set for heap pages >= rel_pages */
+				MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+				map[truncByte] &= (1 << truncOffset) - 1;
+			}
+
+			/* Now it's safe to tally bits from this final VM page below */
+		}
+
+		/* Tally the all-visible and all-frozen counts from this page */
+		umap = (uint64 *) map;
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			*all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+	}
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Just frees the memory allocated by visibilitymap_snap for now (presumably
+ * this will need to release temp files in later revisions of the patch)
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	pfree(vmsnap);
+}
+
+/*
+ *	visibilitymap_snap_status - get status of bits from vm snapshot
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to caller's snapshot of the visibility map?
+ */
+uint8
+visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+	char	   *map;
+	uint8		result;
+	Page		localvmpage;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_status %d", heapBlk);
+#endif
+
+	/* If we didn't copy the VM page, assume heapBlk not all-visible */
+	if (mapBlock >= vmsnap->nvmpages)
+		return 0;
+
+	localvmpage = ((Page) vmsnap->vmpages) + mapBlock * BLCKSZ;
+	map = PageGetContents(localvmpage);
+
+	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+	return result;
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
-- 
2.34.1

Jeremy Schneider

schnjere@amazon.com

over 3 years ago

In reply to: Peter Geoghegan (#1)

Re: New strategies for freezing, advancing relfrozenxid early

On 8/25/22 2:21 PM, Peter Geoghegan wrote:

New patch to avoid allocating MultiXacts
========================================

The fourth and final patch is also new. It corrects an undesirable
consequence of the work done by the earlier patches: it makes VACUUM
avoid allocating new MultiXactIds (unless it's fundamentally
impossible, like in a VACUUM FREEZE). With just the first 3 patches
applied, VACUUM will naively process xmax using a cutoff XID that
comes from OldestXmin (and not FreezeLimit, which is how it works on
HEAD). But with the fourth patch in place VACUUM applies an XID cutoff
of either OldestXmin or FreezeLimit selectively, based on the costs
and benefits for any given xmax.

Just like in lazy_scan_noprune, the low level xmax-freezing code can
pick and choose as it goes, within certain reasonable constraints. We
must accept an older final relfrozenxid/relminmxid value for the rel's
authoritative pg_class tuple as a consequence of avoiding xmax
processing, of course, but that shouldn't matter at all (it's
definitely better than the alternative).

We should be careful here. IIUC, the current autovac behavior helps
bound the "spread" or range of active multixact IDs in the system, which
directly determines the number of distinct pages that contain those
multixacts. If the proposed change herein causes the spread/range of
MXIDs to significantly increase, then it will increase the number of
blocks and increase the probability of thrashing on the SLRUs for these
data structures. There may be another separate thread or two about
issues with SLRUs already?

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeremy Schneider (#2)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Aug 25, 2022 at 3:35 PM Jeremy Schneider <schnjere@amazon.com> wrote:

We should be careful here. IIUC, the current autovac behavior helps
bound the "spread" or range of active multixact IDs in the system, which
directly determines the number of distinct pages that contain those
multixacts. If the proposed change herein causes the spread/range of
MXIDs to significantly increase, then it will increase the number of
blocks and increase the probability of thrashing on the SLRUs for these
data structures.

As a general rule VACUUM will tend to do more eager freezing with the
patch set compared to HEAD, though it should never do less eager
freezing. Not even in corner cases -- never.

With the patch, VACUUM pretty much uses the most aggressive possible
XID-wise/MXID-wise cutoffs in almost all cases (though only when we
actually decide to freeze a page at all, which is now a separate
question). The fourth patch in the patch series introduces a very
limited exception, where we use the same cutoffs that we'll always use
on HEAD (FreezeLimit + MultiXactCutoff) instead of the aggressive
variants (OldestXmin and OldestMxact). This isn't just *any* xmax
containing a MultiXact: it's a Multi that contains *some* XIDs that
*need* to go away during the ongoing VACUUM, and others that *cannot*
go away. Oh, and there usually has to be a need to keep two or more
XIDs for this to happen -- if there is only one XID then we can
usually swap xmax with that XID without any fuss.

PS. see also
/messages/by-id/247e3ce4-ae81-d6ad-f54d-7d3e0409a950@ardentperf.com

I think that the problem you describe here is very real, though I
suspect that it needs to be addressed by making opportunistic cleanup
of Multis happen more reliably. Running VACUUM more often just isn't
practical once a table reaches a certain size. In general, any kind of
processing that is time sensitive probably shouldn't be happening
solely during VACUUM -- it's just too risky. VACUUM might take a
relatively long time to get to the affected page. It might not even be
that long in wall clock time or whatever -- just too long to reliably
avoid the problem.

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Peter Geoghegan (#3)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Aug 25, 2022 at 4:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

As a general rule VACUUM will tend to do more eager freezing with the
patch set compared to HEAD, though it should never do less eager
freezing. Not even in corner cases -- never.

Come to think of it, I don't think that that's quite true. Though the
fourth patch isn't particularly related to the problem.

It *is* true that VACUUM will do at least as much freezing of XID
based tuple header fields as before. That just leaves MXIDs. It's even
true that we will do just as much freezing as before if you go pure on
MultiXact-age. But I'm the one that likes to point out that age is
altogether the wrong approach for stuff like this -- so that won't cut
it.

More concretely, I think that the patch series will fail to do certain
inexpensive eager processing of tuple xmax, that will happen today,
regardless of what the user has set vacuum_freeze_min_age or
vacuum_multixact_freeze_min_age to. Although we currently only care
about XID age when processing simple XIDs, we already sometimes make
trade-offs similar to the trade-off I propose to make in the fourth
patch for Multis.

In other words, on HEAD, we promise to process any XMID >=
MultiXactCutoff inside FreezeMultiXactId(). But we also manage to do
"eager processing of xmax" when it's cheap and easy to do so, without
caring about MultiXactCutoff at all -- this is likely the common case,
even. This preexisting eager processing of Multis is likely important
in many applications.

The problem that I think I've created is that page-level freezing as
implemented in lazy_scan_prune by the third patch doesn't know that we
might be a good idea to execute a subset of freeze plans, in order to
remove a multi from a page right away. It mostly has the right idea by
holding off on freezing until it looks like a good idea at the level
of the whole page, but I think that this is a plausible exception.
Just because we're much more sensitive to leaving behind an Multi, and
right now the only code path that can remove a Multi that isn't needed
anymore is FreezeMultiXactId().

If xmax was an updater that aborted instead of a multi then we could
rely on hint bits being set by pruning to avoid clog lookups.
Technically nobody has violated their contract here, I think, but it
still seems like it could easily be unacceptable.

I need to come up with my own microbenchmark suite for Multis -- that
was on my TODO list already. I still believe that the fourth patch
addresses Andres' complaint about allocating new Multis during VACUUM.
But it seems like I need to think about the nuances of Multis some
more. In particular, what the performance impact might be of making a
decision on freezing at the page level, in light of the special
performance considerations for Multis.

Maybe it needs to be more granular than that, more often. Or maybe we
can comprehensively solve the problem in some other way entirely.
Maybe pruning should do this instead, in general. Something like that
might put this right, and be independently useful.

--
Peter Geoghegan

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Geoghegan (#1)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, 2022-08-25 at 14:21 -0700, Peter Geoghegan wrote:

The main high level goal of this work is to avoid painful, disruptive
antiwraparound autovacuums (and other aggressive VACUUMs) that do way
too much "catch up" freezing, all at once, causing significant
disruption to production workloads.

Sounds like a good goal, and loosely follows the precedent of
checkpoint targets and vacuum cost delays.

A new GUC/reloption called vacuum_freeze_strategy_threshold is added
to control freezing strategy (also influences our choice of skipping
strategy). It defaults to 4GB, so tables smaller than that cutoff
(which are usually the majority of all tables) will continue to
freeze
in much the same way as today by default. Our current lazy approach
to
freezing makes sense there, and should be preserved for its own sake.

Why is the threshold per-table? Imagine someone who has a bunch of 4GB
partitions that add up to a huge amount of deferred freezing work.

The initial problem you described is a system-level problem, so it
seems we should track the overall debt in the system in order to keep
up.

for this table, at this time: Is it more important to advance
relfrozenxid early (be eager), or to skip all-visible pages instead
(be lazy)? If it's the former, then we must scan every single page
that isn't all-frozen according to the VM snapshot (including every
all-visible page).

This feels too absolute, to me. If the goal is to freeze more
incrementally, well in advance of wraparound limits, then why can't we
just freeze 1000 out of 10000 freezable pages on this run, and then
leave the rest for a later run?

Thoughts?

What if we thought about this more like a "background freezer". It
would keep track of the total number of unfrozen pages in the system,
and freeze them at some kind of controlled/adaptive rate.

Regular autovacuum's job would be to keep advancing relfrozenxid for
all tables and to do other cleanup, and the background freezer's job
would be to keep the absolute number of unfrozen pages under some
limit. Conceptually those two jobs seem different to me.

Also, regarding patch v1-0001-Add-page-level-freezing, do you think
that narrows the conceptual gap between an all-visible page and an all-
frozen page?

Regards,
Jeff Davis

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeff Davis (#5)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Aug 29, 2022 at 11:47 AM Jeff Davis <pgsql@j-davis.com> wrote:

Sounds like a good goal, and loosely follows the precedent of
checkpoint targets and vacuum cost delays.

Right.

Why is the threshold per-table? Imagine someone who has a bunch of 4GB
partitions that add up to a huge amount of deferred freezing work.

I think it's possible that our cost model will eventually become very
sophisticated, and weigh all kinds of different factors, and work as
one component of a new framework that dynamically schedules autovacuum
workers. My main goal in posting this v1 was validating the *general
idea* of strategies with cost models, and the related question of how
we might use VM snapshots for that. After all, even the basic concept
is totally novel.

The initial problem you described is a system-level problem, so it
seems we should track the overall debt in the system in order to keep
up.

I agree that the problem is fundamentally a system-level problem. One
reason why vacuum_freeze_strategy_threshold works at the table level
right now is to get the ball rolling. In any case the specifics of how
we trigger each strategy are from from settled. That's not the only
reason why we think about things at the table level in the patch set,
though.

There *are* some fundamental reasons why we need to care about
individual tables, rather than caring about unfrozen pages at the
system level *exclusively*. This is something that
vacuum_freeze_strategy_threshold kind of gets right already, despite
its limitations. There are 2 aspects of the design that seemingly have
to work at the whole table level:

1. Concentration matters when it comes to wraparound risk.

Fundamentally, each VACUUM still targets exactly one heap rel, and
advances relfrozenxid at most once per VACUUM operation. While the
total number of "unfrozen heap pages" across the whole database is the
single most important metric, it's not *everything*.

As a general rule, there is much less risk in having a certain fixed
number of unfrozen heap pages spread fairly evenly among several
larger tables, compared to the case where the same number of unfrozen
pages are all concentrated in one particular table -- right now it'll
often be one particular table that is far larger than any other table.
Right now the pain is generally felt with large tables only.

2. We need to think about things at the table level is to manage costs
*over time* holistically. (Closely related to #1.)

The ebb and flow of VACUUM for one particular table is a big part of
the picture here -- and will be significantly affected by table size.
We can probably always afford to risk falling behind on
freezing/relfrozenxid (i.e. we should prefer laziness) if we know that
we'll almost certainly be able to catch up later when things don't
quite work out. That makes small tables much less trouble, even when
there are many more of them (at least up to a point).

As you know, my high level goal is to avoid ever having to make huge
balloon payments to catch up on freezing, which is a much bigger risk
with a large table -- this problem is mostly a per-table problem (both
now and in the future).

A large table will naturally require fewer, larger VACUUM operations
than a small table, no matter what approach is taken with the strategy
stuff. We therefore have fewer VACUUM operations in a given
week/month/year/whatever to spread out the burden -- there will
naturally be fewer opportunities. We want to create the impression
that each autovacuum does approximately the same amount of work (or at
least the same per new heap page for large append-only tables).

It also becomes much more important to only dirty each heap page
during vacuuming ~once with larger tables. With a smaller table, there
is a much higher chance that the pages we modify will already be dirty
from user queries.

for this table, at this time: Is it more important to advance
relfrozenxid early (be eager), or to skip all-visible pages instead
(be lazy)? If it's the former, then we must scan every single page
that isn't all-frozen according to the VM snapshot (including every
all-visible page).

This feels too absolute, to me. If the goal is to freeze more
incrementally, well in advance of wraparound limits, then why can't we
just freeze 1000 out of 10000 freezable pages on this run, and then
leave the rest for a later run?

My remarks here applied only to the question of relfrozenxid
advancement -- not to freezing. Skipping strategy (relfrozenxid
advancement) is a distinct though related concept to freezing
strategy. So I was making a very narrow statement about
invariants/basic correctness rules -- I wasn't arguing against
alternative approaches to freezing beyond the 2 freezing strategies
(not to be confused with skipping strategies) that appear in v1.
That's all I meant -- there is definitely no point in scanning only a
subset of the table's all-visible pages, as far as relfrozenxid
advancement is concerned (and skipping strategy is fundamentally a
choice about relfrozenxid advancement vs work avoidance, eagerness vs
laziness).

Maybe you're right that there is room for additional freezing
strategies, besides the two added by v1-0003-*patch. Definitely seems
possible. The freezing strategy concept should be usable as a
framework for adding additional strategies, including (just for
example) a strategy that decides ahead of time to freeze only so many
pages, though not others (without regard for the fact that the pages
that we are freezing may not be very different to those we won't be
freezing in the current VACUUM).

I'm definitely open to that. It's just a matter of characterizing what
set of workload characteristics this third strategy would solve, how
users might opt in or opt out, etc. Both the eager and the lazy
freezing strategies are based on some notion of what's important for
the table, based on its known characteristics, and based on what seems
like to happen to the table in the future (the next VACUUM, at least).
I'm not completely sure how many strategies we'll end up needing.
Though it seems like the eager/lazy trade-off is a really important
part of how these strategies will need to work, in general.

(Thinks some more) I guess that such an alternative freezing strategy
would probably have to affect the skipping strategy too. It's tricky
to tease apart because it breaks the idea that skipping strategy and
freezing strategy are basically distinct questions. That is a factor
that makes it a bit more complicated to discuss. In any case, as I
said, I have an open mind about alternative freezing strategies beyond
the 2 basic lazy/eager freezing strategies from the patch.

What if we thought about this more like a "background freezer". It
would keep track of the total number of unfrozen pages in the system,
and freeze them at some kind of controlled/adaptive rate.

I like the idea of storing metadata in shared memory. And scheduling
and deprioritizing running autovacuums. Being able to slow down or
even totally halt a given autovacuum worker without much consequence
is enabled by the VM snapshot concept.

That said, this seems like future work to me. Worth discussing, but
trying to keep out of scope in the first version of this that is
committed.

Regular autovacuum's job would be to keep advancing relfrozenxid for
all tables and to do other cleanup, and the background freezer's job
would be to keep the absolute number of unfrozen pages under some
limit. Conceptually those two jobs seem different to me.

The problem with making it such a sharp distinction is that it can be
very useful to manage costs by making it the job of VACUUM to do both
-- we can avoid dirtying the same page multiple times.

I think that we can accomplish the same thing by giving VACUUM more
freedom to do either more or less work, based on the observed
characteristics of the table, and some sense of how costs will tend to
work over time. across multiple distinct VACUUM operations. In
practice that might end up looking very similar to what you describe.

It seems undesirable for VACUUM to ever be too sure of itself -- the
information that triggers autovacuum may not be particularly reliable,
which can be solved to some degree by making as many decisions as
possible at runtime, dynamically, based on the most authoritative and
recent information. Delaying committing to one particular course of
action isn't always possible, but when it is possible (and not too
expensive) we should do it that way on general principle.

Also, regarding patch v1-0001-Add-page-level-freezing, do you think
that narrows the conceptual gap between an all-visible page and an all-
frozen page?

Yes, definitely. However, I don't think that we can just get rid of
the distinction completely -- though I did think about it for a while.
For one thing we need to be able to handle cases like the case where
heap_lock_tuple() modifies an all-frozen page, and makes it
all-visible without making it completely unskippable to every VACUUM
operation.

--
Peter Geoghegan

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Geoghegan (#1)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, 2022-08-25 at 14:21 -0700, Peter Geoghegan wrote:

Attached patch series is a completely overhauled version of earlier
work on freezing. Related work from the Postgres 15 cycle became
commits 0b018fab, f3c15cbe, and 44fa8488.

Recap
=====

The main high level goal of this work is to avoid painful, disruptive
antiwraparound autovacuums (and other aggressive VACUUMs) that do way
too much "catch up" freezing, all at once

I agree with the motivation: that keeping around a lot of deferred work
(unfrozen pages) is risky, and that administrators would want a way to
control that risk.

The solution involves more changes to the philosophy and mechanics of
vacuum than I would expect, though. For instance, VM snapshotting,
page-level-freezing, and a cost model all might make sense, but I don't
see why they are critical for solving the problem above. I think I'm
still missing something. My mental model is closer to the bgwriter and
checkpoint_completion_target.

Allow me to make a naive counter-proposal (not a real proposal, just so
I can better understand the contrast with your proposal):

* introduce a reloption unfrozen_pages_target (default -1, meaning
infinity, which is the current behavior)
* introduce two fields to LVRelState: n_pages_frozen and
delay_skip_count, both initialized to zero
* when marking a page frozen: n_pages_frozen++
* when vacuum begins:
if (unfrozen_pages_target >= 0 &&
current_unfrozen_page_count > unfrozen_pages_target)
{
vacrel->delay_skip_count = current_unfrozen_page_count -
unfrozen_pages_target;
/* ?also use more aggressive freezing thresholds? */
}
* in lazy_scan_skip(), have a final check:
if (vacrel->n_pages_frozen < vacrel->delay_skip_count)
{
break;
}

I know there would still be some problem cases, but to me it seems like
we solve 80% of the problem in a couple dozen lines of code.

a. Can you clarify some of the problem cases, and why it's worth
spending more code to fix them?

b. How much of your effort is groundwork for related future
improvements? If it's a substantial part, can you explain in that
larger context?

c. Can some of your patches be separated into independent discussions?
For instance, patch 1 has been discussed in other threads and seems
independently useful, and I don't see the current work as dependent on
it. Patch 4 also seems largerly independent.

d. Can you help give me a sense of scale of the problems solved by
visibilitymap snapshots and the cost model? Do those need to be in v1?

Regards,
Jeff Davis

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeff Davis (#7)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Aug 30, 2022 at 11:11 AM Jeff Davis <pgsql@j-davis.com> wrote:

The solution involves more changes to the philosophy and mechanics of
vacuum than I would expect, though. For instance, VM snapshotting,
page-level-freezing, and a cost model all might make sense, but I don't
see why they are critical for solving the problem above.

I certainly wouldn't say that they're critical. I tend to doubt that I
can be perfectly crisp about what the exact relationship is between
each component in isolation and how it contributes towards addressing
the problems we're concerned with.

I think I'm
still missing something. My mental model is closer to the bgwriter and
checkpoint_completion_target.

That's not a bad starting point. The main thing that that mental model
is missing is how the timeframes work with VACUUM, and the fact that
there are multiple timeframes involved (maybe the system's vacuuming
work could be seen as having one timeframe at the highest level, but
it's more of a fractal picture overall). Checkpoints just don't take
that long, and checkpoint duration has a fairly low variance (barring
pathological performance problems).

You only have so many buffers that you can dirty, too -- it's a
self-limiting process. This is even true when (for whatever reason)
the checkpoint_completion_target logic just doesn't do what it's
supposed to do. There is more or less a natural floor on how bad
things can get, so you don't have to invent a synthetic floor at all.
LSM-based DB systems like the MyRocks storage engine for MySQL don't
use checkpoints at all -- the closest analog is compaction, which is
closer to a hybrid of VACUUM and checkpointing than anything else.

The LSM compaction model necessitates adding artificial throttling to
keep the system stable over time [1]https://docs.google.com/presentation/d/1WgP-SlKay5AnSoVDSvOIzmu7edMmtYhdywoa0oAR4JQ/edit?usp=sharing. There is a disconnect between
the initial ingest of data, and the compaction process. And so
top-down modelling of costs and benefits with compaction is more
natural with an LSM [2]https://disc-projects.bu.edu/compactionary/research.html -- Peter Geoghegan -- and not a million miles from the strategy
stuff I'm proposing.

Allow me to make a naive counter-proposal (not a real proposal, just so
I can better understand the contrast with your proposal):

I know there would still be some problem cases, but to me it seems like
we solve 80% of the problem in a couple dozen lines of code.

It's not that this statement is wrong, exactly. It's that I believe
that it is all but mandatory for me to ameliorate the downside that
goes with more eager freezing, for example by not doing it at all when
it doesn't seem to make sense. I want to solve the big problem of
freeze debt, without creating any new problems. And if I should also
make things in adjacent areas better too, so much the better.

Why stop at a couple of dozens of lines of code? Why not just change
the default of vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age to 0?

a. Can you clarify some of the problem cases, and why it's worth
spending more code to fix them?

For one thing if we're going to do a lot of extra freezing, we really
want to "get credit" for it afterwards, by updating relfrozenxid to
reflect the new oldest extant XID, and so avoid getting an
antiwraparound VACUUM early, in the near future.

That isn't strictly true, of course. But I think that we at least
ought to have a strong bias in the direction of updating relfrozenxid,
having decided to do significantly more freezing in some particular
VACUUM operation.

b. How much of your effort is groundwork for related future
improvements? If it's a substantial part, can you explain in that
larger context?

Hard to say. It's true that the idea of VM snapshots is quite general,
and could have been introduced in a number of different ways. But I
don't think that that should count against it. It's also not something
that seems contrived or artificial -- it's at least as good of a
reason to add VM snapshots as any other I can think of.

Does it really matter if this project is the freeze debt project, or
the VM snapshot project? Do we even need to decide which one it is
right now?

c. Can some of your patches be separated into independent discussions?
For instance, patch 1 has been discussed in other threads and seems
independently useful, and I don't see the current work as dependent on
it.

I simply don't know if I can usefully split it up just yet.

Patch 4 also seems largerly independent.

Patch 4 directly compensates for a problem created by the earlier
patches. The patch series as a whole isn't supposed to amerliorate the
problem of MultiXacts being allocated in VACUUM. It only needs to
avoid making the situation any worse than it is today IMV (I suspect
that the real fix is to make the VACUUM FREEZE command not tune
vacuum_freeze_min_age).

d. Can you help give me a sense of scale of the problems solved by
visibilitymap snapshots and the cost model? Do those need to be in v1?

I'm not sure. I think that having certainty that we'll be able to scan
only so many pages up-front is very broadly useful, though. Plus it
removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable
relfrozenxid advancement in non-aggressive VACUUMs, but does so in a
way that results in scanning many more pages needlessly. See commit
bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009,
shortly after the visibility map first appeared.

Since relfrozenxid advancement fundamentally works at the table level,
it seems natural to make it a top-down, VACUUM-level thing -- even
within non-aggessive VACUUMs (I guess it already meets that
description in aggressive VACUUMs). And since we really want to
advance relfrozenxid when we do extra freezing (for the reasons I just
went into), it seems natural to me to view it as one problem. I accept
that it's not clear cut, though.

[1]: https://docs.google.com/presentation/d/1WgP-SlKay5AnSoVDSvOIzmu7edMmtYhdywoa0oAR4JQ/edit?usp=sharing
[2]: https://disc-projects.bu.edu/compactionary/research.html -- Peter Geoghegan
--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Peter Geoghegan (#8)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Aug 30, 2022 at 1:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

d. Can you help give me a sense of scale of the problems solved by
visibilitymap snapshots and the cost model? Do those need to be in v1?

I'm not sure. I think that having certainty that we'll be able to scan
only so many pages up-front is very broadly useful, though. Plus it
removes the SKIP_PAGES_THRESHOLD stuff, which was intended to enable
relfrozenxid advancement in non-aggressive VACUUMs, but does so in a
way that results in scanning many more pages needlessly. See commit
bf136cf6, which added the SKIP_PAGES_THRESHOLD stuff back in 2009,
shortly after the visibility map first appeared.

Here is a better example:

Right now the second patch adds both VM snapshots and the skipping
strategy stuff. The VM snapshot is used in the second patch, as a
source of reliable information about how we need to process the table,
in terms of the total number of scanned_pages -- which drives our
choice of strategy. Importantly, we can assess the question of which
skipping strategy to take (in non-aggressive VACUUM) based on 100%
accurate information about how many *extra* pages we'll have to scan
in the event of being eager (i.e. in the event that we prioritize
early relfrozenxid advancement over skipping some pages). Importantly,
that cannot change later on, since VM snapshots are immutable --
everything is locked in. That already seems quite valuable to me.

This general concept could be pushed a lot further without great
difficulty. Since VM snapshots are immutable, it should be relatively
easy to have the implementation make its final decision on skipping
only *after* lazy_scan_heap() returns. We could allow VACUUM to
"change its mind about skipping" in cases where it initially thought
that skipping was the best strategy, only to discover much later on
that that was the wrong choice after all.

A huge amount of new, reliable information will come to light from
scanning the heap rel. In particular, the current value of
vacrel->NewRelfrozenXid seems like it would be particularly
interesting when the time came to consider if a second scan made sense
-- if NewRelfrozenXid is a recent-ish value already, then that argues
for finishing off the all-visible pages in a second heap pass, with
the aim of setting relfrozenxid to a similarly recent value when it
happens to be cheap to do so.

The actual process of scanning precisely those all-visible pages that
were skipped the first time around during a second call to
lazy_scan_heap() can be implemented in the obvious way: by teaching
the VM snapshot infrastructure/lazy_scan_skip() to treat pages that
were skipped the first time around to get scanned during the second
pass over the heap instead. Also, those pages that were scanned the
first time around can/must be skipped on our second pass (excluding
all-frozen pages, which won't be scanned in either heap pass).

I've used the term "second heap pass" here, but that term is slightly
misleading. The final outcome of this whole process is that every heap
page that the vmsnap says VACUUM will need to scan in order for it to
be able to safely advance relfrozenxid will be scanned, precisely
once. The overall order that the heap pages are scanned in will of
course differ from the simple case, but I don't think that it makes
very much difference. In reality there will have only been one heap
pass, consisting of two distinct phases. No individual heap page will
ever be considered for pruning/freezing more than once, no matter
what. This is just a case of *reordering* work. Immutability makes
reordering work easy in general.

--
Peter Geoghegan

#10

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Geoghegan (#9)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, 2022-08-30 at 18:50 -0700, Peter Geoghegan wrote:

Since VM snapshots are immutable, it should be relatively
easy to have the implementation make its final decision on skipping
only *after* lazy_scan_heap() returns.

I like this idea.

Regards,
Jeff Davis

#11

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeff Davis (#10)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Aug 30, 2022 at 9:37 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2022-08-30 at 18:50 -0700, Peter Geoghegan wrote:

Since VM snapshots are immutable, it should be relatively
easy to have the implementation make its final decision on skipping
only *after* lazy_scan_heap() returns.

I like this idea.

I was hoping that you would. I imagine that this idea (with minor
variations) could enable an approach that's much closer to what you
were thinking of: one that mostly focuses on controlling the number of
unfrozen pages, and not so much on advancing relfrozenxid early, just
because we can and we might not get another chance for a long time. In
other words your idea of a design that can freeze more during a
non-aggressive VACUUM, while still potentially skipping all-visible
pages.

I said earlier on that we ought to at least have a strong bias in the
direction of advancing relfrozenxid in larger tables, especially when
we decide to freeze whole pages more eagerly -- we only get one chance
to advance relfrozenxid per VACUUM, and those opportunities will
naturally be few and far between. We cannot really justify all this
extra freezing if it doesn't prevent antiwraparound autovacuums. That
was more or less my objection to going in that direction.

But if we can more or less double the number of opportunities to at
least ask the question "is now a good time to advance relfrozenxid?"
without really paying much for keeping this option open (and I think
that we can), my concern about relfrozenxid advancement becomes far
less important. Just being able to ask that question is significantly
less rare and precious. Plus we'll probably be able to make
significantly better decisions about relfrozenxid overall with the
"second phase because I changed my mind about skipping" concept in
place.

--
Peter Geoghegan

#12

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Geoghegan (#8)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, 2022-08-30 at 13:45 -0700, Peter Geoghegan wrote:

It's that I believe
that it is all but mandatory for me to ameliorate the downside that
goes with more eager freezing, for example by not doing it at all
when
it doesn't seem to make sense. I want to solve the big problem of
freeze debt, without creating any new problems. And if I should also
make things in adjacent areas better too, so much the better.

That clarifies your point. It's still a challenge for me to reason
about which of these potential new problems really need to be solved in
v1, though.

Why stop at a couple of dozens of lines of code? Why not just change
the default of vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age to 0?

I don't think that would actually solve the unbounded buildup of
unfrozen pages. It would still be possible for pages to be marked all
visible before being frozen, and then end up being skipped until an
aggressive vacuum is forced, right?

Or did you mean vacuum_freeze_table_age?

Regards,
Jeff Davis

#13

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeff Davis (#12)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Aug 30, 2022 at 11:28 PM Jeff Davis <pgsql@j-davis.com> wrote:

That clarifies your point. It's still a challenge for me to reason
about which of these potential new problems really need to be solved in
v1, though.

I don't claim to understand it that well myself -- not just yet.
I feel like I have the right general idea, but the specifics
aren't all there (which is very often the case for me at this
point in the cycle). That seems like a good basis for further
discussion.

It's going to be quite a few months before some version of this
patchset is committed, at the very earliest. Obviously these are
questions that need answers, but the process of getting to those
answers is a significant part of the work itself IMV.

Why stop at a couple of dozens of lines of code? Why not just change
the default of vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age to 0?

I don't think that would actually solve the unbounded buildup of
unfrozen pages. It would still be possible for pages to be marked all
visible before being frozen, and then end up being skipped until an
aggressive vacuum is forced, right?

With the 15 work in place, and with the insert-driven autovacuum
behavior from 13, it is likely that this would be enough to avoid all
antiwraparound vacuums in an append-only table. There is still one
case where we can throw away the opportunity to advance relfrozenxid
during non-aggressive VACUUMs for no good reason -- I didn't fix them
all just yet. But the remaining case (which is in lazy_scan_skip()) is
very narrow.

With vacuum_freeze_min_age = 0 and vacuum_multixact_freeze_min_age =
0, any page that is eligible to be set all-visible is also eligible to
have its tuples frozen and be set all-frozen instead, immediately.
When it isn't then we'll scan it in the next VACUUM anyway.

Actually I'm also ignoring some subtleties with Multis that could make
this not quite happen, but again, that's only a super obscure corner case.
The idea that just setting vacuum_freeze_min_age = 0 and
vacuum_multixact_freeze_min_age = 0 will be enough is definitely true
in spirit. You don't need to touch vacuum_freeze_table_age (if you did
then you'd get aggressive VACUUMs, and one goal here is to avoid
those whenever possible -- especially aggressive antiwraparound
autovacuums).

--
Peter Geoghegan

#14

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Peter Geoghegan (#1)

4 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Aug 25, 2022 at 2:21 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached patch series is a completely overhauled version of earlier
work on freezing. Related work from the Postgres 15 cycle became
commits 0b018fab, f3c15cbe, and 44fa8488.

Attached is v2.

This is just to keep CFTester happy, since v1 now has conflicts when
applied against HEAD. There are no notable changes in this v2 compared
to v1.

--
Peter Geoghegan

Attachments:

v2-0001-Add-page-level-freezing-to-VACUUM.patchapplication/octet-stream; name=v2-0001-Add-page-level-freezing-to-VACUUM.patchDownload

From eb8bb8275417ddd8ed127a90575493457d069bbd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v2 1/4] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

This approach decouples the question of _how_ VACUUM could/will freeze a
given heap page (which of its XIDs are eligible to be frozen) from the
question of whether it actually makes sense to do so right now.

Just adding page-level freezing does not change all that much on its
own: VACUUM will still typically freeze very lazily, since we're only
forcing freezing of all of a page's eligible tuples when we decide to
freeze at least one (on the basis of XID age and FreezeLimit).  For now
VACUUM still freezes everything almost as lazily as it always has.
Later work will teach VACUUM to apply an alternative eager freezing
strategy that triggers page-level freezing earlier, based on additional
criteria.
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |  37 +++++-
 src/backend/access/heap/heapam.c     | 171 ++++++++++++++++-----------
 src/backend/access/heap/vacuumlazy.c | 152 ++++++++++++++----------
 4 files changed, 230 insertions(+), 134 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index abf62d9df..c201f8ae6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,8 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
+									TransactionId limit_xid, MultiXactId limit_multi,
 									TransactionId *relfrozenxid_out,
 									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 1705e736b..40556271d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -330,6 +330,38 @@ typedef struct xl_heap_freeze_tuple
 	uint8		frzflags;
 } xl_heap_freeze_tuple;
 
+/*
+ * State used by VACUUM to track what the oldest extant XID/MXID will become
+ * when determing whether and how to freeze a page's heap tuples via calls to
+ * heap_prepare_freeze_tuple.
+ *
+ * The relfrozenxid_out and relminmxid_out fields are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
+ * relminmxid_out as needed to avoid unsafe values in rel's authoritative
+ * pg_class tuple.
+ *
+ * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
+ * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ */
+typedef struct page_frozenxid_tracker
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze;
+
+	/* Values used when page is to be frozen based on freeze plans */
+	TransactionId relfrozenxid_out;
+	MultiXactId relminmxid_out;
+
+	/* Used by caller for '!freeze' pages */
+	TransactionId relfrozenxid_nofreeze_out;
+	MultiXactId relminmxid_nofreeze_out;
+
+} page_frozenxid_tracker;
+
 /*
  * This is what we need to know about a block being frozen during vacuum
  *
@@ -409,10 +441,11 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId relminmxid,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
+									  TransactionId limit_xid,
+									  MultiXactId limit_multi,
 									  xl_heap_freeze_tuple *frz,
 									  bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  page_frozenxid_tracker *xtrack);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 588716606..d6aea370f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6431,26 +6431,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
- * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
+ * Caller must initialize xtrack fields for page as a whole before calling
+ * here with first tuple for the page.  See page_frozenxid_tracker comments.
+ *
+ * Caller must set frz->offset itself if heap_execute_freeze_tuple is called.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6463,34 +6452,46 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  TransactionId limit_xid, MultiXactId limit_multi,
 						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  page_frozenxid_tracker *xtrack)
 {
 	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin,
+				freeze_xmax;
 	TransactionId xid;
 
+	/*
+	 * limit_xid *must* be <= cutoff_xid, to ensure that any XID older than it
+	 * can neither be running nor seen as running by any open transaction.
+	 * This ensures that we only freeze XIDs that are safe to freeze -- those
+	 * that are already unambiguously visible to everybody.
+	 *
+	 * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
+	 * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)
+	 */
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
+
 	frz->frzflags = 0;
 	frz->t_infomask2 = tuple->t_infomask2;
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen iff our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+	{
+		freeze_xmin = false;
+		xmin_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmin */
+	}
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6499,8 +6500,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoff_xid);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
@@ -6514,8 +6515,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 
@@ -6526,7 +6527,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.  It needs
+	 * to return true for any tuple that we would force to be frozen here.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6534,7 +6536,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
+		TransactionId mxid_oldest_xid_out = xtrack->relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
@@ -6553,8 +6555,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
+			if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6582,10 +6584,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6613,10 +6615,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, xtrack->relminmxid_out))
+				xtrack->relminmxid_out = xid;
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 		}
 		else
 		{
@@ -6656,8 +6658,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			freeze_xmax = false;
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
@@ -6673,6 +6675,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+		Assert(changed);
+	}
 	if (freeze_xmax)
 	{
 		Assert(!xmax_already_frozen);
@@ -6703,11 +6710,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we ignore the cutoff_xid and just always perform the
 		 * freeze operation.  The oldest release in which such a value can
 		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
+		 * was removed in PostgreSQL 9.0.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6721,18 +6724,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			else
 				frz->frzflags |= XLH_FREEZE_XVAC;
 
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
+			/* Set XMIN_COMMITTED defensively */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
+
+			/*
+			 * Force freezing any page with an xvac to keep things simple.
+			 * This allows totally_frozen tracking to ignore xvac.
+			 */
 			changed = true;
+			xtrack->freeze = true;
 		}
 	}
 
-	*totally_frozen = (xmin_frozen &&
+	/*
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen (provided caller executes freeze plan for the page)
+	 */
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
+
+	/*
+	 * Force vacuumlazy.c to freeze page when avoiding it would violate the
+	 * rule that XIDs < limit_xid (and MXIDs < limit_multi) must never remain
+	 */
+	if (!xtrack->freeze && !(xmin_already_frozen && xmax_already_frozen))
+		xtrack->freeze =
+			heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+									&xtrack->relfrozenxid_nofreeze_out,
+									&xtrack->relminmxid_nofreeze_out);
+
 	return changed;
 }
 
@@ -6785,14 +6806,20 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	page_frozenxid_tracker dummy;
+
+	dummy.freeze = true;
+	dummy.relfrozenxid_out = cutoff_xid;
+	dummy.relminmxid_out = cutoff_multi;
+	dummy.relfrozenxid_nofreeze_out = cutoff_xid;
+	dummy.relminmxid_nofreeze_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
+										  cutoff_xid, cutoff_multi,
 										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+										  &dummy);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7218,17 +7245,23 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * heap_tuple_would_freeze
  *
  * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * force freezing of any of the XID/XMID fields from the tuple, given the same
+ * limits.  We must also deal with dead tuples here, since (xmin, xmax, xvac)
+ * fields could be processed by pruning away the whole tuple instead of
+ * freezing.
+ *
+ * Note: VACUUM refers to limit_xid and limit_multi as "FreezeLimit" and
+ * "MultiXactCutoff" respectively.  These should not be confused with the
+ * absolute cutoffs for freezing.  We just determine whether caller's tuple
+ * and limits trigger heap_prepare_freeze_tuple to force freezing.
  *
  * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
  * like the heap_prepare_freeze_tuple arguments that they're based on.  We
  * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
+heap_tuple_would_freeze(HeapTupleHeader tuple,
+						TransactionId limit_xid, MultiXactId limit_multi,
 						TransactionId *relfrozenxid_out,
 						MultiXactId *relminmxid_out)
 {
@@ -7242,7 +7275,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	{
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 
@@ -7259,7 +7292,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		/* xmax is a non-permanent XID */
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
@@ -7282,7 +7315,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 		if (MultiXactIdPrecedes(multi, *relminmxid_out))
 			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
+		if (MultiXactIdPrecedes(multi, limit_multi))
 			would_freeze = true;
 
 		/* need to check whether any member of the mxact is old */
@@ -7295,7 +7328,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			Assert(TransactionIdIsNormal(xid));
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(xid, limit_xid))
 				would_freeze = true;
 		}
 		if (nmembers > 0)
@@ -7309,7 +7342,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		{
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			would_freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 21eaf1d8c..07630c34e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,8 +169,9 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
+	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
@@ -507,6 +508,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
@@ -1554,8 +1556,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	page_frozenxid_tracker xtrack;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1571,8 +1573,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	xtrack.freeze = false;
+	xtrack.relfrozenxid_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_out = vacrel->NewRelminMxid;
+	xtrack.relfrozenxid_nofreeze_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_nofreeze_out = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1625,27 +1630,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1777,10 +1778,12 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen], &tuple_totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &xtrack))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1801,9 +1804,33 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (xtrack.freeze || nfrozen == 0)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (might be zero eligible tuples).
+		 */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_out;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* Not freezing this page, so use alternative cutoffs */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_nofreeze_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_nofreeze_out;
+
+		/* Might still set page all-visible, but never all-frozen */
+		nfrozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1811,7 +1838,7 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
-		Assert(prunestate->hastup);
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
@@ -1841,7 +1868,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(rel, buf, vacrel->NewRelfrozenXid,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
@@ -1850,6 +1877,41 @@ retry:
 	}
 
 	/*
+	 * Now save details of the LP_DEAD items from the page in vacrel
+	 */
+	if (lpdead_items > 0)
+	{
+		VacDeadItems *dead_items = vacrel->dead_items;
+		ItemPointerData tmp;
+
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_items->items[dead_items->num_items++] = tmp;
+		}
+
+		Assert(dead_items->num_items <= dead_items->max_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->has_lpdead_items = true;
+		prunestate->all_visible = false;
+	}
+
+	/* Finally, add page-local counts to whole-VACUUM counts */
+	vacrel->tuples_deleted += tuples_deleted;
+	vacrel->lpdead_items += lpdead_items;
+	vacrel->live_tuples += live_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
+
+	/*
+	 * We're done, but assert that some postconditions hold before returning.
+	 *
 	 * The second pass over the heap can also set visibility map bits, using
 	 * the same approach.  This is important when the table frequently has a
 	 * few old LP_DEAD items on each page by the time we get to it (typically
@@ -1873,7 +1935,7 @@ retry:
 			Assert(false);
 
 		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1885,38 +1947,6 @@ retry:
 			   cutoff == prunestate->visibility_cutoff_xid);
 	}
 #endif
-
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
-	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
-		vacrel->lpdead_item_pages++;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
-	}
-
-	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += tuples_deleted;
-	vacrel->lpdead_items += lpdead_items;
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
 
 /*
-- 
2.34.1

v2-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/octet-stream; name=v2-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload

From 4998d2e383c883810844eb48b4e589e9717c1b3b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v2 2/4] Teach VACUUM to use visibility map snapshot.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip.  The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").  It also
gives VACUUM reliable information about how many pages will be scanned,
before its physical heap scan even begins.  That makes it easier to
model the costs that VACUUM incurs using a top-down, up-front approach.

Non-aggressive VACUUMs now make an up-front choice about VM skipping
strategy: they decide whether to prioritize early advancement of
relfrozenxid (eager behavior) over avoiding work by skipping all-visible
pages (lazy behavior).  Nothing about the details of how lazy_scan_prune
freezes changes just yet, though a later commit will add the concept of
freezing strategies.

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.  VACUUM will now either scan
every all-visible page, or none at all.  This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.  It was possible that a single all-visible page
located in a range of all-frozen blocks would render it unsafe to
advance relfrozenxid later on; lazy_scan_skip just didn't have enough
high-level context about the table as a whole.  Now our policy around
skipping all-visible pages is exactly the same condition as whether or
not it's safe to advance relfrozenxid later on; nothing is left to
chance.

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work).  For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().
---
 src/include/access/visibilitymap.h      |   7 +
 src/backend/access/heap/vacuumlazy.c    | 325 +++++++++++++++---------
 src/backend/access/heap/visibilitymap.c | 162 ++++++++++++
 3 files changed, 369 insertions(+), 125 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..49f197bb3 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,9 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +38,10 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+									  BlockNumber *all_visible, BlockNumber *all_frozen);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
+extern uint8 visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 07630c34e..9ba975c1a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,10 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * Threshold that controls whether non-aggressive VACUUMs will skip any
+ * all-visible pages
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -146,8 +146,10 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
+	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	bool		skipallvis;
+	/* Skip (don't scan) all-frozen pages? */
+	bool		skipallfrozen;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -171,13 +173,14 @@ typedef struct LVRelState
 	TransactionId OldestXmin;
 	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
+	/* Snapshot of visibility map, taken just after OldestXmin acquired */
+	vmsnapshot *vmsnap;
 	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -248,10 +251,12 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber all_visible,
+									  BlockNumber all_frozen);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel,
 								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+								  bool *next_unskippable_allvis);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -314,7 +319,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	bool		verbose,
 				instrument,
 				aggressive,
-				skipwithvm,
+				skipallfrozen,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
@@ -322,6 +327,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				all_visible,
+				all_frozen,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -367,7 +375,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
 
-	skipwithvm = true;
+	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
 		/*
@@ -375,7 +383,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * visibility map (even those set all-frozen)
 		 */
 		aggressive = true;
-		skipwithvm = false;
+		skipallfrozen = false;
 	}
 
 	/*
@@ -400,20 +408,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
@@ -440,7 +434,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
 	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
+	/* Set skipallvis/skipallfrozen provisionally (before lazy_scan_strategy) */
+	vacrel->skipallvis = (!aggressive && skipallfrozen);
+	vacrel->skipallfrozen = skipallfrozen;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -505,11 +501,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
 	 * XIDs must at least be considered for freezing (though not necessarily
 	 * frozen) during its scan.
+	 *
+	 * Also acquire a read-only snapshot of the visibility map at this point.
+	 * We can work off of the snapshot when deciding which heap pages are safe
+	 * to skip.  This approach allows VACUUM to avoid scanning pages whose VM
+	 * bit gets unset concurrently, which is important with large tables that
+	 * take a long time to VACUUM.
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
+	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
+										&all_visible, &all_frozen);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
 	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
@@ -517,7 +521,37 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
-	vacrel->skippedallvis = false;
+
+	/*
+	 * Use visibility map snapshot to determine whether we'll skip all-visible
+	 * pages using vmsnap in lazy_scan_heap
+	 */
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   all_visible, all_frozen);
+	if (verbose)
+	{
+		char	   *msgfmt;
+		StringInfoData buf;
+
+		Assert(!IsAutoVacuumWorkerProcess());
+
+		if (aggressive)
+			msgfmt = _("aggressively vacuuming \"%s.%s.%s\"");
+		else
+			msgfmt = _("vacuuming \"%s.%s.%s\"");
+
+		initStringInfo(&buf);
+		appendStringInfo(&buf, msgfmt, get_database_name(MyDatabaseId),
+						 vacrel->relnamespace, vacrel->relname);
+
+		ereport(INFO,
+				(errmsg_internal("%s", buf.data),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
+		pfree(buf.data);
+	}
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -534,6 +568,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -580,12 +615,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
 		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -630,6 +664,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -853,8 +890,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
+	bool		next_unskippable_allvis;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -868,43 +904,33 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	/* Set up an initial range of skippable blocks using VM snapshot */
+	next_unskippable_block = lazy_scan_skip(vacrel, 0,
+											&next_unskippable_allvis);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
-
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
+		if (blkno < next_unskippable_block)
 		{
 			/* Last page always scanned (may need to set nonempty_pages) */
 			Assert(blkno < rel_pages - 1);
 
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
+			/* Skip (don't scan) this page */
+			continue;
 		}
 
+		/*
+		 * Can't skip this page safely.  Must scan the page.  But determine
+		 * the next skippable range after the page first.
+		 */
+		all_visible_according_to_vmsnap = next_unskippable_allvis;
+		next_unskippable_block = lazy_scan_skip(vacrel, blkno + 1,
+												&next_unskippable_allvis);
+
 		vacrel->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
@@ -1113,10 +1139,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Handle setting visibility map bit based on information from our VM
+		 * snapshot, and from prunestate
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1145,11 +1171,11 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * the page-level bit is clear.  However, lazy_scan_skip works off of
+		 * a snapshot of the VM that might be quite old by now.  Recheck with
+		 * a buffer lock held before concluding that the VM is corrupt.
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1188,7 +1214,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1281,7 +1307,97 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *	lazy_scan_strategy() -- Determine skipping strategy.
+ *
+ * Determines if the ongoing VACUUM operation should skip all-visible pages
+ * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ *
+ * Returns final scanned_pages for the VACUUM operation.
+ */
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel,
+				   BlockNumber all_visible,
+				   BlockNumber all_frozen)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				scanned_pages_skipallvis,
+				scanned_pages_skipallfrozen;
+	uint8		mapbits;
+
+	Assert(vacrel->scanned_pages == 0);
+	Assert(rel_pages >= all_visible && all_visible >= all_frozen);
+
+	/*
+	 * First figure out the final scanned_pages for each of the skipping
+	 * policies that lazy_scan_skip might end up using: skipallvis (skip both
+	 * all-frozen and all-visible) and skipallfrozen (just skip all-frozen).
+	 */
+	scanned_pages_skipallvis = rel_pages - all_visible;
+	scanned_pages_skipallfrozen = rel_pages - all_frozen;
+
+	/*
+	 * Even if the last page is skippable, it will still get scanned because
+	 * of lazy_scan_skip's "try to set nonempty_pages for last page" rule.
+	 * Reconcile that rule with what vmsnap says about the last page now.
+	 *
+	 * When vmsnap thinks that we will be skipping the last page (we won't),
+	 * increment scanned_pages to compensate.  Otherwise change nothing.
+	 */
+	mapbits = visibilitymap_snap_status(vacrel->vmsnap, rel_pages - 1);
+	if (mapbits & VISIBILITYMAP_ALL_VISIBLE)
+		scanned_pages_skipallvis++;
+	if (mapbits & VISIBILITYMAP_ALL_FROZEN)
+		scanned_pages_skipallfrozen++;
+
+	/*
+	 * Okay, now we have all the information we need to decide on a strategy
+	 */
+	if (!vacrel->skipallfrozen)
+	{
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		return rel_pages;
+	}
+	else if (vacrel->aggressive)
+		Assert(!vacrel->skipallvis);
+	else
+	{
+		BlockNumber nextra,
+					nextra_threshold;
+
+		/*
+		 * Decide on whether or not we'll skip all-visible pages.
+		 *
+		 * In general, VACUUM doesn't necessarily have to freeze anything to
+		 * be able to advance relfrozenxid and/or relminmxid by a significant
+		 * number of XIDs/MXIDs.  The oldest tuples might turn out to have
+		 * been deleted since VACUUM last ran, or this VACUUM might find that
+		 * there simply are no MultiXacts that even need to be considered.
+		 *
+		 * It's hard to predict whether this VACUUM operation will work out
+		 * that way, so be lazy (just skip) unless the added cost is very low.
+		 * We opt for a skipallfrozen-only VACUUM when the number of extra
+		 * pages (extra scanned pages that are all-visible but not all-frozen)
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 */
+		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
+		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+		nextra_threshold = Max(32, nextra_threshold);
+
+		vacrel->skipallvis = nextra >= nextra_threshold;
+	}
+
+	/* Return the appropriate variant of scanned_pages */
+	if (vacrel->skipallvis)
+		return scanned_pages_skipallvis;
+	if (vacrel->skipallfrozen)
+		return scanned_pages_skipallfrozen;
+
+	return rel_pages;			/* keep compiler quiet */
+}
+
+/*
+ *	lazy_scan_skip() -- set up the next range of skippable blocks.
  *
  * lazy_scan_heap() calls here every time it needs to set up a new range of
  * blocks to skip via the visibility map.  Caller passes the next block in
@@ -1289,34 +1405,25 @@ lazy_scan_heap(LVRelState *vacrel)
  * no skippable blocks we just return caller's next_block.  The all-visible
  * status of the returned block is set in *next_unskippable_allvis for caller,
  * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * can be for rel's last page, and when DISABLE_PAGE_SKIPPING is used.
  *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
+ * This function operates on a snapshot of the visibility map that was taken
+ * just after OldestXmin was acquired.  VACUUM only needs to scan all pages
+ * whose tuples might contain XIDs < OldestXmin (or MXIDs < OldestMxact),
+ * which excludes pages treated as all-frozen here (pages >= rel_pages, too).
  */
 static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
+lazy_scan_skip(LVRelState *vacrel, BlockNumber next_block,
+			   bool *next_unskippable_allvis)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
+				next_unskippable_block = next_block;
 
 	*next_unskippable_allvis = true;
 	while (next_unskippable_block < rel_pages)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
+		uint8		mapbits = visibilitymap_snap_status(vacrel->vmsnap,
+														next_unskippable_block);
 
 		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
@@ -1332,55 +1439,23 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 		 * lock on rel to attempt a truncation that fails anyway, just because
 		 * there are tuples on the last page (it is likely that there will be
 		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
 		 */
 		if (next_unskippable_block == rel_pages - 1)
 			break;
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
+		Assert(vacrel->skipallfrozen || !vacrel->skipallvis);
+		if (!vacrel->skipallfrozen)
 			break;
 
 		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 * Don't skip all-visible pages when lazy_scan_strategy determined
+		 * that it was more important for this VACUUM to advance relfrozenxid
 		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
+		if (!vacrel->skipallvis && (mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+			break;
 
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
 		next_unskippable_block++;
-		nskippable_blocks++;
-	}
-
-	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
-	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
-	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
-	else
-	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
 	}
 
 	return next_unskippable_block;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index ed72eb7b6..6848576fd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,9 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap - get read-only snapshot of visibility map
+ *		visibilitymap_snap_release - release previously acquired snapshot
+ *		visibilitymap_snap_status - get status of bits from vm snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +55,9 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset (only tuples deleted before its
+ * OldestXmin cutoff are considered dead).
  *
  * LOCKING
  *
@@ -124,6 +130,22 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Snapshot of visibility map at a point in time.
+ *
+ * TODO This is currently always just a palloc()'d buffer -- give more thought
+ * to resource management (at a minimum add spilling to temp file).
+ */
+struct vmsnapshot
+{
+	/* Snapshot may contain zero or more visibility map pages */
+	BlockNumber nvmpages;
+
+	/* Copy of VM pages from the time that visibilitymap_snap() was called */
+	char		vmpages[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
@@ -368,6 +390,146 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap - get read-only snapshot of visibility map
+ *
+ * Initializes caller's snapshot, allocating memory in caller's memory context.
+ * Caller can use visibilitymap_snapshot_status to get the status of individual
+ * heap pages at the point that we were called.
+ *
+ * Used by VACUUM to determine which pages it must scan up front.  This avoids
+ * useless scans of concurrently unset heap pages.  VACUUM prefers to leave
+ * them to be scanned during the next VACUUM operation.
+ *
+ * rel_pages is the current size of the heap relation.
+ */
+vmsnapshot *
+visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+				   BlockNumber *all_visible, BlockNumber *all_frozen)
+{
+	BlockNumber nvmpages,
+				mapBlockLast;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap %s %d", RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	*all_visible = 0;
+	*all_frozen = 0;
+	mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages);
+	nvmpages = mapBlockLast + 1;
+	vmsnap = palloc(offsetof(vmsnapshot, vmpages) + BLCKSZ * nvmpages);
+	vmsnap->nvmpages = nvmpages;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Page		localvmpage;
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		localvmpage = vmsnap->vmpages + mapBlock * BLCKSZ;
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(localvmpage, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/*
+		 * Visibility map page copied to local buffer for caller's snapshot.
+		 * Caller requires an exact count of all-visible and all-frozen blocks
+		 * in the heap relation.  Handle that now.
+		 *
+		 * Must "truncate" our local copy of the VM to avoid incorrectly
+		 * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+		 * this by clearing irrelevant bits on the last VM page copied.
+		 */
+		map = PageGetContents(localvmpage);
+		if (mapBlock == mapBlockLast)
+		{
+			/* byte and bit for first heap page not to be scanned by VACUUM */
+			uint32		truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			uint8		truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
+
+			if (truncByte != 0 || truncOffset != 0)
+			{
+				/* Clear any bits set for heap pages >= rel_pages */
+				MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+				map[truncByte] &= (1 << truncOffset) - 1;
+			}
+
+			/* Now it's safe to tally bits from this final VM page below */
+		}
+
+		/* Tally the all-visible and all-frozen counts from this page */
+		umap = (uint64 *) map;
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			*all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+	}
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Just frees the memory allocated by visibilitymap_snap for now (presumably
+ * this will need to release temp files in later revisions of the patch)
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	pfree(vmsnap);
+}
+
+/*
+ *	visibilitymap_snap_status - get status of bits from vm snapshot
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to caller's snapshot of the visibility map?
+ */
+uint8
+visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+	char	   *map;
+	uint8		result;
+	Page		localvmpage;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_status %d", heapBlk);
+#endif
+
+	/* If we didn't copy the VM page, assume heapBlk not all-visible */
+	if (mapBlock >= vmsnap->nvmpages)
+		return 0;
+
+	localvmpage = ((Page) vmsnap->vmpages) + mapBlock * BLCKSZ;
+	map = PageGetContents(localvmpage);
+
+	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+	return result;
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
-- 
2.34.1

v2-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/octet-stream; name=v2-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload

From ad713713814b380bd20cd0ce8e0a33a3b4fa572d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 15:13:27 -0700
Subject: [PATCH v2 3/4] Add eager freezing strategy to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  Use of the eager strategy (an
alternative to the classic lazy freezing strategy) is controlled by a
new GUC, vacuum_freeze_strategy_threshold (and an associated
autovacuum_* reloption).  Tables whose rel_pages are >= the cutoff will
have VACUUM use the eager freezing strategy.  Otherwise we use the lazy
freezing strategy, which is the classic approach (actually, we always
use eager freezing in aggressive VACUUMs, though they are expected to be
much rarer now).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).

If and when a smaller table (a table that uses lazy freezing at first)
grows past the table size threshold, the next VACUUM against the table
shouldn't have to do too much extra freezing to catch up when we perform
eager freezing for the first time (the table still won't be very large).
Once VACUUM has caught up, the amount of work required in each VACUUM
operation should be roughly proportionate to the number of new pages, at
least with a pure append-only table.

In summary, we try to get the benefit of the lazy freezing strategy,
without ever allowing VACUUM to fall uncomfortably far behind.  In
particular, we avoid accumulating an excessive number of unfrozen
all-visible pages in any one table.  This approach is often enough to
keep relfrozenxid recent, but we still have aggressive/antiwraparound
autovacuums for tables where it doesn't work out that way.

Note that freezing strategy is distinct from (though related to) the
strategy for skipping pages with the visibility map.  In practice tables
that use eager freezing always eagerly scan all-visible pages (they
prioritize advancing relfrozenxid), partly because we expect few or no
all-visible pages there (at least during the second or subsequent VACUUM
that uses eager freezing).  When VACUUM uses the classic/lazy freezing
strategy, VACUUM will also scan pages eagerly (i.e. it will scan any
all-visible pages and only skip all-frozen pages) when the added cost is
relatively low.
---
 src/include/access/heapam_xlog.h              |  8 +-
 src/include/commands/vacuum.h                 |  4 +
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++
 src/backend/access/heap/heapam.c              |  8 +-
 src/backend/access/heap/vacuumlazy.c          | 80 ++++++++++++++++---
 src/backend/commands/vacuum.c                 |  4 +
 src/backend/postmaster/autovacuum.c           | 10 +++
 src/backend/utils/misc/guc.c                  | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 15 ++++
 doc/src/sgml/maintenance.sgml                 |  6 +-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++
 13 files changed, 155 insertions(+), 18 deletions(-)

diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 40556271d..9ea1db505 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -345,7 +345,11 @@ typedef struct xl_heap_freeze_tuple
  * pg_class tuple.
  *
  * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
- * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ * relminmxid_nofreeze_out must also be maintained.  If vacuumlazy.c caller
+ * opts to not execute freeze plans produced by heap_prepare_freeze_tuple for
+ * its own reasons, then new relfrozenxid and relminmxid values must reflect
+ * that that choice was made.  (This is only safe when 'freeze' is still unset
+ * after the final last heap_prepare_freeze_tuple call for the page.)
  */
 typedef struct page_frozenxid_tracker
 {
@@ -356,7 +360,7 @@ typedef struct page_frozenxid_tracker
 	TransactionId relfrozenxid_out;
 	MultiXactId relminmxid_out;
 
-	/* Used by caller for '!freeze' pages */
+	/* Used by caller that opts not to freeze a '!freeze' page */
 	TransactionId relfrozenxid_nofreeze_out;
 	MultiXactId relminmxid_nofreeze_out;
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f..52379f819 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -256,6 +259,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 7dc401cf0..c6d8265cf 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 609329bb2..f4e2109e7 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d6aea370f..699a5acae 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6429,7 +6429,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * WAL-log what we would need to do, and return true.  Return false if nothing
  * is to be changed.  In addition, set *totally_frozen to true if the tuple
  * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * more freezing will eventually be required (assuming page is to be frozen).
+ *
+ * Although this interface is primarily tuple-based, caller decides on whether
+ * or not to freeze the page as a whole.  We'll often help caller to prepare a
+ * complete "freeze plan" that it ultimately discards.  However, our caller
+ * doesn't always get to choose; it must freeze when xtrack.freeze is set
+ * here.  This ensures that any XIDs < limit_xid are never left behind.
  *
  * Caller must initialize xtrack fields for page as a whole before calling
  * here with first tuple for the page.  See page_frozenxid_tracker comments.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9ba975c1a..ba54e5767 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -110,7 +110,7 @@
 
 /*
  * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages
+ * all-visible pages when using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
@@ -150,6 +150,8 @@ typedef struct LVRelState
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
+	/* Proactively freeze all tuples on pages about to be set all-visible? */
+	bool		allvis_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -252,6 +254,7 @@ typedef struct LVSavedErrInfo
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
 									  BlockNumber all_frozen);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
@@ -327,6 +330,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				eager_threshold,
 				all_visible,
 				all_frozen,
 				scanned_pages,
@@ -366,6 +370,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
 	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
 	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 *
+	 * Also determine our cutoff for applying the eager/all-visible freezing
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
+	 * even during non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -374,6 +382,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->multixact_freeze_table_age,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
+	eager_threshold = params->freeze_strategy_threshold < 0 ?
+		vacuum_freeze_strategy_threshold :
+		params->freeze_strategy_threshold;
 
 	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -523,10 +534,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->NewRelminMxid = OldestMxact;
 
 	/*
-	 * Use visibility map snapshot to determine whether we'll skip all-visible
-	 * pages using vmsnap in lazy_scan_heap
+	 * Use visibility map snapshot to determine freezing strategy, and whether
+	 * we'll skip all-visible pages using vmsnap in lazy_scan_heap
 	 */
-	scanned_pages = lazy_scan_strategy(vacrel,
+	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
 									   all_visible, all_frozen);
 	if (verbose)
 	{
@@ -1307,17 +1318,28 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine skipping strategy.
+ *	lazy_scan_strategy() -- Determine freezing/skipping strategy.
  *
- * Determines if the ongoing VACUUM operation should skip all-visible pages
- * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * Our traditional/lazy freezing strategy is useful when putting off the work
+ * of freezing totally avoids work that turns out to have been unnecessary.
+ * On the other hand we eagerly freeze pages when that strategy spreads out
+ * the burden of freezing over time.  Performance stability is important; no
+ * one VACUUM operation should need to freeze disproportionately many pages.
+ * Antiwraparound VACUUMs of append-only tables should generally be avoided.
+ *
+ * Also determines if the ongoing VACUUM operation should skip all-visible
+ * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
+ * important that relfrozenxid advance in affected tables, which are larger.
+ * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
+ * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
+ * depending on the extra cost - we might need to scan only a few extra pages.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_strategy(LVRelState *vacrel,
-				   BlockNumber all_visible,
-				   BlockNumber all_frozen)
+lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
+				   BlockNumber all_visible, BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1350,21 +1372,48 @@ lazy_scan_strategy(LVRelState *vacrel,
 		scanned_pages_skipallfrozen++;
 
 	/*
-	 * Okay, now we have all the information we need to decide on a strategy
+	 * Okay, now we have all the information we need to decide on a strategy.
+	 *
+	 * We use the all-visible/eager freezing strategy when a threshold
+	 * controlled by the freeze_strategy_threshold GUC/reloption is crossed.
+	 * VACUUM won't accumulate any unfrozen all-visible pages over time in
+	 * tables above the threshold.  The system won't fall behind on freezing.
 	 */
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
 		return rel_pages;
 	}
 	else if (vacrel->aggressive)
+	{
+		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
+	}
+	else if (rel_pages >= eager_threshold)
+	{
+		/*
+		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
+		 * GUC-based threshold for eager freezing.
+		 *
+		 * We always scan all-visible pages when the treshold is crossed, so
+		 * that relfrozenxid can be advanced.  There will typically be few or
+		 * no all-visible pages (only all-frozen) in the table anyway, at
+		 * least after the first VACUUM that exceeds the threshold.
+		 */
+		vacrel->allvis_freeze_strategy = true;
+		vacrel->skipallvis = false;
+	}
 	else
 	{
 		BlockNumber nextra,
 					nextra_threshold;
 
+		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
+		vacrel->allvis_freeze_strategy = false;
+
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1882,8 +1931,15 @@ retry:
 	 *
 	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
+	 *
+	 * When ongoing VACUUM opted to use the all-visible freezing strategy we
+	 * freeze any page that will become all-visible, making it all-frozen
+	 * instead. (Actually, there are edge-cases where this might not result in
+	 * marking the page all-frozen in the visibility map, but that should have
+	 * only a negligible impact.)
 	 */
-	if (xtrack.freeze || nfrozen == 0)
+	if (xtrack.freeze || nfrozen == 0 ||
+		(vacrel->allvis_freeze_strategy && prunestate->all_visible))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7ccde07de..b837e0331 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index b3b1afba8..ff78152b4 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -150,6 +150,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2006,6 +2007,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2013,6 +2015,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2797,6 +2800,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2846,6 +2850,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2868,6 +2877,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9fbbfb1be..06b1bf764 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2736,6 +2736,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90bec0502..e701e464e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -694,6 +694,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a5cd4e44c..ba3e012a0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9147,6 +9147,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..554b3a75d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -588,9 +588,9 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
+    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
+    transactions older than this cutoff XID are guaranteed to have been frozen.
+    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c14b2010d..7e684d187 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1680,6 +1680,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.34.1

v2-0004-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/octet-stream; name=v2-0004-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload

From 103bb567cf7c0e92a2f352cf2ad19d6d350cb8d7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 31 Jul 2022 13:53:19 -0700
Subject: [PATCH v2 4/4] Avoid allocating MultiXacts during VACUUM.

Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
teach it the difference between OldestXmin and FreezeLimit.  Use this
high-level context to intelligently avoid allocating new MultiXactIds
during VACUUM operations.  We should always prefer to avoid allocating
new MultiXacts during VACUUM on general principle.  VACUUM is the only
mechanism that can claw back MultixactId space, so allowing VACUUM to
consume MultiXactId space (for any reason) adds to the risk that the
system will trigger the multiStopLimit wraparound protection mechanism.

FreezeMultiXactId() is now eager when it's cheap to process xmax, and
lazy when it's expensive/risky to process xmax (because an expensive
second pass that might result in allocating a new Multi is required).
We make a similar trade-off to the one made by vacuumlazy.c when a
cleanup lock isn't available on some heap page.  We can usually put off
freezing (for the time being) when it's inconvenient to proceed.  The
only downside to this approach is that it necessitates pushing back the
final relfrozenxid/relminmxid value that can be set in pg_class.
---
 src/backend/access/heap/heapam.c | 49 +++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 699a5acae..e18000d81 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6111,11 +6111,21 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
  *
  * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ *
+ * Allocating new MultiXactIds during VACUUM is something that we always
+ * prefer to avoid.  Caller passes us down all the context required to do this
+ * on their behalf: cutoff_xid, cutoff_multi, limit_xid, and limit_multi.
+ *
+ * We must never leave behind XIDs/MXIDs from before the "limits" given.
+ * There is lots of leeway around XIDs/MXIDs >= caller's "limits", though.
+ * When it's possible to inexpensively process xmax right away, we're eager
+ * about it.  Otherwise we're lazy about it -- next time it might be cheap.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
+				  TransactionId limit_xid, MultiXactId limit_multi,
 				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
@@ -6208,13 +6218,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below the limit for Xids, so we
 	 * need to walk the whole members array to figure out what to do, if
 	 * anything.
+	 *
+	 * We use limit_xid for this (VACUUM's FreezeLimit), rather than using
+	 * cutoff_xid (VACUUM's OldestXmin).  We greatly prefer to avoid a second
+	 * pass over the Multi that results in allocating a new replacement Multi.
 	 */
-
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6225,12 +6238,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP optimization */
 	for (i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		if (TransactionIdPrecedes(members[i].xid, limit_xid))
 		{
 			need_replace = true;
 			break;
@@ -6239,11 +6251,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			temp_xid_out = members[i].xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* The Multi itself must be >= limit_multi, too */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, limit_multi);
+
 	if (!need_replace)
 	{
 		/*
@@ -6260,6 +6271,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_would_freeze will indicate that the tuple must be frozen.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
@@ -6359,7 +6373,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				/*
 				 * Running locker cannot possibly be older than the cutoff.
 				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * cutoff_xid is VACUUM's OldestXmin, which is also the
 				 * initial value used for top-level relfrozenxid_out tracking
 				 * state.  A running locker cannot be older than VACUUM's
 				 * OldestXmin, either, so we don't need a temp_xid_out step.
@@ -6529,7 +6543,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given limit for Xids.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
@@ -6547,6 +6561,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
 									cutoff_xid, cutoff_multi,
+									limit_xid, limit_multi,
 									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
@@ -6587,12 +6602,18 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * MultiXactId, to carry forward two or more original member XIDs.
 			 * Might have to ratchet back relfrozenxid_out here, though never
 			 * relminmxid_out.
+			 *
+			 * We only do this when we have no choice; heap_tuple_would_freeze
+			 * will definitely force the page to be frozen below.
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
 			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
 												 xtrack->relfrozenxid_out));
+			Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+										   &xtrack->relfrozenxid_nofreeze_out,
+										   &xtrack->relminmxid_nofreeze_out));
 			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
-- 
2.34.1

#15

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Peter Geoghegan (#13)

5 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Aug 31, 2022 at 12:03 AM Peter Geoghegan <pg@bowt.ie> wrote:

Actually I'm also ignoring some subtleties with Multis that could make
this not quite happen, but again, that's only a super obscure corner case.
The idea that just setting vacuum_freeze_min_age = 0 and
vacuum_multixact_freeze_min_age = 0 will be enough is definitely true
in spirit. You don't need to touch vacuum_freeze_table_age (if you did
then you'd get aggressive VACUUMs, and one goal here is to avoid
those whenever possible -- especially aggressive antiwraparound
autovacuums).

Attached is v3. There is a new patch included here -- v3-0004-*patch,
or "Unify aggressive VACUUM with antiwraparound VACUUM". No other
notable changes.

I decided to work on this now because it seems like it might give a
more complete picture of the high level direction that I'm pushing
towards. Perhaps this will make it easier to review the patch series
as a whole, even. The new patch unifies the concept of antiwraparound
VACUUM with the concept of aggressive VACUUM. Now there is only
antiwraparound and regular VACUUM (uh, barring VACUUM FULL). And now
antiwraparound VACUUMs are not limited to antiwraparound autovacuums
-- a manual VACUUM can also be antiwraparound (that's just the new
name for "aggressive").

We will typically only get antiwraparound vacuuming in a regular
VACUUM when the user goes out of their way to get that behavior.
VACUUM FREEZE is the best example. For the most part the
skipping/freezing strategy stuff has a good sense of what matters
already, and shouldn't need to be guided very often.

The patch relegates vacuum_freeze_table_age to a compatibility option,
making its default -1, meaning "just use autovacuum_freeze_max_age". I
always thought that having two table age based GUCs was confusing.
There was a period between 2006 and 2009 when we had
autovacuum_freeze_max_age, but did not yet have
vacuum_freeze_table_age. This change can almost be thought of as a
return to the simpler user interface that existed at that time. Of
course we must not resurrect the problems that vacuum_freeze_table_age
was intended to address (see originating commit 65878185) by mistake.
We need an improved version of the same basic concept, too.

The patch more or less replaces the table-age-aggressive-escalation
concept (previously implemented using vacuum_freeze_table_age) with
new logic that makes lazyvacuum.c's choice of skipping strategy *also*
depend upon table age -- it is now one more factor to be considered.
Both costs and benefits are weighed here. We now give just a little
weight to table age at a relatively early stage (XID-age-wise early),
and escalate from there. As the table's relfrozenxid gets older and
older, we give less and less weight to putting off the cost of
freezing. This general approach is possible because the false
dichotomy that is "aggressive vs non-aggressive" has mostly been
eliminated. This makes things less confusing for users and hackers.

The details of the skipping-strategy-choice algorithm are still
unsettled in v3 (no real change there). ISTM that the important thing
is still the high level concepts. Jeff was slightly puzzled by the
emphasis placed on the cost model/strategy stuff, at least at one
point. Hopefully my intent will be made clearer by the ideas featured
in the new patch. The skipping strategy decision making process isn't
particularly complicated, but it now looks more like an optimization
problem of some kind or other.

It might make sense to go further in the same direction by making
"regular vs aggressive/antiwraparound" into a *strict* continuum. In
other words, it might make sense to get rid of the two remaining cases
where VACUUM conditions its behavior on whether this VACUUM operation
is antiwraparound/aggressive or not. I'm referring to the cleanup lock
skipping behavior around lazy_scan_noprune(), as well as the
PROC_VACUUM_FOR_WRAPAROUND no-auto-cancellation behavior enforced in
autovacuum workers. We will still need to keep roughly the same two
behaviors, but the timelines can be totally different. We must be
reasonably sure that the cure won't be worse than the disease -- I'm
aware of quite a few individual cases that felt that way [1]https://www.tritondatacenter.com/blog/manta-postmortem-7-27-2015 is the most high profile example, but I have personally been called in to deal with similar problems in the past -- Peter Geoghegan.
Aggressive interventions can make sense, but they need to be
proportionate to the problem that's right in front of us. "Kicking the
can down the road" is often the safest and most responsible approach
-- it all depends on the details.

[1]: https://www.tritondatacenter.com/blog/manta-postmortem-7-27-2015 is the most high profile example, but I have personally been called in to deal with similar problems in the past -- Peter Geoghegan
is the most high profile example, but I have personally been called in
to deal with similar problems in the past
--
Peter Geoghegan

Attachments:

v3-0001-Add-page-level-freezing-to-VACUUM.patchapplication/x-patch; name=v3-0001-Add-page-level-freezing-to-VACUUM.patchDownload

From a884a48561cc1f0ae91d49bb86b42240e7686035 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v3 1/5] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

This approach decouples the question of _how_ VACUUM could/will freeze a
given heap page (which of its XIDs are eligible to be frozen) from the
question of whether it actually makes sense to do so right now.

Just adding page-level freezing does not change all that much on its
own: VACUUM will still typically freeze very lazily, since we're only
forcing freezing of all of a page's eligible tuples when we decide to
freeze at least one (on the basis of XID age and FreezeLimit).  For now
VACUUM still freezes everything almost as lazily as it always has.
Later work will teach VACUUM to apply an alternative eager freezing
strategy that triggers page-level freezing earlier, based on additional
criteria.
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |  37 +++++-
 src/backend/access/heap/heapam.c     | 171 ++++++++++++++++-----------
 src/backend/access/heap/vacuumlazy.c | 154 ++++++++++++++----------
 4 files changed, 231 insertions(+), 135 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index abf62d9df..c201f8ae6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,8 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
+									TransactionId limit_xid, MultiXactId limit_multi,
 									TransactionId *relfrozenxid_out,
 									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 1705e736b..40556271d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -330,6 +330,38 @@ typedef struct xl_heap_freeze_tuple
 	uint8		frzflags;
 } xl_heap_freeze_tuple;
 
+/*
+ * State used by VACUUM to track what the oldest extant XID/MXID will become
+ * when determing whether and how to freeze a page's heap tuples via calls to
+ * heap_prepare_freeze_tuple.
+ *
+ * The relfrozenxid_out and relminmxid_out fields are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
+ * relminmxid_out as needed to avoid unsafe values in rel's authoritative
+ * pg_class tuple.
+ *
+ * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
+ * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ */
+typedef struct page_frozenxid_tracker
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze;
+
+	/* Values used when page is to be frozen based on freeze plans */
+	TransactionId relfrozenxid_out;
+	MultiXactId relminmxid_out;
+
+	/* Used by caller for '!freeze' pages */
+	TransactionId relfrozenxid_nofreeze_out;
+	MultiXactId relminmxid_nofreeze_out;
+
+} page_frozenxid_tracker;
+
 /*
  * This is what we need to know about a block being frozen during vacuum
  *
@@ -409,10 +441,11 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId relminmxid,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
+									  TransactionId limit_xid,
+									  MultiXactId limit_multi,
 									  xl_heap_freeze_tuple *frz,
 									  bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  page_frozenxid_tracker *xtrack);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 588716606..d6aea370f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6431,26 +6431,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
- * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
+ * Caller must initialize xtrack fields for page as a whole before calling
+ * here with first tuple for the page.  See page_frozenxid_tracker comments.
+ *
+ * Caller must set frz->offset itself if heap_execute_freeze_tuple is called.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6463,34 +6452,46 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  TransactionId limit_xid, MultiXactId limit_multi,
 						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  page_frozenxid_tracker *xtrack)
 {
 	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin,
+				freeze_xmax;
 	TransactionId xid;
 
+	/*
+	 * limit_xid *must* be <= cutoff_xid, to ensure that any XID older than it
+	 * can neither be running nor seen as running by any open transaction.
+	 * This ensures that we only freeze XIDs that are safe to freeze -- those
+	 * that are already unambiguously visible to everybody.
+	 *
+	 * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
+	 * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)
+	 */
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
+
 	frz->frzflags = 0;
 	frz->t_infomask2 = tuple->t_infomask2;
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen iff our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+	{
+		freeze_xmin = false;
+		xmin_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmin */
+	}
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6499,8 +6500,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoff_xid);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
@@ -6514,8 +6515,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 
@@ -6526,7 +6527,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.  It needs
+	 * to return true for any tuple that we would force to be frozen here.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6534,7 +6536,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
+		TransactionId mxid_oldest_xid_out = xtrack->relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
@@ -6553,8 +6555,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
+			if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6582,10 +6584,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6613,10 +6615,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, xtrack->relminmxid_out))
+				xtrack->relminmxid_out = xid;
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 		}
 		else
 		{
@@ -6656,8 +6658,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			freeze_xmax = false;
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
@@ -6673,6 +6675,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+		Assert(changed);
+	}
 	if (freeze_xmax)
 	{
 		Assert(!xmax_already_frozen);
@@ -6703,11 +6710,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we ignore the cutoff_xid and just always perform the
 		 * freeze operation.  The oldest release in which such a value can
 		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
+		 * was removed in PostgreSQL 9.0.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6721,18 +6724,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			else
 				frz->frzflags |= XLH_FREEZE_XVAC;
 
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
+			/* Set XMIN_COMMITTED defensively */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
+
+			/*
+			 * Force freezing any page with an xvac to keep things simple.
+			 * This allows totally_frozen tracking to ignore xvac.
+			 */
 			changed = true;
+			xtrack->freeze = true;
 		}
 	}
 
-	*totally_frozen = (xmin_frozen &&
+	/*
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen (provided caller executes freeze plan for the page)
+	 */
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
+
+	/*
+	 * Force vacuumlazy.c to freeze page when avoiding it would violate the
+	 * rule that XIDs < limit_xid (and MXIDs < limit_multi) must never remain
+	 */
+	if (!xtrack->freeze && !(xmin_already_frozen && xmax_already_frozen))
+		xtrack->freeze =
+			heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+									&xtrack->relfrozenxid_nofreeze_out,
+									&xtrack->relminmxid_nofreeze_out);
+
 	return changed;
 }
 
@@ -6785,14 +6806,20 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	page_frozenxid_tracker dummy;
+
+	dummy.freeze = true;
+	dummy.relfrozenxid_out = cutoff_xid;
+	dummy.relminmxid_out = cutoff_multi;
+	dummy.relfrozenxid_nofreeze_out = cutoff_xid;
+	dummy.relminmxid_nofreeze_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
+										  cutoff_xid, cutoff_multi,
 										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+										  &dummy);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7218,17 +7245,23 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * heap_tuple_would_freeze
  *
  * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * force freezing of any of the XID/XMID fields from the tuple, given the same
+ * limits.  We must also deal with dead tuples here, since (xmin, xmax, xvac)
+ * fields could be processed by pruning away the whole tuple instead of
+ * freezing.
+ *
+ * Note: VACUUM refers to limit_xid and limit_multi as "FreezeLimit" and
+ * "MultiXactCutoff" respectively.  These should not be confused with the
+ * absolute cutoffs for freezing.  We just determine whether caller's tuple
+ * and limits trigger heap_prepare_freeze_tuple to force freezing.
  *
  * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
  * like the heap_prepare_freeze_tuple arguments that they're based on.  We
  * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
+heap_tuple_would_freeze(HeapTupleHeader tuple,
+						TransactionId limit_xid, MultiXactId limit_multi,
 						TransactionId *relfrozenxid_out,
 						MultiXactId *relminmxid_out)
 {
@@ -7242,7 +7275,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	{
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 
@@ -7259,7 +7292,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		/* xmax is a non-permanent XID */
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
@@ -7282,7 +7315,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 		if (MultiXactIdPrecedes(multi, *relminmxid_out))
 			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
+		if (MultiXactIdPrecedes(multi, limit_multi))
 			would_freeze = true;
 
 		/* need to check whether any member of the mxact is old */
@@ -7295,7 +7328,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			Assert(TransactionIdIsNormal(xid));
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(xid, limit_xid))
 				would_freeze = true;
 		}
 		if (nmembers > 0)
@@ -7309,7 +7342,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		{
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			would_freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472..fad274621 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,8 +169,9 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
+	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
@@ -511,6 +512,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
@@ -1563,8 +1565,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	page_frozenxid_tracker xtrack;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1580,8 +1582,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	xtrack.freeze = false;
+	xtrack.relfrozenxid_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_out = vacrel->NewRelminMxid;
+	xtrack.relfrozenxid_nofreeze_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_nofreeze_out = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1634,27 +1639,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1786,11 +1787,13 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[tuples_frozen],
 									  &tuple_totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &xtrack))
 		{
 			/* Will execute freeze below */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1811,9 +1814,33 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (xtrack.freeze || tuples_frozen == 0)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (might be zero eligible tuples).
+		 */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_out;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* Not freezing this page, so use alternative cutoffs */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_nofreeze_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_nofreeze_out;
+
+		/* Might still set page all-visible, but never all-frozen */
+		tuples_frozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1821,7 +1848,7 @@ retry:
 	 */
 	if (tuples_frozen > 0)
 	{
-		Assert(prunestate->hastup);
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		vacrel->frozen_pages++;
 
@@ -1853,7 +1880,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(rel, buf, vacrel->NewRelfrozenXid,
 									 frozen, tuples_frozen);
 			PageSetLSN(page, recptr);
 		}
@@ -1862,6 +1889,42 @@ retry:
 	}
 
 	/*
+	 * Now save details of the LP_DEAD items from the page in vacrel
+	 */
+	if (lpdead_items > 0)
+	{
+		VacDeadItems *dead_items = vacrel->dead_items;
+		ItemPointerData tmp;
+
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_items->items[dead_items->num_items++] = tmp;
+		}
+
+		Assert(dead_items->num_items <= dead_items->max_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->has_lpdead_items = true;
+		prunestate->all_visible = false;
+	}
+
+	/* Finally, add page-local counts to whole-VACUUM counts */
+	vacrel->tuples_deleted += tuples_deleted;
+	vacrel->tuples_frozen += tuples_frozen;
+	vacrel->lpdead_items += lpdead_items;
+	vacrel->live_tuples += live_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
+
+	/*
+	 * We're done, but assert that some postconditions hold before returning.
+	 *
 	 * The second pass over the heap can also set visibility map bits, using
 	 * the same approach.  This is important when the table frequently has a
 	 * few old LP_DEAD items on each page by the time we get to it (typically
@@ -1885,7 +1948,7 @@ retry:
 			Assert(false);
 
 		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1897,39 +1960,6 @@ retry:
 			   cutoff == prunestate->visibility_cutoff_xid);
 	}
 #endif
-
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
-	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
-		vacrel->lpdead_item_pages++;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
-	}
-
-	/* Finally, add page-local counts to whole-VACUUM counts */
-	vacrel->tuples_deleted += tuples_deleted;
-	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
-	vacrel->live_tuples += live_tuples;
-	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
 
 /*
-- 
2.34.1

v3-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/x-patch; name=v3-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload

From b4135b387fba17b95f1fc4327d26cf9b95f5fe2f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 31 Jul 2022 13:53:19 -0700
Subject: [PATCH v3 5/5] Avoid allocating MultiXacts during VACUUM.

Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
teach it the difference between OldestXmin and FreezeLimit.  Use this
high-level context to intelligently avoid allocating new MultiXactIds
during VACUUM operations.  We should always prefer to avoid allocating
new MultiXacts during VACUUM on general principle.  VACUUM is the only
mechanism that can claw back MultixactId space, so allowing VACUUM to
consume MultiXactId space (for any reason) adds to the risk that the
system will trigger the multiStopLimit wraparound protection mechanism.

FreezeMultiXactId() is now eager when it's cheap to process xmax, and
lazy when it's expensive/risky to process xmax (because an expensive
second pass that might result in allocating a new Multi is required).
We make a similar trade-off to the one made by vacuumlazy.c when a
cleanup lock isn't available on some heap page.  We can usually put off
freezing (for the time being) when it's inconvenient to proceed.  The
only downside to this approach is that it necessitates pushing back the
final relfrozenxid/relminmxid value that can be set in pg_class.
---
 src/backend/access/heap/heapam.c | 49 +++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 699a5acae..e18000d81 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6111,11 +6111,21 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
  *
  * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ *
+ * Allocating new MultiXactIds during VACUUM is something that we always
+ * prefer to avoid.  Caller passes us down all the context required to do this
+ * on their behalf: cutoff_xid, cutoff_multi, limit_xid, and limit_multi.
+ *
+ * We must never leave behind XIDs/MXIDs from before the "limits" given.
+ * There is lots of leeway around XIDs/MXIDs >= caller's "limits", though.
+ * When it's possible to inexpensively process xmax right away, we're eager
+ * about it.  Otherwise we're lazy about it -- next time it might be cheap.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
+				  TransactionId limit_xid, MultiXactId limit_multi,
 				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
@@ -6208,13 +6218,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below the limit for Xids, so we
 	 * need to walk the whole members array to figure out what to do, if
 	 * anything.
+	 *
+	 * We use limit_xid for this (VACUUM's FreezeLimit), rather than using
+	 * cutoff_xid (VACUUM's OldestXmin).  We greatly prefer to avoid a second
+	 * pass over the Multi that results in allocating a new replacement Multi.
 	 */
-
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6225,12 +6238,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP optimization */
 	for (i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		if (TransactionIdPrecedes(members[i].xid, limit_xid))
 		{
 			need_replace = true;
 			break;
@@ -6239,11 +6251,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			temp_xid_out = members[i].xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* The Multi itself must be >= limit_multi, too */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, limit_multi);
+
 	if (!need_replace)
 	{
 		/*
@@ -6260,6 +6271,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_would_freeze will indicate that the tuple must be frozen.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
@@ -6359,7 +6373,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				/*
 				 * Running locker cannot possibly be older than the cutoff.
 				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * cutoff_xid is VACUUM's OldestXmin, which is also the
 				 * initial value used for top-level relfrozenxid_out tracking
 				 * state.  A running locker cannot be older than VACUUM's
 				 * OldestXmin, either, so we don't need a temp_xid_out step.
@@ -6529,7 +6543,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given limit for Xids.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
@@ -6547,6 +6561,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
 									cutoff_xid, cutoff_multi,
+									limit_xid, limit_multi,
 									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
@@ -6587,12 +6602,18 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * MultiXactId, to carry forward two or more original member XIDs.
 			 * Might have to ratchet back relfrozenxid_out here, though never
 			 * relminmxid_out.
+			 *
+			 * We only do this when we have no choice; heap_tuple_would_freeze
+			 * will definitely force the page to be frozen below.
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
 			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
 												 xtrack->relfrozenxid_out));
+			Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+										   &xtrack->relfrozenxid_nofreeze_out,
+										   &xtrack->relminmxid_nofreeze_out));
 			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
-- 
2.34.1

v3-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/x-patch; name=v3-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload

From 5071ca4e60888fffbf4e2ef526f102da41c7cab7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 15:13:27 -0700
Subject: [PATCH v3 3/5] Add eager freezing strategy to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  Use of the eager strategy (an
alternative to the classic lazy freezing strategy) is controlled by a
new GUC, vacuum_freeze_strategy_threshold (and an associated
autovacuum_* reloption).  Tables whose rel_pages are >= the cutoff will
have VACUUM use the eager freezing strategy.  Otherwise we use the lazy
freezing strategy, which is the classic approach (actually, we always
use eager freezing in aggressive VACUUMs, though they are expected to be
much rarer now).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).

If and when a smaller table (a table that uses lazy freezing at first)
grows past the table size threshold, the next VACUUM against the table
shouldn't have to do too much extra freezing to catch up when we perform
eager freezing for the first time (the table still won't be very large).
Once VACUUM has caught up, the amount of work required in each VACUUM
operation should be roughly proportionate to the number of new pages, at
least with a pure append-only table.

In summary, we try to get the benefit of the lazy freezing strategy,
without ever allowing VACUUM to fall uncomfortably far behind.  In
particular, we avoid accumulating an excessive number of unfrozen
all-visible pages in any one table.  This approach is often enough to
keep relfrozenxid recent, but we still have aggressive/antiwraparound
autovacuums for tables where it doesn't work out that way.

Note that freezing strategy is distinct from (though related to) the
strategy for skipping pages with the visibility map.  In practice tables
that use eager freezing always eagerly scan all-visible pages (they
prioritize advancing relfrozenxid), partly because we expect few or no
all-visible pages there (at least during the second or subsequent VACUUM
that uses eager freezing).  When VACUUM uses the classic/lazy freezing
strategy, VACUUM will also scan pages eagerly (i.e. it will scan any
all-visible pages and only skip all-frozen pages) when the added cost is
relatively low.
---
 src/include/access/heapam_xlog.h              |  8 +-
 src/include/commands/vacuum.h                 |  4 +
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++
 src/backend/access/heap/heapam.c              |  8 +-
 src/backend/access/heap/vacuumlazy.c          | 76 ++++++++++++++++---
 src/backend/commands/vacuum.c                 |  4 +
 src/backend/postmaster/autovacuum.c           | 10 +++
 src/backend/utils/misc/guc.c                  | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 31 ++++++--
 doc/src/sgml/maintenance.sgml                 |  6 +-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++
 13 files changed, 162 insertions(+), 23 deletions(-)

diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 40556271d..9ea1db505 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -345,7 +345,11 @@ typedef struct xl_heap_freeze_tuple
  * pg_class tuple.
  *
  * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
- * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ * relminmxid_nofreeze_out must also be maintained.  If vacuumlazy.c caller
+ * opts to not execute freeze plans produced by heap_prepare_freeze_tuple for
+ * its own reasons, then new relfrozenxid and relminmxid values must reflect
+ * that that choice was made.  (This is only safe when 'freeze' is still unset
+ * after the final last heap_prepare_freeze_tuple call for the page.)
  */
 typedef struct page_frozenxid_tracker
 {
@@ -356,7 +360,7 @@ typedef struct page_frozenxid_tracker
 	TransactionId relfrozenxid_out;
 	MultiXactId relminmxid_out;
 
-	/* Used by caller for '!freeze' pages */
+	/* Used by caller that opts not to freeze a '!freeze' page */
 	TransactionId relfrozenxid_nofreeze_out;
 	MultiXactId relminmxid_nofreeze_out;
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f..52379f819 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -256,6 +259,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 7dc401cf0..c6d8265cf 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 609329bb2..f4e2109e7 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d6aea370f..699a5acae 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6429,7 +6429,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * WAL-log what we would need to do, and return true.  Return false if nothing
  * is to be changed.  In addition, set *totally_frozen to true if the tuple
  * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * more freezing will eventually be required (assuming page is to be frozen).
+ *
+ * Although this interface is primarily tuple-based, caller decides on whether
+ * or not to freeze the page as a whole.  We'll often help caller to prepare a
+ * complete "freeze plan" that it ultimately discards.  However, our caller
+ * doesn't always get to choose; it must freeze when xtrack.freeze is set
+ * here.  This ensures that any XIDs < limit_xid are never left behind.
  *
  * Caller must initialize xtrack fields for page as a whole before calling
  * here with first tuple for the page.  See page_frozenxid_tracker comments.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 97e272081..8d750ac7f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -110,7 +110,7 @@
 
 /*
  * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages
+ * all-visible pages when using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
@@ -150,6 +150,8 @@ typedef struct LVRelState
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
+	/* Proactively freeze all tuples on pages about to be set all-visible? */
+	bool		allvis_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -254,6 +256,7 @@ typedef struct LVSavedErrInfo
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
 									  BlockNumber all_frozen);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
@@ -328,6 +331,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				eager_threshold,
 				all_visible,
 				all_frozen,
 				scanned_pages,
@@ -367,6 +371,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
 	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
 	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 *
+	 * Also determine our cutoff for applying the eager/all-visible freezing
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
+	 * even during non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -375,6 +383,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->multixact_freeze_table_age,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
+	eager_threshold = params->freeze_strategy_threshold < 0 ?
+		vacuum_freeze_strategy_threshold :
+		params->freeze_strategy_threshold;
 
 	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -529,7 +540,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
-	scanned_pages = lazy_scan_strategy(vacrel,
+	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
 									   all_visible, all_frozen);
 	if (verbose)
 	{
@@ -1304,17 +1315,28 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine skipping strategy.
+ *	lazy_scan_strategy() -- Determine freezing/skipping strategy.
  *
- * Determines if the ongoing VACUUM operation should skip all-visible pages
- * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * Our traditional/lazy freezing strategy is useful when putting off the work
+ * of freezing totally avoids work that turns out to have been unnecessary.
+ * On the other hand we eagerly freeze pages when that strategy spreads out
+ * the burden of freezing over time.  Performance stability is important; no
+ * one VACUUM operation should need to freeze disproportionately many pages.
+ * Antiwraparound VACUUMs of append-only tables should generally be avoided.
+ *
+ * Also determines if the ongoing VACUUM operation should skip all-visible
+ * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
+ * important that relfrozenxid advance in affected tables, which are larger.
+ * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
+ * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
+ * depending on the extra cost - we might need to scan only a few extra pages.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_strategy(LVRelState *vacrel,
-				   BlockNumber all_visible,
-				   BlockNumber all_frozen)
+lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
+				   BlockNumber all_visible, BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1347,21 +1369,48 @@ lazy_scan_strategy(LVRelState *vacrel,
 		scanned_pages_skipallfrozen++;
 
 	/*
-	 * Okay, now we have all the information we need to decide on a strategy
+	 * Okay, now we have all the information we need to decide on a strategy.
+	 *
+	 * We use the all-visible/eager freezing strategy when a threshold
+	 * controlled by the freeze_strategy_threshold GUC/reloption is crossed.
+	 * VACUUM won't accumulate any unfrozen all-visible pages over time in
+	 * tables above the threshold.  The system won't fall behind on freezing.
 	 */
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
 		return rel_pages;
 	}
 	else if (vacrel->aggressive)
+	{
+		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
+	}
+	else if (rel_pages >= eager_threshold)
+	{
+		/*
+		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
+		 * GUC-based threshold for eager freezing.
+		 *
+		 * We always scan all-visible pages when the threshold is crossed, so
+		 * that relfrozenxid can be advanced.  There will typically be few or
+		 * no all-visible pages (only all-frozen) in the table anyway, at
+		 * least after the first VACUUM that exceeds the threshold.
+		 */
+		vacrel->allvis_freeze_strategy = true;
+		vacrel->skipallvis = false;
+	}
 	else
 	{
 		BlockNumber nextra,
 					nextra_threshold;
 
+		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
+		vacrel->allvis_freeze_strategy = false;
+
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1873,8 +1922,15 @@ retry:
 	 *
 	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
+	 *
+	 * When ongoing VACUUM opted to use the all-visible freezing strategy we
+	 * freeze any page that will become all-visible, making it all-frozen
+	 * instead. (Actually, there are edge-cases where this might not result in
+	 * marking the page all-frozen in the visibility map, but that should have
+	 * only a negligible impact.)
 	 */
-	if (xtrack.freeze || tuples_frozen == 0)
+	if (xtrack.freeze || tuples_frozen == 0 ||
+		(vacrel->allvis_freeze_strategy && prunestate->all_visible))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7ccde07de..b837e0331 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9dc6bf947..18a8e8b80 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -150,6 +150,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2009,6 +2010,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2016,6 +2018,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2800,6 +2803,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2849,6 +2853,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2871,6 +2880,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 55bf99851..bb018ae62 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2733,6 +2733,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90bec0502..e701e464e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -694,6 +694,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a5cd4e44c..091be17c3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9147,6 +9147,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9155,9 +9170,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9234,10 +9251,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        Specifies the cutoff age (in multixacts) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages with an older multixact ID.  The
+        default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..554b3a75d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -588,9 +588,9 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
+    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
+    transactions older than this cutoff XID are guaranteed to have been frozen.
+    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c14b2010d..7e684d187 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1680,6 +1680,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.34.1

v3-0004-Unify-aggressive-VACUUM-with-antiwraparound-VACUU.patchapplication/x-patch; name=v3-0004-Unify-aggressive-VACUUM-with-antiwraparound-VACUU.patchDownload

From f7d265d6ec8783786d642aea6cb0964552bcce9a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v3 4/5] Unify aggressive VACUUM with antiwraparound VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Before then, every lazy VACUUM
was "equally aggressive": each operation froze whatever tuples before
the age-wise cutoff needed to be frozen.  And each table's relfrozenxid
was updated at the end.  In short, the previous behavior was much less
efficient, but did at least have one thing going for it: it was much
easier to understand.

Even when the visibility map first went in, there was some awareness of
the problem of successive VACUUMs against the same table that had
unpredictable performance characteristics.  The vacuum_freeze_table_age
GUC was added to 8.4 (around the same time as the visibility map itself)
by commit 65878185, to ameliorate problems in this area.  The GUC made
the behavior of successive VACUUMs somewhat more continuous by forcing
"early aggressive VACUUMs" of tables whose relfrozenxid had already
attained an age that exceeded vacuum_freeze_table_age when VACUUM began.
An aggressive VACUUM could therefore sometimes take place before an
aggressive antiwraparound autovacuum was triggered.

Antiwraparound autovacuums were just another way that autovacuum.c could
trigger an autovacuum worker before now (for the most part).  Although
antiwraparound "implies aggressive", aggressive has never "implied
antiwraparound".  This is a consequence of having two table-age GUCs
that both influence VACUUM's behavior around relfrozenxid advancement.
While table age certainly does matter, it's far from the only thing that
matters.  And while we should sometimes "behave aggressively", it's more
useful to structure everything as being on a continuum between laziness
and eagerness/aggressiveness.

Rather than relying on vacuum_freeze_table_age to "escalate to an
aggressive VACUUM early", the skipping strategy infrastructure now gives
some consideration to table age (in addition to the target relation's
physical characteristics, in particular the number of extra all-visible
blocks).  The costs and the benefits are weighed against each other.
The closer we get to needing an antiwraparound VACUUM, the less
concerned we are about the added cost of advancing relfrozenxid in the
ongoing VACUUM.

We now _start_ to give _some_ weight to table age at a relatively early
stage, when the table's age(relfrozenxid) first crosses the half-way
point (autovacuum_freeze_max_age/2, or the multixact equivalent).  Once
we're very close to the point of antiwraparound for a given table, any
VACUUM against that table will automatically choose the eager skipping
strategy.

The concept of aggressive VACUUM is now merged with the concept of
antiwraparound VACUUM.  Note that this means that a manually issued
VACUUM command can now sometimes be classified as an antiwraparound
VACUUM (and get reported as such in VERBOSE output).

The default value of vacuum_freeze_table_age is now -1, which is
interpreted as "the current value of the autovacuum_freeze_max_age GUC".
The "table age" GUCs/reloptions can be used as compatibility options,
but are otherwise superseded by VACUUM's freezing and skipping
strategies.
---
 src/include/commands/vacuum.h                 |   5 +-
 src/include/storage/proc.h                    |   4 +-
 src/backend/access/heap/vacuumlazy.c          | 171 ++++---
 src/backend/commands/cluster.c                |   3 +-
 src/backend/commands/vacuum.c                 |  96 ++--
 src/backend/postmaster/autovacuum.c           |   4 +-
 src/backend/storage/lmgr/proc.c               |   2 +-
 src/backend/utils/activity/pgstat_relation.c  |   6 +-
 src/backend/utils/misc/guc.c                  |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   4 +-
 doc/src/sgml/config.sgml                      |  80 ++--
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 451 ++++++------------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  23 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   5 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  27 +-
 src/test/regress/expected/reloptions.out      |   4 +-
 src/test/regress/sql/reloptions.sql           |   4 +-
 21 files changed, 424 insertions(+), 503 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 52379f819..5bb33a2bb 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -223,7 +223,7 @@ typedef struct VacuumParams
 	int			freeze_strategy_threshold;	/* threshold to use eager
 											 * freezing, in total heap blocks,
 											 * -1 to use default */
-	bool		is_wraparound;	/* force a for-wraparound vacuum */
+	bool		is_antiwrap_autovac;	/* antiwraparound autovacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
 									 * default */
@@ -298,7 +298,8 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  TransactionId *oldestXmin,
 								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
-								  MultiXactId *multiXactCutoff);
+								  MultiXactId *multiXactCutoff,
+								  double *antiwrapfrac);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619e..785ef610d 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -57,7 +57,7 @@ struct XidCache
 										 * CONCURRENTLY or REINDEX
 										 * CONCURRENTLY on non-expressional,
 										 * non-partial index */
-#define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
+#define		PROC_AUTOVACUUM_FOR_WRAPAROUND	0x08	/* affects cancellation */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
 #define		PROC_AFFECTS_ALL_HORIZONS	0x20	/* this proc's xmin must be
@@ -66,7 +66,7 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_AUTOVACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8d750ac7f..950347c50 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,11 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages when using the lazy freezing strategy
+ * Thresholds that control whether VACUUM will skip any all-visible pages when
+ * using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
+#define SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES	0.15
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -144,9 +145,9 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
-	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	/* Antiwraparound VACUUM? (must set relfrozenxid >= FreezeLimit) */
+	bool		antiwraparound;
+	/* Skip (don't scan) all-visible pages? (must be !antiwraparound) */
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
@@ -258,7 +259,8 @@ static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
 									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
-									  BlockNumber all_frozen);
+									  BlockNumber all_frozen,
+									  double antiwrapfrac);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
 								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
@@ -322,7 +324,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				aggressive,
+				antiwraparound,
 				skipallfrozen,
 				frozenxid_updated,
 				minmulti_updated;
@@ -330,6 +332,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				FreezeLimit;
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
+	double		antiwrapfrac;
 	BlockNumber orig_rel_pages,
 				eager_threshold,
 				all_visible,
@@ -369,32 +372,42 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
 	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
 	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
-	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
-	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 * an antiwraparound VACUUM then lazy_scan_heap cannot leave behind
+	 * unfrozen XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to
+	 * go away).
 	 *
 	 * Also determine our cutoff for applying the eager/all-visible freezing
 	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
-	 * even during non-aggressive VACUUMs.
+	 * even during regular VACUUMs.
 	 */
-	aggressive = vacuum_set_xid_limits(rel,
-									   params->freeze_min_age,
-									   params->multixact_freeze_min_age,
-									   params->freeze_table_age,
-									   params->multixact_freeze_table_age,
-									   &OldestXmin, &OldestMxact,
-									   &FreezeLimit, &MultiXactCutoff);
+	antiwraparound = vacuum_set_xid_limits(rel,
+										   params->freeze_min_age,
+										   params->multixact_freeze_min_age,
+										   params->freeze_table_age,
+										   params->multixact_freeze_table_age,
+										   &OldestXmin, &OldestMxact,
+										   &FreezeLimit, &MultiXactCutoff,
+										   &antiwrapfrac);
 	eager_threshold = params->freeze_strategy_threshold < 0 ?
 		vacuum_freeze_strategy_threshold :
 		params->freeze_strategy_threshold;
 
+	/*
+	 * An autovacuum to prevent wraparound should already be recognized as
+	 * antiwraparound based on generic criteria.  Even still, make sure that
+	 * autovacuum.c always gets what it asked for.
+	 */
+	if (params->is_antiwrap_autovac)
+		antiwraparound = true;
+
 	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
 		/*
-		 * Force aggressive mode, and disable skipping blocks using the
+		 * Force antiwraparound mode, and disable skipping blocks using the
 		 * visibility map (even those set all-frozen)
 		 */
-		aggressive = true;
+		antiwraparound = true;
 		skipallfrozen = false;
 	}
 
@@ -445,9 +458,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
-	vacrel->aggressive = aggressive;
+	vacrel->antiwraparound = antiwraparound;
 	/* Set skipallvis/skipallfrozen provisionally (before lazy_scan_strategy) */
-	vacrel->skipallvis = (!aggressive && skipallfrozen);
+	vacrel->skipallvis = (!antiwraparound && skipallfrozen);
 	vacrel->skipallfrozen = skipallfrozen;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
@@ -541,7 +554,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
 	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
-									   all_visible, all_frozen);
+									   all_visible, all_frozen,
+									   antiwrapfrac);
 	if (verbose)
 	{
 		char	   *msgfmt;
@@ -549,8 +563,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 		Assert(!IsAutoVacuumWorkerProcess());
 
-		if (aggressive)
-			msgfmt = _("aggressively vacuuming \"%s.%s.%s\"");
+		if (antiwraparound)
+			msgfmt = _("vacuuming \"%s.%s.%s\" to prevent wraparound");
 		else
 			msgfmt = _("vacuuming \"%s.%s.%s\"");
 
@@ -617,25 +631,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
+	 * Antiwraparound VACUUMs must always be able to advance relfrozenxid to a
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * Regular VACUUMs may advance them by any amount, or not at all.
 	 */
 	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
-		   TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
+		   TransactionIdPrecedesOrEquals(antiwraparound ? FreezeLimit :
 										 vacrel->relfrozenxid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
+		   MultiXactIdPrecedesOrEquals(antiwraparound ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
 	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * Must keep original relfrozenxid in a regular VACUUM whose
 		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
-		Assert(!aggressive);
+		Assert(!antiwraparound);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -709,29 +723,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			if (verbose)
 			{
 				/*
-				 * Aggressiveness already reported earlier, in dedicated
+				 * Antiwraparound-ness already reported earlier, in dedicated
 				 * VACUUM VERBOSE ereport
 				 */
-				Assert(!params->is_wraparound);
+				Assert(!params->is_antiwrap_autovac);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
-			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
 			else
 			{
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
+				/*
+				 * Note that we don't differentiate between an antiwraparound
+				 * autovacuum that was launched by autovacuum.c as
+				 * antiwraparound and one that only became antiwraparound
+				 * because freeze_table_age is set.
+				 */
+				Assert(IsAutoVacuumWorkerProcess());
+				if (antiwraparound)
+					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -1063,7 +1071,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * lazy_scan_noprune could not do all required processing.  Wait
 			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
 			 */
-			Assert(vacrel->aggressive);
+			Assert(vacrel->antiwraparound);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
 		}
@@ -1325,18 +1333,22 @@ lazy_scan_heap(LVRelState *vacrel)
  * Antiwraparound VACUUMs of append-only tables should generally be avoided.
  *
  * Also determines if the ongoing VACUUM operation should skip all-visible
- * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
- * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
+ * pages for regular VACUUMs, where advancing relfrozenxid is optional.  When
+ * VACUUM freezes eagerly it always also scans pages eagerly, since it's
  * important that relfrozenxid advance in affected tables, which are larger.
  * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
  * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
  * depending on the extra cost - we might need to scan only a few extra pages.
+ * Decision is based in part on caller's antiwrapfrac argument, which is a
+ * value from 0.0 - 1.0 that represents how close the table age is to needing
+ * an antiwraparound VACUUM.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
 lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
-				   BlockNumber all_visible, BlockNumber all_frozen)
+				   BlockNumber all_visible, BlockNumber all_frozen,
+				   double antiwrapfrac)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1379,21 +1391,21 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		Assert(vacrel->antiwraparound && !vacrel->skipallvis);
 		vacrel->allvis_freeze_strategy = true;
 		return rel_pages;
 	}
-	else if (vacrel->aggressive)
+	else if (vacrel->antiwraparound)
 	{
-		/* Always freeze all-visible pages during aggressive VACUUMs */
+		/* Always freeze all-visible pages during antiwraparound VACUUMs */
 		Assert(!vacrel->skipallvis);
 		vacrel->allvis_freeze_strategy = true;
 	}
 	else if (rel_pages >= eager_threshold)
 	{
 		/*
-		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
-		 * GUC-based threshold for eager freezing.
+		 * Regular VACUUM of table whose rel_pages now exceeds GUC-based
+		 * threshold for eager freezing.
 		 *
 		 * We always scan all-visible pages when the threshold is crossed, so
 		 * that relfrozenxid can be advanced.  There will typically be few or
@@ -1408,7 +1420,7 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		BlockNumber nextra,
 					nextra_threshold;
 
-		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
+		/* Regular VACUUM of small table -- use lazy freeze strategy */
 		vacrel->allvis_freeze_strategy = false;
 
 		/*
@@ -1424,13 +1436,32 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		 * that way, so be lazy (just skip) unless the added cost is very low.
 		 * We opt for a skipallfrozen-only VACUUM when the number of extra
 		 * pages (extra scanned pages that are all-visible but not all-frozen)
-		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small)
+		 * if relfrozenxid has yet to attain an age that uses 50% of the XID
+		 * space available before an antiwraparound VACUUM becomes necessary.
+		 * A more aggressive threshold of 15% is used when relfrozenxid is
+		 * older than that.
 		 */
 		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
-		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+
+		if (antiwrapfrac < 0.5)
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_THRESHOLD_PAGES;
+		else
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES;
+
 		nextra_threshold = Max(32, nextra_threshold);
 
-		vacrel->skipallvis = nextra >= nextra_threshold;
+		/*
+		 * We always prefer eagerly advancing relfrozenxid when it already
+		 * attained an age that consumes >= 90% of the available XID space
+		 * before the crossover point for antiwraparound VACUUM.
+		 */
+		if (antiwrapfrac < 0.9)
+			vacrel->skipallvis = nextra >= nextra_threshold;
+		else
+			vacrel->skipallvis = false;
 	}
 
 	/* Return the appropriate variant of scanned_pages */
@@ -2083,11 +2114,11 @@ retry:
  * operation left LP_DEAD items behind.  We'll at least collect any such items
  * in the dead_items array for removal from indexes.
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * For antiwraparound VACUUM callers, we may return false to indicate that a
+ * full cleanup lock is required for processing by lazy_scan_prune.  This is
+ * only necessary when the antiwraparound VACUUM needs to freeze some tuple
+ * XIDs from one or more tuples on the page.  We always return true for
+ * regular VACUUM callers.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
@@ -2160,13 +2191,13 @@ lazy_scan_noprune(LVRelState *vacrel,
 									&NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
+			if (vacrel->antiwraparound)
 			{
 				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
+				 * Antiwraparound VACUUMs must always be able to advance rel's
 				 * relfrozenxid to a value >= FreezeLimit (and be able to
 				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
+				 * The ongoing antiwraparound VACUUM won't be able to do that
 				 * unless it can freeze an XID (or MXID) from this tuple now.
 				 *
 				 * The only safe option is to have caller perform processing
@@ -2178,8 +2209,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 			}
 
 			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
+			 * Regular VACUUMs are under no obligation to advance relfrozenxid
+			 * (even by one XID).  We can be much laxer here.
 			 *
 			 * Currently we always just accept an older final relfrozenxid
 			 * and/or relminmxid value.  We never make caller wait or work a
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index dc35b0291..bfcf157ab 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -825,6 +825,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 				FreezeXid;
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
+	double		antiwrapfrac;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -913,7 +914,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &OldestMxact,
-						  &FreezeXid, &MultiXactCutoff);
+						  &FreezeXid, &MultiXactCutoff, &antiwrapfrac);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b837e0331..81686ccce 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -267,8 +267,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	/* Determine freezing strategy later on using GUC or reloption */
 	params.freeze_strategy_threshold = -1;
 
-	/* user-invoked vacuum is never "for wraparound" */
-	params.is_wraparound = false;
+	/* user-invoked vacuum isn't an autovacuum */
+	params.is_antiwrap_autovac = false;
 
 	/* user-invoked vacuum uses VACOPT_VERBOSE instead of log_min_duration */
 	params.log_min_duration = -1;
@@ -943,14 +943,20 @@ get_all_vacuum_rels(int options)
  * - oldestMxact is the Mxid below which MultiXacts are definitely not
  *   seen as visible by any running transaction.
  * - freezeLimit is the Xid below which all Xids are definitely replaced by
- *   FrozenTransactionId during aggressive vacuums.
+ *   FrozenTransactionId during antiwraparound vacuums.
  * - multiXactCutoff is the value below which all MultiXactIds are definitely
- *   removed from Xmax during aggressive vacuums.
+ *   removed from Xmax during antiwraparound vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
- * minimum).
+ * operation antiwraparound.  An antiwraparound VACUUM is required to advance
+ * relfrozenxid up to FreezeLimit (at a minimum), and relminmxid up to
+ * multiXactCutoff (at a minimum).  Otherwise VACUUM advances relfrozenxid on
+ * a best-effort basis.
+ *
+ * Sets *antiwrapfrac to give caller a sense of how close we came to requiring
+ * an antiwraparound VACUUM in terms of XID/MXID space consumed.  This is set
+ * to a value between 0.0 and 1.0, where 1.0 represents the point that an
+ * antiwraparound VACUUM will be (or has already) been forced.
  *
  * oldestXmin and oldestMxact are the most recent values that can ever be
  * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
@@ -966,15 +972,20 @@ vacuum_set_xid_limits(Relation rel,
 					  TransactionId *oldestXmin,
 					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
-					  MultiXactId *multiXactCutoff)
+					  MultiXactId *multiXactCutoff,
+					  double *antiwrapfrac)
 {
 	TransactionId nextXID,
 				safeOldestXmin,
-				aggressiveXIDCutoff;
+				antiwrapXIDCutoff;
 	MultiXactId nextMXID,
 				safeOldestMxact,
-				aggressiveMXIDCutoff;
-	int			effective_multixact_freeze_max_age;
+				antiwrapMXIDCutoff;
+	double		XIDFrac,
+				MXIDFrac;
+	int			effective_multixact_freeze_max_age,
+				relfrozenxid_age,
+				relminmxid_age;
 
 	/*
 	 * Acquire oldestXmin.
@@ -1065,8 +1076,8 @@ vacuum_set_xid_limits(Relation rel,
 		*multiXactCutoff = *oldestMxact;
 
 	/*
-	 * Done setting output parameters; check if oldestXmin or oldestMxact are
-	 * held back to an unsafe degree in passing
+	 * Done setting cutoff output parameters; check if oldestXmin or
+	 * oldestMxact are held back to an unsafe degree in passing
 	 */
 	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
 	if (!TransactionIdIsNormal(safeOldestXmin))
@@ -1085,48 +1096,59 @@ vacuum_set_xid_limits(Relation rel,
 				 errhint("Close open transactions soon to avoid wraparound problems.\n"
 						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
 
+	*antiwrapfrac = 1.0;		/* Initialize */
+
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
+	 * Finally, figure out if caller needs to do an antiwraparound VACUUM now.
 	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
+	antiwrapXIDCutoff = nextXID - freeze_table_age;
+	if (!TransactionIdIsNormal(antiwrapXIDCutoff))
+		antiwrapXIDCutoff = FirstNormalTransactionId;
 	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
+									  antiwrapXIDCutoff))
 		return true;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.   The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
+	antiwrapMXIDCutoff = nextMXID - multixact_freeze_table_age;
+	if (antiwrapMXIDCutoff < FirstMultiXactId)
+		antiwrapMXIDCutoff = FirstMultiXactId;
 	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
+									antiwrapMXIDCutoff))
 		return true;
 
-	/* Non-aggressive VACUUM */
+	/*
+	 * Regular VACUUM for vacuumlazy.c caller.  Need to work out how close we
+	 * came to needing an antiwraparound VACUUM.
+	 */
+	relfrozenxid_age = Max(nextXID - rel->rd_rel->relfrozenxid, 1);
+	relminmxid_age = Max(nextMXID - rel->rd_rel->relminmxid, 1);
+	XIDFrac = (double) relfrozenxid_age / (double) freeze_table_age;
+	MXIDFrac = (double) relminmxid_age / (double) multixact_freeze_table_age;
+
+	*antiwrapfrac = Max(XIDFrac, MXIDFrac);
+
 	return false;
 }
 
@@ -1869,8 +1891,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyProc->statusFlags |= PROC_IN_VACUUM;
-		if (params->is_wraparound)
-			MyProc->statusFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+		if (params->is_antiwrap_autovac)
+			MyProc->statusFlags |= PROC_AUTOVACUUM_FOR_WRAPAROUND;
 		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
 		LWLockRelease(ProcArrayLock);
 	}
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 18a8e8b80..112c84b01 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2881,7 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
 		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
-		tab->at_params.is_wraparound = wraparound;
+		tab->at_params.is_antiwrap_autovac = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
 		tab->at_vacuum_cost_delay = vac_cost_delay;
@@ -3193,7 +3193,7 @@ autovac_report_activity(autovac_table *tab)
 
 	snprintf(activity + len, MAX_AUTOVAC_ACTIV_LEN - len,
 			 " %s.%s%s", tab->at_nspname, tab->at_relname,
-			 tab->at_params.is_wraparound ? " (to prevent wraparound)" : "");
+			 tab->at_params.is_antiwrap_autovac ? " (to prevent wraparound)" : "");
 
 	/* Set statement_timestamp() to current time for pg_stat_activity */
 	SetCurrentStatementStartTimestamp();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37aaab133..158eab321 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1384,7 +1384,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			 * wraparound.
 			 */
 			if ((statusFlags & PROC_IS_AUTOVACUUM) &&
-				!(statusFlags & PROC_VACUUM_FOR_WRAPAROUND))
+				!(statusFlags & PROC_AUTOVACUUM_FOR_WRAPAROUND))
 			{
 				int			pid = autovac->pid;
 
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb..3b7618b6d 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -234,9 +234,9 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->n_dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
-	 * It's currently used only to track when we need to perform an "insert"
+	 * It is quite possible that a regular VACUUM ended up skipping various
+	 * pages, however, we'll zero the insert counter here regardless.  It's
+	 * currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
 	 * until enough tuples have been inserted to trigger another insert
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb018ae62..6b337a6af 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2706,10 +2706,10 @@ static struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2726,10 +2726,10 @@ static struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e701e464e..81502d0ca 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,11 +693,11 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
+#vacuum_freeze_table_age = -1
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
+#vacuum_multixact_freeze_table_age = -1
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 091be17c3..01f246464 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8217,7 +8217,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8406,7 +8406,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9130,20 +9130,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> performs antiwraparound vacuuming if the
+        table's <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        Antiwraparound vacuuming differs from regular vacuuming in
+        that it will reliably advance
+        <structfield>relfrozenxid</structfield> to a recent value,
+        even when <command>VACUUM</command> wouldn't usually deemed it
+        necessary.  The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and
+         advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9179,9 +9190,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
-        that there is not an unreasonably short time between forced
+        that there is not an unreasonably short time between forced antiwraparound
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9227,19 +9238,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+        <command>VACUUM</command> performs antiwraparound vacuuming if
+        the table's <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+        field has reached the multixact age specified by this setting.
+        Antiwraparound vacuuming differs from regular vacuuming in
+        that it will reliably advance
+        <structfield>relminmxid</structfield> to a recent value, even
+        when <command>VACUUM</command> wouldn't usually deemed it
+        necessary.  The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>
+        is used.  For more information see <xref
+         linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and
+         advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9258,7 +9278,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
-        so that there is not an unreasonably short time between forced
+        so that there is not an unreasonably short time between forced antiwraparound
         autovacuums.
         For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
        </para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 554b3a75d..80fd3d548 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,18 +400,8 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
-
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
-
-    <indexterm>
-     <primary>wraparound</primary>
-     <secondary>of transaction IDs</secondary>
-    </indexterm>
+  <sect2 id="vacuum-xid-space">
+   <title>Managing the 32-bit Transaction ID address space</title>
 
    <para>
     <productname>PostgreSQL</productname>'s
@@ -419,165 +409,23 @@
     depend on being able to compare transaction ID (<acronym>XID</acronym>)
     numbers: a row version with an insertion XID greater than the current
     transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
+    to the current transaction.  But since the on-disk representation
+    of transaction IDs is only 32-bits, the system is incapable of
+    representing <emphasis>distances</emphasis> between any two XIDs
+    that exceed about 2 billion transaction IDs.
    </para>
 
    <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
-    <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
-    </para>
-    <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
-    </para>
-   </note>
-
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
+    One of the purposes of periodic vacuuming is to manage the
+    Transaction Id address space.  <command>VACUUM</command> will mark
+    rows as <emphasis>frozen</emphasis>, indicating that they were
+    inserted by a transaction that committed sufficiently far in the
+    past that the effects of the inserting transaction are certain to
+    be visible to all current and future transactions.  There is, in
+    effect, an infinite distance between a frozen transaction ID and
+    any unfrozen transaction ID.  This allows the on-disk
+    representation of transaction IDs to recycle the 32-bit address
+    space efficiently.
    </para>
 
    <para>
@@ -587,15 +435,15 @@
     <structname>pg_database</structname>.  In particular,
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
-    transactions older than this cutoff XID are guaranteed to have been frozen.
-    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    XID at the end of the most recent <command>VACUUM</command>.  All
+    rows inserted by transactions older than this cutoff XID are
+    guaranteed to have been frozen.  Similarly, the
+    <structfield>datfrozenxid</structfield> column of a database's
+    <structname>pg_database</structname> row is a lower bound on the
+    unfrozen XIDs appearing in that database &mdash; it is just the
+    minimum of the per-table <structfield>relfrozenxid</structfield>
+    values within the database.  A convenient way to examine this
+    information is to execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -611,89 +459,13 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     cutoff XID to the current transaction's XID.
    </para>
 
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
-    </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
-
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
     </indexterm>
 
-    <indexterm>
-     <primary>wraparound</primary>
-     <secondary>of multixact IDs</secondary>
-    </indexterm>
-
     <para>
      <firstterm>Multixact IDs</firstterm> are used to support row locking by
      multiple transactions.  Since there is only limited space in a tuple
@@ -704,49 +476,137 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
+    </para>
+    <para>
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="controlling-freezing">
+    <title>Controlling freezing</title>
+   <para>
+    As a general rule, the more tuples that <command>VACUUM</command>
+    freezes, the more recently <command>VACUUM</command> can set the
+    table's <structfield>relfrozenxid</structfield> and
+    <structfield>relminmxid</structfield> fields to afterwards.
+    <xref linkend="guc-vacuum-freeze-min-age"/> and <xref
+     linkend="guc-vacuum-multixact-freeze-min-age"/> control how old
+    an XID or MultiXactId value has to be before the row will be
+    frozen (absent any other factor that triggers freezing).
+    This is only enforced in smaller tables that use the lazy freezing
+    strategy (controlled by
+    <xref linkend="guc-vacuum-freeze-strategy-threshold"/>).
+    Increasing these settings may avoid unnecessary work, but that
+    isn't generally recommended.
+   </para>
+
+   <tip>
+    <para>
+     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+     parameter is specified, <command>VACUUM</command> prints various
+     statistics about the table.  This includes information about how
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> advanced.  The same details appear
+     in the server log when autovacuum logging (controlled by <xref
+      linkend="guc-log-autovacuum-min-duration"/>) reports on a
+     <command>VACUUM</command> operation executed by autovacuum.
+    </para>
+   </tip>
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound VACUUM</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> can only be advanced when
+     <command>VACUUM</command> actually runs.  Even then,
+     <command>VACUUM</command> must scan every page of the table that
+     might contain unfrozen XIDs.  <command>VACUUM</command> usually
+     advances <structfield>relfrozenxid</structfield> on a best-effort
+     basis, weighing costs against benefits.  This approach spreads
+     out the burden of freezing over time, across multiple
+     <command>VACUUM</command> operations.  However, if no
+     <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be forced for the table.
+     This will reliably set <structfield>relfrozenxid</structfield>
+     and <structfield>relminmxid</structfield> to a relatively recent
+     values.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
 
-    <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     <programlisting>
+      WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+      HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+     </programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+     <programlisting>
+      ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+      HINT:  Stop the postmaster and vacuum that database in single-user mode.
+     </programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover without data loss, by manually executing the
+     required <command>VACUUM</command> commands.  However, since the system will not
+     execute commands once it has gone into the safety shutdown mode,
+     the only way to do this is to stop the server and start the server in single-user
+     mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
+     in single-user mode.  See the <xref linkend="app-postgres"/> reference
+     page for details about using single-user mode.
     </para>
    </sect3>
+
   </sect2>
 
   <sect2 id="autovacuum">
@@ -832,22 +692,13 @@ vacuum insert threshold = vacuum base insert threshold + vacuum insert scale fac
     and vacuum insert scale factor is
     <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
     Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
+    <firstterm>all visible</firstterm> and also allow tuples to be frozen.
+    The number of obsolete tuples and
     the number of inserted tuples are obtained from the cumulative statistics system;
     it is a semi-accurate count updated by each <command>UPDATE</command>,
     <command>DELETE</command> and <command>INSERT</command> operation.  (It is
     only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
+    load.)
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 7e684d187..74a61abe2 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1501,7 +1501,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c582021d2..42360f165 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,12 +119,12 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
+      Selects eager <quote>freezing</quote> of tuples, and forces
+      antiwraparound mode.  Specifying <literal>FREEZE</literal> is
+      equivalent to performing <command>VACUUM</command> with the
+      <xref linkend="guc-vacuum-freeze-min-age"/> and <xref
+       linkend="guc-vacuum-freeze-table-age"/> parameters
+      set to zero.  Eager freezing is always performed when the
       table is rewritten, so this option is redundant when <literal>FULL</literal>
       is specified.
      </para>
@@ -154,13 +154,8 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>DISABLE_PAGE_SKIPPING</literal></term>
     <listitem>
      <para>
-      Normally, <command>VACUUM</command> will skip pages based on the <link
-      linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      Normally, <command>VACUUM</command> will skip pages based on the
+      <link linkend="vacuum-for-visibility-map">visibility map</link>.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -215,7 +210,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..fdc81a237 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,8 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples and force antiwraparound
+        mode.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +260,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..6a266033a 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_regular_vacuum vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_regular_vacuum vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_regular_vacuum vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_regular_vacuum vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_regular_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_regular_vacuum pinholder_commit vacuumer_regular_vacuum
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..28fb52433 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,15 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs regular VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin):
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_regular_vacuum
 {
   VACUUM smalltbl;
 }
@@ -75,15 +75,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page ("missed dead" tuples are counted in
+# reltuples, much like "recently dead" tuples).
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +91,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +102,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +115,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +127,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which only antiwraparound VACUUM is willing to do).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +135,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..a02348900 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,7 +102,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
+-- Do an antiwraparound vacuum to prevent page-skipping.
 VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
@@ -128,7 +128,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
+-- Do an antiwraparound vacuum to prevent page-skipping.
 VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..6c727695e 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,7 +61,7 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
+-- Do an antiwraparound vacuum to prevent page-skipping.
 VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
@@ -72,7 +72,7 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
+-- Do an antiwraparound vacuum to prevent page-skipping.
 VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
-- 
2.34.1

v3-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/x-patch; name=v3-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload

From 478dbb3405a88e550fbbeb1865a7ff6ae0ca8a27 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v3 2/5] Teach VACUUM to use visibility map snapshot.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip.  The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").  It also
gives VACUUM reliable information about how many pages will be scanned,
before its physical heap scan even begins.  That makes it easier to
model the costs that VACUUM incurs using a top-down, up-front approach.

Non-aggressive VACUUMs now make an up-front choice about VM skipping
strategy: they decide whether to prioritize early advancement of
relfrozenxid (eager behavior) over avoiding work by skipping all-visible
pages (lazy behavior).  Nothing about the details of how lazy_scan_prune
freezes changes just yet, though a later commit will add the concept of
freezing strategies.

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.  VACUUM will now either scan
every all-visible page, or none at all.  This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.  It was possible that a single all-visible page
located in a range of all-frozen blocks would render it unsafe to
advance relfrozenxid later on; lazy_scan_skip just didn't have enough
high-level context about the table as a whole.  Now our policy around
skipping all-visible pages is exactly the same condition as whether or
not it's safe to advance relfrozenxid later on; nothing is left to
chance.

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work).  For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().
---
 src/include/access/visibilitymap.h      |   7 +
 src/backend/access/heap/vacuumlazy.c    | 356 ++++++++++++++----------
 src/backend/access/heap/visibilitymap.c | 162 +++++++++++
 3 files changed, 375 insertions(+), 150 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..49f197bb3 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,9 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +38,10 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+									  BlockNumber *all_visible, BlockNumber *all_frozen);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
+extern uint8 visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index fad274621..97e272081 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,10 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * Threshold that controls whether non-aggressive VACUUMs will skip any
+ * all-visible pages
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -146,8 +146,10 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
+	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	bool		skipallvis;
+	/* Skip (don't scan) all-frozen pages? */
+	bool		skipallfrozen;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -177,7 +179,8 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map used by lazy_scan_skip */
+	vmsnapshot *vmsnap;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -250,10 +253,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber all_visible,
+									  BlockNumber all_frozen);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel,
+								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -316,7 +320,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	bool		verbose,
 				instrument,
 				aggressive,
-				skipwithvm,
+				skipallfrozen,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
@@ -324,6 +328,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				all_visible,
+				all_frozen,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -369,7 +376,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
 
-	skipwithvm = true;
+	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
 		/*
@@ -377,7 +384,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * visibility map (even those set all-frozen)
 		 */
 		aggressive = true;
-		skipwithvm = false;
+		skipallfrozen = false;
 	}
 
 	/*
@@ -402,20 +409,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
@@ -442,7 +435,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
 	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
+	/* Set skipallvis/skipallfrozen provisionally (before lazy_scan_strategy) */
+	vacrel->skipallvis = (!aggressive && skipallfrozen);
+	vacrel->skipallfrozen = skipallfrozen;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -503,12 +498,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * confused about whether a tuple should be frozen or removed.  (In the
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
-	 *
-	 * We must determine rel_pages _after_ OldestXmin has been established.
-	 * lazy_scan_heap's physical heap scan (scan of pages < rel_pages) is
-	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
-	 * XIDs must at least be considered for freezing (though not necessarily
-	 * frozen) during its scan.
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
@@ -521,7 +510,51 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
-	vacrel->skippedallvis = false;
+
+	/*
+	 * VACUUM must scan all pages that might have XIDs < OldestXmin in tuple
+	 * headers to be able to safely advance relfrozenxid later on.  There is
+	 * no good reason to scan any additional pages. (Actually we might opt to
+	 * skip all-visible pages.  Either way we won't scan pages for no reason.)
+	 *
+	 * Now that OldestXmin and rel_pages are acquired, acquire an immutable
+	 * snapshot of the visibility map as well.  lazy_scan_skip works off of
+	 * the vmsnap, not the authoritative VM, which can continue to change.
+	 * Pages that lazy_scan_heap will scan are fixed and known in advance.
+	 *
+	 * The exact number of pages that lazy_scan_heap will scan also depends on
+	 * our choice of skipping strategy.  VACUUM can either choose to skip any
+	 * all-visible pages lazily, or choose to scan those same pages instead.
+	 * Decide on a skipping strategy to determine final scanned_pages.
+	 */
+	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
+										&all_visible, &all_frozen);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   all_visible, all_frozen);
+	if (verbose)
+	{
+		char	   *msgfmt;
+		StringInfoData buf;
+
+		Assert(!IsAutoVacuumWorkerProcess());
+
+		if (aggressive)
+			msgfmt = _("aggressively vacuuming \"%s.%s.%s\"");
+		else
+			msgfmt = _("vacuuming \"%s.%s.%s\"");
+
+		initStringInfo(&buf);
+		appendStringInfo(&buf, msgfmt, get_database_name(MyDatabaseId),
+						 vacrel->relnamespace, vacrel->relname);
+
+		ereport(INFO,
+				(errmsg_internal("%s", buf.data),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
+		pfree(buf.data);
+	}
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -538,6 +571,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -584,12 +618,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
 		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -634,6 +667,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -857,13 +893,12 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -877,42 +912,24 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = lazy_scan_skip(vacrel, 0, &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		if (blkno < next_block_to_scan)
+			continue;
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = lazy_scan_skip(vacrel, blkno + 1,
+											&next_all_visible);
 
 		vacrel->scanned_pages++;
 
@@ -1122,10 +1139,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1153,12 +1169,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1197,7 +1211,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1290,47 +1304,121 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *	lazy_scan_strategy() -- Determine skipping strategy.
  *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * Determines if the ongoing VACUUM operation should skip all-visible pages
+ * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
  *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
+ * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
+lazy_scan_strategy(LVRelState *vacrel,
+				   BlockNumber all_visible,
+				   BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
+				scanned_pages_skipallvis,
+				scanned_pages_skipallfrozen;
+	uint8		mapbits;
 
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
+	Assert(vacrel->scanned_pages == 0);
+	Assert(rel_pages >= all_visible && all_visible >= all_frozen);
+
+	/*
+	 * First figure out the final scanned_pages for each of the skipping
+	 * policies that lazy_scan_skip might end up using: skipallvis (skip both
+	 * all-frozen and all-visible) and skipallfrozen (just skip all-frozen).
+	 */
+	scanned_pages_skipallvis = rel_pages - all_visible;
+	scanned_pages_skipallfrozen = rel_pages - all_frozen;
+
+	/*
+	 * Even if the last page is skippable, it will still get scanned because
+	 * of lazy_scan_skip's "try to set nonempty_pages for last page" rule.
+	 * Reconcile that rule with what vmsnap says about the last page now.
+	 *
+	 * When vmsnap thinks that we will be skipping the last page (we won't),
+	 * increment scanned_pages to compensate.  Otherwise change nothing.
+	 */
+	mapbits = visibilitymap_snap_status(vacrel->vmsnap, rel_pages - 1);
+	if (mapbits & VISIBILITYMAP_ALL_VISIBLE)
+		scanned_pages_skipallvis++;
+	if (mapbits & VISIBILITYMAP_ALL_FROZEN)
+		scanned_pages_skipallfrozen++;
+
+	/*
+	 * Okay, now we have all the information we need to decide on a strategy
+	 */
+	if (!vacrel->skipallfrozen)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		return rel_pages;
+	}
+	else if (vacrel->aggressive)
+		Assert(!vacrel->skipallvis);
+	else
+	{
+		BlockNumber nextra,
+					nextra_threshold;
+
+		/*
+		 * Decide on whether or not we'll skip all-visible pages.
+		 *
+		 * In general, VACUUM doesn't necessarily have to freeze anything to
+		 * be able to advance relfrozenxid and/or relminmxid by a significant
+		 * number of XIDs/MXIDs.  The oldest tuples might turn out to have
+		 * been deleted since VACUUM last ran, or this VACUUM might find that
+		 * there simply are no MultiXacts that even need to be considered.
+		 *
+		 * It's hard to predict whether this VACUUM operation will work out
+		 * that way, so be lazy (just skip) unless the added cost is very low.
+		 * We opt for a skipallfrozen-only VACUUM when the number of extra
+		 * pages (extra scanned pages that are all-visible but not all-frozen)
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 */
+		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
+		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+		nextra_threshold = Max(32, nextra_threshold);
+
+		vacrel->skipallvis = nextra >= nextra_threshold;
+	}
+
+	/* Return the appropriate variant of scanned_pages */
+	if (vacrel->skipallvis)
+		return scanned_pages_skipallvis;
+	if (vacrel->skipallfrozen)
+		return scanned_pages_skipallfrozen;
+
+	return rel_pages;			/* keep compiler quiet */
+}
+
+/*
+ *	lazy_scan_skip() -- get the next block to scan according to vmsnap.
+ *
+ * lazy_scan_heap() caller passes the next block in line.  We return the next
+ * block to scan.  Caller skips the blocks preceding returned block, if any.
+ *
+ * The all-visible status of the returned block is set in *all_visible, too.
+ * Block usually won't be all-visible (since it's unskippable), but it can be
+ * when next_block is rel's last page and when DISABLE_PAGE_SKIPPING in use.
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, BlockNumber next_block, bool *all_visible)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_block_to_scan = next_block;
+
+	*all_visible = true;
+	while (next_block_to_scan < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_snap_status(vacrel->vmsnap,
+														next_block_to_scan);
 
 		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
 			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
+			*all_visible = false;
 			break;
 		}
 
@@ -1341,58 +1429,26 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 		 * lock on rel to attempt a truncation that fails anyway, just because
 		 * there are tuples on the last page (it is likely that there will be
 		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
 		 */
-		if (next_unskippable_block == rel_pages - 1)
+		if (next_block_to_scan == rel_pages - 1)
 			break;
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
+		Assert(vacrel->skipallfrozen || !vacrel->skipallvis);
+		if (!vacrel->skipallfrozen)
 			break;
 
 		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 * Don't skip all-visible pages when lazy_scan_strategy determined
+		 * that it was more important for this VACUUM to advance relfrozenxid
 		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
+		if (!vacrel->skipallvis && (mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+			break;
 
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
+		next_block_to_scan++;
 	}
 
-	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
-	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
-	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
-	else
-	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
-	}
-
-	return next_unskippable_block;
+	return next_block_to_scan;
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index ed72eb7b6..6848576fd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,9 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap - get read-only snapshot of visibility map
+ *		visibilitymap_snap_release - release previously acquired snapshot
+ *		visibilitymap_snap_status - get status of bits from vm snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +55,9 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset (only tuples deleted before its
+ * OldestXmin cutoff are considered dead).
  *
  * LOCKING
  *
@@ -124,6 +130,22 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Snapshot of visibility map at a point in time.
+ *
+ * TODO This is currently always just a palloc()'d buffer -- give more thought
+ * to resource management (at a minimum add spilling to temp file).
+ */
+struct vmsnapshot
+{
+	/* Snapshot may contain zero or more visibility map pages */
+	BlockNumber nvmpages;
+
+	/* Copy of VM pages from the time that visibilitymap_snap() was called */
+	char		vmpages[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
@@ -368,6 +390,146 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap - get read-only snapshot of visibility map
+ *
+ * Initializes caller's snapshot, allocating memory in caller's memory context.
+ * Caller can use visibilitymap_snapshot_status to get the status of individual
+ * heap pages at the point that we were called.
+ *
+ * Used by VACUUM to determine which pages it must scan up front.  This avoids
+ * useless scans of concurrently unset heap pages.  VACUUM prefers to leave
+ * them to be scanned during the next VACUUM operation.
+ *
+ * rel_pages is the current size of the heap relation.
+ */
+vmsnapshot *
+visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+				   BlockNumber *all_visible, BlockNumber *all_frozen)
+{
+	BlockNumber nvmpages,
+				mapBlockLast;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap %s %d", RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	*all_visible = 0;
+	*all_frozen = 0;
+	mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages);
+	nvmpages = mapBlockLast + 1;
+	vmsnap = palloc(offsetof(vmsnapshot, vmpages) + BLCKSZ * nvmpages);
+	vmsnap->nvmpages = nvmpages;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Page		localvmpage;
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		localvmpage = vmsnap->vmpages + mapBlock * BLCKSZ;
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(localvmpage, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/*
+		 * Visibility map page copied to local buffer for caller's snapshot.
+		 * Caller requires an exact count of all-visible and all-frozen blocks
+		 * in the heap relation.  Handle that now.
+		 *
+		 * Must "truncate" our local copy of the VM to avoid incorrectly
+		 * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+		 * this by clearing irrelevant bits on the last VM page copied.
+		 */
+		map = PageGetContents(localvmpage);
+		if (mapBlock == mapBlockLast)
+		{
+			/* byte and bit for first heap page not to be scanned by VACUUM */
+			uint32		truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			uint8		truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
+
+			if (truncByte != 0 || truncOffset != 0)
+			{
+				/* Clear any bits set for heap pages >= rel_pages */
+				MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+				map[truncByte] &= (1 << truncOffset) - 1;
+			}
+
+			/* Now it's safe to tally bits from this final VM page below */
+		}
+
+		/* Tally the all-visible and all-frozen counts from this page */
+		umap = (uint64 *) map;
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			*all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+	}
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Just frees the memory allocated by visibilitymap_snap for now (presumably
+ * this will need to release temp files in later revisions of the patch)
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	pfree(vmsnap);
+}
+
+/*
+ *	visibilitymap_snap_status - get status of bits from vm snapshot
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to caller's snapshot of the visibility map?
+ */
+uint8
+visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+	char	   *map;
+	uint8		result;
+	Page		localvmpage;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_status %d", heapBlk);
+#endif
+
+	/* If we didn't copy the VM page, assume heapBlk not all-visible */
+	if (mapBlock >= vmsnap->nvmpages)
+		return 0;
+
+	localvmpage = ((Page) vmsnap->vmpages) + mapBlock * BLCKSZ;
+	map = PageGetContents(localvmpage);
+
+	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+	return result;
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
-- 
2.34.1

#16

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Peter Geoghegan (#15)

6 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Sep 8, 2022 at 1:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v3. There is a new patch included here -- v3-0004-*patch,
or "Unify aggressive VACUUM with antiwraparound VACUUM". No other
notable changes.

I decided to work on this now because it seems like it might give a
more complete picture of the high level direction that I'm pushing
towards. Perhaps this will make it easier to review the patch series
as a whole, even.

This needed to be rebased over the guc.c work recently pushed to HEAD.

Attached is v4. This isn't just to fix bitrot, though; I'm also
including one new patch -- v4-0006-*.patch. This small patch teaches
VACUUM to size dead_items while capping the allocation at the space
required for "scanned_pages * MaxHeapTuplesPerPage" item pointers. In
other words, we now use scanned_pages instead of rel_pages to cap the
size of dead_items, potentially saving quite a lot of memory. There is
no possible downside to this approach, because we already know exactly
how many pages will be scanned from the VM snapshot -- there is zero
added risk of a second pass over the indexes.

This is still only scratching the surface of what is possible with
dead_items. The visibility map snapshot concept can enable a far more
sophisticated approach to resource management in vacuumlazy.c. It
could help us to replace a simple array of item pointers (the current
dead_items array) with a faster and more space-efficient data
structure. Masahiko Sawada has done a lot of work on this recently, so
this may interest him.

We don't just have up-front knowledge of the total number of
scanned_pages with VM snapshots -- we also have up-front knowledge of
which specific pages will be scanned. So we have reliable information
about the final distribution of dead_items (which specific heap blocks
might have dead_items) right from the start. While this extra
information/context is not a totally complete picture, it still seems
like it could be very useful as a way of driving how some new
dead_items data structure compresses TIDs. That will depend on the
distribution of TIDs -- the final "heap TID key space".

VM snapshots could also make it practical for the new data structure
to spill to disk to avoid multiple index scans/passed by VACUUM.
Perhaps this will result in behavior that's similar to how hash joins
spill to disk -- having 90% of the memory required to do everything
in-memory *usually* has similar performance characteristics to just
doing everything in memory. Most individual TID lookups from
ambulkdelete() will find that the TID *doesn't* need to be deleted --
a little like a hash join with low join selectivity (the common case
for hash joins). It's not like a merge join + sort, where we must
either spill everything or nothing (a merge join can be better than a
hash join with high join selectivity).

--
Peter Geoghegan

Attachments:

v4-0001-Add-page-level-freezing-to-VACUUM.patchapplication/octet-stream; name=v4-0001-Add-page-level-freezing-to-VACUUM.patchDownload

From bf53c9eb71cefd380aace360935dd82d2e85fae3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v4 1/6] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

This approach decouples the question of _how_ VACUUM could/will freeze a
given heap page (which of its XIDs are eligible to be frozen) from the
question of whether it actually makes sense to do so right now.

Just adding page-level freezing does not change all that much on its
own: VACUUM will still typically freeze very lazily, since we're only
forcing freezing of all of a page's eligible tuples when we decide to
freeze at least one (on the basis of XID age and FreezeLimit).  For now
VACUUM still freezes everything almost as lazily as it always has.
Later work will teach VACUUM to apply an alternative eager freezing
strategy that triggers page-level freezing earlier, based on additional
criteria.
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |  37 +++++-
 src/backend/access/heap/heapam.c     | 171 ++++++++++++++++-----------
 src/backend/access/heap/vacuumlazy.c |  95 +++++++++------
 4 files changed, 200 insertions(+), 107 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index abf62d9df..c201f8ae6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,8 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
+									TransactionId limit_xid, MultiXactId limit_multi,
 									TransactionId *relfrozenxid_out,
 									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 1705e736b..40556271d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -330,6 +330,38 @@ typedef struct xl_heap_freeze_tuple
 	uint8		frzflags;
 } xl_heap_freeze_tuple;
 
+/*
+ * State used by VACUUM to track what the oldest extant XID/MXID will become
+ * when determing whether and how to freeze a page's heap tuples via calls to
+ * heap_prepare_freeze_tuple.
+ *
+ * The relfrozenxid_out and relminmxid_out fields are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
+ * relminmxid_out as needed to avoid unsafe values in rel's authoritative
+ * pg_class tuple.
+ *
+ * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
+ * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ */
+typedef struct page_frozenxid_tracker
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze;
+
+	/* Values used when page is to be frozen based on freeze plans */
+	TransactionId relfrozenxid_out;
+	MultiXactId relminmxid_out;
+
+	/* Used by caller for '!freeze' pages */
+	TransactionId relfrozenxid_nofreeze_out;
+	MultiXactId relminmxid_nofreeze_out;
+
+} page_frozenxid_tracker;
+
 /*
  * This is what we need to know about a block being frozen during vacuum
  *
@@ -409,10 +441,11 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId relminmxid,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
+									  TransactionId limit_xid,
+									  MultiXactId limit_multi,
 									  xl_heap_freeze_tuple *frz,
 									  bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  page_frozenxid_tracker *xtrack);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 588716606..d6aea370f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6431,26 +6431,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
- * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
+ * Caller must initialize xtrack fields for page as a whole before calling
+ * here with first tuple for the page.  See page_frozenxid_tracker comments.
+ *
+ * Caller must set frz->offset itself if heap_execute_freeze_tuple is called.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6463,34 +6452,46 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  TransactionId limit_xid, MultiXactId limit_multi,
 						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  page_frozenxid_tracker *xtrack)
 {
 	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin,
+				freeze_xmax;
 	TransactionId xid;
 
+	/*
+	 * limit_xid *must* be <= cutoff_xid, to ensure that any XID older than it
+	 * can neither be running nor seen as running by any open transaction.
+	 * This ensures that we only freeze XIDs that are safe to freeze -- those
+	 * that are already unambiguously visible to everybody.
+	 *
+	 * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
+	 * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)
+	 */
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
+
 	frz->frzflags = 0;
 	frz->t_infomask2 = tuple->t_infomask2;
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen iff our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+	{
+		freeze_xmin = false;
+		xmin_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmin */
+	}
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6499,8 +6500,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoff_xid);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
@@ -6514,8 +6515,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 
@@ -6526,7 +6527,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.  It needs
+	 * to return true for any tuple that we would force to be frozen here.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6534,7 +6536,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
+		TransactionId mxid_oldest_xid_out = xtrack->relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
@@ -6553,8 +6555,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
+			if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6582,10 +6584,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6613,10 +6615,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, xtrack->relminmxid_out))
+				xtrack->relminmxid_out = xid;
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 		}
 		else
 		{
@@ -6656,8 +6658,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			freeze_xmax = false;
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
@@ -6673,6 +6675,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+		Assert(changed);
+	}
 	if (freeze_xmax)
 	{
 		Assert(!xmax_already_frozen);
@@ -6703,11 +6710,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we ignore the cutoff_xid and just always perform the
 		 * freeze operation.  The oldest release in which such a value can
 		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
+		 * was removed in PostgreSQL 9.0.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6721,18 +6724,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			else
 				frz->frzflags |= XLH_FREEZE_XVAC;
 
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
+			/* Set XMIN_COMMITTED defensively */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
+
+			/*
+			 * Force freezing any page with an xvac to keep things simple.
+			 * This allows totally_frozen tracking to ignore xvac.
+			 */
 			changed = true;
+			xtrack->freeze = true;
 		}
 	}
 
-	*totally_frozen = (xmin_frozen &&
+	/*
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen (provided caller executes freeze plan for the page)
+	 */
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
+
+	/*
+	 * Force vacuumlazy.c to freeze page when avoiding it would violate the
+	 * rule that XIDs < limit_xid (and MXIDs < limit_multi) must never remain
+	 */
+	if (!xtrack->freeze && !(xmin_already_frozen && xmax_already_frozen))
+		xtrack->freeze =
+			heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+									&xtrack->relfrozenxid_nofreeze_out,
+									&xtrack->relminmxid_nofreeze_out);
+
 	return changed;
 }
 
@@ -6785,14 +6806,20 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	page_frozenxid_tracker dummy;
+
+	dummy.freeze = true;
+	dummy.relfrozenxid_out = cutoff_xid;
+	dummy.relminmxid_out = cutoff_multi;
+	dummy.relfrozenxid_nofreeze_out = cutoff_xid;
+	dummy.relminmxid_nofreeze_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
+										  cutoff_xid, cutoff_multi,
 										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+										  &dummy);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7218,17 +7245,23 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * heap_tuple_would_freeze
  *
  * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * force freezing of any of the XID/XMID fields from the tuple, given the same
+ * limits.  We must also deal with dead tuples here, since (xmin, xmax, xvac)
+ * fields could be processed by pruning away the whole tuple instead of
+ * freezing.
+ *
+ * Note: VACUUM refers to limit_xid and limit_multi as "FreezeLimit" and
+ * "MultiXactCutoff" respectively.  These should not be confused with the
+ * absolute cutoffs for freezing.  We just determine whether caller's tuple
+ * and limits trigger heap_prepare_freeze_tuple to force freezing.
  *
  * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
  * like the heap_prepare_freeze_tuple arguments that they're based on.  We
  * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
+heap_tuple_would_freeze(HeapTupleHeader tuple,
+						TransactionId limit_xid, MultiXactId limit_multi,
 						TransactionId *relfrozenxid_out,
 						MultiXactId *relminmxid_out)
 {
@@ -7242,7 +7275,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	{
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 
@@ -7259,7 +7292,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		/* xmax is a non-permanent XID */
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
@@ -7282,7 +7315,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 		if (MultiXactIdPrecedes(multi, *relminmxid_out))
 			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
+		if (MultiXactIdPrecedes(multi, limit_multi))
 			would_freeze = true;
 
 		/* need to check whether any member of the mxact is old */
@@ -7295,7 +7328,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			Assert(TransactionIdIsNormal(xid));
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(xid, limit_xid))
 				would_freeze = true;
 		}
 		if (nmembers > 0)
@@ -7309,7 +7342,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		{
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			would_freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472..abda286b7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,8 +169,9 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
+	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
@@ -511,6 +512,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
@@ -1563,8 +1565,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	page_frozenxid_tracker xtrack;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1580,8 +1582,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	xtrack.freeze = false;
+	xtrack.relfrozenxid_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_out = vacrel->NewRelminMxid;
+	xtrack.relfrozenxid_nofreeze_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_nofreeze_out = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1634,27 +1639,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1786,11 +1787,13 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[tuples_frozen],
 									  &tuple_totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &xtrack))
 		{
 			/* Will execute freeze below */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1811,9 +1814,33 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (xtrack.freeze || tuples_frozen == 0)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (might be zero eligible tuples).
+		 */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_out;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* Not freezing this page, so use alternative cutoffs */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_nofreeze_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_nofreeze_out;
+
+		/* Might still set page all-visible, but never all-frozen */
+		tuples_frozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1821,7 +1848,7 @@ retry:
 	 */
 	if (tuples_frozen > 0)
 	{
-		Assert(prunestate->hastup);
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		vacrel->frozen_pages++;
 
@@ -1853,7 +1880,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(rel, buf, vacrel->NewRelfrozenXid,
 									 frozen, tuples_frozen);
 			PageSetLSN(page, recptr);
 		}
@@ -1876,7 +1903,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1884,8 +1911,7 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1906,9 +1932,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1922,6 +1945,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->all_visible = false;
+		prunestate->has_lpdead_items = true;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-- 
2.34.1

v4-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/octet-stream; name=v4-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload

From 3400903c3a4419106c36c14149a0f95e19133ecc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 15:13:27 -0700
Subject: [PATCH v4 3/6] Add eager freezing strategy to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  Use of the eager strategy (an
alternative to the classic lazy freezing strategy) is controlled by a
new GUC, vacuum_freeze_strategy_threshold (and an associated
autovacuum_* reloption).  Tables whose rel_pages are >= the cutoff will
have VACUUM use the eager freezing strategy.  Otherwise we use the lazy
freezing strategy, which is the classic approach (actually, we always
use eager freezing in aggressive VACUUMs, though they are expected to be
much rarer now).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).

If and when a smaller table (a table that uses lazy freezing at first)
grows past the table size threshold, the next VACUUM against the table
shouldn't have to do too much extra freezing to catch up when we perform
eager freezing for the first time (the table still won't be very large).
Once VACUUM has caught up, the amount of work required in each VACUUM
operation should be roughly proportionate to the number of new pages, at
least with a pure append-only table.

In summary, we try to get the benefit of the lazy freezing strategy,
without ever allowing VACUUM to fall uncomfortably far behind.  In
particular, we avoid accumulating an excessive number of unfrozen
all-visible pages in any one table.  This approach is often enough to
keep relfrozenxid recent, but we still have aggressive/antiwraparound
autovacuums for tables where it doesn't work out that way.

Note that freezing strategy is distinct from (though related to) the
strategy for skipping pages with the visibility map.  In practice tables
that use eager freezing always eagerly scan all-visible pages (they
prioritize advancing relfrozenxid), partly because we expect few or no
all-visible pages there (at least during the second or subsequent VACUUM
that uses eager freezing).  When VACUUM uses the classic/lazy freezing
strategy, VACUUM will also scan pages eagerly (i.e. it will scan any
all-visible pages and only skip all-frozen pages) when the added cost is
relatively low.
---
 src/include/access/heapam_xlog.h              |  8 +-
 src/include/commands/vacuum.h                 |  4 +
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++
 src/backend/access/heap/heapam.c              |  8 +-
 src/backend/access/heap/vacuumlazy.c          | 76 ++++++++++++++++---
 src/backend/commands/vacuum.c                 |  4 +
 src/backend/postmaster/autovacuum.c           | 10 +++
 src/backend/utils/misc/guc_tables.c           | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 31 ++++++--
 doc/src/sgml/maintenance.sgml                 |  6 +-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++
 13 files changed, 162 insertions(+), 23 deletions(-)

diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 40556271d..9ea1db505 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -345,7 +345,11 @@ typedef struct xl_heap_freeze_tuple
  * pg_class tuple.
  *
  * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
- * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ * relminmxid_nofreeze_out must also be maintained.  If vacuumlazy.c caller
+ * opts to not execute freeze plans produced by heap_prepare_freeze_tuple for
+ * its own reasons, then new relfrozenxid and relminmxid values must reflect
+ * that that choice was made.  (This is only safe when 'freeze' is still unset
+ * after the final last heap_prepare_freeze_tuple call for the page.)
  */
 typedef struct page_frozenxid_tracker
 {
@@ -356,7 +360,7 @@ typedef struct page_frozenxid_tracker
 	TransactionId relfrozenxid_out;
 	MultiXactId relminmxid_out;
 
-	/* Used by caller for '!freeze' pages */
+	/* Used by caller that opts not to freeze a '!freeze' page */
 	TransactionId relfrozenxid_nofreeze_out;
 	MultiXactId relminmxid_nofreeze_out;
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f..52379f819 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -256,6 +259,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 7dc401cf0..c6d8265cf 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 0aa4b334a..82f1aab89 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d6aea370f..699a5acae 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6429,7 +6429,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * WAL-log what we would need to do, and return true.  Return false if nothing
  * is to be changed.  In addition, set *totally_frozen to true if the tuple
  * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * more freezing will eventually be required (assuming page is to be frozen).
+ *
+ * Although this interface is primarily tuple-based, caller decides on whether
+ * or not to freeze the page as a whole.  We'll often help caller to prepare a
+ * complete "freeze plan" that it ultimately discards.  However, our caller
+ * doesn't always get to choose; it must freeze when xtrack.freeze is set
+ * here.  This ensures that any XIDs < limit_xid are never left behind.
  *
  * Caller must initialize xtrack fields for page as a whole before calling
  * here with first tuple for the page.  See page_frozenxid_tracker comments.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0004e7a44..c0b30c659 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -110,7 +110,7 @@
 
 /*
  * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages
+ * all-visible pages when using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
@@ -150,6 +150,8 @@ typedef struct LVRelState
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
+	/* Proactively freeze all tuples on pages about to be set all-visible? */
+	bool		allvis_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -254,6 +256,7 @@ typedef struct LVSavedErrInfo
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
 									  BlockNumber all_frozen);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
@@ -328,6 +331,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				eager_threshold,
 				all_visible,
 				all_frozen,
 				scanned_pages,
@@ -367,6 +371,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
 	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
 	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 *
+	 * Also determine our cutoff for applying the eager/all-visible freezing
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
+	 * even during non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -375,6 +383,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->multixact_freeze_table_age,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
+	eager_threshold = params->freeze_strategy_threshold < 0 ?
+		vacuum_freeze_strategy_threshold :
+		params->freeze_strategy_threshold;
 
 	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -529,7 +540,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
-	scanned_pages = lazy_scan_strategy(vacrel,
+	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
 									   all_visible, all_frozen);
 	if (verbose)
 	{
@@ -1304,17 +1315,28 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine skipping strategy.
+ *	lazy_scan_strategy() -- Determine freezing/skipping strategy.
  *
- * Determines if the ongoing VACUUM operation should skip all-visible pages
- * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * Our traditional/lazy freezing strategy is useful when putting off the work
+ * of freezing totally avoids work that turns out to have been unnecessary.
+ * On the other hand we eagerly freeze pages when that strategy spreads out
+ * the burden of freezing over time.  Performance stability is important; no
+ * one VACUUM operation should need to freeze disproportionately many pages.
+ * Antiwraparound VACUUMs of append-only tables should generally be avoided.
+ *
+ * Also determines if the ongoing VACUUM operation should skip all-visible
+ * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
+ * important that relfrozenxid advance in affected tables, which are larger.
+ * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
+ * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
+ * depending on the extra cost - we might need to scan only a few extra pages.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_strategy(LVRelState *vacrel,
-				   BlockNumber all_visible,
-				   BlockNumber all_frozen)
+lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
+				   BlockNumber all_visible, BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1347,21 +1369,48 @@ lazy_scan_strategy(LVRelState *vacrel,
 		scanned_pages_skipallfrozen++;
 
 	/*
-	 * Okay, now we have all the information we need to decide on a strategy
+	 * Okay, now we have all the information we need to decide on a strategy.
+	 *
+	 * We use the all-visible/eager freezing strategy when a threshold
+	 * controlled by the freeze_strategy_threshold GUC/reloption is crossed.
+	 * VACUUM won't accumulate any unfrozen all-visible pages over time in
+	 * tables above the threshold.  The system won't fall behind on freezing.
 	 */
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
 		return rel_pages;
 	}
 	else if (vacrel->aggressive)
+	{
+		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
+	}
+	else if (rel_pages >= eager_threshold)
+	{
+		/*
+		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
+		 * GUC-based threshold for eager freezing.
+		 *
+		 * We always scan all-visible pages when the threshold is crossed, so
+		 * that relfrozenxid can be advanced.  There will typically be few or
+		 * no all-visible pages (only all-frozen) in the table anyway, at
+		 * least after the first VACUUM that exceeds the threshold.
+		 */
+		vacrel->allvis_freeze_strategy = true;
+		vacrel->skipallvis = false;
+	}
 	else
 	{
 		BlockNumber nextra,
 					nextra_threshold;
 
+		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
+		vacrel->allvis_freeze_strategy = false;
+
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1873,8 +1922,15 @@ retry:
 	 *
 	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
+	 *
+	 * When ongoing VACUUM opted to use the all-visible freezing strategy we
+	 * freeze any page that will become all-visible, making it all-frozen
+	 * instead. (Actually, there are edge-cases where this might not result in
+	 * marking the page all-frozen in the visibility map, but that should have
+	 * only a negligible impact.)
 	 */
-	if (xtrack.freeze || tuples_frozen == 0)
+	if (xtrack.freeze || tuples_frozen == 0 ||
+		(vacrel->allvis_freeze_strategy && prunestate->all_visible))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7ccde07de..b837e0331 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1e90b72b7..2e4dd4090 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 87e625aa7..1c594dbe1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2487,6 +2487,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 90bec0502..e701e464e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -694,6 +694,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a5cd4e44c..091be17c3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9147,6 +9147,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9155,9 +9170,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9234,10 +9251,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        Specifies the cutoff age (in multixacts) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages with an older multixact ID.  The
+        default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..554b3a75d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -588,9 +588,9 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
+    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
+    transactions older than this cutoff XID are guaranteed to have been frozen.
+    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c14b2010d..7e684d187 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1680,6 +1680,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.34.1

v4-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchapplication/octet-stream; name=v4-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchDownload

From e031909109773e93b25b5111500bf6205c42de02 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 23 Jul 2022 17:19:01 -0700
Subject: [PATCH v4 6/6] Size VACUUM's dead_items space using VM snapshot.

VACUUM knows precisely how many pages it will scan ahead of time from
its snapshot of the visibility map following recent work.  Apply that
information to size the dead_items space for TIDs more precisely (use
scanned_pages instead of rel_pages to cap the allocation).

This can make the memory allocation significantly smaller, without any
added risk of undersizing the array.  Since VACUUM's final scanned_pages
is fully predetermined (by the visibility map snapshot), there is no
question of interference from another backend that concurrently unsets
some heap page's visibility map bit.  Many details of how VACUUM will
process the target relation are "locked in" from the very beginning.
---
 src/backend/access/heap/vacuumlazy.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index db7136601..5762fc029 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -292,7 +292,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -589,7 +590,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -3277,14 +3278,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3293,15 +3293,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3323,12 +3321,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
-- 
2.34.1

v4-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/octet-stream; name=v4-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload

From 7dd869bb5578e92d07f91c690c92e5c9c51eaaae Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 31 Jul 2022 13:53:19 -0700
Subject: [PATCH v4 5/6] Avoid allocating MultiXacts during VACUUM.

Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
teach it the difference between OldestXmin and FreezeLimit.  Use this
high-level context to intelligently avoid allocating new MultiXactIds
during VACUUM operations.  We should always prefer to avoid allocating
new MultiXacts during VACUUM on general principle.  VACUUM is the only
mechanism that can claw back MultixactId space, so allowing VACUUM to
consume MultiXactId space (for any reason) adds to the risk that the
system will trigger the multiStopLimit wraparound protection mechanism.

FreezeMultiXactId() is now eager when it's cheap to process xmax, and
lazy when it's expensive/risky to process xmax (because an expensive
second pass that might result in allocating a new Multi is required).
We make a similar trade-off to the one made by vacuumlazy.c when a
cleanup lock isn't available on some heap page.  We can usually put off
freezing (for the time being) when it's inconvenient to proceed.  The
only downside to this approach is that it necessitates pushing back the
final relfrozenxid/relminmxid value that can be set in pg_class.
---
 src/backend/access/heap/heapam.c | 49 +++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 699a5acae..e18000d81 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6111,11 +6111,21 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
  *
  * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ *
+ * Allocating new MultiXactIds during VACUUM is something that we always
+ * prefer to avoid.  Caller passes us down all the context required to do this
+ * on their behalf: cutoff_xid, cutoff_multi, limit_xid, and limit_multi.
+ *
+ * We must never leave behind XIDs/MXIDs from before the "limits" given.
+ * There is lots of leeway around XIDs/MXIDs >= caller's "limits", though.
+ * When it's possible to inexpensively process xmax right away, we're eager
+ * about it.  Otherwise we're lazy about it -- next time it might be cheap.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
+				  TransactionId limit_xid, MultiXactId limit_multi,
 				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
@@ -6208,13 +6218,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below the limit for Xids, so we
 	 * need to walk the whole members array to figure out what to do, if
 	 * anything.
+	 *
+	 * We use limit_xid for this (VACUUM's FreezeLimit), rather than using
+	 * cutoff_xid (VACUUM's OldestXmin).  We greatly prefer to avoid a second
+	 * pass over the Multi that results in allocating a new replacement Multi.
 	 */
-
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6225,12 +6238,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP optimization */
 	for (i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		if (TransactionIdPrecedes(members[i].xid, limit_xid))
 		{
 			need_replace = true;
 			break;
@@ -6239,11 +6251,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			temp_xid_out = members[i].xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* The Multi itself must be >= limit_multi, too */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, limit_multi);
+
 	if (!need_replace)
 	{
 		/*
@@ -6260,6 +6271,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_would_freeze will indicate that the tuple must be frozen.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
@@ -6359,7 +6373,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				/*
 				 * Running locker cannot possibly be older than the cutoff.
 				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * cutoff_xid is VACUUM's OldestXmin, which is also the
 				 * initial value used for top-level relfrozenxid_out tracking
 				 * state.  A running locker cannot be older than VACUUM's
 				 * OldestXmin, either, so we don't need a temp_xid_out step.
@@ -6529,7 +6543,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given limit for Xids.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
@@ -6547,6 +6561,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
 									cutoff_xid, cutoff_multi,
+									limit_xid, limit_multi,
 									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
@@ -6587,12 +6602,18 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * MultiXactId, to carry forward two or more original member XIDs.
 			 * Might have to ratchet back relfrozenxid_out here, though never
 			 * relminmxid_out.
+			 *
+			 * We only do this when we have no choice; heap_tuple_would_freeze
+			 * will definitely force the page to be frozen below.
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
 			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
 												 xtrack->relfrozenxid_out));
+			Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+										   &xtrack->relfrozenxid_nofreeze_out,
+										   &xtrack->relminmxid_nofreeze_out));
 			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
-- 
2.34.1

v4-0004-Unify-aggressive-VACUUM-with-antiwraparound-VACUU.patchapplication/octet-stream; name=v4-0004-Unify-aggressive-VACUUM-with-antiwraparound-VACUU.patchDownload

From e1bba74d79e3746cb3f873de42b55374027397ed Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v4 4/6] Unify aggressive VACUUM with antiwraparound VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Before then, every lazy VACUUM
was "equally aggressive": each operation froze whatever tuples before
the age-wise cutoff needed to be frozen.  And each table's relfrozenxid
was updated at the end.  In short, the previous behavior was much less
efficient, but did at least have one thing going for it: it was much
easier to understand.

Even when the visibility map first went in, there was some awareness of
the problem of successive VACUUMs against the same table that had
unpredictable performance characteristics.  The vacuum_freeze_table_age
GUC was added to 8.4 (around the same time as the visibility map itself)
by commit 65878185, to ameliorate problems in this area.  The GUC made
the behavior of successive VACUUMs somewhat more continuous by forcing
"early aggressive VACUUMs" of tables whose relfrozenxid had already
attained an age that exceeded vacuum_freeze_table_age when VACUUM began.
An aggressive VACUUM could therefore sometimes take place before an
aggressive antiwraparound autovacuum was triggered.

Antiwraparound autovacuums were just another way that autovacuum.c could
trigger an autovacuum worker before now (for the most part).  Although
antiwraparound "implies aggressive", aggressive has never "implied
antiwraparound".  This is a consequence of having two table-age GUCs
that both influence VACUUM's behavior around relfrozenxid advancement.
While table age certainly does matter, it's far from the only thing that
matters.  And while we should sometimes "behave aggressively", it's more
useful to structure everything as being on a continuum between laziness
and eagerness/aggressiveness.

Rather than relying on vacuum_freeze_table_age to "escalate to an
aggressive VACUUM early", the skipping strategy infrastructure now gives
some consideration to table age (in addition to the target relation's
physical characteristics, in particular the number of extra all-visible
blocks).  The costs and the benefits are weighed against each other.
The closer we get to needing an antiwraparound VACUUM, the less
concerned we are about the added cost of advancing relfrozenxid in the
ongoing VACUUM.

We now _start_ to give _some_ weight to table age at a relatively early
stage, when the table's age(relfrozenxid) first crosses the half-way
point (autovacuum_freeze_max_age/2, or the multixact equivalent).  Once
we're very close to the point of antiwraparound for a given table, any
VACUUM against that table will automatically choose the eager skipping
strategy.

The concept of aggressive VACUUM is now merged with the concept of
antiwraparound VACUUM.  Note that this means that a manually issued
VACUUM command can now sometimes be classified as an antiwraparound
VACUUM (and get reported as such in VERBOSE output).

The default value of vacuum_freeze_table_age is now -1, which is
interpreted as "the current value of the autovacuum_freeze_max_age GUC".
The "table age" GUCs/reloptions can be used as compatibility options,
but are otherwise superseded by VACUUM's freezing and skipping
strategies.
---
 src/include/commands/vacuum.h                 |   5 +-
 src/include/storage/proc.h                    |   4 +-
 src/backend/access/heap/vacuumlazy.c          | 171 ++++---
 src/backend/commands/cluster.c                |   3 +-
 src/backend/commands/vacuum.c                 |  96 ++--
 src/backend/postmaster/autovacuum.c           |   4 +-
 src/backend/storage/lmgr/proc.c               |   2 +-
 src/backend/utils/activity/pgstat_relation.c  |   6 +-
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   4 +-
 doc/src/sgml/config.sgml                      |  80 ++--
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 451 ++++++------------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  23 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   5 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  27 +-
 src/test/regress/expected/reloptions.out      |   4 +-
 src/test/regress/sql/reloptions.sql           |   4 +-
 21 files changed, 424 insertions(+), 503 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 52379f819..5bb33a2bb 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -223,7 +223,7 @@ typedef struct VacuumParams
 	int			freeze_strategy_threshold;	/* threshold to use eager
 											 * freezing, in total heap blocks,
 											 * -1 to use default */
-	bool		is_wraparound;	/* force a for-wraparound vacuum */
+	bool		is_antiwrap_autovac;	/* antiwraparound autovacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
 									 * default */
@@ -298,7 +298,8 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  TransactionId *oldestXmin,
 								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
-								  MultiXactId *multiXactCutoff);
+								  MultiXactId *multiXactCutoff,
+								  double *antiwrapfrac);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619e..785ef610d 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -57,7 +57,7 @@ struct XidCache
 										 * CONCURRENTLY or REINDEX
 										 * CONCURRENTLY on non-expressional,
 										 * non-partial index */
-#define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
+#define		PROC_AUTOVACUUM_FOR_WRAPAROUND	0x08	/* affects cancellation */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
 #define		PROC_AFFECTS_ALL_HORIZONS	0x20	/* this proc's xmin must be
@@ -66,7 +66,7 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_AUTOVACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c0b30c659..db7136601 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,11 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages when using the lazy freezing strategy
+ * Thresholds that control whether VACUUM will skip any all-visible pages when
+ * using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
+#define SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES	0.15
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -144,9 +145,9 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
-	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	/* Antiwraparound VACUUM? (must set relfrozenxid >= FreezeLimit) */
+	bool		antiwraparound;
+	/* Skip (don't scan) all-visible pages? (must be !antiwraparound) */
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
@@ -258,7 +259,8 @@ static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
 									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
-									  BlockNumber all_frozen);
+									  BlockNumber all_frozen,
+									  double antiwrapfrac);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
 								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
@@ -322,7 +324,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				aggressive,
+				antiwraparound,
 				skipallfrozen,
 				frozenxid_updated,
 				minmulti_updated;
@@ -330,6 +332,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				FreezeLimit;
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
+	double		antiwrapfrac;
 	BlockNumber orig_rel_pages,
 				eager_threshold,
 				all_visible,
@@ -369,32 +372,42 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
 	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
 	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
-	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
-	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 * an antiwraparound VACUUM then lazy_scan_heap cannot leave behind
+	 * unfrozen XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to
+	 * go away).
 	 *
 	 * Also determine our cutoff for applying the eager/all-visible freezing
 	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
-	 * even during non-aggressive VACUUMs.
+	 * even during regular VACUUMs.
 	 */
-	aggressive = vacuum_set_xid_limits(rel,
-									   params->freeze_min_age,
-									   params->multixact_freeze_min_age,
-									   params->freeze_table_age,
-									   params->multixact_freeze_table_age,
-									   &OldestXmin, &OldestMxact,
-									   &FreezeLimit, &MultiXactCutoff);
+	antiwraparound = vacuum_set_xid_limits(rel,
+										   params->freeze_min_age,
+										   params->multixact_freeze_min_age,
+										   params->freeze_table_age,
+										   params->multixact_freeze_table_age,
+										   &OldestXmin, &OldestMxact,
+										   &FreezeLimit, &MultiXactCutoff,
+										   &antiwrapfrac);
 	eager_threshold = params->freeze_strategy_threshold < 0 ?
 		vacuum_freeze_strategy_threshold :
 		params->freeze_strategy_threshold;
 
+	/*
+	 * An autovacuum to prevent wraparound should already be recognized as
+	 * antiwraparound based on generic criteria.  Even still, make sure that
+	 * autovacuum.c always gets what it asked for.
+	 */
+	if (params->is_antiwrap_autovac)
+		antiwraparound = true;
+
 	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
 		/*
-		 * Force aggressive mode, and disable skipping blocks using the
+		 * Force antiwraparound mode, and disable skipping blocks using the
 		 * visibility map (even those set all-frozen)
 		 */
-		aggressive = true;
+		antiwraparound = true;
 		skipallfrozen = false;
 	}
 
@@ -445,9 +458,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
-	vacrel->aggressive = aggressive;
+	vacrel->antiwraparound = antiwraparound;
 	/* Set skipallvis/skipallfrozen provisionally (before lazy_scan_strategy) */
-	vacrel->skipallvis = (!aggressive && skipallfrozen);
+	vacrel->skipallvis = (!antiwraparound && skipallfrozen);
 	vacrel->skipallfrozen = skipallfrozen;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
@@ -541,7 +554,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
 	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
-									   all_visible, all_frozen);
+									   all_visible, all_frozen,
+									   antiwrapfrac);
 	if (verbose)
 	{
 		char	   *msgfmt;
@@ -549,8 +563,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 		Assert(!IsAutoVacuumWorkerProcess());
 
-		if (aggressive)
-			msgfmt = _("aggressively vacuuming \"%s.%s.%s\"");
+		if (antiwraparound)
+			msgfmt = _("vacuuming \"%s.%s.%s\" to prevent wraparound");
 		else
 			msgfmt = _("vacuuming \"%s.%s.%s\"");
 
@@ -617,25 +631,25 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
+	 * Antiwraparound VACUUMs must always be able to advance relfrozenxid to a
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * Regular VACUUMs may advance them by any amount, or not at all.
 	 */
 	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
-		   TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
+		   TransactionIdPrecedesOrEquals(antiwraparound ? FreezeLimit :
 										 vacrel->relfrozenxid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
+		   MultiXactIdPrecedesOrEquals(antiwraparound ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
 	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * Must keep original relfrozenxid in a regular VACUUM whose
 		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
-		Assert(!aggressive);
+		Assert(!antiwraparound);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -709,29 +723,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			if (verbose)
 			{
 				/*
-				 * Aggressiveness already reported earlier, in dedicated
+				 * Antiwraparound-ness already reported earlier, in dedicated
 				 * VACUUM VERBOSE ereport
 				 */
-				Assert(!params->is_wraparound);
+				Assert(!params->is_antiwrap_autovac);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
-			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
 			else
 			{
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
+				/*
+				 * Note that we don't differentiate between an antiwraparound
+				 * autovacuum that was launched by autovacuum.c as
+				 * antiwraparound and one that only became antiwraparound
+				 * because freeze_table_age is set.
+				 */
+				Assert(IsAutoVacuumWorkerProcess());
+				if (antiwraparound)
+					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -1063,7 +1071,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * lazy_scan_noprune could not do all required processing.  Wait
 			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
 			 */
-			Assert(vacrel->aggressive);
+			Assert(vacrel->antiwraparound);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
 		}
@@ -1325,18 +1333,22 @@ lazy_scan_heap(LVRelState *vacrel)
  * Antiwraparound VACUUMs of append-only tables should generally be avoided.
  *
  * Also determines if the ongoing VACUUM operation should skip all-visible
- * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
- * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
+ * pages for regular VACUUMs, where advancing relfrozenxid is optional.  When
+ * VACUUM freezes eagerly it always also scans pages eagerly, since it's
  * important that relfrozenxid advance in affected tables, which are larger.
  * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
  * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
  * depending on the extra cost - we might need to scan only a few extra pages.
+ * Decision is based in part on caller's antiwrapfrac argument, which is a
+ * value from 0.0 - 1.0 that represents how close the table age is to needing
+ * an antiwraparound VACUUM.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
 lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
-				   BlockNumber all_visible, BlockNumber all_frozen)
+				   BlockNumber all_visible, BlockNumber all_frozen,
+				   double antiwrapfrac)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1379,21 +1391,21 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		Assert(vacrel->antiwraparound && !vacrel->skipallvis);
 		vacrel->allvis_freeze_strategy = true;
 		return rel_pages;
 	}
-	else if (vacrel->aggressive)
+	else if (vacrel->antiwraparound)
 	{
-		/* Always freeze all-visible pages during aggressive VACUUMs */
+		/* Always freeze all-visible pages during antiwraparound VACUUMs */
 		Assert(!vacrel->skipallvis);
 		vacrel->allvis_freeze_strategy = true;
 	}
 	else if (rel_pages >= eager_threshold)
 	{
 		/*
-		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
-		 * GUC-based threshold for eager freezing.
+		 * Regular VACUUM of table whose rel_pages now exceeds GUC-based
+		 * threshold for eager freezing.
 		 *
 		 * We always scan all-visible pages when the threshold is crossed, so
 		 * that relfrozenxid can be advanced.  There will typically be few or
@@ -1408,7 +1420,7 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		BlockNumber nextra,
 					nextra_threshold;
 
-		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
+		/* Regular VACUUM of small table -- use lazy freeze strategy */
 		vacrel->allvis_freeze_strategy = false;
 
 		/*
@@ -1424,13 +1436,32 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		 * that way, so be lazy (just skip) unless the added cost is very low.
 		 * We opt for a skipallfrozen-only VACUUM when the number of extra
 		 * pages (extra scanned pages that are all-visible but not all-frozen)
-		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small)
+		 * if relfrozenxid has yet to attain an age that uses 50% of the XID
+		 * space available before an antiwraparound VACUUM becomes necessary.
+		 * A more aggressive threshold of 15% is used when relfrozenxid is
+		 * older than that.
 		 */
 		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
-		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+
+		if (antiwrapfrac < 0.5)
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_THRESHOLD_PAGES;
+		else
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES;
+
 		nextra_threshold = Max(32, nextra_threshold);
 
-		vacrel->skipallvis = nextra >= nextra_threshold;
+		/*
+		 * We always prefer eagerly advancing relfrozenxid when it already
+		 * attained an age that consumes >= 90% of the available XID space
+		 * before the crossover point for antiwraparound VACUUM.
+		 */
+		if (antiwrapfrac < 0.9)
+			vacrel->skipallvis = nextra >= nextra_threshold;
+		else
+			vacrel->skipallvis = false;
 	}
 
 	/* Return the appropriate variant of scanned_pages */
@@ -2080,11 +2111,11 @@ retry:
  * operation left LP_DEAD items behind.  We'll at least collect any such items
  * in the dead_items array for removal from indexes.
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * For antiwraparound VACUUM callers, we may return false to indicate that a
+ * full cleanup lock is required for processing by lazy_scan_prune.  This is
+ * only necessary when the antiwraparound VACUUM needs to freeze some tuple
+ * XIDs from one or more tuples on the page.  We always return true for
+ * regular VACUUM callers.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
@@ -2157,13 +2188,13 @@ lazy_scan_noprune(LVRelState *vacrel,
 									&NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
+			if (vacrel->antiwraparound)
 			{
 				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
+				 * Antiwraparound VACUUMs must always be able to advance rel's
 				 * relfrozenxid to a value >= FreezeLimit (and be able to
 				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
+				 * The ongoing antiwraparound VACUUM won't be able to do that
 				 * unless it can freeze an XID (or MXID) from this tuple now.
 				 *
 				 * The only safe option is to have caller perform processing
@@ -2175,8 +2206,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 			}
 
 			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
+			 * Regular VACUUMs are under no obligation to advance relfrozenxid
+			 * (even by one XID).  We can be much laxer here.
 			 *
 			 * Currently we always just accept an older final relfrozenxid
 			 * and/or relminmxid value.  We never make caller wait or work a
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 1976a373e..24aa096e0 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -826,6 +826,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 				FreezeXid;
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
+	double		antiwrapfrac;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -914,7 +915,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &OldestMxact,
-						  &FreezeXid, &MultiXactCutoff);
+						  &FreezeXid, &MultiXactCutoff, &antiwrapfrac);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b837e0331..81686ccce 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -267,8 +267,8 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 	/* Determine freezing strategy later on using GUC or reloption */
 	params.freeze_strategy_threshold = -1;
 
-	/* user-invoked vacuum is never "for wraparound" */
-	params.is_wraparound = false;
+	/* user-invoked vacuum isn't an autovacuum */
+	params.is_antiwrap_autovac = false;
 
 	/* user-invoked vacuum uses VACOPT_VERBOSE instead of log_min_duration */
 	params.log_min_duration = -1;
@@ -943,14 +943,20 @@ get_all_vacuum_rels(int options)
  * - oldestMxact is the Mxid below which MultiXacts are definitely not
  *   seen as visible by any running transaction.
  * - freezeLimit is the Xid below which all Xids are definitely replaced by
- *   FrozenTransactionId during aggressive vacuums.
+ *   FrozenTransactionId during antiwraparound vacuums.
  * - multiXactCutoff is the value below which all MultiXactIds are definitely
- *   removed from Xmax during aggressive vacuums.
+ *   removed from Xmax during antiwraparound vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
- * minimum).
+ * operation antiwraparound.  An antiwraparound VACUUM is required to advance
+ * relfrozenxid up to FreezeLimit (at a minimum), and relminmxid up to
+ * multiXactCutoff (at a minimum).  Otherwise VACUUM advances relfrozenxid on
+ * a best-effort basis.
+ *
+ * Sets *antiwrapfrac to give caller a sense of how close we came to requiring
+ * an antiwraparound VACUUM in terms of XID/MXID space consumed.  This is set
+ * to a value between 0.0 and 1.0, where 1.0 represents the point that an
+ * antiwraparound VACUUM will be (or has already) been forced.
  *
  * oldestXmin and oldestMxact are the most recent values that can ever be
  * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
@@ -966,15 +972,20 @@ vacuum_set_xid_limits(Relation rel,
 					  TransactionId *oldestXmin,
 					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
-					  MultiXactId *multiXactCutoff)
+					  MultiXactId *multiXactCutoff,
+					  double *antiwrapfrac)
 {
 	TransactionId nextXID,
 				safeOldestXmin,
-				aggressiveXIDCutoff;
+				antiwrapXIDCutoff;
 	MultiXactId nextMXID,
 				safeOldestMxact,
-				aggressiveMXIDCutoff;
-	int			effective_multixact_freeze_max_age;
+				antiwrapMXIDCutoff;
+	double		XIDFrac,
+				MXIDFrac;
+	int			effective_multixact_freeze_max_age,
+				relfrozenxid_age,
+				relminmxid_age;
 
 	/*
 	 * Acquire oldestXmin.
@@ -1065,8 +1076,8 @@ vacuum_set_xid_limits(Relation rel,
 		*multiXactCutoff = *oldestMxact;
 
 	/*
-	 * Done setting output parameters; check if oldestXmin or oldestMxact are
-	 * held back to an unsafe degree in passing
+	 * Done setting cutoff output parameters; check if oldestXmin or
+	 * oldestMxact are held back to an unsafe degree in passing
 	 */
 	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
 	if (!TransactionIdIsNormal(safeOldestXmin))
@@ -1085,48 +1096,59 @@ vacuum_set_xid_limits(Relation rel,
 				 errhint("Close open transactions soon to avoid wraparound problems.\n"
 						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
 
+	*antiwrapfrac = 1.0;		/* Initialize */
+
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
+	 * Finally, figure out if caller needs to do an antiwraparound VACUUM now.
 	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
+	antiwrapXIDCutoff = nextXID - freeze_table_age;
+	if (!TransactionIdIsNormal(antiwrapXIDCutoff))
+		antiwrapXIDCutoff = FirstNormalTransactionId;
 	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
+									  antiwrapXIDCutoff))
 		return true;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.   The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
+	antiwrapMXIDCutoff = nextMXID - multixact_freeze_table_age;
+	if (antiwrapMXIDCutoff < FirstMultiXactId)
+		antiwrapMXIDCutoff = FirstMultiXactId;
 	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
+									antiwrapMXIDCutoff))
 		return true;
 
-	/* Non-aggressive VACUUM */
+	/*
+	 * Regular VACUUM for vacuumlazy.c caller.  Need to work out how close we
+	 * came to needing an antiwraparound VACUUM.
+	 */
+	relfrozenxid_age = Max(nextXID - rel->rd_rel->relfrozenxid, 1);
+	relminmxid_age = Max(nextMXID - rel->rd_rel->relminmxid, 1);
+	XIDFrac = (double) relfrozenxid_age / (double) freeze_table_age;
+	MXIDFrac = (double) relminmxid_age / (double) multixact_freeze_table_age;
+
+	*antiwrapfrac = Max(XIDFrac, MXIDFrac);
+
 	return false;
 }
 
@@ -1869,8 +1891,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 */
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyProc->statusFlags |= PROC_IN_VACUUM;
-		if (params->is_wraparound)
-			MyProc->statusFlags |= PROC_VACUUM_FOR_WRAPAROUND;
+		if (params->is_antiwrap_autovac)
+			MyProc->statusFlags |= PROC_AUTOVACUUM_FOR_WRAPAROUND;
 		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
 		LWLockRelease(ProcArrayLock);
 	}
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2e4dd4090..4b68d0c5e 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2882,7 +2882,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
 		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
-		tab->at_params.is_wraparound = wraparound;
+		tab->at_params.is_antiwrap_autovac = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
 		tab->at_vacuum_cost_delay = vac_cost_delay;
@@ -3194,7 +3194,7 @@ autovac_report_activity(autovac_table *tab)
 
 	snprintf(activity + len, MAX_AUTOVAC_ACTIV_LEN - len,
 			 " %s.%s%s", tab->at_nspname, tab->at_relname,
-			 tab->at_params.is_wraparound ? " (to prevent wraparound)" : "");
+			 tab->at_params.is_antiwrap_autovac ? " (to prevent wraparound)" : "");
 
 	/* Set statement_timestamp() to current time for pg_stat_activity */
 	SetCurrentStatementStartTimestamp();
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 37aaab133..158eab321 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1384,7 +1384,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 			 * wraparound.
 			 */
 			if ((statusFlags & PROC_IS_AUTOVACUUM) &&
-				!(statusFlags & PROC_VACUUM_FOR_WRAPAROUND))
+				!(statusFlags & PROC_AUTOVACUUM_FOR_WRAPAROUND))
 			{
 				int			pid = autovac->pid;
 
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index a846d9ffb..3b7618b6d 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -234,9 +234,9 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->n_dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
-	 * It's currently used only to track when we need to perform an "insert"
+	 * It is quite possible that a regular VACUUM ended up skipping various
+	 * pages, however, we'll zero the insert counter here regardless.  It's
+	 * currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
 	 * until enough tuples have been inserted to trigger another insert
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c594dbe1..913a7df36 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2460,10 +2460,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2480,10 +2480,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e701e464e..81502d0ca 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,11 +693,11 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
+#vacuum_freeze_table_age = -1
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
+#vacuum_multixact_freeze_table_age = -1
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 091be17c3..01f246464 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8217,7 +8217,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8406,7 +8406,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9130,20 +9130,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> performs antiwraparound vacuuming if the
+        table's <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        Antiwraparound vacuuming differs from regular vacuuming in
+        that it will reliably advance
+        <structfield>relfrozenxid</structfield> to a recent value,
+        even when <command>VACUUM</command> wouldn't usually deemed it
+        necessary.  The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and
+         advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9179,9 +9190,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
-        that there is not an unreasonably short time between forced
+        that there is not an unreasonably short time between forced antiwraparound
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9227,19 +9238,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+        <command>VACUUM</command> performs antiwraparound vacuuming if
+        the table's <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+        field has reached the multixact age specified by this setting.
+        Antiwraparound vacuuming differs from regular vacuuming in
+        that it will reliably advance
+        <structfield>relminmxid</structfield> to a recent value, even
+        when <command>VACUUM</command> wouldn't usually deemed it
+        necessary.  The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>
+        is used.  For more information see <xref
+         linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and
+         advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9258,7 +9278,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
-        so that there is not an unreasonably short time between forced
+        so that there is not an unreasonably short time between forced antiwraparound
         autovacuums.
         For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
        </para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 554b3a75d..80fd3d548 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,18 +400,8 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
-
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
-
-    <indexterm>
-     <primary>wraparound</primary>
-     <secondary>of transaction IDs</secondary>
-    </indexterm>
+  <sect2 id="vacuum-xid-space">
+   <title>Managing the 32-bit Transaction ID address space</title>
 
    <para>
     <productname>PostgreSQL</productname>'s
@@ -419,165 +409,23 @@
     depend on being able to compare transaction ID (<acronym>XID</acronym>)
     numbers: a row version with an insertion XID greater than the current
     transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
+    to the current transaction.  But since the on-disk representation
+    of transaction IDs is only 32-bits, the system is incapable of
+    representing <emphasis>distances</emphasis> between any two XIDs
+    that exceed about 2 billion transaction IDs.
    </para>
 
    <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
-    <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
-    </para>
-    <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
-    </para>
-   </note>
-
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
+    One of the purposes of periodic vacuuming is to manage the
+    Transaction Id address space.  <command>VACUUM</command> will mark
+    rows as <emphasis>frozen</emphasis>, indicating that they were
+    inserted by a transaction that committed sufficiently far in the
+    past that the effects of the inserting transaction are certain to
+    be visible to all current and future transactions.  There is, in
+    effect, an infinite distance between a frozen transaction ID and
+    any unfrozen transaction ID.  This allows the on-disk
+    representation of transaction IDs to recycle the 32-bit address
+    space efficiently.
    </para>
 
    <para>
@@ -587,15 +435,15 @@
     <structname>pg_database</structname>.  In particular,
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
-    transactions older than this cutoff XID are guaranteed to have been frozen.
-    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    XID at the end of the most recent <command>VACUUM</command>.  All
+    rows inserted by transactions older than this cutoff XID are
+    guaranteed to have been frozen.  Similarly, the
+    <structfield>datfrozenxid</structfield> column of a database's
+    <structname>pg_database</structname> row is a lower bound on the
+    unfrozen XIDs appearing in that database &mdash; it is just the
+    minimum of the per-table <structfield>relfrozenxid</structfield>
+    values within the database.  A convenient way to examine this
+    information is to execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -611,89 +459,13 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     cutoff XID to the current transaction's XID.
    </para>
 
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
-    </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
-
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
     </indexterm>
 
-    <indexterm>
-     <primary>wraparound</primary>
-     <secondary>of multixact IDs</secondary>
-    </indexterm>
-
     <para>
      <firstterm>Multixact IDs</firstterm> are used to support row locking by
      multiple transactions.  Since there is only limited space in a tuple
@@ -704,49 +476,137 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
+    </para>
+    <para>
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="controlling-freezing">
+    <title>Controlling freezing</title>
+   <para>
+    As a general rule, the more tuples that <command>VACUUM</command>
+    freezes, the more recently <command>VACUUM</command> can set the
+    table's <structfield>relfrozenxid</structfield> and
+    <structfield>relminmxid</structfield> fields to afterwards.
+    <xref linkend="guc-vacuum-freeze-min-age"/> and <xref
+     linkend="guc-vacuum-multixact-freeze-min-age"/> control how old
+    an XID or MultiXactId value has to be before the row will be
+    frozen (absent any other factor that triggers freezing).
+    This is only enforced in smaller tables that use the lazy freezing
+    strategy (controlled by
+    <xref linkend="guc-vacuum-freeze-strategy-threshold"/>).
+    Increasing these settings may avoid unnecessary work, but that
+    isn't generally recommended.
+   </para>
+
+   <tip>
+    <para>
+     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+     parameter is specified, <command>VACUUM</command> prints various
+     statistics about the table.  This includes information about how
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> advanced.  The same details appear
+     in the server log when autovacuum logging (controlled by <xref
+      linkend="guc-log-autovacuum-min-duration"/>) reports on a
+     <command>VACUUM</command> operation executed by autovacuum.
+    </para>
+   </tip>
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound VACUUM</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> can only be advanced when
+     <command>VACUUM</command> actually runs.  Even then,
+     <command>VACUUM</command> must scan every page of the table that
+     might contain unfrozen XIDs.  <command>VACUUM</command> usually
+     advances <structfield>relfrozenxid</structfield> on a best-effort
+     basis, weighing costs against benefits.  This approach spreads
+     out the burden of freezing over time, across multiple
+     <command>VACUUM</command> operations.  However, if no
+     <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be forced for the table.
+     This will reliably set <structfield>relfrozenxid</structfield>
+     and <structfield>relminmxid</structfield> to a relatively recent
+     values.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
 
-    <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     <programlisting>
+      WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+      HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+     </programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+     <programlisting>
+      ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+      HINT:  Stop the postmaster and vacuum that database in single-user mode.
+     </programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover without data loss, by manually executing the
+     required <command>VACUUM</command> commands.  However, since the system will not
+     execute commands once it has gone into the safety shutdown mode,
+     the only way to do this is to stop the server and start the server in single-user
+     mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
+     in single-user mode.  See the <xref linkend="app-postgres"/> reference
+     page for details about using single-user mode.
     </para>
    </sect3>
+
   </sect2>
 
   <sect2 id="autovacuum">
@@ -832,22 +692,13 @@ vacuum insert threshold = vacuum base insert threshold + vacuum insert scale fac
     and vacuum insert scale factor is
     <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
     Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
+    <firstterm>all visible</firstterm> and also allow tuples to be frozen.
+    The number of obsolete tuples and
     the number of inserted tuples are obtained from the cumulative statistics system;
     it is a semi-accurate count updated by each <command>UPDATE</command>,
     <command>DELETE</command> and <command>INSERT</command> operation.  (It is
     only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
+    load.)
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 7e684d187..74a61abe2 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1501,7 +1501,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c582021d2..42360f165 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,12 +119,12 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
+      Selects eager <quote>freezing</quote> of tuples, and forces
+      antiwraparound mode.  Specifying <literal>FREEZE</literal> is
+      equivalent to performing <command>VACUUM</command> with the
+      <xref linkend="guc-vacuum-freeze-min-age"/> and <xref
+       linkend="guc-vacuum-freeze-table-age"/> parameters
+      set to zero.  Eager freezing is always performed when the
       table is rewritten, so this option is redundant when <literal>FULL</literal>
       is specified.
      </para>
@@ -154,13 +154,8 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>DISABLE_PAGE_SKIPPING</literal></term>
     <listitem>
      <para>
-      Normally, <command>VACUUM</command> will skip pages based on the <link
-      linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      Normally, <command>VACUUM</command> will skip pages based on the
+      <link linkend="vacuum-for-visibility-map">visibility map</link>.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -215,7 +210,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..fdc81a237 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,8 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples and force antiwraparound
+        mode.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +260,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..6a266033a 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_regular_vacuum vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_regular_vacuum vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_regular_vacuum vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_regular_vacuum vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_regular_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_regular_vacuum pinholder_commit vacuumer_regular_vacuum
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_regular_vacuum: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..28fb52433 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,15 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs regular VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin):
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_regular_vacuum
 {
   VACUUM smalltbl;
 }
@@ -75,15 +75,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page ("missed dead" tuples are counted in
+# reltuples, much like "recently dead" tuples).
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +91,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +102,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +115,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +127,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which only antiwraparound VACUUM is willing to do).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +135,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_regular_vacuum
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..a02348900 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,7 +102,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
+-- Do an antiwraparound vacuum to prevent page-skipping.
 VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
@@ -128,7 +128,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
+-- Do an antiwraparound vacuum to prevent page-skipping.
 VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..6c727695e 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,7 +61,7 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
+-- Do an antiwraparound vacuum to prevent page-skipping.
 VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
@@ -72,7 +72,7 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
+-- Do an antiwraparound vacuum to prevent page-skipping.
 VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
-- 
2.34.1

v4-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/octet-stream; name=v4-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload

From 10085894d6a664706f1f08a173dea6f82f03a035 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v4 2/6] Teach VACUUM to use visibility map snapshot.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip.  The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").  It also
gives VACUUM reliable information about how many pages will be scanned,
before its physical heap scan even begins.  That makes it easier to
model the costs that VACUUM incurs using a top-down, up-front approach.

Non-aggressive VACUUMs now make an up-front choice about VM skipping
strategy: they decide whether to prioritize early advancement of
relfrozenxid (eager behavior) over avoiding work by skipping all-visible
pages (lazy behavior).  Nothing about the details of how lazy_scan_prune
freezes changes just yet, though a later commit will add the concept of
freezing strategies.

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.  VACUUM will now either scan
every all-visible page, or none at all.  This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.  It was possible that a single all-visible page
located in a range of all-frozen blocks would render it unsafe to
advance relfrozenxid later on; lazy_scan_skip just didn't have enough
high-level context about the table as a whole.  Now our policy around
skipping all-visible pages is exactly the same condition as whether or
not it's safe to advance relfrozenxid later on; nothing is left to
chance.

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work).  For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().
---
 src/include/access/visibilitymap.h      |   7 +
 src/backend/access/heap/vacuumlazy.c    | 356 ++++++++++++++----------
 src/backend/access/heap/visibilitymap.c | 162 +++++++++++
 3 files changed, 375 insertions(+), 150 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..49f197bb3 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,9 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +38,10 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+									  BlockNumber *all_visible, BlockNumber *all_frozen);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
+extern uint8 visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index abda286b7..0004e7a44 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,10 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * Threshold that controls whether non-aggressive VACUUMs will skip any
+ * all-visible pages
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -146,8 +146,10 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
+	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	bool		skipallvis;
+	/* Skip (don't scan) all-frozen pages? */
+	bool		skipallfrozen;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -177,7 +179,8 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map used by lazy_scan_skip */
+	vmsnapshot *vmsnap;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -250,10 +253,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber all_visible,
+									  BlockNumber all_frozen);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel,
+								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -316,7 +320,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	bool		verbose,
 				instrument,
 				aggressive,
-				skipwithvm,
+				skipallfrozen,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
@@ -324,6 +328,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				all_visible,
+				all_frozen,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -369,7 +376,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
 
-	skipwithvm = true;
+	skipallfrozen = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
 		/*
@@ -377,7 +384,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * visibility map (even those set all-frozen)
 		 */
 		aggressive = true;
-		skipwithvm = false;
+		skipallfrozen = false;
 	}
 
 	/*
@@ -402,20 +409,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
@@ -442,7 +435,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
 	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
+	/* Set skipallvis/skipallfrozen provisionally (before lazy_scan_strategy) */
+	vacrel->skipallvis = (!aggressive && skipallfrozen);
+	vacrel->skipallfrozen = skipallfrozen;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -503,12 +498,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * confused about whether a tuple should be frozen or removed.  (In the
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
-	 *
-	 * We must determine rel_pages _after_ OldestXmin has been established.
-	 * lazy_scan_heap's physical heap scan (scan of pages < rel_pages) is
-	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
-	 * XIDs must at least be considered for freezing (though not necessarily
-	 * frozen) during its scan.
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
@@ -521,7 +510,51 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
-	vacrel->skippedallvis = false;
+
+	/*
+	 * VACUUM must scan all pages that might have XIDs < OldestXmin in tuple
+	 * headers to be able to safely advance relfrozenxid later on.  There is
+	 * no good reason to scan any additional pages. (Actually we might opt to
+	 * skip all-visible pages.  Either way we won't scan pages for no reason.)
+	 *
+	 * Now that OldestXmin and rel_pages are acquired, acquire an immutable
+	 * snapshot of the visibility map as well.  lazy_scan_skip works off of
+	 * the vmsnap, not the authoritative VM, which can continue to change.
+	 * Pages that lazy_scan_heap will scan are fixed and known in advance.
+	 *
+	 * The exact number of pages that lazy_scan_heap will scan also depends on
+	 * our choice of skipping strategy.  VACUUM can either choose to skip any
+	 * all-visible pages lazily, or choose to scan those same pages instead.
+	 * Decide on a skipping strategy to determine final scanned_pages.
+	 */
+	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
+										&all_visible, &all_frozen);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   all_visible, all_frozen);
+	if (verbose)
+	{
+		char	   *msgfmt;
+		StringInfoData buf;
+
+		Assert(!IsAutoVacuumWorkerProcess());
+
+		if (aggressive)
+			msgfmt = _("aggressively vacuuming \"%s.%s.%s\"");
+		else
+			msgfmt = _("vacuuming \"%s.%s.%s\"");
+
+		initStringInfo(&buf);
+		appendStringInfo(&buf, msgfmt, get_database_name(MyDatabaseId),
+						 vacrel->relnamespace, vacrel->relname);
+
+		ereport(INFO,
+				(errmsg_internal("%s", buf.data),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
+		pfree(buf.data);
+	}
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -538,6 +571,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -584,12 +618,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
 		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -634,6 +667,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -857,13 +893,12 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -877,42 +912,24 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = lazy_scan_skip(vacrel, 0, &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		if (blkno < next_block_to_scan)
+			continue;
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = lazy_scan_skip(vacrel, blkno + 1,
+											&next_all_visible);
 
 		vacrel->scanned_pages++;
 
@@ -1122,10 +1139,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1153,12 +1169,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1197,7 +1211,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1290,47 +1304,121 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *	lazy_scan_strategy() -- Determine skipping strategy.
  *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * Determines if the ongoing VACUUM operation should skip all-visible pages
+ * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
  *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
+ * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
+lazy_scan_strategy(LVRelState *vacrel,
+				   BlockNumber all_visible,
+				   BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
+				scanned_pages_skipallvis,
+				scanned_pages_skipallfrozen;
+	uint8		mapbits;
 
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
+	Assert(vacrel->scanned_pages == 0);
+	Assert(rel_pages >= all_visible && all_visible >= all_frozen);
+
+	/*
+	 * First figure out the final scanned_pages for each of the skipping
+	 * policies that lazy_scan_skip might end up using: skipallvis (skip both
+	 * all-frozen and all-visible) and skipallfrozen (just skip all-frozen).
+	 */
+	scanned_pages_skipallvis = rel_pages - all_visible;
+	scanned_pages_skipallfrozen = rel_pages - all_frozen;
+
+	/*
+	 * Even if the last page is skippable, it will still get scanned because
+	 * of lazy_scan_skip's "try to set nonempty_pages for last page" rule.
+	 * Reconcile that rule with what vmsnap says about the last page now.
+	 *
+	 * When vmsnap thinks that we will be skipping the last page (we won't),
+	 * increment scanned_pages to compensate.  Otherwise change nothing.
+	 */
+	mapbits = visibilitymap_snap_status(vacrel->vmsnap, rel_pages - 1);
+	if (mapbits & VISIBILITYMAP_ALL_VISIBLE)
+		scanned_pages_skipallvis++;
+	if (mapbits & VISIBILITYMAP_ALL_FROZEN)
+		scanned_pages_skipallfrozen++;
+
+	/*
+	 * Okay, now we have all the information we need to decide on a strategy
+	 */
+	if (!vacrel->skipallfrozen)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		return rel_pages;
+	}
+	else if (vacrel->aggressive)
+		Assert(!vacrel->skipallvis);
+	else
+	{
+		BlockNumber nextra,
+					nextra_threshold;
+
+		/*
+		 * Decide on whether or not we'll skip all-visible pages.
+		 *
+		 * In general, VACUUM doesn't necessarily have to freeze anything to
+		 * be able to advance relfrozenxid and/or relminmxid by a significant
+		 * number of XIDs/MXIDs.  The oldest tuples might turn out to have
+		 * been deleted since VACUUM last ran, or this VACUUM might find that
+		 * there simply are no MultiXacts that even need to be considered.
+		 *
+		 * It's hard to predict whether this VACUUM operation will work out
+		 * that way, so be lazy (just skip) unless the added cost is very low.
+		 * We opt for a skipallfrozen-only VACUUM when the number of extra
+		 * pages (extra scanned pages that are all-visible but not all-frozen)
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 */
+		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
+		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+		nextra_threshold = Max(32, nextra_threshold);
+
+		vacrel->skipallvis = nextra >= nextra_threshold;
+	}
+
+	/* Return the appropriate variant of scanned_pages */
+	if (vacrel->skipallvis)
+		return scanned_pages_skipallvis;
+	if (vacrel->skipallfrozen)
+		return scanned_pages_skipallfrozen;
+
+	return rel_pages;			/* keep compiler quiet */
+}
+
+/*
+ *	lazy_scan_skip() -- get the next block to scan according to vmsnap.
+ *
+ * lazy_scan_heap() caller passes the next block in line.  We return the next
+ * block to scan.  Caller skips the blocks preceding returned block, if any.
+ *
+ * The all-visible status of the returned block is set in *all_visible, too.
+ * Block usually won't be all-visible (since it's unskippable), but it can be
+ * when next_block is rel's last page and when DISABLE_PAGE_SKIPPING in use.
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, BlockNumber next_block, bool *all_visible)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_block_to_scan = next_block;
+
+	*all_visible = true;
+	while (next_block_to_scan < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_snap_status(vacrel->vmsnap,
+														next_block_to_scan);
 
 		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
 			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
+			*all_visible = false;
 			break;
 		}
 
@@ -1341,58 +1429,26 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 		 * lock on rel to attempt a truncation that fails anyway, just because
 		 * there are tuples on the last page (it is likely that there will be
 		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
 		 */
-		if (next_unskippable_block == rel_pages - 1)
+		if (next_block_to_scan == rel_pages - 1)
 			break;
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
+		Assert(vacrel->skipallfrozen || !vacrel->skipallvis);
+		if (!vacrel->skipallfrozen)
 			break;
 
 		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 * Don't skip all-visible pages when lazy_scan_strategy determined
+		 * that it was more important for this VACUUM to advance relfrozenxid
 		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
+		if (!vacrel->skipallvis && (mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+			break;
 
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
+		next_block_to_scan++;
 	}
 
-	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
-	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
-	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
-	else
-	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
-	}
-
-	return next_unskippable_block;
+	return next_block_to_scan;
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index ed72eb7b6..6848576fd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,9 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap - get read-only snapshot of visibility map
+ *		visibilitymap_snap_release - release previously acquired snapshot
+ *		visibilitymap_snap_status - get status of bits from vm snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +55,9 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset (only tuples deleted before its
+ * OldestXmin cutoff are considered dead).
  *
  * LOCKING
  *
@@ -124,6 +130,22 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Snapshot of visibility map at a point in time.
+ *
+ * TODO This is currently always just a palloc()'d buffer -- give more thought
+ * to resource management (at a minimum add spilling to temp file).
+ */
+struct vmsnapshot
+{
+	/* Snapshot may contain zero or more visibility map pages */
+	BlockNumber nvmpages;
+
+	/* Copy of VM pages from the time that visibilitymap_snap() was called */
+	char		vmpages[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
@@ -368,6 +390,146 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *buf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap - get read-only snapshot of visibility map
+ *
+ * Initializes caller's snapshot, allocating memory in caller's memory context.
+ * Caller can use visibilitymap_snapshot_status to get the status of individual
+ * heap pages at the point that we were called.
+ *
+ * Used by VACUUM to determine which pages it must scan up front.  This avoids
+ * useless scans of concurrently unset heap pages.  VACUUM prefers to leave
+ * them to be scanned during the next VACUUM operation.
+ *
+ * rel_pages is the current size of the heap relation.
+ */
+vmsnapshot *
+visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+				   BlockNumber *all_visible, BlockNumber *all_frozen)
+{
+	BlockNumber nvmpages,
+				mapBlockLast;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap %s %d", RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	*all_visible = 0;
+	*all_frozen = 0;
+	mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages);
+	nvmpages = mapBlockLast + 1;
+	vmsnap = palloc(offsetof(vmsnapshot, vmpages) + BLCKSZ * nvmpages);
+	vmsnap->nvmpages = nvmpages;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Page		localvmpage;
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		localvmpage = vmsnap->vmpages + mapBlock * BLCKSZ;
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(localvmpage, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/*
+		 * Visibility map page copied to local buffer for caller's snapshot.
+		 * Caller requires an exact count of all-visible and all-frozen blocks
+		 * in the heap relation.  Handle that now.
+		 *
+		 * Must "truncate" our local copy of the VM to avoid incorrectly
+		 * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+		 * this by clearing irrelevant bits on the last VM page copied.
+		 */
+		map = PageGetContents(localvmpage);
+		if (mapBlock == mapBlockLast)
+		{
+			/* byte and bit for first heap page not to be scanned by VACUUM */
+			uint32		truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			uint8		truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
+
+			if (truncByte != 0 || truncOffset != 0)
+			{
+				/* Clear any bits set for heap pages >= rel_pages */
+				MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+				map[truncByte] &= (1 << truncOffset) - 1;
+			}
+
+			/* Now it's safe to tally bits from this final VM page below */
+		}
+
+		/* Tally the all-visible and all-frozen counts from this page */
+		umap = (uint64 *) map;
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			*all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+	}
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Just frees the memory allocated by visibilitymap_snap for now (presumably
+ * this will need to release temp files in later revisions of the patch)
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	pfree(vmsnap);
+}
+
+/*
+ *	visibilitymap_snap_status - get status of bits from vm snapshot
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to caller's snapshot of the visibility map?
+ */
+uint8
+visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+	char	   *map;
+	uint8		result;
+	Page		localvmpage;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_status %d", heapBlk);
+#endif
+
+	/* If we didn't copy the VM page, assume heapBlk not all-visible */
+	if (mapBlock >= vmsnap->nvmpages)
+		return 0;
+
+	localvmpage = ((Page) vmsnap->vmpages) + mapBlock * BLCKSZ;
+	map = PageGetContents(localvmpage);
+
+	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+	return result;
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
-- 
2.34.1

#17

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Peter Geoghegan (#16)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan <pg@bowt.ie> wrote:

This is still only scratching the surface of what is possible with
dead_items. The visibility map snapshot concept can enable a far more
sophisticated approach to resource management in vacuumlazy.c. It
could help us to replace a simple array of item pointers (the current
dead_items array) with a faster and more space-efficient data
structure. Masahiko Sawada has done a lot of work on this recently, so
this may interest him.

I don't quite see how it helps "enable" that. It'd be more logical to
me to say the VM snapshot *requires* you to think harder about
resource management, since a palloc'd snapshot should surely be
counted as part of the configured memory cap that admins control.
(Commonly, it'll be less than a few dozen MB, so I'll leave that
aside.) Since Masahiko hasn't (to my knowlege) gone as far as
integrating his ideas into vacuum, I'm not sure if the current state
of affairs has some snag that a snapshot will ease, but if there is,
you haven't described what it is.

I do remember your foreshadowing in the radix tree thread a while
back, and I do think it's an intriguing idea to combine pages-to-scan
and dead TIDs in the same data structure. The devil is in the details,
of course. It's worth looking into.

VM snapshots could also make it practical for the new data structure
to spill to disk to avoid multiple index scans/passed by VACUUM.

I'm not sure spilling to disk is solving the right problem (as opposed
to the hash join case, or to the proposed conveyor belt system which
has a broader aim). I've found several times that a customer will ask
if raising maintenance work mem from 1GB to 10GB will make vacuum
faster. Looking at the count of index scans, it's pretty much always
"1", so even if the current approach could scale above 1GB, "no" it
wouldn't help to raise that limit.

Your mileage may vary, of course.

Continuing my customer example, searching the dead TID list faster
*will* make vacuum faster. The proposed tree structure is more memory
efficient, and IIUC could scale beyond 1GB automatically since each
node is a separate allocation, so the answer will be "yes" in the rare
case the current setting is in fact causing multiple index scans.
Furthermore, it doesn't have to anticipate the maximum size, so there
is no up front calculation assuming max-tuples-per-page, so it
automatically uses less memory for less demanding tables.

(But +1 for changing that calculation for as long as we do have the
single array.)

--
John Naylor
EDB: http://www.enterprisedb.com

#18

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: John Naylor (#17)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Sep 14, 2022 at 3:18 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan <pg@bowt.ie> wrote:

This is still only scratching the surface of what is possible with
dead_items. The visibility map snapshot concept can enable a far more
sophisticated approach to resource management in vacuumlazy.c.

I don't quite see how it helps "enable" that.

I have already written a simple throwaway patch that can use the
current VM snapshot data structure (which is just a local copy of the
VM's pages) to do a cheap precheck ahead of actually doing a binary
search in dead_items -- if a TID's heap page is all-visible or
all-frozen (depending on the type of VACUUM) then we're 100%
guaranteed to not visit it, and so it's 100% guaranteed to not have
any dead_items (actually it could have LP_DEAD items by the time the
index scan happens, but they won't be in our dead_items array in any
case). Since we're working off of an immutable source, this
optimization is simple to implement already. Very simple.

I haven't even bothered to benchmark this throwaway patch (I literally
wrote it in 5 minutes to show Masahiko what I meant). I can't see why
even that throwaway prototype wouldn't significantly improve
performance, though. After all, the VM snapshot data structure is far
denser than dead_items, and the largest tables often have most heap
pages skipped via the VM.

I'm not really interested in pursuing this simple approach because it
conflicts with Masahiko's work on the data structure, and there are
other good reasons to expect that to help. Plus I'm already very busy
with what I have here.

It'd be more logical to
me to say the VM snapshot *requires* you to think harder about
resource management, since a palloc'd snapshot should surely be
counted as part of the configured memory cap that admins control.

That's clearly true -- it creates a new problem for resource
management that will need to be solved. But that doesn't mean that it
can't ultimately make resource management better and easier.

Remember, we don't randomly visit some skippable pages for no good
reason in the patch, since the SKIP_PAGES_THRESHOLD stuff is
completely gone. The VM snapshot isn't just a data structure that
vacuumlazy.c uses as it sees fit -- it's actually more like a set of
instructions on which pages to scan, that vacuumlazy.c *must* follow.
There is no way that vacuumlazy.c can accidentally pick up a few extra
dead_items here and there due to concurrent activity that unsets VM
pages. We don't need to leave that to chance -- it is locked in from
the start.

I do remember your foreshadowing in the radix tree thread a while
back, and I do think it's an intriguing idea to combine pages-to-scan
and dead TIDs in the same data structure. The devil is in the details,
of course. It's worth looking into.

Of course.

Looking at the count of index scans, it's pretty much always
"1", so even if the current approach could scale above 1GB, "no" it
wouldn't help to raise that limit.

I agree that multiple index scans are rare. But I also think that
they're disproportionately involved in really problematic cases for
VACUUM. That said, I agree that simply making lookups to dead_items as
fast as possible is the single most important way to improve VACUUM by
improving dead_items.

Furthermore, it doesn't have to anticipate the maximum size, so there
is no up front calculation assuming max-tuples-per-page, so it
automatically uses less memory for less demanding tables.

The final number of TIDs doesn't seem like the most interesting
information that VM snapshots could provide us when it comes to
building the dead_items TID data structure -- the *distribution* of
TIDs across heap pages seems much more interesting. The "shape" can be
known ahead of time, at least to some degree. It can help with
compression, which will reduce cache misses.

Andres made remarks about memory usage with sparse dead TID patterns
at this point on the "Improve dead tuple storage for lazy vacuum"
thread:

/messages/by-id/20210710025543.37sizjvgybemkdus@alap3.anarazel.de

I haven't studied the radix tree stuff in great detail, so I am
uncertain of how much the VM snapshot concept could help, and where
exactly it would help. I'm just saying that it seems promising,
especially as a way of addressing concerns like this.

--
Peter Geoghegan

#19

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Peter Geoghegan (#18)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Sep 14, 2022 at 3:18 AM John Naylor

Furthermore, it doesn't have to anticipate the maximum size, so there
is no up front calculation assuming max-tuples-per-page, so it
automatically uses less memory for less demanding tables.

The final number of TIDs doesn't seem like the most interesting
information that VM snapshots could provide us when it comes to
building the dead_items TID data structure -- the *distribution* of
TIDs across heap pages seems much more interesting. The "shape" can be
known ahead of time, at least to some degree. It can help with
compression, which will reduce cache misses.

My point here was simply that spilling to disk is an admission of
failure to utilize memory efficiently and thus shouldn't be a selling
point of VM snapshots. Other selling points could still be valid.

--
John Naylor
EDB: http://www.enterprisedb.com

#20

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: John Naylor (#19)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Sep 15, 2022 at 12:09 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

The final number of TIDs doesn't seem like the most interesting
information that VM snapshots could provide us when it comes to
building the dead_items TID data structure -- the *distribution* of
TIDs across heap pages seems much more interesting. The "shape" can be
known ahead of time, at least to some degree. It can help with
compression, which will reduce cache misses.

My point here was simply that spilling to disk is an admission of
failure to utilize memory efficiently and thus shouldn't be a selling
point of VM snapshots. Other selling points could still be valid.

I was trying to explain the goals of this work in a way that was as
accessible as possible. It's not easy to get the high level ideas
across, let alone all of the details.

It's true that I have largely ignored the question of how VM snapshots
will need to spill up until now. There are several reasons for this,
most of which you could probably guess. FWIW it wouldn't be at all
difficult to add *some* reasonable spilling behavior very soon; the
underlying access patterns are highly sequential and predictable, in
the obvious way.

--
Peter Geoghegan

#21

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Geoghegan (#15)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, 2022-09-08 at 13:23 -0700, Peter Geoghegan wrote:

The new patch unifies the concept of antiwraparound
VACUUM with the concept of aggressive VACUUM. Now there is only
antiwraparound and regular VACUUM (uh, barring VACUUM FULL). And now
antiwraparound VACUUMs are not limited to antiwraparound autovacuums
-- a manual VACUUM can also be antiwraparound (that's just the new
name for "aggressive").

I like this general approach. The existing GUCs have evolved in a
confusing way.

For the most part the
skipping/freezing strategy stuff has a good sense of what matters
already, and shouldn't need to be guided very often.

I'd like to know more clearly where manual VACUUM fits in here. Will it
user a more aggressive strategy than an autovacuum, and how so?

The patch relegates vacuum_freeze_table_age to a compatibility
option,
making its default -1, meaning "just use autovacuum_freeze_max_age".

The purpose of vacuum_freeze_table_age seems to be that, if you
regularly issue VACUUM commands, it will prevent a surprise
antiwraparound vacuum. Is that still the case?

Maybe it would make more sense to have vacuum_freeze_table_age be a
fraction of autovacuum_freeze_max_age, and be treated as a maximum so
that other intelligence might kick in and freeze sooner?

This makes things less confusing for users and hackers.

It may take an adjustment period ;-)

The details of the skipping-strategy-choice algorithm are still
unsettled in v3 (no real change there). ISTM that the important thing
is still the high level concepts. Jeff was slightly puzzled by the
emphasis placed on the cost model/strategy stuff, at least at one
point. Hopefully my intent will be made clearer by the ideas featured
in the new patch.

Yes, it's clearing things up, but it's still a complex problem.
There's:

a. xid age vs the actual amount of deferred work to be done
b. advancing relfrozenxid vs skipping all-visible pages
c. difficulty in controlling reasonable behavior (e.g.
vacuum_freeze_min_age often being ignored, freezing
individual tuples rather than pages)

Your first email described the motivation in terms of (a), but the
patches seem more focused on (b) and (c).

The skipping strategy decision making process isn't
particularly complicated, but it now looks more like an optimization
problem of some kind or other.

There's another important point here, which is that it gives an
opportunity to decide to freeze some all-visible pages in a given round
just to reduce the deferred work, without worrying about advancing
relfrozenxid.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#22

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeff Davis (#21)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Oct 3, 2022 at 5:41 PM Jeff Davis <pgsql@j-davis.com> wrote:

I like this general approach. The existing GUCs have evolved in a
confusing way.

Thanks for taking a look!

For the most part the
skipping/freezing strategy stuff has a good sense of what matters
already, and shouldn't need to be guided very often.

I'd like to know more clearly where manual VACUUM fits in here. Will it
user a more aggressive strategy than an autovacuum, and how so?

There is no change whatsoever in the relationship between manually
issued VACUUMs and autovacuums. We interpret autovacuum_freeze_max_age
in almost the same way as HEAD. The only detail that's changed is that
we almost always interpret "freeze_table_age" as "just use
autovacuum_freeze_max_age" in the patch, rather than as
"vacuum_freeze_table_age, though never more than 95% of
autovacuum_freeze_max_age", as on HEAD.

Maybe this would be less confusing if I went just a bit further, and
totally got rid of the concept that vacuumlazy.c calls aggressive
VACUUM on HEAD -- then there really would be exactly one kind of
VACUUM, just like before the visibility map was first introduced back
in 2009. This would relegate antiwraparound-ness to just another
condition that autovacuum.c used to launch VACUUMs.

Giving VACUUM the freedom to choose where and how to freeze and
advance relfrozenxid based on both costs and benefits is key here.
Anything that needlessly imposes a rigid rule on vacuumlazy.c
undermines that -- it ties VACUUM's hands. The user can still
influence many of the details using high-level GUCs that work at the
table level, rather than GUCs that can only work at the level of
individual VACUUM operations (that leaves too much to chance). Users
shouldn't want or need to micromanage VACUUM.

The patch relegates vacuum_freeze_table_age to a compatibility
option,
making its default -1, meaning "just use autovacuum_freeze_max_age".

The purpose of vacuum_freeze_table_age seems to be that, if you
regularly issue VACUUM commands, it will prevent a surprise
antiwraparound vacuum. Is that still the case?

The user really shouldn't need to do anything with
vacuum_freeze_table_age at all now. It's mostly just a way for the
user to optionally insist on advancing relfrozenxid via a
antiwraparound/aggressive VACUUM -- like in a manual VACUUM FREEZE.
Even VACUUM FREEZE shouldn't be necessary very often.

Maybe it would make more sense to have vacuum_freeze_table_age be a
fraction of autovacuum_freeze_max_age, and be treated as a maximum so
that other intelligence might kick in and freeze sooner?

That's kind of how the newly improved skipping strategy stuff works.
It gives some weight to table age as one additional factor (based on
how close the table's age is to autovacuum_freeze_max_age or its Multi
equivalent).

If table age is (say) 60% of autovacuum_freeze_max_age, then VACUUM
should be "60% as aggressive" as a conventional
aggressive/antiwraparound autovacuum would be. What that actually
means is that the VACUUM will tend to prefer advancing relfrozenxid
the closer we get to the cutoff, gradually giving less and less
consideration to putting off work as we get closer and closer. When we
get to 100% then we'll definitely advance relfrozenxid (via a
conventional aggressive/antiwraparound VACUUM).

The precise details are unsettled, but I'm pretty sure that the
general idea is sound. Basically we're replacing
vacuum_freeze_table_age with a dynamic, flexible version of the same
basic idea. Now we don't just care about the need to advance
relfrozenxid (benefits), though; we also care about costs.

This makes things less confusing for users and hackers.

It may take an adjustment period ;-)

Perhaps this is more of an aspiration at this point. :-)

Yes, it's clearing things up, but it's still a complex problem.
There's:

a. xid age vs the actual amount of deferred work to be done
b. advancing relfrozenxid vs skipping all-visible pages
c. difficulty in controlling reasonable behavior (e.g.
vacuum_freeze_min_age often being ignored, freezing
individual tuples rather than pages)

Your first email described the motivation in terms of (a), but the
patches seem more focused on (b) and (c).

I think that all 3 areas are deeply and hopelessly intertwined.

For example, vacuum_freeze_min_age is effectively ignored in many
important cases right now precisely because we senselessly skip
all-visible pages with unfrozen tuples, no matter what -- the problem
actually comes from the visibility map, which vacuum_freeze_min_age
predates by quite a few years. So how can you possibly address the
vacuum_freeze_min_age issues without also significantly revising VM
skipping behavior? They're practically the same problem!

And once you've fixed vacuum_freeze_min_age (and skipping), how can
you then pass up the opportunity to advance relfrozenxid early when
doing so will require only a little extra work? I'm going to regress
some cases if I simply ignore the relfrozenxid factor. Finally, the
debt issue is itself a consequence of the other problems.

Perhaps this is an example of the inventor's paradox, where the more
ambitious plan may actually be easier and more likely to succeed than
a more limited plan that just focuses on one immediate problem. All of
these problems seem to be a result of adding accretion after accretion
over the years. A high-level rethink is well overdue. We need to
return to basics.

The skipping strategy decision making process isn't
particularly complicated, but it now looks more like an optimization
problem of some kind or other.

There's another important point here, which is that it gives an
opportunity to decide to freeze some all-visible pages in a given round
just to reduce the deferred work, without worrying about advancing
relfrozenxid.

True. Though I think that a strong bias in the direction of advancing
relfrozenxid by some amount (not necessarily by very many XIDs) still
makes sense, especially when we're already freezing aggressively.

--
Peter Geoghegan

#23

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Geoghegan (#22)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, 2022-10-03 at 20:11 -0700, Peter Geoghegan wrote:

True. Though I think that a strong bias in the direction of advancing
relfrozenxid by some amount (not necessarily by very many XIDs) still
makes sense, especially when we're already freezing aggressively.

Take the case where you load a lot of data in one transaction. After
the loading transaction finishes, those new pages will soon be marked
all-visible.

In the future, vacuum runs will have to decide what to do. If a vacuum
decides to do an aggressive scan to freeze all of those pages, it may
be at some unfortunate time and disrupt the workload. But if it skips
them all, then it's just deferring the work until it runs up against
autovacuum_freeze_max_age, which might also be at an unfortunate time.

So how does your patch series handle this case? I assume there's some
mechanism to freeze a moderate number of pages without worrying about
advancing relfrozenxid?

Regards,
Jeff Davis

#24

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeff Davis (#23)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Oct 3, 2022 at 10:13 PM Jeff Davis <pgsql@j-davis.com> wrote:

Take the case where you load a lot of data in one transaction. After
the loading transaction finishes, those new pages will soon be marked
all-visible.

In the future, vacuum runs will have to decide what to do. If a vacuum
decides to do an aggressive scan to freeze all of those pages, it may
be at some unfortunate time and disrupt the workload. But if it skips
them all, then it's just deferring the work until it runs up against
autovacuum_freeze_max_age, which might also be at an unfortunate time.

Predicting the future accurately is intrinsically hard. We're already
doing that today by freezing lazily. I think that we can come up with
a better overall strategy, but there is always a risk that we'll come
out worse off in some individual cases. I think it's worth it if it
avoids ever really flying off the rails.

So how does your patch series handle this case? I assume there's some
mechanism to freeze a moderate number of pages without worrying about
advancing relfrozenxid?

It mostly depends on whether or not the table exceeds the new
vacuum_freeze_strategy_threshold GUC in size at the time of the
VACUUM. This is 4GB by default, at least right now.

The case where the table size doesn't exceed that threshold yet will
see each VACUUM advance relfrozenxid when it happens to be very cheap
to do so, in terms of the amount of extra scanned_pages. If the number
of extra scanned_pages is less than 5% of the total table size
(current rel_pages), then we'll advance relfrozenxid early by making
sure to scan any all-visible pages.

Actually, this scanned_pages threshold starts at 5%. It is usually 5%,
but it will eventually start to grow (i.e. make VACUUM freeze eagerly
more often) once table age exceeds 50% of autovacuum_freeze_max_age at
the start of the VACUUM. So the skipping strategy threshold is more or
less a blend of physical units (heap pages) and logical units (XID
age).

Then there is the case where it's already a larger table at the point
a given VACUUM begins -- a table that ends up exceeding the same table
size threshold, vacuum_freeze_strategy_threshold. When that happens
we'll freeze all pages that are going to be marked all-visible as a
matter of policy (i.e. use eager freezing strategy), so that the same
pages can be marked all-frozen instead. We won't freeze pages that
aren't full of all-visible tuples (except for LP_DEAD items), unless
they have XIDs that are so old that vacuum_freeze_min_age triggers
freezing.

Once a table becomes larger than vacuum_freeze_strategy_threshold,
VACUUM stops marking pages all-visible in the first place,
consistently marking them all-frozen instead. So naturally there just
cannot be any all-visible pages after the first eager freezing VACUUM
(actually there are some obscure edge cases that can result in the odd
all-visible page here or there, but this should be extremely rare, and
have only negligible impact).

Bigger tables always have pages frozen eagerly, and in practice always
advance relfrozenxid early. In other words, eager freezing strategy
implies eager freezing strategy -- though not the other way around.
Again, these details that may change in the future. My focus is
validating the high level concepts.

So we avoid big spikes, and try to do the work when it's cheapest.

--
Peter Geoghegan

#25

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Geoghegan (#24)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, 2022-10-03 at 22:45 -0700, Peter Geoghegan wrote:

Once a table becomes larger than vacuum_freeze_strategy_threshold,
VACUUM stops marking pages all-visible in the first place,
consistently marking them all-frozen instead.

What are the trade-offs here? Why does it depend on table size?

Regards,
Jeff Davis

#26

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeff Davis (#25)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Oct 4, 2022 at 10:39 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2022-10-03 at 22:45 -0700, Peter Geoghegan wrote:

Once a table becomes larger than vacuum_freeze_strategy_threshold,
VACUUM stops marking pages all-visible in the first place,
consistently marking them all-frozen instead.

What are the trade-offs here? Why does it depend on table size?

That's a great question. The table-level threshold
vacuum_freeze_strategy_threshold more or less buckets every table into
one of two categories: small tables and big tables. Perhaps this seems
simplistic to you. That would be an understandable reaction, given the
central importance of this threshold. The current default of 4GB could
have easily been 8GB or perhaps even 16GB instead.

It's not so much size as the rate of growth over time that matters. We
really want to do eager freezing on "growth tables", particularly
append-only tables. On the other hand we don't want to do useless
freezing on small, frequently updated tables, like pgbench_tellers or
pgbench_branches -- those tables may well require zero freezing, and
yet each VACUUM will advance relfrozenxid to a very recent value
consistently (even on Postgres 15). But "growth" is hard to capture,
because in general we have to infer things about the future from the
past, which is difficult and messy.

Since it's hard to capture "growth table vs fixed size table"
directly, we use table size as a proxy. It's far from perfect, but I
think that it will work quite well in practice because most individual
tables simply never get very large. It's very common for a relatively
small number of tables to consistently grow, without bound (perhaps
not strictly append-only tables, but tables where nothing is ever
deleted and inserts keep happening). So a simplistic threshold
(combined with dynamic per-page decisions about freezing) should be
enough to avoid most of the downside of eager freezing. In particular,
we will still freeze lazily in tables where it's obviously very
unlikely to be worth it.

In general I think that being correct on average is overrated. It's
more important to always avoid being dramatically wrong -- especially
if there is no way to course correct in the next VACUUM. Although I
think that we have a decent chance of coming out ahead by every
available metric, that isn't really the goal. Why should performance
stability not have some cost, at least in some cases? I want to keep
the cost as low as possible (often "negative cost" relative to
Postgres 15), but overall I am consciously making a trade-off. There
are downsides.

--
Peter Geoghegan

#27

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Peter Geoghegan (#26)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, 2022-10-04 at 11:09 -0700, Peter Geoghegan wrote:

So a simplistic threshold
(combined with dynamic per-page decisions about freezing) should be
enough to avoid most of the downside of eager freezing.

...

I want to keep
the cost as low as possible (often "negative cost" relative to
Postgres 15), but overall I am consciously making a trade-off. There
are downsides.

I am fine with that, but I'd like us all to understand what the
downsides are.

If I understand correctly:

1. Eager freezing (meaning to freeze at the same time as setting all-
visible) causes a modest amount of WAL traffic, hopefully before the
next checkpoint so we can avoid FPIs. Lazy freezing (meaning set all-
visible but don't freeze) defers the work, and it might never need to
be done; but if it does, it can cause spikes at unfortunate times and
is more likely to generate more FPIs.

2. You're trying to mitigate the downsides of eager freezing by:
a. when freezing a tuple, eagerly freeze other tuples on that page
b. optimize WAL freeze records

3. You're trying to capture the trade-off in #1 by using the table size
as a proxy. Deferred work is only really a problem for big tables, so
that's where you use eager freezing. But maybe we can just always use
eager freezing?:
a. You're mitigating the WAL work for freezing.
b. A lot of people run with checksums on, meaning that setting the
all-visible bit requires WAL work anyway, and often FPIs.
c. All-visible is conceptually similar to freezing, but less
important, and it feels more and more like the design concept of all-
visible isn't carrying its weight.
d. (tangent) I had an old patch[1]/messages/by-id/1353551097.11440.128.camel@sussancws0025 that actually removed
PD_ALL_VISIBLE (the page bit, not the VM bit), which was rejected, but
perhaps its time has come?

Regards,
Jeff Davis

[1]: /messages/by-id/1353551097.11440.128.camel@sussancws0025
/messages/by-id/1353551097.11440.128.camel@sussancws0025

#28

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jeff Davis (#27)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Oct 4, 2022 at 7:59 PM Jeff Davis <pgsql@j-davis.com> wrote:

I am fine with that, but I'd like us all to understand what the
downsides are.

Although I'm sure that there must be one case that loses measurably,
it's not particularly obvious where to start looking for one. I mean
it's easy to imagine individual pages that we lose on, but a practical
test case where most of the pages are like that reliably is harder to
imagine.

If I understand correctly:

1. Eager freezing (meaning to freeze at the same time as setting all-
visible) causes a modest amount of WAL traffic, hopefully before the
next checkpoint so we can avoid FPIs. Lazy freezing (meaning set all-
visible but don't freeze) defers the work, and it might never need to
be done; but if it does, it can cause spikes at unfortunate times and
is more likely to generate more FPIs.

Lazy freezing means to freeze every eligible tuple (every XID <
OldestXmin) when one or more XIDs are before FreezeLimit. Eager
freezing means freezing every eligible tuple when the page is about to
be set all-visible, or whenever lazy freezing would trigger freezing.

Eager freezing tends to avoid big spikes in larger tables, which is
very important. It can sometimes be cheaper and better in every way
than lazy freezing. Though lazy freezing sometimes retains an
advantage by avoiding freezing that is never going to be needed
altogether, typically only in small tables.

Lazy freezing is fairly similar to what we do on HEAD now -- though
it's not identical. It's still "page level freezing". It has lazy
criteria for triggering page freezing.

2. You're trying to mitigate the downsides of eager freezing by:
a. when freezing a tuple, eagerly freeze other tuples on that page
b. optimize WAL freeze records

Sort of.

Both of these techniques apply to eager freezing too, in fact. It's
just that eager freezing is likely to do the bulk of all freezing that
actually goes ahead. It'll disproportionately be helped by these
techniques because it'll do most actual freezing that goes ahead (even
when most VACUUM operations use the lazy freezing strategy, which is
probably the common case -- just because lazy freezing freezes
lazily).

3. You're trying to capture the trade-off in #1 by using the table size
as a proxy. Deferred work is only really a problem for big tables, so
that's where you use eager freezing.

Right.

But maybe we can just always use
eager freezing?:

That doesn't seem like a bad idea, though it might be tricky to put
into practice. It might be possible to totally unite the concept of
all-visible and all-frozen pages in the scope of this work. But there
are surprisingly many tricky details involved. I'm not surprised that
you're suggesting this -- it basically makes sense to me. It's just
the practicalities that I worry about here.

a. You're mitigating the WAL work for freezing.

I don't see why this would be true. Lazy vs Eager are exactly the same
for a given page at the point that freezing is triggered. We'll freeze
all eligible tuples (often though not always every tuple), or none at
all.

Lazy vs Eager describe the policy for deciding to freeze a page, but
do not affect the actual execution steps taken once we decide to
freeze.

b. A lot of people run with checksums on, meaning that setting the
all-visible bit requires WAL work anyway, and often FPIs.

The idea of rolling the WAL records into one does seem appealing, but
we'd still need the original WAL record to set a page all-visible in
VACUUM's second heap pass (only setting a page all-visible in the
first heap pass could be optimized by making the FREEZE_PAGE WAL
record mark the page all-visible too). Or maybe we'd roll that into
the VACUUM WAL record at the same time.

In any case the second heap pass would have to have a totally
different WAL logging strategy to the first heap pass. Not
insurmountable, but not exactly an easy thing to do in passing either.

c. All-visible is conceptually similar to freezing, but less
important, and it feels more and more like the design concept of all-
visible isn't carrying its weight.

Well, not quite -- at least not on the VM side itself.

There are cases where heap_lock_tuple() will update a tuple's xmax,
replacing it with a new Multi. This will necessitate clearly the
page's all-frozen bit in the VM -- but the all-visible bit will stay
set. This is why it's possible for small numbers of all-visible pages
to appear even in large tables that have been eagerly frozen.

d. (tangent) I had an old patch[1] that actually removed
PD_ALL_VISIBLE (the page bit, not the VM bit), which was rejected, but
perhaps its time has come?

I remember that pgCon developer meeting well. :-)

If anything your original argument for getting rid of PD_ALL_VISIBLE
is weakened by the proposal to merge together the WAL records for
freezing and for setting a heap page all visible. You'd know for sure
that the page will be dirtied when such a WAL record needed to be
written, so there is actually no reason to care about dirtying the
page. No?

I'm in favor of reducing the number of WAL records required in common
cases if at all possible -- purely because the generic WAL record
overhead of having an extra WAL record does probably add to the WAL
overhead for work performed in lazy_scan_prune(). But it seems like
separate work to me.

--
Peter Geoghegan

#29

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#15)

7 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Sep 8, 2022 at 1:23 PM Peter Geoghegan <pg@bowt.ie> wrote:

It might make sense to go further in the same direction by making
"regular vs aggressive/antiwraparound" into a *strict* continuum. In
other words, it might make sense to get rid of the two remaining cases
where VACUUM conditions its behavior on whether this VACUUM operation
is antiwraparound/aggressive or not.

I decided to go ahead with this in the attached revision, v5. This
revision totally gets rid of the general concept of discrete
aggressive/non-aggressive modes for each VACUUM operation (see
"v5-0004-Make-VACUUM-s-aggressive-behaviors-continuous.patch" and its
commit message). My new approach turned out to be simpler than the
previous half measures that I described as "unifying aggressive and
antiwraparound" (which itself first appeared in v3).

I now wish that I had all of these pieces in place for v1, since this
was the direction I was thinking of all along -- that might have made
life easier for reviewers like Jeff. What we have in v5 is what I had
in mind all along, which turns out to have only a little extra code
anyway. It might have been less confusing if I'd started this thread
with something like v5 -- the story I need to tell would have been
simpler that way. This is pretty much the end point I had in mind.

Note that we still retain what were previously "aggressive only"
behaviors. We only remove "aggressive" as a distinct mode of operation
that exclusively applies the aggressive behaviors. We're now selective
in how we apply each of the behaviors, based on the needs of the
table. We want to behave in a way that's proportionate to the problem
at hand, which is made easy by not tying anything to a discrete mode
of operation. It's a false dichotomy; why should we ever have only one
reason for running VACUUM, that's determined up front?

There are still antiwraparound autovacuums in v5, but that is really
just another way that autovacuum can launch an autovacuum worker (much
like it was before the introduction of the visibility map in 8.4) --
both conceptually, and in terms of how the code works in vacuumlazy.c.
In practice an antiwraparound autovacuum is guaranteed to advance
relfrozenxid in roughly the same way as on HEAD (otherwise what's the
point?), but that doesn't make the VACUUM operation itself special in
any way. Besides, antiwraparound autovacuums will naturally be rare,
because there are many more opportunities for a VACUUM to advance
relfrozenxid "early" now (only "early" relative to how it would work
on early Postgres versions). It's already clear that having
antiwraparound autovacuums and aggressive mode VACUUMs as two separate
concepts that are closely associated has some problems [1]/messages/by-id/CAH2-Wz=DJAokY_GhKJchgpa8k9t_H_OVOvfPEn97jGNr9W=deg@mail.gmail.com. Formally
making antiwraparound autovacuums just another way to launch a VACUUM
via autovacuum seems quite useful to me.

For the most part users are expected to just take relfrozenxid
advancement for granted now. They should mostly be able to assume that
VACUUM will do whatever is required to keep it sufficiently current
over time. They can influence VACUUM's behavior, but that mostly works
at the level of the table (not the level of any individual VACUUM
operation). The freezing and skipping strategy stuff should do what is
necessary to keep up in the long run. We don't want to put too much
emphasis on relfrozenxid in the short run, because it isn't a reliable
proxy for how we've kept up with the physical work of freezing --
that's what really matters. It should be okay to "fall behind on table
age" in the short run, provided we don't fall behind on the physical
work of freezing. Those two things shouldn't be conflated.

We now use a separate pair of XID/MXID-based cutoffs to determine
whether or not we're willing to wait for a cleanup lock the hard way
(which can happen in any VACUUM, since of course there is no longer
any special VACUUM with special behaviors). The new pair of cutoffs
replace the use of FreezeLimit/MultiXactCutoff by lazy_scan_noprune
(those are now only used to decide on what to freeze inside
lazy_scan_prune). Same concept, but with a different, independent
timeline. This was necessary just to get an existing isolation test
(vacuum-no-cleanup-lock) to continue to work. But it just makes sense
to have a different timeline for a completely different behavior. And
it'll be more robust.

It's a really bad idea for VACUUM to try to wait indefinitely long for
a cleanup lock, since that's totally outside of its control. It
typically won't take very long at all for VACUUM to acquire a cleanup
lock, of course, but that is beside the point -- who really cares
what's true on average, for something like this? Sometimes it'll take
hours to acquire a cleanup lock, and there is no telling when that
might happen! And so pausing VACUUM/freezing of all other pages just
to freeze one page makes little sense. Waiting for a cleanup lock
before we really need to is just an overreaction, which risks making
the situation worse. The cure must not be worse than the disease.

This revision also resolves problems with freezing MultiXactIds too
lazily [2]/messages/by-id/CAH2-Wz=+B5f1izRDPYKw+sUgOr6=AkWXp2NikU5cub0ftbRQhA@mail.gmail.com -- Peter Geoghegan. We now always trigger page level freezing in the event of
encountering a Multi. This is more consistent with the behavior on
HEAD, where we can easily process a Multi well before the cutoff
represented by vacuum_multixact_freeze_min_age (e.g., we notice that a
Multi has no members still running, making it safe to remove before
the cutoff is reached).

Also attaching a prebuilt copy of the "routine vacuuming" docs as of
v5. This is intended to be a convenience for reviewers, or anybody
with a general interest in the patch series. The docs certainly still
need work, but I feel that I'm making progress on that side of things
(especially in this latest revision). Making life easier for DBAs is
the single most important goal of this work, so the user docs are of
central importance. The current "Routine Vacuuming" docs have lots of
problems, but to some extent the problems are with the concepts
themselves.

[1]: /messages/by-id/CAH2-Wz=DJAokY_GhKJchgpa8k9t_H_OVOvfPEn97jGNr9W=deg@mail.gmail.com
[2]: /messages/by-id/CAH2-Wz=+B5f1izRDPYKw+sUgOr6=AkWXp2NikU5cub0ftbRQhA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v5-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/octet-stream; name=v5-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload

From 730fe3a84cac7b2c9562168255bbc443345fcee8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 15:13:27 -0700
Subject: [PATCH v5 3/6] Add eager freezing strategy to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  Use of the eager strategy (an
alternative to the classic lazy freezing strategy) is controlled by a
new GUC, vacuum_freeze_strategy_threshold (and an associated
autovacuum_* reloption).  Tables whose rel_pages are >= the cutoff will
have VACUUM use the eager freezing strategy.  Otherwise we use the lazy
freezing strategy, which is the classic approach (actually, we always
use eager freezing in aggressive VACUUMs, though they are expected to be
much rarer now).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).

If and when a smaller table (a table that uses lazy freezing at first)
grows past the table size threshold, the next VACUUM against the table
shouldn't have to do too much extra freezing to catch up when we perform
eager freezing for the first time (the table still won't be very large).
Once VACUUM has caught up, the amount of work required in each VACUUM
operation should be roughly proportionate to the number of new pages, at
least with a pure append-only table.

In summary, we try to get the benefit of the lazy freezing strategy,
without ever allowing VACUUM to fall uncomfortably far behind.  In
particular, we avoid accumulating an excessive number of unfrozen
all-visible pages in any one table.  This approach is often enough to
keep relfrozenxid recent, but we still have aggressive/antiwraparound
autovacuums for tables where it doesn't work out that way.

Note that freezing strategy is distinct from (though related to) the
strategy for skipping pages with the visibility map.  In practice tables
that use eager freezing always eagerly scan all-visible pages (they
prioritize advancing relfrozenxid), partly because we expect few or no
all-visible pages there (at least during the second or subsequent VACUUM
that uses eager freezing).  When VACUUM uses the classic/lazy freezing
strategy, VACUUM will also scan pages eagerly (i.e. it will scan any
all-visible pages and only skip all-frozen pages) when the added cost is
relatively low.
---
 src/include/access/heapam_xlog.h              |  8 +-
 src/include/commands/vacuum.h                 |  4 +
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++
 src/backend/access/heap/heapam.c              |  8 +-
 src/backend/access/heap/vacuumlazy.c          | 76 ++++++++++++++++---
 src/backend/commands/vacuum.c                 |  4 +
 src/backend/postmaster/autovacuum.c           | 10 +++
 src/backend/utils/misc/guc_tables.c           | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 31 ++++++--
 doc/src/sgml/maintenance.sgml                 |  6 +-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++
 13 files changed, 162 insertions(+), 23 deletions(-)

diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 24aab7bd2..ec0ea04fd 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -345,7 +345,11 @@ typedef struct xl_heap_freeze_tuple
  * pg_class tuple.
  *
  * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
- * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ * relminmxid_nofreeze_out must also be maintained.  If vacuumlazy.c caller
+ * opts to not execute freeze plans produced by heap_prepare_freeze_tuple for
+ * its own reasons, then new relfrozenxid and relminmxid values must reflect
+ * that that choice was made.  (This is only safe when 'freeze' is still unset
+ * after the final last heap_prepare_freeze_tuple call for the page.)
  */
 typedef struct page_frozenxid_tracker
 {
@@ -356,7 +360,7 @@ typedef struct page_frozenxid_tracker
 	TransactionId relfrozenxid_out;
 	MultiXactId relminmxid_out;
 
-	/* Used by caller for '!freeze' pages */
+	/* Used by caller that opts not to freeze a '!freeze' page */
 	TransactionId relfrozenxid_nofreeze_out;
 	MultiXactId relminmxid_nofreeze_out;
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f..52379f819 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -256,6 +259,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca..e195b63d7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 6458a9c27..69340cea4 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ad970e099..f6ea1eb93 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6437,7 +6437,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * WAL-log what we would need to do, and return true.  Return false if nothing
  * is to be changed.  In addition, set *totally_frozen to true if the tuple
  * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * more freezing will eventually be required (assuming page is to be frozen).
+ *
+ * Although this interface is primarily tuple-based, caller decides on whether
+ * or not to freeze the page as a whole.  We'll often help caller to prepare a
+ * complete "freeze plan" that it ultimately discards.  However, our caller
+ * doesn't always get to choose; it must freeze when xtrack.freeze is set
+ * here.  This ensures that any XIDs < limit_xid are never left behind.
  *
  * Caller must initialize xtrack fields for page as a whole before calling
  * here with first tuple for the page.  See page_frozenxid_tracker comments.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9f560b132..8fe0766ff 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -110,7 +110,7 @@
 
 /*
  * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages
+ * all-visible pages when using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
@@ -150,6 +150,8 @@ typedef struct LVRelState
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
+	/* Proactively freeze all tuples on pages about to be set all-visible? */
+	bool		allvis_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -254,6 +256,7 @@ typedef struct LVSavedErrInfo
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
 									  BlockNumber all_frozen);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
@@ -327,6 +330,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				eager_threshold,
 				all_visible,
 				all_frozen,
 				scanned_pages,
@@ -366,6 +370,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
 	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
 	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 *
+	 * Also determine our cutoff for applying the eager/all-visible freezing
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
+	 * even during non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -374,6 +382,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->multixact_freeze_table_age,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
+	eager_threshold = params->freeze_strategy_threshold < 0 ?
+		vacuum_freeze_strategy_threshold :
+		params->freeze_strategy_threshold;
 
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -526,7 +537,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
-	scanned_pages = lazy_scan_strategy(vacrel,
+	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
 									   all_visible, all_frozen);
 	if (verbose)
 		ereport(INFO,
@@ -1282,17 +1293,28 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine skipping strategy.
+ *	lazy_scan_strategy() -- Determine freezing/skipping strategy.
  *
- * Determines if the ongoing VACUUM operation should skip all-visible pages
- * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * Our traditional/lazy freezing strategy is useful when putting off the work
+ * of freezing totally avoids work that turns out to have been unnecessary.
+ * On the other hand we eagerly freeze pages when that strategy spreads out
+ * the burden of freezing over time.  Performance stability is important; no
+ * one VACUUM operation should need to freeze disproportionately many pages.
+ * Antiwraparound VACUUMs of append-only tables should generally be avoided.
+ *
+ * Also determines if the ongoing VACUUM operation should skip all-visible
+ * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
+ * important that relfrozenxid advance in affected tables, which are larger.
+ * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
+ * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
+ * depending on the extra cost - we might need to scan only a few extra pages.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_strategy(LVRelState *vacrel,
-				   BlockNumber all_visible,
-				   BlockNumber all_frozen)
+lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
+				   BlockNumber all_visible, BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1325,21 +1347,48 @@ lazy_scan_strategy(LVRelState *vacrel,
 		scanned_pages_skipallfrozen++;
 
 	/*
-	 * Okay, now we have all the information we need to decide on a strategy
+	 * Okay, now we have all the information we need to decide on a strategy.
+	 *
+	 * We use the all-visible/eager freezing strategy when a threshold
+	 * controlled by the freeze_strategy_threshold GUC/reloption is crossed.
+	 * VACUUM won't accumulate any unfrozen all-visible pages over time in
+	 * tables above the threshold.  The system won't fall behind on freezing.
 	 */
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
 		return rel_pages;
 	}
 	else if (vacrel->aggressive)
+	{
+		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
+	}
+	else if (rel_pages >= eager_threshold)
+	{
+		/*
+		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
+		 * GUC-based threshold for eager freezing.
+		 *
+		 * We always scan all-visible pages when the threshold is crossed, so
+		 * that relfrozenxid can be advanced.  There will typically be few or
+		 * no all-visible pages (only all-frozen) in the table anyway, at
+		 * least after the first VACUUM that exceeds the threshold.
+		 */
+		vacrel->allvis_freeze_strategy = true;
+		vacrel->skipallvis = false;
+	}
 	else
 	{
 		BlockNumber nextra,
 					nextra_threshold;
 
+		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
+		vacrel->allvis_freeze_strategy = false;
+
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1851,8 +1900,15 @@ retry:
 	 *
 	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
+	 *
+	 * When ongoing VACUUM opted to use the all-visible freezing strategy we
+	 * freeze any page that will become all-visible, making it all-frozen
+	 * instead. (Actually, there are edge-cases where this might not result in
+	 * marking the page all-frozen in the visibility map, but that should have
+	 * only a negligible impact.)
 	 */
-	if (xtrack.freeze || tuples_frozen == 0)
+	if (xtrack.freeze || tuples_frozen == 0 ||
+		(vacrel->allvis_freeze_strategy && prunestate->all_visible))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7ccde07de..b837e0331 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1e90b72b7..2e4dd4090 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab08793..87b41795d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2477,6 +2477,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c35..a409e6281 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,6 +693,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 66312b53b..22471a81f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9140,6 +9140,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9148,9 +9163,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9227,10 +9244,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        Specifies the cutoff age (in multixacts) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages with an older multixact ID.  The
+        default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..554b3a75d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -588,9 +588,9 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
+    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
+    transactions older than this cutoff XID are guaranteed to have been frozen.
+    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c14b2010d..7e684d187 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1680,6 +1680,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.34.1

v5-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchapplication/octet-stream; name=v5-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchDownload

From e83db0ff598268a4acddaa32351d208093954236 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 23 Jul 2022 17:19:01 -0700
Subject: [PATCH v5 6/6] Size VACUUM's dead_items space using VM snapshot.

VACUUM knows precisely how many pages it will scan ahead of time from
its snapshot of the visibility map following recent work.  Apply that
information to size the dead_items space for TIDs more precisely (use
scanned_pages instead of rel_pages to cap the allocation).

This can make the memory allocation significantly smaller, without any
added risk of undersizing the array.  Since VACUUM's final scanned_pages
is fully predetermined (by the visibility map snapshot), there is no
question of interference from another backend that concurrently unsets
some heap page's visibility map bit.  Many details of how VACUUM will
process the target relation are "locked in" from the very beginning.
---
 src/backend/access/heap/vacuumlazy.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c1e7eae7e..baf2d942b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -293,7 +293,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -566,7 +567,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -3220,14 +3221,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3236,15 +3236,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3266,12 +3264,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
-- 
2.34.1

v5-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/octet-stream; name=v5-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload

From fc78af1ec04ea23273d665e60d3c7a0e458285ab Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v5 2/6] Teach VACUUM to use visibility map snapshot.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip.  The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").  It also
gives VACUUM reliable information about how many pages will be scanned,
before its physical heap scan even begins.  That makes it easier to
model the costs that VACUUM incurs using a top-down, up-front approach.

Non-aggressive VACUUMs now make an up-front choice about VM skipping
strategy: they decide whether to prioritize early advancement of
relfrozenxid (eager behavior) over avoiding work by skipping all-visible
pages (lazy behavior).  Nothing about the details of how lazy_scan_prune
freezes changes just yet, though a later commit will add the concept of
freezing strategies.

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.  VACUUM will now either scan
every all-visible page, or none at all.  This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.  It was possible that a single all-visible page
located in a range of all-frozen blocks would render it unsafe to
advance relfrozenxid later on; lazy_scan_skip just didn't have enough
high-level context about the table as a whole.  Now our policy around
skipping all-visible pages is exactly the same condition as whether or
not it's safe to advance relfrozenxid later on; nothing is left to
chance.

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work).  For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().
---
 src/include/access/visibilitymap.h      |   7 +
 src/backend/access/heap/vacuumlazy.c    | 342 +++++++++++++-----------
 src/backend/access/heap/visibilitymap.c | 162 +++++++++++
 3 files changed, 357 insertions(+), 154 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..49f197bb3 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,9 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +38,10 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+									  BlockNumber *all_visible, BlockNumber *all_frozen);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
+extern uint8 visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index abda286b7..9f560b132 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,10 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * Threshold that controls whether non-aggressive VACUUMs will skip any
+ * all-visible pages
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -146,8 +146,10 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
+	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	bool		skipallvis;
+	/* Skip (don't scan) all-frozen pages? */
+	bool		skipallfrozen;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -177,7 +179,8 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map used by lazy_scan_skip */
+	vmsnapshot *vmsnap;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -250,10 +253,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber all_visible,
+									  BlockNumber all_frozen);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel,
+								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -316,7 +320,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	bool		verbose,
 				instrument,
 				aggressive,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
@@ -324,6 +327,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				all_visible,
+				all_frozen,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -369,7 +375,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
 
-	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
 		/*
@@ -377,7 +382,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * visibility map (even those set all-frozen)
 		 */
 		aggressive = true;
-		skipwithvm = false;
 	}
 
 	/*
@@ -402,20 +406,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
@@ -442,7 +432,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
 	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
+	/* Initialize skipallvis/skipallfrozen before lazy_scan_strategy call */
+	vacrel->skipallvis = !aggressive;
+	vacrel->skipallfrozen = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -503,12 +495,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * confused about whether a tuple should be frozen or removed.  (In the
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
-	 *
-	 * We must determine rel_pages _after_ OldestXmin has been established.
-	 * lazy_scan_heap's physical heap scan (scan of pages < rel_pages) is
-	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
-	 * XIDs must at least be considered for freezing (though not necessarily
-	 * frozen) during its scan.
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
@@ -521,7 +507,36 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
-	vacrel->skippedallvis = false;
+
+	/*
+	 * VACUUM must scan all pages that might have XIDs < OldestXmin in tuple
+	 * headers to be able to safely advance relfrozenxid later on.  There is
+	 * no good reason to scan any additional pages. (Actually we might opt to
+	 * skip all-visible pages.  Either way we won't scan pages for no reason.)
+	 *
+	 * Now that OldestXmin and rel_pages are acquired, acquire an immutable
+	 * snapshot of the visibility map as well.  lazy_scan_skip works off of
+	 * the vmsnap, not the authoritative VM, which can continue to change.
+	 * Pages that lazy_scan_heap will scan are fixed and known in advance.
+	 *
+	 * The exact number of pages that lazy_scan_heap will scan also depends on
+	 * our choice of skipping strategy.  VACUUM can either choose to skip any
+	 * all-visible pages lazily, or choose to scan those same pages instead.
+	 * Decide on a skipping strategy to determine final scanned_pages.
+	 */
+	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
+										&all_visible, &all_frozen);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   all_visible, all_frozen);
+	if (verbose)
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace, vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -538,6 +553,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -584,12 +600,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
 		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -634,6 +649,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -661,10 +679,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -857,13 +871,12 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -877,42 +890,24 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = lazy_scan_skip(vacrel, 0, &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		if (blkno < next_block_to_scan)
+			continue;
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = lazy_scan_skip(vacrel, blkno + 1,
+											&next_all_visible);
 
 		vacrel->scanned_pages++;
 
@@ -1122,10 +1117,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1153,12 +1147,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1197,7 +1189,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1290,47 +1282,121 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *	lazy_scan_strategy() -- Determine skipping strategy.
  *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * Determines if the ongoing VACUUM operation should skip all-visible pages
+ * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
  *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
+ * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
+lazy_scan_strategy(LVRelState *vacrel,
+				   BlockNumber all_visible,
+				   BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
+				scanned_pages_skipallvis,
+				scanned_pages_skipallfrozen;
+	uint8		mapbits;
 
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
+	Assert(vacrel->scanned_pages == 0);
+	Assert(rel_pages >= all_visible && all_visible >= all_frozen);
+
+	/*
+	 * First figure out the final scanned_pages for each of the skipping
+	 * policies that lazy_scan_skip might end up using: skipallvis (skip both
+	 * all-frozen and all-visible) and skipallfrozen (just skip all-frozen).
+	 */
+	scanned_pages_skipallvis = rel_pages - all_visible;
+	scanned_pages_skipallfrozen = rel_pages - all_frozen;
+
+	/*
+	 * Even if the last page is skippable, it will still get scanned because
+	 * of lazy_scan_skip's "try to set nonempty_pages for last page" rule.
+	 * Reconcile that rule with what vmsnap says about the last page now.
+	 *
+	 * When vmsnap thinks that we will be skipping the last page (we won't),
+	 * increment scanned_pages to compensate.  Otherwise change nothing.
+	 */
+	mapbits = visibilitymap_snap_status(vacrel->vmsnap, rel_pages - 1);
+	if (mapbits & VISIBILITYMAP_ALL_VISIBLE)
+		scanned_pages_skipallvis++;
+	if (mapbits & VISIBILITYMAP_ALL_FROZEN)
+		scanned_pages_skipallfrozen++;
+
+	/*
+	 * Okay, now we have all the information we need to decide on a strategy
+	 */
+	if (!vacrel->skipallfrozen)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		return rel_pages;
+	}
+	else if (vacrel->aggressive)
+		Assert(!vacrel->skipallvis);
+	else
+	{
+		BlockNumber nextra,
+					nextra_threshold;
+
+		/*
+		 * Decide on whether or not we'll skip all-visible pages.
+		 *
+		 * In general, VACUUM doesn't necessarily have to freeze anything to
+		 * be able to advance relfrozenxid and/or relminmxid by a significant
+		 * number of XIDs/MXIDs.  The oldest tuples might turn out to have
+		 * been deleted since VACUUM last ran, or this VACUUM might find that
+		 * there simply are no MultiXacts that even need to be considered.
+		 *
+		 * It's hard to predict whether this VACUUM operation will work out
+		 * that way, so be lazy (just skip) unless the added cost is very low.
+		 * We opt for a skipallfrozen-only VACUUM when the number of extra
+		 * pages (extra scanned pages that are all-visible but not all-frozen)
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 */
+		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
+		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+		nextra_threshold = Max(32, nextra_threshold);
+
+		vacrel->skipallvis = nextra >= nextra_threshold;
+	}
+
+	/* Return the appropriate variant of scanned_pages */
+	if (vacrel->skipallvis)
+		return scanned_pages_skipallvis;
+	if (vacrel->skipallfrozen)
+		return scanned_pages_skipallfrozen;
+
+	return rel_pages;			/* keep compiler quiet */
+}
+
+/*
+ *	lazy_scan_skip() -- get the next block to scan according to vmsnap.
+ *
+ * lazy_scan_heap() caller passes the next block in line.  We return the next
+ * block to scan.  Caller skips the blocks preceding returned block, if any.
+ *
+ * The all-visible status of the returned block is set in *all_visible, too.
+ * Block usually won't be all-visible (since it's unskippable), but it can be
+ * when next_block is rel's last page and when DISABLE_PAGE_SKIPPING in use.
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, BlockNumber next_block, bool *all_visible)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_block_to_scan = next_block;
+
+	*all_visible = true;
+	while (next_block_to_scan < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_snap_status(vacrel->vmsnap,
+														next_block_to_scan);
 
 		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
 			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
+			*all_visible = false;
 			break;
 		}
 
@@ -1341,58 +1407,26 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 		 * lock on rel to attempt a truncation that fails anyway, just because
 		 * there are tuples on the last page (it is likely that there will be
 		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
 		 */
-		if (next_unskippable_block == rel_pages - 1)
+		if (next_block_to_scan == rel_pages - 1)
 			break;
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
+		Assert(vacrel->skipallfrozen || !vacrel->skipallvis);
+		if (!vacrel->skipallfrozen)
 			break;
 
 		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 * Don't skip all-visible pages when lazy_scan_strategy determined
+		 * that it was more important for this VACUUM to advance relfrozenxid
 		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
+		if (!vacrel->skipallvis && (mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+			break;
 
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
+		next_block_to_scan++;
 	}
 
-	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
-	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
-	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
-	else
-	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
-	}
-
-	return next_unskippable_block;
+	return next_block_to_scan;
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index d62761728..08134b9ce 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,9 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap - get read-only snapshot of visibility map
+ *		visibilitymap_snap_release - release previously acquired snapshot
+ *		visibilitymap_snap_status - get status of bits from vm snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +55,9 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset (only tuples deleted before its
+ * OldestXmin cutoff are considered dead).
  *
  * LOCKING
  *
@@ -124,6 +130,22 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Snapshot of visibility map at a point in time.
+ *
+ * TODO This is currently always just a palloc()'d buffer -- give more thought
+ * to resource management (at a minimum add spilling to temp file).
+ */
+struct vmsnapshot
+{
+	/* Snapshot may contain zero or more visibility map pages */
+	BlockNumber nvmpages;
+
+	/* Copy of VM pages from the time that visibilitymap_snap() was called */
+	char		vmpages[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
@@ -368,6 +390,146 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap - get read-only snapshot of visibility map
+ *
+ * Initializes caller's snapshot, allocating memory in caller's memory context.
+ * Caller can use visibilitymap_snapshot_status to get the status of individual
+ * heap pages at the point that we were called.
+ *
+ * Used by VACUUM to determine which pages it must scan up front.  This avoids
+ * useless scans of concurrently unset heap pages.  VACUUM prefers to leave
+ * them to be scanned during the next VACUUM operation.
+ *
+ * rel_pages is the current size of the heap relation.
+ */
+vmsnapshot *
+visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+				   BlockNumber *all_visible, BlockNumber *all_frozen)
+{
+	BlockNumber nvmpages,
+				mapBlockLast;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap %s %d", RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	*all_visible = 0;
+	*all_frozen = 0;
+	mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages);
+	nvmpages = mapBlockLast + 1;
+	vmsnap = palloc(offsetof(vmsnapshot, vmpages) + BLCKSZ * nvmpages);
+	vmsnap->nvmpages = nvmpages;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Page		localvmpage;
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		localvmpage = vmsnap->vmpages + mapBlock * BLCKSZ;
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(localvmpage, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/*
+		 * Visibility map page copied to local buffer for caller's snapshot.
+		 * Caller requires an exact count of all-visible and all-frozen blocks
+		 * in the heap relation.  Handle that now.
+		 *
+		 * Must "truncate" our local copy of the VM to avoid incorrectly
+		 * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+		 * this by clearing irrelevant bits on the last VM page copied.
+		 */
+		map = PageGetContents(localvmpage);
+		if (mapBlock == mapBlockLast)
+		{
+			/* byte and bit for first heap page not to be scanned by VACUUM */
+			uint32		truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			uint8		truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
+
+			if (truncByte != 0 || truncOffset != 0)
+			{
+				/* Clear any bits set for heap pages >= rel_pages */
+				MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+				map[truncByte] &= (1 << truncOffset) - 1;
+			}
+
+			/* Now it's safe to tally bits from this final VM page below */
+		}
+
+		/* Tally the all-visible and all-frozen counts from this page */
+		umap = (uint64 *) map;
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			*all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+	}
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Just frees the memory allocated by visibilitymap_snap for now (presumably
+ * this will need to release temp files in later revisions of the patch)
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	pfree(vmsnap);
+}
+
+/*
+ *	visibilitymap_snap_status - get status of bits from vm snapshot
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to caller's snapshot of the visibility map?
+ */
+uint8
+visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+	char	   *map;
+	uint8		result;
+	Page		localvmpage;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_status %d", heapBlk);
+#endif
+
+	/* If we didn't copy the VM page, assume heapBlk not all-visible */
+	if (mapBlock >= vmsnap->nvmpages)
+		return 0;
+
+	localvmpage = ((Page) vmsnap->vmpages) + mapBlock * BLCKSZ;
+	map = PageGetContents(localvmpage);
+
+	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+	return result;
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
-- 
2.34.1

routine-vacuuming.htmltext/html; charset=US-ASCII; name=routine-vacuuming.htmlDownload

v5-0001-Add-page-level-freezing-to-VACUUM.patchapplication/octet-stream; name=v5-0001-Add-page-level-freezing-to-VACUUM.patchDownload

From 2105ce6eecb61783748c110f8cc5c3cf6f0bdd8f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v5 1/6] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

This approach decouples the question of _how_ VACUUM could/will freeze a
given heap page (which of its XIDs are eligible to be frozen) from the
question of whether it actually makes sense to do so right now.

Just adding page-level freezing does not change all that much on its
own: VACUUM will still typically freeze very lazily, since we're only
forcing freezing of all of a page's eligible tuples when we decide to
freeze at least one (on the basis of XID age and FreezeLimit).  For now
VACUUM still freezes everything almost as lazily as it always has.
Later work will teach VACUUM to apply an alternative eager freezing
strategy that triggers page-level freezing earlier, based on additional
criteria.
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |  37 +++++-
 src/backend/access/heap/heapam.c     | 171 ++++++++++++++++-----------
 src/backend/access/heap/vacuumlazy.c |  95 +++++++++------
 4 files changed, 200 insertions(+), 107 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9dab35551..c9e2805f8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,8 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
+									TransactionId limit_xid, MultiXactId limit_multi,
 									TransactionId *relfrozenxid_out,
 									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 34220d93c..24aab7bd2 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -330,6 +330,38 @@ typedef struct xl_heap_freeze_tuple
 	uint8		frzflags;
 } xl_heap_freeze_tuple;
 
+/*
+ * State used by VACUUM to track what the oldest extant XID/MXID will become
+ * when determing whether and how to freeze a page's heap tuples via calls to
+ * heap_prepare_freeze_tuple.
+ *
+ * The relfrozenxid_out and relminmxid_out fields are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
+ * relminmxid_out as needed to avoid unsafe values in rel's authoritative
+ * pg_class tuple.
+ *
+ * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
+ * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ */
+typedef struct page_frozenxid_tracker
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze;
+
+	/* Values used when page is to be frozen based on freeze plans */
+	TransactionId relfrozenxid_out;
+	MultiXactId relminmxid_out;
+
+	/* Used by caller for '!freeze' pages */
+	TransactionId relfrozenxid_nofreeze_out;
+	MultiXactId relminmxid_nofreeze_out;
+
+} page_frozenxid_tracker;
+
 /*
  * This is what we need to know about a block being frozen during vacuum
  *
@@ -409,10 +441,11 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId relminmxid,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
+									  TransactionId limit_xid,
+									  MultiXactId limit_multi,
 									  xl_heap_freeze_tuple *frz,
 									  bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  page_frozenxid_tracker *xtrack);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *frz);
 extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bd4d85041..ad970e099 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6439,26 +6439,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
- * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
+ * Caller must initialize xtrack fields for page as a whole before calling
+ * here with first tuple for the page.  See page_frozenxid_tracker comments.
+ *
+ * Caller must set frz->offset itself if heap_execute_freeze_tuple is called.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6471,34 +6460,46 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  TransactionId limit_xid, MultiXactId limit_multi,
 						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  page_frozenxid_tracker *xtrack)
 {
 	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin,
+				freeze_xmax;
 	TransactionId xid;
 
+	/*
+	 * limit_xid *must* be <= cutoff_xid, to ensure that any XID older than it
+	 * can neither be running nor seen as running by any open transaction.
+	 * This ensures that we only freeze XIDs that are safe to freeze -- those
+	 * that are already unambiguously visible to everybody.
+	 *
+	 * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
+	 * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)
+	 */
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
+
 	frz->frzflags = 0;
 	frz->t_infomask2 = tuple->t_infomask2;
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen iff our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+	{
+		freeze_xmin = false;
+		xmin_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmin */
+	}
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6507,8 +6508,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoff_xid);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
@@ -6522,8 +6523,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 
@@ -6534,7 +6535,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.  It needs
+	 * to return true for any tuple that we would force to be frozen here.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6542,7 +6544,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
+		TransactionId mxid_oldest_xid_out = xtrack->relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
@@ -6561,8 +6563,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
+			if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6590,10 +6592,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6621,10 +6623,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, xtrack->relminmxid_out))
+				xtrack->relminmxid_out = xid;
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 		}
 		else
 		{
@@ -6664,8 +6666,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			freeze_xmax = false;
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
@@ -6681,6 +6683,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+		Assert(changed);
+	}
 	if (freeze_xmax)
 	{
 		Assert(!xmax_already_frozen);
@@ -6711,11 +6718,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we ignore the cutoff_xid and just always perform the
 		 * freeze operation.  The oldest release in which such a value can
 		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
+		 * was removed in PostgreSQL 9.0.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6729,18 +6732,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			else
 				frz->frzflags |= XLH_FREEZE_XVAC;
 
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
+			/* Set XMIN_COMMITTED defensively */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
+
+			/*
+			 * Force freezing any page with an xvac to keep things simple.
+			 * This allows totally_frozen tracking to ignore xvac.
+			 */
 			changed = true;
+			xtrack->freeze = true;
 		}
 	}
 
-	*totally_frozen = (xmin_frozen &&
+	/*
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen (provided caller executes freeze plan for the page)
+	 */
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
+
+	/*
+	 * Force vacuumlazy.c to freeze page when avoiding it would violate the
+	 * rule that XIDs < limit_xid (and MXIDs < limit_multi) must never remain
+	 */
+	if (!xtrack->freeze && !(xmin_already_frozen && xmax_already_frozen))
+		xtrack->freeze =
+			heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+									&xtrack->relfrozenxid_nofreeze_out,
+									&xtrack->relminmxid_nofreeze_out);
+
 	return changed;
 }
 
@@ -6793,14 +6814,20 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	page_frozenxid_tracker dummy;
+
+	dummy.freeze = true;
+	dummy.relfrozenxid_out = cutoff_xid;
+	dummy.relminmxid_out = cutoff_multi;
+	dummy.relfrozenxid_nofreeze_out = cutoff_xid;
+	dummy.relminmxid_nofreeze_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
+										  cutoff_xid, cutoff_multi,
 										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+										  &dummy);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7226,17 +7253,23 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * heap_tuple_would_freeze
  *
  * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * force freezing of any of the XID/XMID fields from the tuple, given the same
+ * limits.  We must also deal with dead tuples here, since (xmin, xmax, xvac)
+ * fields could be processed by pruning away the whole tuple instead of
+ * freezing.
+ *
+ * Note: VACUUM refers to limit_xid and limit_multi as "FreezeLimit" and
+ * "MultiXactCutoff" respectively.  These should not be confused with the
+ * absolute cutoffs for freezing.  We just determine whether caller's tuple
+ * and limits trigger heap_prepare_freeze_tuple to force freezing.
  *
  * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
  * like the heap_prepare_freeze_tuple arguments that they're based on.  We
  * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
+heap_tuple_would_freeze(HeapTupleHeader tuple,
+						TransactionId limit_xid, MultiXactId limit_multi,
 						TransactionId *relfrozenxid_out,
 						MultiXactId *relminmxid_out)
 {
@@ -7250,7 +7283,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	{
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 
@@ -7267,7 +7300,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		/* xmax is a non-permanent XID */
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
@@ -7290,7 +7323,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 		if (MultiXactIdPrecedes(multi, *relminmxid_out))
 			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
+		if (MultiXactIdPrecedes(multi, limit_multi))
 			would_freeze = true;
 
 		/* need to check whether any member of the mxact is old */
@@ -7303,7 +7336,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			Assert(TransactionIdIsNormal(xid));
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(xid, limit_xid))
 				would_freeze = true;
 		}
 		if (nmembers > 0)
@@ -7317,7 +7350,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		{
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			would_freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472..abda286b7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,8 +169,9 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
+	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
@@ -511,6 +512,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
@@ -1563,8 +1565,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	page_frozenxid_tracker xtrack;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1580,8 +1582,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	xtrack.freeze = false;
+	xtrack.relfrozenxid_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_out = vacrel->NewRelminMxid;
+	xtrack.relfrozenxid_nofreeze_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_nofreeze_out = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1634,27 +1639,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1786,11 +1787,13 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[tuples_frozen],
 									  &tuple_totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &xtrack))
 		{
 			/* Will execute freeze below */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1811,9 +1814,33 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (xtrack.freeze || tuples_frozen == 0)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (might be zero eligible tuples).
+		 */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_out;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* Not freezing this page, so use alternative cutoffs */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_nofreeze_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_nofreeze_out;
+
+		/* Might still set page all-visible, but never all-frozen */
+		tuples_frozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1821,7 +1848,7 @@ retry:
 	 */
 	if (tuples_frozen > 0)
 	{
-		Assert(prunestate->hastup);
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		vacrel->frozen_pages++;
 
@@ -1853,7 +1880,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(rel, buf, vacrel->NewRelfrozenXid,
 									 frozen, tuples_frozen);
 			PageSetLSN(page, recptr);
 		}
@@ -1876,7 +1903,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1884,8 +1911,7 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1906,9 +1932,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1922,6 +1945,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->all_visible = false;
+		prunestate->has_lpdead_items = true;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-- 
2.34.1

v5-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/octet-stream; name=v5-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload

From 55f525675e6cad81d28c2e309c64639a8922cb8e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 31 Jul 2022 13:53:19 -0700
Subject: [PATCH v5 5/6] Avoid allocating MultiXacts during VACUUM.

Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
teach it the difference between OldestXmin and FreezeLimit.  Use this
high-level context to intelligently avoid allocating new MultiXactIds
during VACUUM operations.  We should always prefer to avoid allocating
new MultiXacts during VACUUM on general principle.  VACUUM is the only
mechanism that can claw back MultixactId space, so allowing VACUUM to
consume MultiXactId space (for any reason) adds to the risk that the
system will trigger the multiStopLimit wraparound protection mechanism.

FreezeMultiXactId() is now eager when it's cheap to process xmax, and
lazy when it's expensive/risky to process xmax (because an expensive
second pass that might result in allocating a new Multi is required).
We make a similar trade-off to the one made by lazy_scan_noprune() when
a cleanup lock isn't available on some heap page.  We can usually put
off freezing (for the time being) when it's inconvenient to proceed.  We
need only accept an older final relfrozenxid/relminmxid value to make
that safe, which is typically a good trade-off.

Note that MultiXactIds are processed eagerly in all cases by triggering
page-level freezing whenever FreezeMultiXactId() processes a Multi
(though not in the no-op processing case).  We don't do the same thing
with an XID based xmax.  This is closer to the historic behavior.
---
 src/backend/access/heap/heapam.c | 60 ++++++++++++++++++++++++--------
 1 file changed, 46 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f6ea1eb93..f86a29e98 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6119,11 +6119,21 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
  *
  * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ *
+ * Allocating new MultiXactIds during VACUUM is something that we always
+ * prefer to avoid.  Caller passes us down all the context required to do this
+ * on their behalf: cutoff_xid, cutoff_multi, limit_xid, and limit_multi.
+ *
+ * We must never leave behind XIDs/MXIDs from before the "limits" given.
+ * There is lots of leeway around XIDs/MXIDs >= caller's "limits", though.
+ * When it's possible to inexpensively process xmax right away, we're eager
+ * about it.  Otherwise we're lazy about it -- next time it might be cheap.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
+				  TransactionId limit_xid, MultiXactId limit_multi,
 				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
@@ -6216,13 +6226,17 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below the limit for Xids, so we
 	 * need to walk the whole members array to figure out what to do, if
 	 * anything.
+	 *
+	 * We use limit_xid for this (VACUUM's FreezeLimit), rather than using
+	 * cutoff_xid (VACUUM's OldestXmin).  We greatly prefer to avoid a second
+	 * pass over the Multi, especially when doing so results in allocating a
+	 * new replacement Multi.
 	 */
-
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6233,12 +6247,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP optimization */
 	for (i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		if (TransactionIdPrecedes(members[i].xid, limit_xid))
 		{
 			need_replace = true;
 			break;
@@ -6247,11 +6260,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			temp_xid_out = members[i].xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* The Multi itself must be >= limit_multi, too */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, limit_multi);
+
 	if (!need_replace)
 	{
 		/*
@@ -6268,6 +6280,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_would_freeze will indicate that the tuple must be frozen.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
@@ -6367,7 +6382,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				/*
 				 * Running locker cannot possibly be older than the cutoff.
 				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * cutoff_xid is VACUUM's OldestXmin, which is also the
 				 * initial value used for top-level relfrozenxid_out tracking
 				 * state.  A running locker cannot be older than VACUUM's
 				 * OldestXmin, either, so we don't need a temp_xid_out step.
@@ -6537,7 +6552,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given limit for Xids.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
@@ -6555,6 +6570,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
 									cutoff_xid, cutoff_multi,
+									limit_xid, limit_multi,
 									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
@@ -6595,12 +6611,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * MultiXactId, to carry forward two or more original member XIDs.
 			 * Might have to ratchet back relfrozenxid_out here, though never
 			 * relminmxid_out.
+			 *
+			 * We only do this when we have no choice; heap_tuple_would_freeze
+			 * will definitely force the page to be frozen below (or would, if
+			 * we weren't about to trigger freezing for the page anyway).
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
 			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
 												 xtrack->relfrozenxid_out));
+			Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+										   &xtrack->relfrozenxid_nofreeze_out,
+										   &xtrack->relminmxid_nofreeze_out));
 			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
@@ -6643,6 +6666,15 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(freeze_xmax);
 			Assert(!TransactionIdIsValid(newxmax));
 		}
+
+		/*
+		 * Trigger page level freezing to ensure that we remove MultiXacts at
+		 * the earliest opportunity when it's cheap to do so (when VACUUM
+		 * won't need to allocate a new Multi).  We even do this in the
+		 * FRM_RETURN_IS_MULTI case, though it's redundant there.
+		 */
+		if ((flags & FRM_NOOP) == 0)
+			xtrack->freeze = true;
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
-- 
2.34.1

v5-0004-Make-VACUUM-s-aggressive-behaviors-continuous.patchapplication/octet-stream; name=v5-0004-Make-VACUUM-s-aggressive-behaviors-continuous.patchDownload

From 7fccc33244639618d532636dacc9248babf1001c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v5 4/6] Make VACUUM's aggressive behaviors continuous.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Before then, every lazy VACUUM
was "equally aggressive": each operation froze whatever tuples before
the age-wise cutoff needed to be frozen.  And each table's relfrozenxid
was updated at the end.  In short, the previous behavior was much less
efficient, but did at least have one thing going for it: it was much
easier to understand at a high level.

VACUUM no longer applies a separate mode of operation (aggressive mode).
There are still antiwraparound autovacuums, but they're now little more
than another way that autovacuum.c can launch an autovacuum worker to
run VACUUM.  The same set of behaviors previously associated with
aggressive mode are retained, but now get applied selectively, on a
timeline attuned to the needs of the table.

The closer that a table's age gets to the autovacuum_freeze_max_age
cutoff, the less VACUUM will care about avoiding the cost of scanning
extra pages to advance relfrozenxid "early".  This new approach cares
about both costs (extra pages scanned) and benefits (the need for
relfrozenxid advancements), unlike the previous approach driven by
vacuum_freeze_table_age, which "escalated to aggressive mode" purely
based on a simple XID age cutoff.  The vacuum_freeze_table_age GUC is
now relegated to a compatibility option.  Its default value is now -1,
which is interpreted as "current value of autovacuum_freeze_max_age".

VACUUM will still advance relfrozenxid at roughly the same XID-age-wise
cadence as before with static tables, but can also advance relfrozenxid
much more frequently in tables where that happens to make sense.  In
practice many tables will tend to have relfrozenxid advanced by some
amount during every VACUUM, especially larger tables and very small
tables.

The emphasis is now on keeping each table's age reasonably recent over
time, across multiple successive VACUUM operations, while spreading out
the burden of freezing, avoiding big spikes.  Freezing is now primarily
treated as an overhead of long term storage of tuples in physical heap
pages.  There is less emphasis on the role freezing plays in preventing
the system from reaching the point of an xidStopLimit outage.

Now every VACUUM might need to wait for a cleanup lock, though few will.
It can only happen when required to advance relfrozenxid to no less than
half way between the existing relfrozenxid and nextXID.  In general
there is no telling how long VACUUM might spend waiting for a cleanup
lock, so it's usually more useful to focus on keeping up with freezing
at the level of the whole table.  VACUUM can afford to set relfrozenxid
to a significantly older value in the short term, since there are now
more opportunities to advance relfrozenxid in the long term.
---
 src/include/commands/vacuum.h                 |   7 +-
 src/backend/access/heap/vacuumlazy.c          | 224 +++---
 src/backend/access/transam/multixact.c        |   5 +-
 src/backend/commands/cluster.c                |  10 +-
 src/backend/commands/vacuum.c                 | 113 +--
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   4 +-
 doc/src/sgml/config.sgml                      |  80 +-
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 718 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  27 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   5 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  30 +-
 src/test/regress/expected/reloptions.out      |   6 +-
 src/test/regress/sql/reloptions.sql           |   6 +-
 19 files changed, 628 insertions(+), 649 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 52379f819..a70df0218 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -290,7 +290,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_set_xid_limits(Relation rel,
+extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age,
 								  int multixact_freeze_min_age,
 								  int freeze_table_age,
@@ -298,7 +298,10 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  TransactionId *oldestXmin,
 								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
-								  MultiXactId *multiXactCutoff);
+								  MultiXactId *multiXactCutoff,
+								  TransactionId *minXid,
+								  MultiXactId *minMulti,
+								  double *antiwrapfrac);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8fe0766ff..c1e7eae7e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,11 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages when using the lazy freezing strategy
+ * Thresholds that control whether VACUUM will skip any all-visible pages when
+ * using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
+#define SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES	0.15
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -144,9 +145,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
-	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	/* Skip (don't scan) all-visible pages? */
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
@@ -178,6 +177,9 @@ typedef struct LVRelState
 	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Earliest permissible NewRelfrozenXid/NewRelminMxid values */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
@@ -258,7 +260,8 @@ static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
 									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
-									  BlockNumber all_frozen);
+									  BlockNumber all_frozen,
+									  double antiwrapfrac);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
 								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
@@ -322,13 +325,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				aggressive,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
-				FreezeLimit;
+				FreezeLimit,
+				MinXid;
 	MultiXactId OldestMxact,
-				MultiXactCutoff;
+				MultiXactCutoff,
+				MinMulti;
+	double		antiwrapfrac;
 	BlockNumber orig_rel_pages,
 				eager_threshold,
 				all_visible,
@@ -367,33 +372,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
 	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
-	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
-	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 * used to determine which XIDs/MultiXactIds will be frozen.
 	 *
 	 * Also determine our cutoff for applying the eager/all-visible freezing
-	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
-	 * even during non-aggressive VACUUMs.
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy.
 	 */
-	aggressive = vacuum_set_xid_limits(rel,
-									   params->freeze_min_age,
-									   params->multixact_freeze_min_age,
-									   params->freeze_table_age,
-									   params->multixact_freeze_table_age,
-									   &OldestXmin, &OldestMxact,
-									   &FreezeLimit, &MultiXactCutoff);
+	vacuum_set_xid_limits(rel,
+						  params->freeze_min_age,
+						  params->multixact_freeze_min_age,
+						  params->freeze_table_age,
+						  params->multixact_freeze_table_age,
+						  &OldestXmin, &OldestMxact,
+						  &FreezeLimit, &MultiXactCutoff,
+						  &MinXid, &MinMulti, &antiwrapfrac);
 	eager_threshold = params->freeze_strategy_threshold < 0 ?
 		vacuum_freeze_strategy_threshold :
 		params->freeze_strategy_threshold;
 
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		aggressive = true;
-	}
+	/*
+	 * Make sure that antiwraparound autovacuums always have the opportunity
+	 * to advance relfrozenxid to a value >= MinXid.
+	 *
+	 * This is needed so that antiwraparound autovacuums reliably advance
+	 * relfrozenxid to the satisfaction of autovacuum.c, even when the
+	 * autovacuum_freeze_max_age reloption (not GUC) triggered the autovacuum.
+	 */
+	if (params->is_wraparound)
+		antiwrapfrac = 1.0;
 
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
@@ -442,10 +447,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
-	vacrel->aggressive = aggressive;
 	/* Initialize skipallvis/skipallfrozen before lazy_scan_strategy call */
-	vacrel->skipallvis = !aggressive;
-	vacrel->skipallfrozen = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
+	vacrel->skipallvis = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
+	vacrel->skipallfrozen = vacrel->skipallvis;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -515,6 +519,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->FreezeLimit = FreezeLimit;
 	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* MinXid limits final relfrozenxid's age (always <= FreezeLimit) */
+	vacrel->MinXid = MinXid;
+	/* MinMulti limits final relminmxid's age (always <= MultiXactCutoff) */
+	vacrel->MinMulti = MinMulti;
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
@@ -538,7 +546,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
 	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
-									   all_visible, all_frozen);
+									   all_visible, all_frozen,
+									   antiwrapfrac);
 	if (verbose)
 		ereport(INFO,
 				(errmsg("vacuuming \"%s.%s.%s\"",
@@ -599,25 +608,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
-		   TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
-										 vacrel->relfrozenxid,
-										 vacrel->NewRelfrozenXid));
+		   TransactionIdPrecedesOrEquals(MinXid, vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
-									   vacrel->relminmxid,
-									   vacrel->NewRelminMxid));
+		   MultiXactIdPrecedesOrEquals(MinMulti, vacrel->NewRelminMxid));
 	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
-		 * lazy_scan_strategy call determined it would skip all-visible pages
+		 * Must keep original relfrozenxid when lazy_scan_strategy call
+		 * decided to skip all-visible pages
 		 */
-		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -693,23 +696,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
-			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
 			else
 			{
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
+				Assert(IsAutoVacuumWorkerProcess());
+				if (params->is_wraparound)
+					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -1041,7 +1032,6 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * lazy_scan_noprune could not do all required processing.  Wait
 			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
 			 */
-			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
 		}
@@ -1300,21 +1290,19 @@ lazy_scan_heap(LVRelState *vacrel)
  * On the other hand we eagerly freeze pages when that strategy spreads out
  * the burden of freezing over time.  Performance stability is important; no
  * one VACUUM operation should need to freeze disproportionately many pages.
- * Antiwraparound VACUUMs of append-only tables should generally be avoided.
  *
  * Also determines if the ongoing VACUUM operation should skip all-visible
- * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
- * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
- * important that relfrozenxid advance in affected tables, which are larger.
- * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
- * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
- * depending on the extra cost - we might need to scan only a few extra pages.
+ * pages when advancing relfrozenxid is still optional (before target rel has
+ * attained an age that forces an antiwraparound autovacuum).  Decision is
+ * based in part on caller's antiwrapfrac argument, which represents how close
+ * the table age is to forcing antiwraparound autovacuum.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
 lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
-				   BlockNumber all_visible, BlockNumber all_frozen)
+				   BlockNumber all_visible, BlockNumber all_frozen,
+				   double antiwrapfrac)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1357,21 +1345,15 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		Assert(vacrel->aggressive && !vacrel->skipallvis);
-		vacrel->allvis_freeze_strategy = true;
-		return rel_pages;
-	}
-	else if (vacrel->aggressive)
-	{
-		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
 		vacrel->allvis_freeze_strategy = true;
+		return rel_pages;
 	}
 	else if (rel_pages >= eager_threshold)
 	{
 		/*
-		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
-		 * GUC-based threshold for eager freezing.
+		 * VACUUM of table whose rel_pages now exceeds GUC-based threshold for
+		 * eager freezing.
 		 *
 		 * We always scan all-visible pages when the threshold is crossed, so
 		 * that relfrozenxid can be advanced.  There will typically be few or
@@ -1386,9 +1368,6 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		BlockNumber nextra,
 					nextra_threshold;
 
-		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
-		vacrel->allvis_freeze_strategy = false;
-
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1402,13 +1381,44 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		 * that way, so be lazy (just skip) unless the added cost is very low.
 		 * We opt for a skipallfrozen-only VACUUM when the number of extra
 		 * pages (extra scanned pages that are all-visible but not all-frozen)
-		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small)
+		 * if relfrozenxid has yet to attain an age that uses 50% of the XID
+		 * space available before the GUC cutoff for antiwraparound
+		 * autovacuum.  A more aggressive threshold of 15% is used when
+		 * relfrozenxid is older than that.
 		 */
 		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
-		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+
+		if (antiwrapfrac < 0.5)
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_THRESHOLD_PAGES;
+		else
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES;
+
 		nextra_threshold = Max(32, nextra_threshold);
 
-		vacrel->skipallvis = nextra >= nextra_threshold;
+		/*
+		 * We must advance relfrozenxid when it already attained an age that
+		 * consumes >= 90% of the available XID space (or MXID space) before
+		 * the crossover point for antiwraparound autovacuum.
+		 *
+		 * Also use eager freezing strategy when we're past the "90% towards
+		 * wraparound" point, even though the table size is below the usual
+		 * eager_threshold table size cutoff.  The added cost is usually not
+		 * too great.  We may be able to fall into a pattern of continually
+		 * advancing relfrozenxid this way.
+		 */
+		if (antiwrapfrac < 0.9)
+		{
+			vacrel->skipallvis = nextra >= nextra_threshold;
+			vacrel->allvis_freeze_strategy = false;
+		}
+		else
+		{
+			vacrel->skipallvis = false;
+			vacrel->allvis_freeze_strategy = true;
+		}
 	}
 
 	/* Return the appropriate variant of scanned_pages */
@@ -2058,11 +2068,9 @@ retry:
  * operation left LP_DEAD items behind.  We'll at least collect any such items
  * in the dead_items array for removal from indexes.
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We may return false to indicate that a full cleanup lock is required for
+ * processing by lazy_scan_prune.  This is only necessary when VACUUM needs to
+ * freeze some tuple XIDs from one or more tuples on the page.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
@@ -2130,36 +2138,24 @@ lazy_scan_noprune(LVRelState *vacrel,
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_would_freeze(tupleheader,
-									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff,
+									vacrel->MinXid,
+									vacrel->MinMulti,
 									&NewRelfrozenXid, &NewRelminMxid))
 		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
 			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
+			 * Tuple with XID < MinXid (or MXID < MinMulti)
 			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
+			 * VACUUM must always be able to advance rel's relfrozenxid and
+			 * relminmxid to minimum values.  The ongoing VACUUM won't be able
+			 * to do that unless it can freeze an XID (or MXID) from this
+			 * tuple now.
+			 *
+			 * The only safe option is to have caller perform processing of
+			 * this page using lazy_scan_prune.  Caller might have to wait a
+			 * while for a cleanup lock, but it can't be helped.
 			 */
+			vacrel->offnum = InvalidOffsetNumber;
+			return false;
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index a7383f553..cda1e5a3d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2818,10 +2818,7 @@ ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
  * freeze table and the minimum freeze age based on the effective
  * autovacuum_multixact_freeze_max_age this function returns.  In the worst
  * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will try to freeze every multixact.
- *
- * It's possible that these thresholds should be user-tunable, but for now
- * we keep it simple.
+ * table will freeze every multixact.
  */
 int
 MultiXactMemberFreezeThreshold(void)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 1976a373e..f5bc3f61c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -823,9 +823,12 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin,
-				FreezeXid;
+				FreezeXid,
+				MinXid;
 	MultiXactId OldestMxact,
-				MultiXactCutoff;
+				MultiXactCutoff,
+				MinMulti;
+	double		antiwrapfrac;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -914,7 +917,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &OldestMxact,
-						  &FreezeXid, &MultiXactCutoff);
+						  &FreezeXid, &MultiXactCutoff, &MinXid, &MinMulti,
+						  &antiwrapfrac);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b837e0331..1157f4653 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -943,21 +943,25 @@ get_all_vacuum_rels(int options)
  * - oldestMxact is the Mxid below which MultiXacts are definitely not
  *   seen as visible by any running transaction.
  * - freezeLimit is the Xid below which all Xids are definitely replaced by
- *   FrozenTransactionId during aggressive vacuums.
+ *   FrozenTransactionId in heap pages that caller can cleanup lock.
  * - multiXactCutoff is the value below which all MultiXactIds are definitely
- *   removed from Xmax during aggressive vacuums.
+ *   removed from Xmax in heap pages that caller can cleanup lock.
+ * - minXid is the earliest valid relfrozenxid value to set in pg_class.
+ * - minMulti is the earliest valid relminmxid value to set in pg_class.
+ * - antiwrapfrac is how close the table's age is to the point that autovacuum
+ *   will launch an antiwraparound autovacuum worker.
  *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
- * minimum).
+ * The antiwrapfrac value 1.0 represents the point that autovacuum.c
+ * scheduling considers advancing relfrozenxid strictly necessary.  Values
+ * between 0.0 and 1.0 represent how close the table is to the point of
+ * mandatory relfrozenxid/relminmxid advancement (up to minXid/minMulti).
  *
  * oldestXmin and oldestMxact are the most recent values that can ever be
  * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
  * vacuumlazy.c caller later on.  These values should be passed when it turns
  * out that VACUUM will leave no unfrozen XIDs/MXIDs behind in the table.
  */
-bool
+void
 vacuum_set_xid_limits(Relation rel,
 					  int freeze_min_age,
 					  int multixact_freeze_min_age,
@@ -966,15 +970,20 @@ vacuum_set_xid_limits(Relation rel,
 					  TransactionId *oldestXmin,
 					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
-					  MultiXactId *multiXactCutoff)
+					  MultiXactId *multiXactCutoff,
+					  TransactionId *minXid,
+					  MultiXactId *minMulti,
+					  double *antiwrapfrac)
 {
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
-	int			effective_multixact_freeze_max_age;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
+	int			effective_multixact_freeze_max_age,
+				relfrozenxid_age,
+				relminmxid_age;
 
 	/*
 	 * Acquire oldestXmin.
@@ -1065,8 +1074,8 @@ vacuum_set_xid_limits(Relation rel,
 		*multiXactCutoff = *oldestMxact;
 
 	/*
-	 * Done setting output parameters; check if oldestXmin or oldestMxact are
-	 * held back to an unsafe degree in passing
+	 * Check if oldestXmin or oldestMxact are held back to an unsafe degree in
+	 * passing
 	 */
 	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
 	if (!TransactionIdIsNormal(safeOldestXmin))
@@ -1086,48 +1095,64 @@ vacuum_set_xid_limits(Relation rel,
 						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
+	 * Work out how close we are to needing an antiwraparound VACUUM.
 	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.   The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/* Final antiwrapfrac can come from either XID or MXID table age */
+	relfrozenxid_age = Max(nextXID - rel->rd_rel->relfrozenxid, 1);
+	relminmxid_age = Max(nextMXID - rel->rd_rel->relminmxid, 1);
+	freeze_table_age = Max(freeze_table_age, 1);
+	multixact_freeze_table_age = Max(multixact_freeze_table_age, 1);
+	XIDFrac = (double) relfrozenxid_age / (double) freeze_table_age;
+	MXIDFrac = (double) relminmxid_age / (double) multixact_freeze_table_age;
+	*antiwrapfrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Pages that caller can cleanup lock immediately will never be left with
+	 * XIDs < freezeLimit (nor with MXIDs < multiXactCutoff).  Determine
+	 * values for a distinct set of cutoffs applied to pages that cannot be
+	 * immediately cleanup locked. The cutoffs govern caller's wait behavior.
+	 *
+	 * It is safer to accept earlier final relfrozenxid and relminmxid values
+	 * than it would be to wait indefinitely for a cleanup lock.  Waiting for
+	 * a cleanup lock to freeze one heap page risks not freezing every other
+	 * eligible heap page.  Keeping up the momentum is what matters most.
+	 */
+	*minXid = nextXID - (freeze_table_age / 2);
+	if (!TransactionIdIsNormal(*minXid))
+		*minXid = FirstNormalTransactionId;
+	/* minXid must always be <= freezeLimit */
+	if (TransactionIdPrecedes(*freezeLimit, *minXid))
+		*minXid = *freezeLimit;
+
+	*minMulti = nextMXID - (multixact_freeze_table_age / 2);
+	if (*minMulti < FirstMultiXactId)
+		*minMulti = FirstMultiXactId;
+	/* minMulti must always be <= multiXactCutoff */
+	if (MultiXactIdPrecedes(*multiXactCutoff, *minMulti))
+		*minMulti = *multiXactCutoff;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 55a355f58..b586b4aff 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -234,8 +234,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->n_dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 87b41795d..842d82f38 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2450,10 +2450,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2470,10 +2470,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a409e6281..544dcf57d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -692,11 +692,11 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
+#vacuum_freeze_table_age = -1
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
+#vacuum_multixact_freeze_table_age = -1
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 22471a81f..9388368a3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8210,7 +8210,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8399,7 +8399,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9123,20 +9123,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> performs antiwraparound vacuuming if the
+        table's <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        Antiwraparound vacuuming differs from regular vacuuming in
+        that it will reliably advance
+        <structfield>relfrozenxid</structfield> to a recent value,
+        even when <command>VACUUM</command> wouldn't usually deemed it
+        necessary.  The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and
+         advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9172,9 +9183,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
-        that there is not an unreasonably short time between forced
+        that there is not an unreasonably short time between forced antiwraparound
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9220,19 +9231,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+        <command>VACUUM</command> performs antiwraparound vacuuming if
+        the table's <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+        field has reached the multixact age specified by this setting.
+        Antiwraparound vacuuming differs from regular vacuuming in
+        that it will reliably advance
+        <structfield>relminmxid</structfield> to a recent value, even
+        when <command>VACUUM</command> wouldn't usually deemed it
+        necessary.  The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>
+        is used.  For more information see <xref
+         linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and
+         advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9251,7 +9271,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
-        so that there is not an unreasonably short time between forced
+        so that there is not an unreasonably short time between forced antiwraparound
         autovacuums.
         For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
        </para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 554b3a75d..6e7ae4930 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
-    transactions older than this cutoff XID are guaranteed to have been frozen.
-    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,106 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     <xref linkend="guc-vacuum-freeze-min-age"/> and <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/> also act as
+     limits on the age of the final values that
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> can be set to.  Note that
+     lazy strategy <command>VACUUM</command>s don't necessarily have to
+     advance either field by any amount, but may nevertheless advance
+     each field frequently in practice.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +662,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 7e684d187..74a61abe2 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1501,7 +1501,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c582021d2..43ffbbbd3 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-min-age"/> and <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
@@ -154,13 +154,8 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>DISABLE_PAGE_SKIPPING</literal></term>
     <listitem>
      <para>
-      Normally, <command>VACUUM</command> will skip pages based on the <link
-      linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      Normally, <command>VACUUM</command> will skip pages based on the
+      <link linkend="vacuum-for-visibility-map">visibility map</link>.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -215,7 +210,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..fdc81a237 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,8 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples and force antiwraparound
+        mode.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +260,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..998adf526 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,18 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), without ever being
+# prepared to wait for a cleanup lock (we'll never wait on a cleanup
+# lock because autovacuum_freeze_max_age and vacuum_freeze_table_age use
+# their default settings).
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +78,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +94,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +105,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +118,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +130,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +138,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..9963b165f 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +127,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..5038dbeb3 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,7 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +71,7 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.34.1

#30

Justin Pryzby

pryzby@telsasoft.com

about 3 years ago

In reply to: Peter Geoghegan (#29)

Re: New strategies for freezing, advancing relfrozenxid early

Note that this fails under -fsanitize=align

Subject: [PATCH v5 2/6] Teach VACUUM to use visibility map snapshot.

performing post-bootstrap initialization ...
../src/backend/access/heap/visibilitymap.c:482:38: runtime error: load of misaligned address 0x5559e1352424 for type 'uint64', which requires 8 byte alignment

Show quoted text

*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);

#31

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Justin Pryzby (#30)

6 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Nov 10, 2022 at 7:44 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

performing post-bootstrap initialization ...
../src/backend/access/heap/visibilitymap.c:482:38: runtime error: load of misaligned address 0x5559e1352424 for type 'uint64', which requires 8 byte alignment

This issue is fixed in the attached revision, v6. I now avoid breaking
alignment-picky platforms in visibilitymap.c by using PGAlignedBlock
in the vm snapshot struct (this replaces the raw char buffer used in
earlier revisions).

Posting v6 will also keep CFTester happy. v5 no longer applies cleanly
due to conflicts caused by today's "Deduplicate freeze plans in freeze
WAL records" commit.

No other changes in v6 that are worth noting here.

Thanks
--
Peter Geoghegan

Attachments:

v6-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/x-patch; name=v6-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload

From 51a863190f70c8baa6d04e3ffd06473843f3326d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 31 Jul 2022 13:53:19 -0700
Subject: [PATCH v6 5/6] Avoid allocating MultiXacts during VACUUM.

Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
teach it the difference between OldestXmin and FreezeLimit.  Use this
high-level context to intelligently avoid allocating new MultiXactIds
during VACUUM operations.  We should always prefer to avoid allocating
new MultiXacts during VACUUM on general principle.  VACUUM is the only
mechanism that can claw back MultixactId space, so allowing VACUUM to
consume MultiXactId space (for any reason) adds to the risk that the
system will trigger the multiStopLimit wraparound protection mechanism.

FreezeMultiXactId() is now eager when it's cheap to process xmax, and
lazy when it's expensive/risky to process xmax (because an expensive
second pass that might result in allocating a new Multi is required).
We make a similar trade-off to the one made by lazy_scan_noprune() when
a cleanup lock isn't available on some heap page.  We can usually put
off freezing (for the time being) when it's inconvenient to proceed.  We
need only accept an older final relfrozenxid/relminmxid value to make
that safe, which is typically a good trade-off.

Note that MultiXactIds are processed eagerly in all cases by triggering
page-level freezing whenever FreezeMultiXactId() processes a Multi
(though not in the no-op processing case).  We don't do the same thing
with an XID based xmax.  This is closer to the historic behavior.
---
 src/backend/access/heap/heapam.c | 44 ++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6a164fdb8..3dae17a9d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6122,11 +6122,21 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
  *
  * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ *
+ * Allocating new MultiXactIds during VACUUM is something that we always
+ * prefer to avoid.  Caller passes us down all the context required to do this
+ * on their behalf: cutoff_xid, cutoff_multi, limit_xid, and limit_multi.
+ *
+ * We must never leave behind XIDs/MXIDs from before the "limits" given.
+ * There is lots of leeway around XIDs/MXIDs >= caller's "limits", though.
+ * When it's possible to inexpensively process xmax right away, we're eager
+ * about it.  Otherwise we're lazy about it -- next time it might be cheap.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
+				  TransactionId limit_xid, MultiXactId limit_multi,
 				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
@@ -6219,13 +6229,17 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below the limit for Xids, so we
 	 * need to walk the whole members array to figure out what to do, if
 	 * anything.
+	 *
+	 * We use limit_xid for this (VACUUM's FreezeLimit), rather than using
+	 * cutoff_xid (VACUUM's OldestXmin).  We greatly prefer to avoid a second
+	 * pass over the Multi, especially when doing so results in allocating a
+	 * new replacement Multi.
 	 */
-
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6236,12 +6250,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP optimization */
 	for (i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		if (TransactionIdPrecedes(members[i].xid, limit_xid))
 		{
 			need_replace = true;
 			break;
@@ -6250,11 +6263,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			temp_xid_out = members[i].xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* The Multi itself must be >= limit_multi, too */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, limit_multi);
+
 	if (!need_replace)
 	{
 		/*
@@ -6271,6 +6283,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_would_freeze will indicate that the tuple must be frozen.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
@@ -6370,7 +6385,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				/*
 				 * Running locker cannot possibly be older than the cutoff.
 				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * cutoff_xid is VACUUM's OldestXmin, which is also the
 				 * initial value used for top-level relfrozenxid_out tracking
 				 * state.  A running locker cannot be older than VACUUM's
 				 * OldestXmin, either, so we don't need a temp_xid_out step.
@@ -6538,7 +6553,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given limit for Xids.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
@@ -6556,6 +6571,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
 									cutoff_xid, cutoff_multi,
+									limit_xid, limit_multi,
 									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
-- 
2.34.1

v6-0004-Make-VACUUM-s-aggressive-behaviors-continuous.patchapplication/x-patch; name=v6-0004-Make-VACUUM-s-aggressive-behaviors-continuous.patchDownload

From f2066c8ca5ba1b6f31257a36bb3dd065ecb1e3d4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v6 4/6] Make VACUUM's aggressive behaviors continuous.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Before then, every lazy VACUUM
was "equally aggressive": each operation froze whatever tuples before
the age-wise cutoff needed to be frozen.  And each table's relfrozenxid
was updated at the end.  In short, the previous behavior was much less
efficient, but did at least have one thing going for it: it was much
easier to understand at a high level.

VACUUM no longer applies a separate mode of operation (aggressive mode).
There are still antiwraparound autovacuums, but they're now little more
than another way that autovacuum.c can launch an autovacuum worker to
run VACUUM.  The same set of behaviors previously associated with
aggressive mode are retained, but now get applied selectively, on a
timeline attuned to the needs of the table.

The closer that a table's age gets to the autovacuum_freeze_max_age
cutoff, the less VACUUM will care about avoiding the cost of scanning
extra pages to advance relfrozenxid "early".  This new approach cares
about both costs (extra pages scanned) and benefits (the need for
relfrozenxid advancements), unlike the previous approach driven by
vacuum_freeze_table_age, which "escalated to aggressive mode" purely
based on a simple XID age cutoff.  The vacuum_freeze_table_age GUC is
now relegated to a compatibility option.  Its default value is now -1,
which is interpreted as "current value of autovacuum_freeze_max_age".

VACUUM will still advance relfrozenxid at roughly the same XID-age-wise
cadence as before with static tables, but can also advance relfrozenxid
much more frequently in tables where that happens to make sense.  In
practice many tables will tend to have relfrozenxid advanced by some
amount during every VACUUM, especially larger tables and very small
tables.

The emphasis is now on keeping each table's age reasonably recent over
time, across multiple successive VACUUM operations, while spreading out
the burden of freezing, avoiding big spikes.  Freezing is now primarily
treated as an overhead of long term storage of tuples in physical heap
pages.  There is less emphasis on the role freezing plays in preventing
the system from reaching the point of an xidStopLimit outage.

Now every VACUUM might need to wait for a cleanup lock, though few will.
It can only happen when required to advance relfrozenxid to no less than
half way between the existing relfrozenxid and nextXID.  In general
there is no telling how long VACUUM might spend waiting for a cleanup
lock, so it's usually more useful to focus on keeping up with freezing
at the level of the whole table.  VACUUM can afford to set relfrozenxid
to a significantly older value in the short term, since there are now
more opportunities to advance relfrozenxid in the long term.
---
 src/include/commands/vacuum.h                 |   7 +-
 src/backend/access/heap/vacuumlazy.c          | 223 +++---
 src/backend/access/transam/multixact.c        |   5 +-
 src/backend/commands/cluster.c                |  10 +-
 src/backend/commands/vacuum.c                 | 113 +--
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   4 +-
 doc/src/sgml/config.sgml                      | 103 +--
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  27 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  30 +-
 src/test/regress/expected/reloptions.out      |   6 +-
 src/test/regress/sql/reloptions.sql           |   6 +-
 19 files changed, 638 insertions(+), 663 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 52379f819..a70df0218 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -290,7 +290,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_set_xid_limits(Relation rel,
+extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age,
 								  int multixact_freeze_min_age,
 								  int freeze_table_age,
@@ -298,7 +298,10 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  TransactionId *oldestXmin,
 								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
-								  MultiXactId *multiXactCutoff);
+								  MultiXactId *multiXactCutoff,
+								  TransactionId *minXid,
+								  MultiXactId *minMulti,
+								  double *antiwrapfrac);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 278833077..97f3b83ac 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,11 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages when using the lazy freezing strategy
+ * Thresholds that control whether VACUUM will skip any all-visible pages when
+ * using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
+#define SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES	0.15
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -144,9 +145,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
-	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	/* Skip (don't scan) all-visible pages? */
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
@@ -178,6 +177,9 @@ typedef struct LVRelState
 	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Earliest permissible NewRelfrozenXid/NewRelminMxid values */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
@@ -258,7 +260,8 @@ static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
 									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
-									  BlockNumber all_frozen);
+									  BlockNumber all_frozen,
+									  double antiwrapfrac);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
 								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
@@ -322,13 +325,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				aggressive,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
-				FreezeLimit;
+				FreezeLimit,
+				MinXid;
 	MultiXactId OldestMxact,
-				MultiXactCutoff;
+				MultiXactCutoff,
+				MinMulti;
+	double		antiwrapfrac;
 	BlockNumber orig_rel_pages,
 				eager_threshold,
 				all_visible,
@@ -367,33 +372,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
 	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
-	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
-	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 * used to determine which XIDs/MultiXactIds will be frozen.
 	 *
 	 * Also determine our cutoff for applying the eager/all-visible freezing
-	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
-	 * even during non-aggressive VACUUMs.
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy.
 	 */
-	aggressive = vacuum_set_xid_limits(rel,
-									   params->freeze_min_age,
-									   params->multixact_freeze_min_age,
-									   params->freeze_table_age,
-									   params->multixact_freeze_table_age,
-									   &OldestXmin, &OldestMxact,
-									   &FreezeLimit, &MultiXactCutoff);
+	vacuum_set_xid_limits(rel,
+						  params->freeze_min_age,
+						  params->multixact_freeze_min_age,
+						  params->freeze_table_age,
+						  params->multixact_freeze_table_age,
+						  &OldestXmin, &OldestMxact,
+						  &FreezeLimit, &MultiXactCutoff,
+						  &MinXid, &MinMulti, &antiwrapfrac);
 	eager_threshold = params->freeze_strategy_threshold < 0 ?
 		vacuum_freeze_strategy_threshold :
 		params->freeze_strategy_threshold;
 
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		aggressive = true;
-	}
+	/*
+	 * Make sure that antiwraparound autovacuums always have the opportunity
+	 * to advance relfrozenxid to a value >= MinXid.
+	 *
+	 * This is needed so that antiwraparound autovacuums reliably advance
+	 * relfrozenxid to the satisfaction of autovacuum.c, even when the
+	 * autovacuum_freeze_max_age reloption (not GUC) triggered the autovacuum.
+	 */
+	if (params->is_wraparound)
+		antiwrapfrac = 1.0;
 
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
@@ -442,10 +447,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
-	vacrel->aggressive = aggressive;
 	/* Initialize skipallvis/skipallfrozen before lazy_scan_strategy call */
-	vacrel->skipallvis = !aggressive;
-	vacrel->skipallfrozen = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
+	vacrel->skipallvis = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
+	vacrel->skipallfrozen = vacrel->skipallvis;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -515,6 +519,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->FreezeLimit = FreezeLimit;
 	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* MinXid limits final relfrozenxid's age (always <= FreezeLimit) */
+	vacrel->MinXid = MinXid;
+	/* MinMulti limits final relminmxid's age (always <= MultiXactCutoff) */
+	vacrel->MinMulti = MinMulti;
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
@@ -538,7 +546,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
 	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
-									   all_visible, all_frozen);
+									   all_visible, all_frozen,
+									   antiwrapfrac);
 	if (verbose)
 		ereport(INFO,
 				(errmsg("vacuuming \"%s.%s.%s\"",
@@ -599,25 +608,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
-		   TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
-										 vacrel->relfrozenxid,
-										 vacrel->NewRelfrozenXid));
+		   TransactionIdPrecedesOrEquals(MinXid, vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
-									   vacrel->relminmxid,
-									   vacrel->NewRelminMxid));
+		   MultiXactIdPrecedesOrEquals(MinMulti, vacrel->NewRelminMxid));
 	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
-		 * lazy_scan_strategy call determined it would skip all-visible pages
+		 * Must keep original relfrozenxid when lazy_scan_strategy call
+		 * decided to skip all-visible pages
 		 */
-		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -693,23 +696,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
-			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
 			else
 			{
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
+				Assert(IsAutoVacuumWorkerProcess());
+				if (params->is_wraparound)
+					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -1041,7 +1032,6 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * lazy_scan_noprune could not do all required processing.  Wait
 			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
 			 */
-			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
 		}
@@ -1300,21 +1290,19 @@ lazy_scan_heap(LVRelState *vacrel)
  * On the other hand we eagerly freeze pages when that strategy spreads out
  * the burden of freezing over time.  Performance stability is important; no
  * one VACUUM operation should need to freeze disproportionately many pages.
- * Antiwraparound VACUUMs of append-only tables should generally be avoided.
  *
  * Also determines if the ongoing VACUUM operation should skip all-visible
- * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
- * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
- * important that relfrozenxid advance in affected tables, which are larger.
- * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
- * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
- * depending on the extra cost - we might need to scan only a few extra pages.
+ * pages when advancing relfrozenxid is still optional (before target rel has
+ * attained an age that forces an antiwraparound autovacuum).  Decision is
+ * based in part on caller's antiwrapfrac argument, which represents how close
+ * the table age is to forcing antiwraparound autovacuum.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
 lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
-				   BlockNumber all_visible, BlockNumber all_frozen)
+				   BlockNumber all_visible, BlockNumber all_frozen,
+				   double antiwrapfrac)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1357,21 +1345,15 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		Assert(vacrel->aggressive && !vacrel->skipallvis);
-		vacrel->allvis_freeze_strategy = true;
-		return rel_pages;
-	}
-	else if (vacrel->aggressive)
-	{
-		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
 		vacrel->allvis_freeze_strategy = true;
+		return rel_pages;
 	}
 	else if (rel_pages >= eager_threshold)
 	{
 		/*
-		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
-		 * GUC-based threshold for eager freezing.
+		 * VACUUM of table whose rel_pages now exceeds GUC-based threshold for
+		 * eager freezing.
 		 *
 		 * We always scan all-visible pages when the threshold is crossed, so
 		 * that relfrozenxid can be advanced.  There will typically be few or
@@ -1386,9 +1368,6 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		BlockNumber nextra,
 					nextra_threshold;
 
-		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
-		vacrel->allvis_freeze_strategy = false;
-
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1402,13 +1381,44 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		 * that way, so be lazy (just skip) unless the added cost is very low.
 		 * We opt for a skipallfrozen-only VACUUM when the number of extra
 		 * pages (extra scanned pages that are all-visible but not all-frozen)
-		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small)
+		 * if relfrozenxid has yet to attain an age that uses 50% of the XID
+		 * space available before the GUC cutoff for antiwraparound
+		 * autovacuum.  A more aggressive threshold of 15% is used when
+		 * relfrozenxid is older than that.
 		 */
 		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
-		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+
+		if (antiwrapfrac < 0.5)
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_THRESHOLD_PAGES;
+		else
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES;
+
 		nextra_threshold = Max(32, nextra_threshold);
 
-		vacrel->skipallvis = nextra >= nextra_threshold;
+		/*
+		 * We must advance relfrozenxid when it already attained an age that
+		 * consumes >= 90% of the available XID space (or MXID space) before
+		 * the crossover point for antiwraparound autovacuum.
+		 *
+		 * Also use eager freezing strategy when we're past the "90% towards
+		 * wraparound" point, even though the table size is below the usual
+		 * eager_threshold table size cutoff.  The added cost is usually not
+		 * too great.  We may be able to fall into a pattern of continually
+		 * advancing relfrozenxid this way.
+		 */
+		if (antiwrapfrac < 0.9)
+		{
+			vacrel->skipallvis = nextra >= nextra_threshold;
+			vacrel->allvis_freeze_strategy = false;
+		}
+		else
+		{
+			vacrel->skipallvis = false;
+			vacrel->allvis_freeze_strategy = true;
+		}
 	}
 
 	/* Return the appropriate variant of scanned_pages */
@@ -2023,11 +2033,9 @@ retry:
  * operation left LP_DEAD items behind.  We'll at least collect any such items
  * in the dead_items array for removal from indexes.
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We may return false to indicate that a full cleanup lock is required for
+ * processing by lazy_scan_prune.  This is only necessary when VACUUM needs to
+ * freeze some tuple XIDs from one or more tuples on the page.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
@@ -2095,36 +2103,23 @@ lazy_scan_noprune(LVRelState *vacrel,
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_would_freeze(tupleheader,
-									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff,
+									vacrel->MinXid, vacrel->MinMulti,
 									&NewRelfrozenXid, &NewRelminMxid))
 		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
 			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
+			 * Tuple with XID < MinXid (or MXID < MinMulti)
 			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
+			 * VACUUM must always be able to advance rel's relfrozenxid and
+			 * relminmxid to minimum values.  The ongoing VACUUM won't be able
+			 * to do that unless it can freeze an XID (or MXID) from this
+			 * tuple now.
+			 *
+			 * The only safe option is to have caller perform processing of
+			 * this page using lazy_scan_prune.  Caller might have to wait a
+			 * while for a cleanup lock, but it can't be helped.
 			 */
+			vacrel->offnum = InvalidOffsetNumber;
+			return false;
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 204aa9504..ba575c5fd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2816,10 +2816,7 @@ ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
  * freeze table and the minimum freeze age based on the effective
  * autovacuum_multixact_freeze_max_age this function returns.  In the worst
  * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will try to freeze every multixact.
- *
- * It's possible that these thresholds should be user-tunable, but for now
- * we keep it simple.
+ * table will freeze every multixact.
  */
 int
 MultiXactMemberFreezeThreshold(void)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3b78a2f10..d2950fd6e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -824,9 +824,12 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin,
-				FreezeXid;
+				FreezeXid,
+				MinXid;
 	MultiXactId OldestMxact,
-				MultiXactCutoff;
+				MultiXactCutoff,
+				MinMulti;
+	double		antiwrapfrac;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -915,7 +918,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &OldestMxact,
-						  &FreezeXid, &MultiXactCutoff);
+						  &FreezeXid, &MultiXactCutoff, &MinXid, &MinMulti,
+						  &antiwrapfrac);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index df2bd53b9..5bdab6eb0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -943,21 +943,25 @@ get_all_vacuum_rels(int options)
  * - oldestMxact is the Mxid below which MultiXacts are definitely not
  *   seen as visible by any running transaction.
  * - freezeLimit is the Xid below which all Xids are definitely replaced by
- *   FrozenTransactionId during aggressive vacuums.
+ *   FrozenTransactionId in heap pages that caller can cleanup lock.
  * - multiXactCutoff is the value below which all MultiXactIds are definitely
- *   removed from Xmax during aggressive vacuums.
+ *   removed from Xmax in heap pages that caller can cleanup lock.
+ * - minXid is the earliest valid relfrozenxid value to set in pg_class.
+ * - minMulti is the earliest valid relminmxid value to set in pg_class.
+ * - antiwrapfrac is how close the table's age is to the point that autovacuum
+ *   will launch an antiwraparound autovacuum worker.
  *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
- * minimum).
+ * The antiwrapfrac value 1.0 represents the point that autovacuum.c
+ * scheduling considers advancing relfrozenxid strictly necessary.  Values
+ * between 0.0 and 1.0 represent how close the table is to the point of
+ * mandatory relfrozenxid/relminmxid advancement (up to minXid/minMulti).
  *
  * oldestXmin and oldestMxact are the most recent values that can ever be
  * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
  * vacuumlazy.c caller later on.  These values should be passed when it turns
  * out that VACUUM will leave no unfrozen XIDs/MXIDs behind in the table.
  */
-bool
+void
 vacuum_set_xid_limits(Relation rel,
 					  int freeze_min_age,
 					  int multixact_freeze_min_age,
@@ -966,15 +970,20 @@ vacuum_set_xid_limits(Relation rel,
 					  TransactionId *oldestXmin,
 					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
-					  MultiXactId *multiXactCutoff)
+					  MultiXactId *multiXactCutoff,
+					  TransactionId *minXid,
+					  MultiXactId *minMulti,
+					  double *antiwrapfrac)
 {
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
-	int			effective_multixact_freeze_max_age;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
+	int			effective_multixact_freeze_max_age,
+				relfrozenxid_age,
+				relminmxid_age;
 
 	/*
 	 * Acquire oldestXmin.
@@ -1065,8 +1074,8 @@ vacuum_set_xid_limits(Relation rel,
 		*multiXactCutoff = *oldestMxact;
 
 	/*
-	 * Done setting output parameters; check if oldestXmin or oldestMxact are
-	 * held back to an unsafe degree in passing
+	 * Check if oldestXmin or oldestMxact are held back to an unsafe degree in
+	 * passing
 	 */
 	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
 	if (!TransactionIdIsNormal(safeOldestXmin))
@@ -1086,48 +1095,64 @@ vacuum_set_xid_limits(Relation rel,
 						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
+	 * Work out how close we are to needing an antiwraparound VACUUM.
 	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.   The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/* Final antiwrapfrac can come from either XID or MXID table age */
+	relfrozenxid_age = Max(nextXID - rel->rd_rel->relfrozenxid, 1);
+	relminmxid_age = Max(nextMXID - rel->rd_rel->relminmxid, 1);
+	freeze_table_age = Max(freeze_table_age, 1);
+	multixact_freeze_table_age = Max(multixact_freeze_table_age, 1);
+	XIDFrac = (double) relfrozenxid_age / (double) freeze_table_age;
+	MXIDFrac = (double) relminmxid_age / (double) multixact_freeze_table_age;
+	*antiwrapfrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Pages that caller can cleanup lock immediately will never be left with
+	 * XIDs < freezeLimit (nor with MXIDs < multiXactCutoff).  Determine
+	 * values for a distinct set of cutoffs applied to pages that cannot be
+	 * immediately cleanup locked. The cutoffs govern caller's wait behavior.
+	 *
+	 * It is safer to accept earlier final relfrozenxid and relminmxid values
+	 * than it would be to wait indefinitely for a cleanup lock.  Waiting for
+	 * a cleanup lock to freeze one heap page risks not freezing every other
+	 * eligible heap page.  Keeping up the momentum is what matters most.
+	 */
+	*minXid = nextXID - (freeze_table_age / 2);
+	if (!TransactionIdIsNormal(*minXid))
+		*minXid = FirstNormalTransactionId;
+	/* minXid must always be <= freezeLimit */
+	if (TransactionIdPrecedes(*freezeLimit, *minXid))
+		*minXid = *freezeLimit;
+
+	*minMulti = nextMXID - (multixact_freeze_table_age / 2);
+	if (*minMulti < FirstMultiXactId)
+		*minMulti = FirstMultiXactId;
+	/* minMulti must always be <= multiXactCutoff */
+	if (MultiXactIdPrecedes(*multiXactCutoff, *minMulti))
+		*minMulti = *multiXactCutoff;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 55a355f58..b586b4aff 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -234,8 +234,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->n_dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5ca4a71d7..4dd70c334 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2456,10 +2456,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2476,10 +2476,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a409e6281..544dcf57d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -692,11 +692,11 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
+#vacuum_freeze_table_age = -1
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
+#vacuum_multixact_freeze_table_age = -1
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 109cc4727..4e39a42fe 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8215,7 +8215,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8404,7 +8404,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9120,31 +9120,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
-      <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
-      <indexterm>
-       <primary><varname>vacuum_freeze_table_age</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
-       </para>
-      </listitem>
-     </varlistentry>
-
      <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
       <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
       <indexterm>
@@ -9160,6 +9135,39 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
+      <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_table_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-xid-space"/>.
+       </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9179,7 +9187,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9225,19 +9233,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9249,10 +9265,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in multixacts) that
-        <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages with an older multixact ID.  The
-        default is 5 million multixacts.
+        Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
+        should use to decide whether to trigger freezing of pages with
+        an older multixact ID.  The default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 554b3a75d..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
-    transactions older than this cutoff XID are guaranteed to have been frozen.
-    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c582021d2..43ffbbbd3 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-min-age"/> and <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
@@ -154,13 +154,8 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>DISABLE_PAGE_SKIPPING</literal></term>
     <listitem>
      <para>
-      Normally, <command>VACUUM</command> will skip pages based on the <link
-      linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      Normally, <command>VACUUM</command> will skip pages based on the
+      <link linkend="vacuum-for-visibility-map">visibility map</link>.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -215,7 +210,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..998adf526 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,18 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), without ever being
+# prepared to wait for a cleanup lock (we'll never wait on a cleanup
+# lock because autovacuum_freeze_max_age and vacuum_freeze_table_age use
+# their default settings).
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +78,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +94,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +105,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +118,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +130,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +138,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..9963b165f 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +127,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..5038dbeb3 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,7 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +71,7 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.34.1

v6-0001-Add-page-level-freezing-to-VACUUM.patchapplication/x-patch; name=v6-0001-Add-page-level-freezing-to-VACUUM.patchDownload

From 352867c5027fae6194ab1c6480cd326963e201b1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v6 1/6] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

This approach decouples the question of _how_ VACUUM could/will freeze a
given heap page (which of its XIDs are eligible to be frozen) from the
question of whether it actually makes sense to do so right now.

Just adding page-level freezing does not change all that much on its
own: VACUUM will still typically freeze very lazily, since we're only
forcing freezing of all of a page's eligible tuples when we decide to
freeze at least one (on the basis of XID age and FreezeLimit).  For now
VACUUM still freezes everything almost as lazily as it always has.
Later work will teach VACUUM to apply an alternative eager freezing
strategy that triggers page-level freezing earlier, based on additional
criteria.
---
 src/include/access/heapam.h          |  42 +++++-
 src/backend/access/heap/heapam.c     | 199 +++++++++++++++++----------
 src/backend/access/heap/vacuumlazy.c |  95 ++++++++-----
 3 files changed, 222 insertions(+), 114 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ebe723abb..ea709bf1b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -112,6 +112,38 @@ typedef struct HeapTupleFreeze
 	OffsetNumber offset;
 } HeapTupleFreeze;
 
+/*
+ * State used by VACUUM to track what the oldest extant XID/MXID will become
+ * when determing whether and how to freeze a page's heap tuples via calls to
+ * heap_prepare_freeze_tuple.
+ *
+ * The relfrozenxid_out and relminmxid_out fields are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
+ * relminmxid_out as needed to avoid unsafe values in rel's authoritative
+ * pg_class tuple.
+ *
+ * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
+ * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ */
+typedef struct HeapPageFreeze
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze;
+
+	/* Values used when page is to be frozen based on freeze plans */
+	TransactionId relfrozenxid_out;
+	MultiXactId relminmxid_out;
+
+	/* Used by caller for '!freeze' pages */
+	TransactionId relfrozenxid_nofreeze_out;
+	MultiXactId relminmxid_nofreeze_out;
+
+} HeapPageFreeze;
+
 /* ----------------
  *		function prototypes for heap access method
  *
@@ -180,17 +212,17 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId relfrozenxid, TransactionId relminmxid,
 									  TransactionId cutoff_xid, TransactionId cutoff_multi,
+									  TransactionId limit_xid, MultiXactId limit_multi,
 									  HeapTupleFreeze *frz, bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  HeapPageFreeze *xtrack);
 extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-										 TransactionId FreezeLimit,
+										 TransactionId OldestXmin,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
+									TransactionId limit_xid, MultiXactId limit_multi,
 									TransactionId *relfrozenxid_out,
 									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 807a09d36..2e9b860b3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6444,26 +6444,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  *
  * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
  * returned true for when called.  A later heap_freeze_execute_prepared call
- * will execute freezing for caller's page as a whole.
+ * will execute freezing for caller's page as a whole.  Caller should also
+ * initialize xtrack fields for page as a whole before calling here with first
+ * tuple for the page.  See page_frozenxid_tracker comments.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6473,34 +6461,46 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  TransactionId limit_xid, MultiXactId limit_multi,
 						  HeapTupleFreeze *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  HeapPageFreeze *xtrack)
 {
 	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin,
+				freeze_xmax;
 	TransactionId xid;
 
+	/*
+	 * limit_xid *must* be <= cutoff_xid, to ensure that any XID older than it
+	 * can neither be running nor seen as running by any open transaction.
+	 * This ensures that we only freeze XIDs that are safe to freeze -- those
+	 * that are already unambiguously visible to everybody.
+	 *
+	 * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
+	 * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)
+	 */
+	Assert(TransactionIdPrecedesOrEquals(limit_xid, cutoff_xid));
+	Assert(MultiXactIdPrecedesOrEquals(limit_multi, cutoff_multi));
+
 	frz->frzflags = 0;
 	frz->t_infomask2 = tuple->t_infomask2;
 	frz->t_infomask = tuple->t_infomask;
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen iff our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+	{
+		freeze_xmin = false;
+		xmin_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmin */
+	}
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6509,8 +6509,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoff_xid);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
@@ -6524,8 +6524,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 
@@ -6536,7 +6536,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.  It needs
+	 * to return true for any tuple that we would force to be frozen here.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6544,7 +6545,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
+		TransactionId mxid_oldest_xid_out = xtrack->relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
@@ -6563,8 +6564,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
+			Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+										   &xtrack->relfrozenxid_nofreeze_out,
+										   &xtrack->relminmxid_nofreeze_out));
+			if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6592,10 +6596,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, xtrack->relminmxid_out));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+										   &xtrack->relfrozenxid_nofreeze_out,
+										   &xtrack->relminmxid_nofreeze_out));
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6623,20 +6630,32 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, xtrack->relminmxid_out))
+				xtrack->relminmxid_out = xid;
+			xtrack->relfrozenxid_out = mxid_oldest_xid_out;
 		}
 		else
 		{
 			/*
 			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.
 			 * Won't have to ratchet back relminmxid_out or relfrozenxid_out.
+			 *
+			 * Note: heap_tuple_would_freeze() might not insist that this xmax
+			 * be frozen now, but we always freeze Multis proactively.
 			 */
 			Assert(freeze_xmax);
 			Assert(!TransactionIdIsValid(newxmax));
 		}
+
+		/*
+		 * Trigger page level freezing to ensure that we reliably process
+		 * MultiXacts as instructed by FreezeMultiXactId() in all cases.
+		 * There is no way to opt out of this, since FreezeMultiXactId()
+		 * doesn't provide for that.
+		 */
+		if ((flags & FRM_NOOP) == 0)
+			xtrack->freeze = true;
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6666,8 +6685,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			freeze_xmax = false;
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
 		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
@@ -6683,6 +6702,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+		Assert(changed);
+	}
 	if (freeze_xmax)
 	{
 		Assert(!xmax_already_frozen);
@@ -6713,11 +6737,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we ignore the cutoff_xid and just always perform the
 		 * freeze operation.  The oldest release in which such a value can
 		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
+		 * was removed in PostgreSQL 9.0.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6731,18 +6751,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			else
 				frz->frzflags |= XLH_FREEZE_XVAC;
 
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
+			/* Set XMIN_COMMITTED defensively */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
+
+			/*
+			 * Force freezing any page with an xvac to keep things simple.
+			 * This allows totally_frozen tracking to ignore xvac.
+			 */
 			changed = true;
+			xtrack->freeze = true;
 		}
 	}
 
-	*totally_frozen = (xmin_frozen &&
+	/*
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen (provided caller executes freeze plan for the page)
+	 */
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
+
+	/*
+	 * Force vacuumlazy.c to freeze page when avoiding it would violate the
+	 * rule that XIDs < limit_xid (and MXIDs < limit_multi) must never remain
+	 */
+	if (!xtrack->freeze && !(xmin_already_frozen && xmax_already_frozen))
+		xtrack->freeze =
+			heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+									&xtrack->relfrozenxid_nofreeze_out,
+									&xtrack->relminmxid_nofreeze_out);
+
 	return changed;
 }
 
@@ -6786,13 +6824,13 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
  */
 void
 heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-							 TransactionId FreezeLimit,
+							 TransactionId OldestXmin,
 							 HeapTupleFreeze *tuples, int ntuples)
 {
 	Page		page = BufferGetPage(buffer);
 
 	Assert(ntuples > 0);
-	Assert(TransactionIdIsValid(FreezeLimit));
+	Assert(TransactionIdIsValid(OldestXmin));
 
 	START_CRIT_SECTION();
 
@@ -6822,11 +6860,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 
 		/*
 		 * latestRemovedXid describes the latest processed XID, whereas
-		 * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
-		 * Back up caller's FreezeLimit to avoid false conflicts when
-		 * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
+		 * OldestXmin is the first XID not frozen by VACUUM.  Back up caller's
+		 * OldestXmin to avoid false conflicts.
 		 */
-		latestRemovedXid = FreezeLimit;
+		latestRemovedXid = OldestXmin;
 		TransactionIdRetreat(latestRemovedXid);
 
 		xlrec.latestRemovedXid = latestRemovedXid;
@@ -6868,14 +6905,20 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	HeapTupleFreeze frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	HeapPageFreeze dummy;
+
+	dummy.freeze = true;
+	dummy.relfrozenxid_out = cutoff_xid;
+	dummy.relminmxid_out = cutoff_multi;
+	dummy.relfrozenxid_nofreeze_out = cutoff_xid;
+	dummy.relminmxid_nofreeze_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
+										  cutoff_xid, cutoff_multi,
 										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+										  &dummy);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7301,17 +7344,23 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * heap_tuple_would_freeze
  *
  * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * force freezing of any of the XID/XMID fields from the tuple, given the same
+ * limits.  We must also deal with dead tuples here, since (xmin, xmax, xvac)
+ * fields could be processed by pruning away the whole tuple instead of
+ * freezing.
+ *
+ * Note: VACUUM refers to limit_xid and limit_multi as "FreezeLimit" and
+ * "MultiXactCutoff" respectively.  These should not be confused with the
+ * absolute cutoffs for freezing.  We just determine whether caller's tuple
+ * and limits trigger heap_prepare_freeze_tuple to force freezing.
  *
  * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
  * like the heap_prepare_freeze_tuple arguments that they're based on.  We
  * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
+heap_tuple_would_freeze(HeapTupleHeader tuple,
+						TransactionId limit_xid, MultiXactId limit_multi,
 						TransactionId *relfrozenxid_out,
 						MultiXactId *relminmxid_out)
 {
@@ -7325,7 +7374,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	{
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 
@@ -7342,7 +7391,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		/* xmax is a non-permanent XID */
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, limit_xid))
 			would_freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
@@ -7365,7 +7414,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 		if (MultiXactIdPrecedes(multi, *relminmxid_out))
 			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
+		if (MultiXactIdPrecedes(multi, limit_multi))
 			would_freeze = true;
 
 		/* need to check whether any member of the mxact is old */
@@ -7378,7 +7427,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			Assert(TransactionIdIsNormal(xid));
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(xid, limit_xid))
 				would_freeze = true;
 		}
 		if (nmembers > 0)
@@ -7392,7 +7441,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		{
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			would_freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 834ab83a0..26a4784f3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,8 +169,9 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
+	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
@@ -511,6 +512,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
@@ -1563,8 +1565,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	HeapPageFreeze xtrack;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
@@ -1580,8 +1582,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	xtrack.freeze = false;
+	xtrack.relfrozenxid_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_out = vacrel->NewRelminMxid;
+	xtrack.relfrozenxid_nofreeze_out = vacrel->NewRelfrozenXid;
+	xtrack.relminmxid_nofreeze_out = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1634,27 +1639,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1782,11 +1783,13 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[tuples_frozen],
 									  &tuple_totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &xtrack))
 		{
 			/* Save prepared freeze plan for later */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1807,9 +1810,33 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (xtrack.freeze || tuples_frozen == 0)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (might be zero eligible tuples).
+		 */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_out;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* Not freezing this page, so use alternative cutoffs */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_nofreeze_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_nofreeze_out;
+
+		/* Might still set page all-visible, but never all-frozen */
+		tuples_frozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1817,12 +1844,12 @@ retry:
 	 */
 	if (tuples_frozen > 0)
 	{
-		Assert(prunestate->hastup);
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		vacrel->frozen_pages++;
 
 		/* Execute all freeze plans for page as a single atomic action */
-		heap_freeze_execute_prepared(vacrel->rel, buf, vacrel->FreezeLimit,
+		heap_freeze_execute_prepared(vacrel->rel, buf, vacrel->OldestXmin,
 									 frozen, tuples_frozen);
 	}
 
@@ -1841,7 +1868,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1849,8 +1876,7 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1871,9 +1897,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1887,6 +1910,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->all_visible = false;
+		prunestate->has_lpdead_items = true;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-- 
2.34.1

v6-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/x-patch; name=v6-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload

From 4f5969932451869f0f28295933c28de49a22fdf2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 15:13:27 -0700
Subject: [PATCH v6 3/6] Add eager freezing strategy to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  Use of the eager strategy (an
alternative to the classic lazy freezing strategy) is controlled by a
new GUC, vacuum_freeze_strategy_threshold (and an associated
autovacuum_* reloption).  Tables whose rel_pages are >= the cutoff will
have VACUUM use the eager freezing strategy.  Otherwise we use the lazy
freezing strategy, which is the classic approach (actually, we always
use eager freezing in aggressive VACUUMs, though they are expected to be
much rarer now).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).

If and when a smaller table (a table that uses lazy freezing at first)
grows past the table size threshold, the next VACUUM against the table
shouldn't have to do too much extra freezing to catch up when we perform
eager freezing for the first time (the table still won't be very large).
Once VACUUM has caught up, the amount of work required in each VACUUM
operation should be roughly proportionate to the number of new pages, at
least with a pure append-only table.

In summary, we try to get the benefit of the lazy freezing strategy,
without ever allowing VACUUM to fall uncomfortably far behind.  In
particular, we avoid accumulating an excessive number of unfrozen
all-visible pages in any one table.  This approach is often enough to
keep relfrozenxid recent, but we still have aggressive/antiwraparound
autovacuums for tables where it doesn't work out that way.

Note that freezing strategy is distinct from (though related to) the
strategy for skipping pages with the visibility map.  In practice tables
that use eager freezing always eagerly scan all-visible pages (they
prioritize advancing relfrozenxid), partly because we expect few or no
all-visible pages there (at least during the second or subsequent VACUUM
that uses eager freezing).  When VACUUM uses the classic/lazy freezing
strategy, VACUUM will also scan pages eagerly (i.e. it will scan any
all-visible pages and only skip all-frozen pages) when the added cost is
relatively low.
---
 src/include/access/heapam.h                   |  8 +-
 src/include/commands/vacuum.h                 |  4 +
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++
 src/backend/access/heap/heapam.c              | 11 ++-
 src/backend/access/heap/vacuumlazy.c          | 76 ++++++++++++++++---
 src/backend/commands/vacuum.c                 |  4 +
 src/backend/postmaster/autovacuum.c           | 10 +++
 src/backend/utils/misc/guc_tables.c           | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 31 ++++++--
 doc/src/sgml/maintenance.sgml                 |  6 +-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++
 13 files changed, 164 insertions(+), 24 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ea709bf1b..cfe8eb39e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -127,7 +127,11 @@ typedef struct HeapTupleFreeze
  * pg_class tuple.
  *
  * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
- * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ * relminmxid_nofreeze_out must also be maintained.  If vacuumlazy.c caller
+ * opts to not execute freeze plans produced by heap_prepare_freeze_tuple for
+ * its own reasons, then new relfrozenxid and relminmxid values must reflect
+ * that that choice was made.  (This is only safe when 'freeze' is still unset
+ * after the final last heap_prepare_freeze_tuple call for the page.)
  */
 typedef struct HeapPageFreeze
 {
@@ -138,7 +142,7 @@ typedef struct HeapPageFreeze
 	TransactionId relfrozenxid_out;
 	MultiXactId relminmxid_out;
 
-	/* Used by caller for '!freeze' pages */
+	/* Used by caller that opts not to freeze a '!freeze' page */
 	TransactionId relfrozenxid_nofreeze_out;
 	MultiXactId relminmxid_nofreeze_out;
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f..52379f819 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -256,6 +259,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca..e195b63d7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b734489..4b680501c 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2e9b860b3..6a164fdb8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6440,7 +6440,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * WAL-log what we would need to do, and return true.  Return false if nothing
  * is to be changed.  In addition, set *totally_frozen to true if the tuple
  * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * more freezing will eventually be required (assuming page is to be frozen).
+ *
+ * Although this interface is primarily tuple-based, caller decides on whether
+ * or not to freeze the page as a whole.  We'll often help caller to prepare a
+ * complete "freeze plan" that it ultimately discards.  However, our caller
+ * doesn't always get to choose; it must freeze when xtrack.freeze is set
+ * here.  This ensures that any XIDs < limit_xid are never left behind.
  *
  * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
  * returned true for when called.  A later heap_freeze_execute_prepared call
@@ -6652,7 +6658,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * Trigger page level freezing to ensure that we reliably process
 		 * MultiXacts as instructed by FreezeMultiXactId() in all cases.
 		 * There is no way to opt out of this, since FreezeMultiXactId()
-		 * doesn't provide for that.
+		 * doesn't provide for that. (It helps us with relfrozenxid_out, not
+		 * with relfrozenxid_nofreeze_out.)
 		 */
 		if ((flags & FRM_NOOP) == 0)
 			xtrack->freeze = true;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0e919d697..278833077 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -110,7 +110,7 @@
 
 /*
  * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages
+ * all-visible pages when using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
@@ -150,6 +150,8 @@ typedef struct LVRelState
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
+	/* Proactively freeze all tuples on pages about to be set all-visible? */
+	bool		allvis_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -254,6 +256,7 @@ typedef struct LVSavedErrInfo
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
 									  BlockNumber all_frozen);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
@@ -327,6 +330,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				eager_threshold,
 				all_visible,
 				all_frozen,
 				scanned_pages,
@@ -366,6 +370,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
 	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
 	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 *
+	 * Also determine our cutoff for applying the eager/all-visible freezing
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
+	 * even during non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -374,6 +382,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->multixact_freeze_table_age,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
+	eager_threshold = params->freeze_strategy_threshold < 0 ?
+		vacuum_freeze_strategy_threshold :
+		params->freeze_strategy_threshold;
 
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -526,7 +537,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
-	scanned_pages = lazy_scan_strategy(vacrel,
+	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
 									   all_visible, all_frozen);
 	if (verbose)
 		ereport(INFO,
@@ -1282,17 +1293,28 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine skipping strategy.
+ *	lazy_scan_strategy() -- Determine freezing/skipping strategy.
  *
- * Determines if the ongoing VACUUM operation should skip all-visible pages
- * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * Our traditional/lazy freezing strategy is useful when putting off the work
+ * of freezing totally avoids work that turns out to have been unnecessary.
+ * On the other hand we eagerly freeze pages when that strategy spreads out
+ * the burden of freezing over time.  Performance stability is important; no
+ * one VACUUM operation should need to freeze disproportionately many pages.
+ * Antiwraparound VACUUMs of append-only tables should generally be avoided.
+ *
+ * Also determines if the ongoing VACUUM operation should skip all-visible
+ * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
+ * important that relfrozenxid advance in affected tables, which are larger.
+ * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
+ * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
+ * depending on the extra cost - we might need to scan only a few extra pages.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_strategy(LVRelState *vacrel,
-				   BlockNumber all_visible,
-				   BlockNumber all_frozen)
+lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
+				   BlockNumber all_visible, BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1325,21 +1347,48 @@ lazy_scan_strategy(LVRelState *vacrel,
 		scanned_pages_skipallfrozen++;
 
 	/*
-	 * Okay, now we have all the information we need to decide on a strategy
+	 * Okay, now we have all the information we need to decide on a strategy.
+	 *
+	 * We use the all-visible/eager freezing strategy when a threshold
+	 * controlled by the freeze_strategy_threshold GUC/reloption is crossed.
+	 * VACUUM won't accumulate any unfrozen all-visible pages over time in
+	 * tables above the threshold.  The system won't fall behind on freezing.
 	 */
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
 		return rel_pages;
 	}
 	else if (vacrel->aggressive)
+	{
+		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
+	}
+	else if (rel_pages >= eager_threshold)
+	{
+		/*
+		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
+		 * GUC-based threshold for eager freezing.
+		 *
+		 * We always scan all-visible pages when the threshold is crossed, so
+		 * that relfrozenxid can be advanced.  There will typically be few or
+		 * no all-visible pages (only all-frozen) in the table anyway, at
+		 * least after the first VACUUM that exceeds the threshold.
+		 */
+		vacrel->allvis_freeze_strategy = true;
+		vacrel->skipallvis = false;
+	}
 	else
 	{
 		BlockNumber nextra,
 					nextra_threshold;
 
+		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
+		vacrel->allvis_freeze_strategy = false;
+
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1847,8 +1896,15 @@ retry:
 	 *
 	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
+	 *
+	 * When ongoing VACUUM opted to use the all-visible freezing strategy we
+	 * freeze any page that will become all-visible, making it all-frozen
+	 * instead. (Actually, there are edge-cases where this might not result in
+	 * marking the page all-frozen in the visibility map, but that should have
+	 * only a negligible impact.)
 	 */
-	if (xtrack.freeze || tuples_frozen == 0)
+	if (xtrack.freeze || tuples_frozen == 0 ||
+		(vacrel->allvis_freeze_strategy && prunestate->all_visible))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3c8ea2147..df2bd53b9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 601834d4b..72be67da0 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 836b49484..5ca4a71d7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2483,6 +2483,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c35..a409e6281 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,6 +693,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd50ea8e4..109cc4727 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9145,6 +9145,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9153,9 +9168,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9232,10 +9249,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        Specifies the cutoff age (in multixacts) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages with an older multixact ID.  The
+        default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..554b3a75d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -588,9 +588,9 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
+    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
+    transactions older than this cutoff XID are guaranteed to have been frozen.
+    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.34.1

v6-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/x-patch; name=v6-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload

From 8f3b6237affda15101ffb0b88787bfd6bb92e32f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v6 2/6] Teach VACUUM to use visibility map snapshot.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip.  The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").  It also
gives VACUUM reliable information about how many pages will be scanned,
before its physical heap scan even begins.  That makes it easier to
model the costs that VACUUM incurs using a top-down, up-front approach.

Non-aggressive VACUUMs now make an up-front choice about VM skipping
strategy: they decide whether to prioritize early advancement of
relfrozenxid (eager behavior) over avoiding work by skipping all-visible
pages (lazy behavior).  Nothing about the details of how lazy_scan_prune
freezes changes just yet, though a later commit will add the concept of
freezing strategies.

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.  VACUUM will now either scan
every all-visible page, or none at all.  This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.  It was possible that a single all-visible page
located in a range of all-frozen blocks would render it unsafe to
advance relfrozenxid later on; lazy_scan_skip just didn't have enough
high-level context about the table as a whole.  Now our policy around
skipping all-visible pages is exactly the same condition as whether or
not it's safe to advance relfrozenxid later on; nothing is left to
chance.

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work).  For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().
---
 src/include/access/visibilitymap.h      |   7 +
 src/backend/access/heap/vacuumlazy.c    | 342 +++++++++++++-----------
 src/backend/access/heap/visibilitymap.c | 164 ++++++++++++
 3 files changed, 359 insertions(+), 154 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..49f197bb3 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,9 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +38,10 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+									  BlockNumber *all_visible, BlockNumber *all_frozen);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
+extern uint8 visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 26a4784f3..0e919d697 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,10 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * Threshold that controls whether non-aggressive VACUUMs will skip any
+ * all-visible pages
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -146,8 +146,10 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
+	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	bool		skipallvis;
+	/* Skip (don't scan) all-frozen pages? */
+	bool		skipallfrozen;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -177,7 +179,8 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map used by lazy_scan_skip */
+	vmsnapshot *vmsnap;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -250,10 +253,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber all_visible,
+									  BlockNumber all_frozen);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel,
+								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -316,7 +320,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	bool		verbose,
 				instrument,
 				aggressive,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
@@ -324,6 +327,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				all_visible,
+				all_frozen,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -369,7 +375,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
 
-	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
 		/*
@@ -377,7 +382,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * visibility map (even those set all-frozen)
 		 */
 		aggressive = true;
-		skipwithvm = false;
 	}
 
 	/*
@@ -402,20 +406,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
@@ -442,7 +432,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
 	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
+	/* Initialize skipallvis/skipallfrozen before lazy_scan_strategy call */
+	vacrel->skipallvis = !aggressive;
+	vacrel->skipallfrozen = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -503,12 +495,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * confused about whether a tuple should be frozen or removed.  (In the
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
-	 *
-	 * We must determine rel_pages _after_ OldestXmin has been established.
-	 * lazy_scan_heap's physical heap scan (scan of pages < rel_pages) is
-	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
-	 * XIDs must at least be considered for freezing (though not necessarily
-	 * frozen) during its scan.
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
@@ -521,7 +507,36 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
-	vacrel->skippedallvis = false;
+
+	/*
+	 * VACUUM must scan all pages that might have XIDs < OldestXmin in tuple
+	 * headers to be able to safely advance relfrozenxid later on.  There is
+	 * no good reason to scan any additional pages. (Actually we might opt to
+	 * skip all-visible pages.  Either way we won't scan pages for no reason.)
+	 *
+	 * Now that OldestXmin and rel_pages are acquired, acquire an immutable
+	 * snapshot of the visibility map as well.  lazy_scan_skip works off of
+	 * the vmsnap, not the authoritative VM, which can continue to change.
+	 * Pages that lazy_scan_heap will scan are fixed and known in advance.
+	 *
+	 * The exact number of pages that lazy_scan_heap will scan also depends on
+	 * our choice of skipping strategy.  VACUUM can either choose to skip any
+	 * all-visible pages lazily, or choose to scan those same pages instead.
+	 * Decide on a skipping strategy to determine final scanned_pages.
+	 */
+	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
+										&all_visible, &all_frozen);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   all_visible, all_frozen);
+	if (verbose)
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace, vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -538,6 +553,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -584,12 +600,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
 		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -634,6 +649,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -661,10 +679,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -857,13 +871,12 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -877,42 +890,24 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = lazy_scan_skip(vacrel, 0, &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		if (blkno < next_block_to_scan)
+			continue;
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = lazy_scan_skip(vacrel, blkno + 1,
+											&next_all_visible);
 
 		vacrel->scanned_pages++;
 
@@ -1122,10 +1117,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1153,12 +1147,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1197,7 +1189,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1290,47 +1282,121 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *	lazy_scan_strategy() -- Determine skipping strategy.
  *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * Determines if the ongoing VACUUM operation should skip all-visible pages
+ * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
  *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
+ * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
+lazy_scan_strategy(LVRelState *vacrel,
+				   BlockNumber all_visible,
+				   BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
+				scanned_pages_skipallvis,
+				scanned_pages_skipallfrozen;
+	uint8		mapbits;
 
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
+	Assert(vacrel->scanned_pages == 0);
+	Assert(rel_pages >= all_visible && all_visible >= all_frozen);
+
+	/*
+	 * First figure out the final scanned_pages for each of the skipping
+	 * policies that lazy_scan_skip might end up using: skipallvis (skip both
+	 * all-frozen and all-visible) and skipallfrozen (just skip all-frozen).
+	 */
+	scanned_pages_skipallvis = rel_pages - all_visible;
+	scanned_pages_skipallfrozen = rel_pages - all_frozen;
+
+	/*
+	 * Even if the last page is skippable, it will still get scanned because
+	 * of lazy_scan_skip's "try to set nonempty_pages for last page" rule.
+	 * Reconcile that rule with what vmsnap says about the last page now.
+	 *
+	 * When vmsnap thinks that we will be skipping the last page (we won't),
+	 * increment scanned_pages to compensate.  Otherwise change nothing.
+	 */
+	mapbits = visibilitymap_snap_status(vacrel->vmsnap, rel_pages - 1);
+	if (mapbits & VISIBILITYMAP_ALL_VISIBLE)
+		scanned_pages_skipallvis++;
+	if (mapbits & VISIBILITYMAP_ALL_FROZEN)
+		scanned_pages_skipallfrozen++;
+
+	/*
+	 * Okay, now we have all the information we need to decide on a strategy
+	 */
+	if (!vacrel->skipallfrozen)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		return rel_pages;
+	}
+	else if (vacrel->aggressive)
+		Assert(!vacrel->skipallvis);
+	else
+	{
+		BlockNumber nextra,
+					nextra_threshold;
+
+		/*
+		 * Decide on whether or not we'll skip all-visible pages.
+		 *
+		 * In general, VACUUM doesn't necessarily have to freeze anything to
+		 * be able to advance relfrozenxid and/or relminmxid by a significant
+		 * number of XIDs/MXIDs.  The oldest tuples might turn out to have
+		 * been deleted since VACUUM last ran, or this VACUUM might find that
+		 * there simply are no MultiXacts that even need to be considered.
+		 *
+		 * It's hard to predict whether this VACUUM operation will work out
+		 * that way, so be lazy (just skip) unless the added cost is very low.
+		 * We opt for a skipallfrozen-only VACUUM when the number of extra
+		 * pages (extra scanned pages that are all-visible but not all-frozen)
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 */
+		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
+		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+		nextra_threshold = Max(32, nextra_threshold);
+
+		vacrel->skipallvis = nextra >= nextra_threshold;
+	}
+
+	/* Return the appropriate variant of scanned_pages */
+	if (vacrel->skipallvis)
+		return scanned_pages_skipallvis;
+	if (vacrel->skipallfrozen)
+		return scanned_pages_skipallfrozen;
+
+	return rel_pages;			/* keep compiler quiet */
+}
+
+/*
+ *	lazy_scan_skip() -- get the next block to scan according to vmsnap.
+ *
+ * lazy_scan_heap() caller passes the next block in line.  We return the next
+ * block to scan.  Caller skips the blocks preceding returned block, if any.
+ *
+ * The all-visible status of the returned block is set in *all_visible, too.
+ * Block usually won't be all-visible (since it's unskippable), but it can be
+ * when next_block is rel's last page and when DISABLE_PAGE_SKIPPING in use.
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, BlockNumber next_block, bool *all_visible)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_block_to_scan = next_block;
+
+	*all_visible = true;
+	while (next_block_to_scan < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_snap_status(vacrel->vmsnap,
+														next_block_to_scan);
 
 		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
 			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
+			*all_visible = false;
 			break;
 		}
 
@@ -1341,58 +1407,26 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 		 * lock on rel to attempt a truncation that fails anyway, just because
 		 * there are tuples on the last page (it is likely that there will be
 		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
 		 */
-		if (next_unskippable_block == rel_pages - 1)
+		if (next_block_to_scan == rel_pages - 1)
 			break;
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
+		Assert(vacrel->skipallfrozen || !vacrel->skipallvis);
+		if (!vacrel->skipallfrozen)
 			break;
 
 		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 * Don't skip all-visible pages when lazy_scan_strategy determined
+		 * that it was more important for this VACUUM to advance relfrozenxid
 		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
+		if (!vacrel->skipallvis && (mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+			break;
 
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
+		next_block_to_scan++;
 	}
 
-	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
-	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
-	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
-	else
-	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
-	}
-
-	return next_unskippable_block;
+	return next_block_to_scan;
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e..cfe3cf9b6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,9 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap - get read-only snapshot of visibility map
+ *		visibilitymap_snap_release - release previously acquired snapshot
+ *		visibilitymap_snap_status - get status of bits from vm snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +55,9 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset (only tuples deleted before its
+ * OldestXmin cutoff are considered dead).
  *
  * LOCKING
  *
@@ -124,6 +130,22 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Snapshot of visibility map at a point in time.
+ *
+ * TODO This is currently always just a palloc()'d buffer -- give more thought
+ * to resource management (at a minimum add spilling to temp file).
+ */
+struct vmsnapshot
+{
+	/* Snapshot may contain zero or more visibility map pages */
+	BlockNumber nvmpages;
+
+	/* Copy of VM pages from the time that visibilitymap_snap() was called */
+	PGAlignedBlock vmpages[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
@@ -373,6 +395,148 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap - get read-only snapshot of visibility map
+ *
+ * Initializes caller's snapshot, allocating memory in caller's memory context.
+ * Caller can use visibilitymap_snapshot_status to get the status of individual
+ * heap pages at the point that we were called.
+ *
+ * Used by VACUUM to determine which pages it must scan up front.  This avoids
+ * useless scans of concurrently unset heap pages.  VACUUM prefers to leave
+ * them to be scanned during the next VACUUM operation.
+ *
+ * rel_pages is the current size of the heap relation.
+ */
+vmsnapshot *
+visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+				   BlockNumber *all_visible, BlockNumber *all_frozen)
+{
+	BlockNumber nvmpages,
+				mapBlockLast;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap %s %d", RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	*all_visible = 0;
+	*all_frozen = 0;
+	mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages);
+	nvmpages = mapBlockLast + 1;
+	vmsnap = palloc(offsetof(vmsnapshot, vmpages) +
+					sizeof(PGAlignedBlock) * nvmpages);
+
+	vmsnap->nvmpages = nvmpages;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Page		localvmpage;
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		localvmpage = vmsnap->vmpages->data + mapBlock * BLCKSZ;
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(localvmpage, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/*
+		 * Visibility map page copied to local buffer for caller's snapshot.
+		 * Caller requires an exact count of all-visible and all-frozen blocks
+		 * in the heap relation.  Handle that now.
+		 *
+		 * Must "truncate" our local copy of the VM to avoid incorrectly
+		 * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+		 * this by clearing irrelevant bits on the last VM page copied.
+		 */
+		map = PageGetContents(localvmpage);
+		if (mapBlock == mapBlockLast)
+		{
+			/* byte and bit for first heap page not to be scanned by VACUUM */
+			uint32		truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			uint8		truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
+
+			if (truncByte != 0 || truncOffset != 0)
+			{
+				/* Clear any bits set for heap pages >= rel_pages */
+				MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+				map[truncByte] &= (1 << truncOffset) - 1;
+			}
+
+			/* Now it's safe to tally bits from this final VM page below */
+		}
+
+		/* Tally the all-visible and all-frozen counts from this page */
+		umap = (uint64 *) map;
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			*all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+	}
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Just frees the memory allocated by visibilitymap_snap for now (presumably
+ * this will need to release temp files in later revisions of the patch)
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	pfree(vmsnap);
+}
+
+/*
+ *	visibilitymap_snap_status - get status of bits from vm snapshot
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to caller's snapshot of the visibility map?
+ */
+uint8
+visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+	char	   *map;
+	uint8		result;
+	Page		localvmpage;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_status %d", heapBlk);
+#endif
+
+	/* If we didn't copy the VM page, assume heapBlk not all-visible */
+	if (mapBlock >= vmsnap->nvmpages)
+		return 0;
+
+	localvmpage = ((Page) vmsnap->vmpages) + mapBlock * BLCKSZ;
+	map = PageGetContents(localvmpage);
+
+	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+	return result;
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
-- 
2.34.1

v6-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchapplication/x-patch; name=v6-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchDownload

From 29a8a0d067030a1ffdaddaeca2ef2f8a2c9eef94 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 23 Jul 2022 17:19:01 -0700
Subject: [PATCH v6 6/6] Size VACUUM's dead_items space using VM snapshot.

VACUUM knows precisely how many pages it will scan ahead of time from
its snapshot of the visibility map following recent work.  Apply that
information to size the dead_items space for TIDs more precisely (use
scanned_pages instead of rel_pages to cap the allocation).

This can make the memory allocation significantly smaller, without any
added risk of undersizing the array.  Since VACUUM's final scanned_pages
is fully predetermined (by the visibility map snapshot), there is no
question of interference from another backend that concurrently unsets
some heap page's visibility map bit.  Many details of how VACUUM will
process the target relation are "locked in" from the very beginning.
---
 src/backend/access/heap/vacuumlazy.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 97f3b83ac..e3039ce63 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -293,7 +293,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -566,7 +567,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -3184,14 +3185,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3200,15 +3200,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3230,12 +3228,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
-- 
2.34.1

#32

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Peter Geoghegan (#31)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2022-11-15 19:02:12 -0800, Peter Geoghegan wrote:

From 352867c5027fae6194ab1c6480cd326963e201b1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v6 1/6] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields. OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

This approach decouples the question of _how_ VACUUM could/will freeze a
given heap page (which of its XIDs are eligible to be frozen) from the
question of whether it actually makes sense to do so right now.

Just adding page-level freezing does not change all that much on its
own: VACUUM will still typically freeze very lazily, since we're only
forcing freezing of all of a page's eligible tuples when we decide to
freeze at least one (on the basis of XID age and FreezeLimit). For now
VACUUM still freezes everything almost as lazily as it always has.
Later work will teach VACUUM to apply an alternative eager freezing
strategy that triggers page-level freezing earlier, based on additional
criteria.
---
src/include/access/heapam.h | 42 +++++-
src/backend/access/heap/heapam.c | 199 +++++++++++++++++----------
src/backend/access/heap/vacuumlazy.c | 95 ++++++++-----
3 files changed, 222 insertions(+), 114 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ebe723abb..ea709bf1b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -112,6 +112,38 @@ typedef struct HeapTupleFreeze
OffsetNumber offset;
} HeapTupleFreeze;
+/*
+ * State used by VACUUM to track what the oldest extant XID/MXID will become
+ * when determing whether and how to freeze a page's heap tuples via calls to
+ * heap_prepare_freeze_tuple.

Perhaps this could say something like "what the oldest extant XID/MXID
currently is and what it would be if we decide to freeze the page" or such?

+ * The relfrozenxid_out and relminmxid_out fields are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all

"VACUUM caller's heap rel." could stand to be rephrased.

+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
+ * relminmxid_out as needed to avoid unsafe values in rel's authoritative
+ * pg_class tuple.
+ *
+ * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
+ * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ */

relfrozenxid_nofreeze_out isn't really a "no freeze variant" :)

I think it might be better to just always maintain the nofreeze state.

+typedef struct HeapPageFreeze
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze;

s/freeze/freeze_required/?

+	/* Values used when page is to be frozen based on freeze plans */
+	TransactionId relfrozenxid_out;
+	MultiXactId relminmxid_out;
+
+	/* Used by caller for '!freeze' pages */
+	TransactionId relfrozenxid_nofreeze_out;
+	MultiXactId relminmxid_nofreeze_out;
+
+} HeapPageFreeze;
+

Given the number of parameters to heap_prepare_freeze_tuple, why don't we pass
in more of them in via HeapPageFreeze?

/* ----------------
*		function prototypes for heap access method
*
@@ -180,17 +212,17 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi,
+									  TransactionId limit_xid, MultiXactId limit_multi,
HeapTupleFreeze *frz, bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  HeapPageFreeze *xtrack);

What does 'xtrack' stand for? Xid Tracking?

* VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
* returned true for when called.  A later heap_freeze_execute_prepared call
- * will execute freezing for caller's page as a whole.
+ * will execute freezing for caller's page as a whole.  Caller should also
+ * initialize xtrack fields for page as a whole before calling here with first
+ * tuple for the page.  See page_frozenxid_tracker comments.

s/should/need to/?

page_frozenxid_tracker appears to be a dangling pointer.

+	 * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
+	 * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)

Hm. Perhaps we should just rename them if it requires this kind of
explanation? They're really not good names.

@@ -6524,8 +6524,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
else
{
/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = xid;
}
}

Could use TransactionIdOlder().

@@ -6563,8 +6564,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
*/
Assert(!freeze_xmax);
Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
+			Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+										   &xtrack->relfrozenxid_nofreeze_out,
+										   &xtrack->relminmxid_nofreeze_out));
+			if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
+				xtrack->relfrozenxid_out = newxmax;

Perhaps the Assert(heap_tuple_would_freeze()) bit could be handled once at the
end of the routine, for all paths?

@@ -6731,18 +6751,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
else
frz->frzflags |= XLH_FREEZE_XVAC;

-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
+			/* Set XMIN_COMMITTED defensively */
Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
frz->t_infomask |= HEAP_XMIN_COMMITTED;
+
+			/*
+			 * Force freezing any page with an xvac to keep things simple.
+			 * This allows totally_frozen tracking to ignore xvac.
+			 */
changed = true;
+			xtrack->freeze = true;
}
}

Oh - I totally didn't realize that ->freeze is an out parameter. Seems a bit
odd to have the other fields suffied with _out but not this one?

@@ -6786,13 +6824,13 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
*/
void
heap_freeze_execute_prepared(Relation rel, Buffer buffer,
- TransactionId FreezeLimit,
+ TransactionId OldestXmin,
HeapTupleFreeze *tuples, int ntuples)
{
Page page = BufferGetPage(buffer);
Assert(ntuples > 0);
-	Assert(TransactionIdIsValid(FreezeLimit));
+	Assert(TransactionIdIsValid(OldestXmin));
START_CRIT_SECTION();

@@ -6822,11 +6860,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
/*
* latestRemovedXid describes the latest processed XID, whereas
-		 * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
-		 * Back up caller's FreezeLimit to avoid false conflicts when
-		 * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
+		 * OldestXmin is the first XID not frozen by VACUUM.  Back up caller's
+		 * OldestXmin to avoid false conflicts.
*/
-		latestRemovedXid = FreezeLimit;
+		latestRemovedXid = OldestXmin;
TransactionIdRetreat(latestRemovedXid);
xlrec.latestRemovedXid = latestRemovedXid;

Won't using OldestXmin instead of FreezeLimit potentially cause additional
conflicts? Is there any reason to not compute an accurate value?

@@ -1634,27 +1639,23 @@ retry:
continue;
}

-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
if (ItemIdIsDead(itemid))
{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
continue;
}

What does this have to do with the rest of the commit? And why are we doing
this?

@@ -1782,11 +1783,13 @@ retry:
if (heap_prepare_freeze_tuple(tuple.t_data,
vacrel->relfrozenxid,
vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
vacrel->FreezeLimit,
vacrel->MultiXactCutoff,
&frozen[tuples_frozen],
&tuple_totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &xtrack))
{
/* Save prepared freeze plan for later */
frozen[tuples_frozen++].offset = offnum;
@@ -1807,9 +1810,33 @@ retry:
* that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
* that remains and needs to be considered for freezing now (LP_UNUSED and
* LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
*/
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (xtrack.freeze || tuples_frozen == 0)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.

Seems quite confusing to enter a block with described as "We're freezing the
page." when we're not freezing anything (tuples_frozen == 0).

+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (might be zero eligible tuples).
+		 */
+		vacrel->NewRelfrozenXid = xtrack.relfrozenxid_out;
+		vacrel->NewRelminMxid = xtrack.relminmxid_out;
+		freeze_all_eligible = true;

I don't really get what freeze_all_eligible is trying to do.

#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
{
TransactionId cutoff;
bool		all_frozen;
@@ -1849,8 +1876,7 @@ retry:
if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
Assert(false);

Not related to this change, but why isn't this just
Assert(heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))?

From 8f3b6237affda15101ffb0b88787bfd6bb92e32f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v6 2/6] Teach VACUUM to use visibility map snapshot.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.

This should include a description of the memory usage effects.

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip. The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").

Why is it an advantage for the number of pages to not increase?

It also
gives VACUUM reliable information about how many pages will be scanned,
before its physical heap scan even begins. That makes it easier to
model the costs that VACUUM incurs using a top-down, up-front approach.

Non-aggressive VACUUMs now make an up-front choice about VM skipping
strategy: they decide whether to prioritize early advancement of
relfrozenxid (eager behavior) over avoiding work by skipping all-visible
pages (lazy behavior). Nothing about the details of how lazy_scan_prune
freezes changes just yet, though a later commit will add the concept of
freezing strategies.

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.

Why?

VACUUM will now either scan
every all-visible page, or none at all. This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.

The main goal according to bf136cf6 was to avoid defeating OS readahead, so I
think it should be mentioned here.

To me this is something that ought to be changed separately from the rest of
this commit.

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work). For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().

HEAPBLOCKS_PER_PAGE is 32672 with the defaults. The maximum relation size is
2**32 - 1 blocks. So the max FSM size is 131458 pages, a bit more than 1GB. Is
that correct?

For large relations that are already nearly all-frozen this does add a
noticable amount of overhead, whether spilled to disk or not. Of course
they're also not going to be vacuumed super often, but ...

Perhaps worth turning the VM into a range based description for the snapshot,
given it's a readonly datastructure in local memory? And we don't necessarily
need the all-frozen and all-visible in memory, one should suffice? We don't
even need random access, so it could easily be allocated incrementally, rather
than one large allocation.

Hard to imagine anybody having a multi-TB table without "runs" of
all-visible/all-frozen. I don't think it'd be worth worrying about patterns
that'd be inefficient in a range representation.

+	/*
+	 * VACUUM must scan all pages that might have XIDs < OldestXmin in tuple
+	 * headers to be able to safely advance relfrozenxid later on.  There is
+	 * no good reason to scan any additional pages. (Actually we might opt to
+	 * skip all-visible pages.  Either way we won't scan pages for no reason.)
+	 *
+	 * Now that OldestXmin and rel_pages are acquired, acquire an immutable
+	 * snapshot of the visibility map as well.  lazy_scan_skip works off of
+	 * the vmsnap, not the authoritative VM, which can continue to change.
+	 * Pages that lazy_scan_heap will scan are fixed and known in advance.

Hm. It's a bit sad to compute the snapshot after determining OldestXmin.

We probably should refresh OldestXmin periodically. That won't allow us to get
a more aggressive relfrozenxid, but it'd allow to remove more gunk.

+	 *
+	 * The exact number of pages that lazy_scan_heap will scan also depends on
+	 * our choice of skipping strategy.  VACUUM can either choose to skip any
+	 * all-visible pages lazily, or choose to scan those same pages instead.

What does it mean to "skip lazily"?

+		/*
+		 * Visibility map page copied to local buffer for caller's snapshot.
+		 * Caller requires an exact count of all-visible and all-frozen blocks
+		 * in the heap relation.  Handle that now.

This part of the comment seems like it actually belongs further down?

+		 * Must "truncate" our local copy of the VM to avoid incorrectly
+		 * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+		 * this by clearing irrelevant bits on the last VM page copied.
+		 */

Hm - why would those bits already be set?

+		map = PageGetContents(localvmpage);
+		if (mapBlock == mapBlockLast)
+		{
+			/* byte and bit for first heap page not to be scanned by VACUUM */
+			uint32		truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			uint8		truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
+
+			if (truncByte != 0 || truncOffset != 0)
+			{
+				/* Clear any bits set for heap pages >= rel_pages */
+				MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+				map[truncByte] &= (1 << truncOffset) - 1;
+			}
+
+			/* Now it's safe to tally bits from this final VM page below */
+		}
+
+		/* Tally the all-visible and all-frozen counts from this page */
+		umap = (uint64 *) map;
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			*all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+	}
+
+	return vmsnap;
+}

From 4f5969932451869f0f28295933c28de49a22fdf2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 15:13:27 -0700
Subject: [PATCH v6 3/6] Add eager freezing strategy to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate. Use of the eager strategy (an
alternative to the classic lazy freezing strategy) is controlled by a
new GUC, vacuum_freeze_strategy_threshold (and an associated
autovacuum_* reloption). Tables whose rel_pages are >= the cutoff will
have VACUUM use the eager freezing strategy.

What's the logic behind a hard threshold? Suddenly freezing everything on a
huge relation seems problematic. I realize that never getting all that far
behind is part of the theory, but I don't think that's always going to work.

Wouldn't a better strategy be to freeze a percentage of the relation on every
non-aggressive vacuum? That way the amount of work for an eventual aggressive
vacuum will shrink, without causing individual vacuums to take extremely long.

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead. We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).

The other thing that I think would be to good to use is a) whether the page is
already in s_b, and b) whether the page already is dirty. The cost of freezing
shrinks significantly if it doesn't cause an additional read + write. And that
additional IO IMO one of the major concerns with freezing much more
aggressively in OLTPish workloads where a lot of the rows won't ever get old
enough to need freezing.

From f2066c8ca5ba1b6f31257a36bb3dd065ecb1e3d4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v6 4/6] Make VACUUM's aggressive behaviors continuous.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4. Before then, every lazy VACUUM
was "equally aggressive": each operation froze whatever tuples before
the age-wise cutoff needed to be frozen. And each table's relfrozenxid
was updated at the end. In short, the previous behavior was much less
efficient, but did at least have one thing going for it: it was much
easier to understand at a high level.

VACUUM no longer applies a separate mode of operation (aggressive mode).
There are still antiwraparound autovacuums, but they're now little more
than another way that autovacuum.c can launch an autovacuum worker to
run VACUUM.

The most significant aspect of anti-wrap autvacuums right now is that they
don't auto-cancel. Is that still used? If so, what's the threshold?

IME one of the most common reasons for autovac not keeping up is that the
application occasionally acquires conflicting locks on one of the big
tables. Before reaching anti-wrap age all autovacuums on that table get
cancelled before it gets to update relfrozenxid. Once in that situation
autovac really focusses only on that relation...

Now every VACUUM might need to wait for a cleanup lock, though few will.
It can only happen when required to advance relfrozenxid to no less than
half way between the existing relfrozenxid and nextXID.

Where's that "halfway" bit coming from?

Isn't "half way between the relfrozenxid and nextXID" a problem for instances
with longrunning transactions? Wouldn't this mean that wait for every page if
relfrozenxid can't be advanced much because of a longrunning query or such?

From 51a863190f70c8baa6d04e3ffd06473843f3326d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 31 Jul 2022 13:53:19 -0700
Subject: [PATCH v6 5/6] Avoid allocating MultiXacts during VACUUM.

Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
teach it the difference between OldestXmin and FreezeLimit. Use this
high-level context to intelligently avoid allocating new MultiXactIds
during VACUUM operations. We should always prefer to avoid allocating
new MultiXacts during VACUUM on general principle. VACUUM is the only
mechanism that can claw back MultixactId space, so allowing VACUUM to
consume MultiXactId space (for any reason) adds to the risk that the
system will trigger the multiStopLimit wraparound protection mechanism.

Strictly speaking that's not quite true, you can also drop/truncate tables ;)

Greetings,

Andres Freund

#33

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Andres Freund (#32)

6 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Nov 15, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote:

Subject: [PATCH v6 1/6] Add page-level freezing to VACUUM.

Attached is v7, which incorporates much of your feedback. Thanks for the review!

+/*
+ * State used by VACUUM to track what the oldest extant XID/MXID will become
+ * when determing whether and how to freeze a page's heap tuples via calls to
+ * heap_prepare_freeze_tuple.
Perhaps this could say something like "what the oldest extant XID/MXID
currently is and what it would be if we decide to freeze the page" or such?

Fixed.

+ * The relfrozenxid_out and relminmxid_out fields are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all

"VACUUM caller's heap rel." could stand to be rephrased.

Fixed.

+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each heap_prepare_freeze_tuple call pushes back relfrozenxid_out and/or
+ * relminmxid_out as needed to avoid unsafe values in rel's authoritative
+ * pg_class tuple.
+ *
+ * Alternative "no freeze" variants of relfrozenxid_nofreeze_out and
+ * relminmxid_nofreeze_out must also be maintained for !freeze pages.
+ */

relfrozenxid_nofreeze_out isn't really a "no freeze variant" :)

Why not? I think that that's exactly what it is. We maintain these
alternative "oldest extant XID" values so that vacuumlazy.c's
lazy_scan_prune function can "opt out" of freezing. This is exactly
the same as what we do in lazy_scan_noprune, both conceptually and at
the implementation level.

I think it might be better to just always maintain the nofreeze state.

Not sure. Even if there is very little to gain in cycles by not
maintaining the "nofreeze" cutoffs needlessly, it's still a pure waste
of cycles that can easily be avoided. So it just feels natural to not
waste those cycles -- it may even make the design clearer.

+typedef struct HeapPageFreeze
+{
+     /* Is heap_prepare_freeze_tuple caller required to freeze page? */
+     bool            freeze;

s/freeze/freeze_required/?

Fixed.

Given the number of parameters to heap_prepare_freeze_tuple, why don't we pass
in more of them in via HeapPageFreeze?

HeapPageFreeze is supposed to be mutable state used for one single
page, though. Seems like we should use a separate immutable struct for
this instead.

I've already prototyped a dedicated immutable "cutoffs" struct, which
is instantiated exactly once per VACUUM. Seems like a good approach to
me. The immutable state can be shared by heapam.c's
heap_prepare_freeze_tuple(), vacuumlazy.c, and even
vacuum_set_xid_limits() -- so everybody can work off of the same
struct directly. Will try to get that into shape for the next
revision.

What does 'xtrack' stand for? Xid Tracking?

Yes.

* VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
* returned true for when called.  A later heap_freeze_execute_prepared call
- * will execute freezing for caller's page as a whole.
+ * will execute freezing for caller's page as a whole.  Caller should also
+ * initialize xtrack fields for page as a whole before calling here with first
+ * tuple for the page.  See page_frozenxid_tracker comments.

s/should/need to/?

Changed it to "must".

page_frozenxid_tracker appears to be a dangling pointer.

I think that you mean that the code comments reference an obsolete
type name -- fixed.

+      * VACUUM calls limit_xid "FreezeLimit", and cutoff_xid "OldestXmin".
+      * (limit_multi is "MultiXactCutoff", and cutoff_multi "OldestMxact".)
Hm. Perhaps we should just rename them if it requires this kind of
explanation? They're really not good names.

Agreed -- this can be taken care of as part of using a new VACUUM
operation level struct that is passed as immutable state, which I went
into a moment ago. That centralizes the definitions, which makes it
far easier to understand which cutoff is which. For now I've kept the
names as they were.

Could use TransactionIdOlder().

I suppose, but the way I've done it feels a bit more natural to me,
and appears more often elsewhere. Not sure.

@@ -6563,8 +6564,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
*/
Assert(!freeze_xmax);
Assert(TransactionIdIsValid(newxmax));
-                     if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-                             *relfrozenxid_out = newxmax;
+                     Assert(heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+                                                                                &xtrack->relfrozenxid_nofreeze_out,
+                                                                                &xtrack->relminmxid_nofreeze_out));
+                     if (TransactionIdPrecedes(newxmax, xtrack->relfrozenxid_out))
+                             xtrack->relfrozenxid_out = newxmax;

Perhaps the Assert(heap_tuple_would_freeze()) bit could be handled once at the
end of the routine, for all paths?

The problem with that is that we cannot Assert() when we're removing a
Multi via FRM_INVALIDATE_XMAX processing in certain cases (I tried it
this way myself, and the assertion fails there). This can happen when
the call to FreezeMultiXactId() for the xmax determined that we should
do FRM_INVALIDATE_XMAX processing for the xmax due to the Multi being
"isLockOnly" and preceding "OldestVisibleMXactId[MyBackendId])". Which
is relatively common.

I fixed this by moving the assert further down, while still only
checking the FRM_RETURN_IS_XID and FRM_RETURN_IS_MULTI cases.

Oh - I totally didn't realize that ->freeze is an out parameter. Seems a bit
odd to have the other fields suffied with _out but not this one?

Fixed this by not having an "_out" suffix for any of these mutable
fields from HeapPageFreeze. Now everything is consistent. (The "_out"
convention is totally redundant, now that we have the HeapPageFreeze
struct, which makes it obvious that it is is all mutable state.)

Won't using OldestXmin instead of FreezeLimit potentially cause additional
conflicts? Is there any reason to not compute an accurate value?

This is a concern that I share. I was hoping that I'd be able to get
away with using OldestXmin just for this, because it's simpler that
way. But I had my doubts about it already.

I wonder why it's correct to use FreezeLimit for this on HEAD, though.
What about those FRM_INVALIDATE_XMAX cases that I just mentioned we
couldn't Assert() on? That case effectively removes XIDs that might be
well after FreezeLimit. Granted it might be safe in practice, but it's
far from obvious why it is safe.

Perhaps we can fix this in a not-too-invasive way by reusing
LVPagePruneState.visibility_cutoff_xid for FREEZE_PAGE conflicts (not
just VISIBLE conflicts) in cases where that was possible (while still
using OldestXmin as a fallback in much rarer cases). In practice we're
only triggering freezing eagerly because the page is already expected
to be set all-visible (the whole point is that we'd prefer if it was
set all-frozen instead of all-visible).

(I've not done this in v7, but it's on my TODO list.)

Note that the patch already maintains
LVPagePruneState.visibility_cutoff_xid when there are some LP_DEAD
items on the page, because we temporarily ignore those LP_DEAD items
when considering the eager freezing stuff......

if (ItemIdIsDead(itemid))
{

deadoffsets[lpdead_items++] = offnum;
- prunestate->all_visible = false;
- prunestate->has_lpdead_items = true;
continue;
}

What does this have to do with the rest of the commit? And why are we doing
this?

....which is what you're asking about here.

The eager freezing strategy triggers page-level freezing for any page
that is about to become all-visible, so that it can be set all-frozen
instead. But that's not entirely straightforward when there happens to
be some LP_DEAD items on the heap page. There are really two ways that
a page can become all-visible during VACUUM, and we want to account
for that here. With eager freezing we want to make the pages become
all-frozen instead of just all-visible, regardless of which heap pass
(first pass or second pass) the page is set to become all-visible (and
maybe even all-frozen).

The comments that you mention were moved around a bit in passing.

Note that we still set prunestate->all_visible to false inside
lazy_scan_prune when we see remaining LP_DEAD stub items. We just do
it later on, after we've decided on freezing stuff. (Obviously it
wouldn't be okay to return to lazy_scan_heap without unsetting
prunestate->all_visible if there are LP_DEAD items.)

Seems quite confusing to enter a block with described as "We're freezing the
page." when we're not freezing anything (tuples_frozen == 0).

I don't really get what freeze_all_eligible is trying to do.

freeze_all_eligible (and the "tuples_frozen == 0" behavior) are both
there because we can mark a page as all-frozen in the VM without
freezing any of its tuples first. When that happens, we must make sure
that "prunestate->all_frozen" is set to true, so that we'll actually
set the all-frozen bit. At the same time, we need to be careful about
the case where we *could* set the page all-frozen if we decided to
freeze all eligible tuples -- we need to handle the case where we
choose against freezing (and so can't set the all-frozen bit in the
VM, and so must actually set "prunestate->all_frozen" to false).

This is all kinda tricky because we're simultaneously dealing with the
actual state of the page, and the anticipated state of the page in the
near future. Closely related concepts, but distinct in important ways.

#ifdef USE_ASSERT_CHECKING
/* Note that all_frozen value does not matter when !all_visible */
-     if (prunestate->all_visible)
+     if (prunestate->all_visible && lpdead_items == 0)
{
TransactionId cutoff;
bool            all_frozen;
@@ -1849,8 +1876,7 @@ retry:
if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
Assert(false);

Not related to this change, but why isn't this just
Assert(heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))?

It's just a matter of personal preference. I prefer to have a clear
block of related code that contains multiple related assertions. You
would probably have declared PG_USED_FOR_ASSERTS_ONLY variables at the
top of lazy_scan_prune instead. FWIW if you did it the other way the
assertion would actually have to include a "!prunestate->all_visible"
test that short circuits the heap_page_is_all_visible() call from the
Assert().

Subject: [PATCH v6 2/6] Teach VACUUM to use visibility map snapshot.

This should include a description of the memory usage effects.

The visibilitymap.c side of this is the least worked out part of the
patch series, by far. I have deliberately put off work on the data
structure itself, preferring to focus on the vacuumlazy.c side of
things for the time being. But I still agree -- fixed by acknowledging
that that particular aspect of resource management is unresolved.

I did have an open TODO before in the commit message, which is now
improved based on your feedback: it now fully owns the fact that we
really ignore the impact on memory usage right now. Just because that
part is very WIP (much more so than every other part).

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip. The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").

Why is it an advantage for the number of pages to not increase?

The commit message goes into that immediately after the last line that
you quoted. :-)

Having an immutable structure will help us, both in the short term,
for this particular project, and the long term. for other VACUUM
enhancements.

We need to have something that drives the cost model in vacuumlazy.c
for the skipping strategy stuff -- we need to have advanced
information about costs that drive the decision making process. Thanks
to VM snapshots, the cost model is able to reason about the cost of
relfrozenxid advancement precisely, in terms of "extra" scanned_pages
implied by advancing relfrozenxid during this VACUUM. That level of
precision is pretty nice IMV. It's not strictly necessary, but it's
nice to be able to make a precise accurate comparison between each of
the two skipping strategies.

Did you happen to look at the 6th and final patch? It's trivial, but
can have a big impact. It sizes dead_items while capping its sized
based on scanned_pages, not based on rel_pages. That's obviously
guaranteed to be correct. Note also that the 2nd patch teaches VACUUM
VERBOSE to report the final number of scanned_pages right at the
start, before scanning anything -- so it's a useful basis for much
better progress reporting in pg_stat_progress_vacuum. Stuff like that
also becomes very easy with VM snapshots.

Then there is the more ambitious stuff, that's not in scope for this
project. Example: Perhaps Sawada san will be able to take the concept
of visibility map snapshots, and combine it with his Radix tree design
-- which could presumably benefit from advanced knowledge of which
pages can be scanned. This is information that is reliable, by
definition. In fact I think that it would make a lot of sense for this
visibility map snapshot data structure to be exactly the same
structure used to store dead_items. They really are kind of the same
thing. The design can reason precisely about which heap pages can ever
end up having any LP_DEAD items. (It's already trivial to use the VM
snapshot infrastructure as a precheck cache for dead_items lookups.)

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.

Why?

We can advance relfrozenxid because it's cheap to, or because it's
urgent (according to autovacuum_freeze_max_age). This is kind of true
on HEAD already due to the autovacuum_freeze_max_age "escalate to
aggressive" thing -- but we can do much better than that. Why not
decide to advance relfrozenxid when (say) it's only *starting* to get
urgent when it happens to be relatively cheap (though not dirt cheap)?
We make relfrozenxid advancement a deliberate decision that weighs
*all* available information, and has a sense of the needs of the table
over time.

The user experience is important here. Going back to a model where
there is really just one kind of lazy VACUUM makes a lot of sense. We
should have much more approximate guarantees about relfrozenxid
advancement, since that's what gives us the flexibility to find a
cheaper (or more stable) way of keeping up over time. It matters that
we keep up over time, but it doesn't matter if we fall behind on
relfrozenxid advancement -- at least not if we don't also fall behind
on the work of freezing physical heap pages.

VACUUM will now either scan
every all-visible page, or none at all. This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.

The main goal according to bf136cf6 was to avoid defeating OS readahead, so I
think it should be mentioned here.

Agreed. Fixed.

To me this is something that ought to be changed separately from the rest of
this commit.

Maybe, but I'd say it depends on the final approach taken -- the
visibilitymap.c aspects of the patch are the least settled. I am
seriously considering adding prefetching to the vm snapshot structure,
which would make it very much a direct replacement for
SKIP_PAGES_THRESHOLD.

Separately, I'm curious about what you think of VM snapshots from an
aio point of view. Seems like it would be ideal for prefetching for
aio?

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work). For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().

HEAPBLOCKS_PER_PAGE is 32672 with the defaults. The maximum relation size is
2**32 - 1 blocks. So the max FSM size is 131458 pages, a bit more than 1GB. Is
that correct?

I think that you meant "max VM size". That sounds correct to me.

For large relations that are already nearly all-frozen this does add a
noticable amount of overhead, whether spilled to disk or not. Of course
they're also not going to be vacuumed super often, but ...

I wouldn't be surprised if the patch didn't work with relations that
approach 32 TiB in size. As I said, the visibilitymap.c data structure
is the least worked out piece of the project.

Perhaps worth turning the VM into a range based description for the snapshot,
given it's a readonly datastructure in local memory? And we don't necessarily
need the all-frozen and all-visible in memory, one should suffice? We don't
even need random access, so it could easily be allocated incrementally, rather
than one large allocation.

Definitely think that we should do simple run-length encoding, stuff
like that. Just as long as it allows vacuumlazy.c to work off of a
true snapshot, with scanned_pages known right from the start. The
consumer side of things has been my focus so far.

Hm. It's a bit sad to compute the snapshot after determining OldestXmin.

We probably should refresh OldestXmin periodically. That won't allow us to get
a more aggressive relfrozenxid, but it'd allow to remove more gunk.

That may well be a good idea, but I think that it's also a good idea
to just not scan heap pages that we know won't have XIDs < OldestXmin
(OldestXmin at the start of the VACUUM). That visibly makes the
problem of "recently dead" tuples that cannot be cleaned up a lot
better, without requiring that we do anything with OldestXmin.

I also think that there is something to be said for not updating the
FSM for pages that were all-visible at the beginning of the VACUUM
operation. VACUUM is currently quite happy to update the FSM with its
own confused idea about how much free space there really is on heap
pages with recently dead (dead but not yet removable) tuples. That's
really bad, but really subtle.

What does it mean to "skip lazily"?

Skipping even all-visible pages, prioritizing avoiding work over
advancing relfrozenxid. This is a cost-based decision. As I mentioned
a moment ago, that's one immediate use of VM snapshots (it gives us
precise information to base our decision on, that simply *cannot*
become invalid later on).

+             /*
+              * Visibility map page copied to local buffer for caller's snapshot.
+              * Caller requires an exact count of all-visible and all-frozen blocks
+              * in the heap relation.  Handle that now.

This part of the comment seems like it actually belongs further down?

No, it just looks a bit like that because of the "truncate in-memory
VM" code stanza. It's actually the right order.

+              * Must "truncate" our local copy of the VM to avoid incorrectly
+              * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+              * this by clearing irrelevant bits on the last VM page copied.
+              */

Hm - why would those bits already be set?

No real reason, we "truncate" like this defensively. This will
probably look quite different before too long.

Subject: [PATCH v6 3/6] Add eager freezing strategy to VACUUM.

What's the logic behind a hard threshold? Suddenly freezing everything on a
huge relation seems problematic. I realize that never getting all that far
behind is part of the theory, but I don't think that's always going to work.

It's a vast improvement on what we do currently, especially in
append-only tables.

There is simply no limit on how many physical heap pages will have to
be frozen when there is an aggressive mode VACUUM. It could be
terabytes, since table age predicts precisely nothing about costs.
With the patch we have a useful limit for the first time, that uses
physical units (the only kind of units that make any sense).

Admittedly we should really have special instrumentation that reports
when VACUUM must do "catch up freezing" when the
vacuum_freeze_strategy_threshold threshold is first crossed, to help
users to make better choices in this area. And maybe
vacuum_freeze_strategy_threshold should be lower by default, so it's
not as noticeable. (The GUC partly exists as a compatibility option, a
bridge to the old lazy behavior.)

Freezing just became approximately 5x cheaper with the freeze plan
deduplication work (commit 9e540599). To say nothing about how
vacuuming indexes became a lot cheaper in recent releases. So to some
extent we can afford to be more proactive here. There are some very
nonlinear cost profiles involved here due to write amplification
effects. So having a strong bias against write amplification seems
totally reasonable to me -- we can potentially "get it wrong" and
still come out ahead, because we at least had the right idea about
costs.

I don't deny that there are clear downsides, though. I am convinced
that it's worth it -- performance stability is what users actually
complain about in almost all cases. Why should performance stability
be 100% free?

Wouldn't a better strategy be to freeze a percentage of the relation on every
non-aggressive vacuum? That way the amount of work for an eventual aggressive
vacuum will shrink, without causing individual vacuums to take extremely long.

I think that it's better to avoid aggressive mode altogether. By
committing to advancing relfrozenxid by *some* amount in ~all VACUUMs
against larger tables, we can notice when we don't actually need to do
very much freezing to keep relfrozenxid current, due to workload
characteristics. It depends on workload, of course. But if we don't
try to do this we'll never notice that it's possible to do it.

Why should we necessarily need to freeze very much, after a while? Why
shouldn't most newly frozen pages stay frozen ~forever after a little
while?

The other thing that I think would be to good to use is a) whether the page is
already in s_b, and b) whether the page already is dirty. The cost of freezing
shrinks significantly if it doesn't cause an additional read + write. And that
additional IO IMO one of the major concerns with freezing much more
aggressively in OLTPish workloads where a lot of the rows won't ever get old
enough to need freezing.

Maybe, but I think that systematic effects are more important. We
freeze eagerly during this VACUUM in part because it makes
relfrozenxid advancement possible in the next VACUUM.

Note that eager freezing doesn't freeze the page unless it's already
going to set it all-visible. That's another way in which we ameliorate
the problem of freezing when it makes little sense to -- even with
eager freezing strategy, we *don't* freeze heap pages where it
obviously makes little sense to. Which makes a huge difference on its
own.

There is good reason to believe that most individual heap pages are
very cold data, even in OLTP apps. To a large degree Postgres is
successful because it is good at inexpensively storing data that will
possibly never be accessed:

https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/

Speaking of OLTP apps:

in many cases VACUUM will prune just to remove one or two heap-only
tuples, maybe even generating an FPI in the process. But the removed
tuple wasn't actually doing any harm -- an opportunistic prune could
have done the same thing later on, once we'd built up some more
garbage tuples. So the only reason to prune is to freeze the page. And
yet right now we don't recognize this and freeze the page to get
*some* benefit out of the arguably needlessly prune. This is quite
common, in fact.

The most significant aspect of anti-wrap autvacuums right now is that they
don't auto-cancel. Is that still used? If so, what's the threshold?

This patch set doesn't change anything about antiwraparound
autovacuums -- though it does completely eliminate aggressive mode (so
it's a little like Postgres 8.4).

There is a separate thread discussing the antiwraparound side of this, actually:

/messages/by-id/CAH2-Wz=S-R_2rO49Hm94Nuvhu9_twRGbTm6uwDRmRu-Sqn_t3w@mail.gmail.com

I think that I will need to invent a new type of autovacuum that's
similar to antiwraparound autovacuum, but doesn't have the
cancellation behavior -- that is more or less prerequisite to
committing this patch series. We can accept some risk of relfrozenxid
falling behind if that doesn't create any real risk of antiwraparound
autovacuums.

We can retain antiwraparound autovacuum, which should kick in only
when the new kind of autovacuum has failed to advance relfrozenxid,
having had the opportunity. Maybe antiwraparound autovacuum should be
triggered when age(relfrozenxid) is twice the value of
autovacuum_freeze_max_age. The new kind of autovacuum would trigger at
the same table age that triggers antiwraparound autovacuum with the
current design.

So antiwraparound autovacuum would work in the same way, but would be
much less common -- even for totally static tables. We'd at least be
sure that the auto cancellation behavior was *proportionate* to the
problem at hand, because we'll always have tried and ultimately failed
to advance relfrozenxid without activating the auto cancellation
behavior. We wouldn't trigger a very disruptive behavior routinely,
without any very good reason.

Now every VACUUM might need to wait for a cleanup lock, though few will.
It can only happen when required to advance relfrozenxid to no less than
half way between the existing relfrozenxid and nextXID.

Where's that "halfway" bit coming from?

We don't use FreezeLimit within lazy_scan_noprune in the patch that
gets rid of aggressive mode VACUUM. We use something called minXid in
its place. So a different timeline to freezing (even for tables where
we always use the lazy freezing strategy).

The new minXid cutoff (used by lazy_scan_noprune) comes from this
point in vacuum_set_xid_limits():

+ *minXid = nextXID - (freeze_table_age / 2);
+ if (!TransactionIdIsNormal(*minXid))
+     *minXid = FirstNormalTransactionId;

So that's what I meant by "half way".

(Note that minXid is guaranteed to be <= FreezeLimit, which is itself
guaranteed to be <= OldestXmin, no matter what.)

Isn't "half way between the relfrozenxid and nextXID" a problem for instances
with longrunning transactions?

Should we do less relfrozenxid advancement because there is a long
running transaction, though? It's obviously seriously bad when things
are blocked by a long running transaction, but I don't see the
connection between that and how we wait for cleanup locks. Waiting for
cleanup locks is always really, really bad, and can be avoided in
almost all cases.

I suspect that I still haven't been aggressive enough in how minXid is
set, BTW -- we should be avoiding waiting for a cleanup lock like the
plague. So "half way" isn't enough. Maybe we should have a LOG message
in cases where it actually proves necessary to wait, because it's just
asking for trouble (at least when we're running in autovacuum).

Wouldn't this mean that wait for every page if
relfrozenxid can't be advanced much because of a longrunning query or such?

Old XIDs always start out as young XIDs. Which we're now quite willing
to freeze when conditions look good.

Page level freezing always freezes all eligible XIDs on the page when
triggered, no matter what the details may be. This means that the
oldest XID on a heap page is more or less always an XID that's after
whatever OldestXmin was for the last VACUUM that ran and froze the
page, whenever that happened, and regardless of the mix of XID ages
was on the page at that time.

As a consequence, lone XIDs that are far older than other XIDs on the
same page become much rarer than what you'd see with the current
design -- they have to "survive" multiple VACUUMs, not just one
VACUUM. The best predictor of XID age becomes the time that VACUUM
last froze the page as a whole -- so workload characteristics and
natural variations are much much less likely to lead to problems from
waiting for cleanup locks. (Of course it also helps that we'll try
really hard to do that, and almost always prefer lazy_scan_noprune
processing.)

There is some sense in which we're trying to create a virtuous cycle
here. If we are always in a position to advance relfrozenxid by *some*
amount each VACUUM, however small, then we will have many individual
opportunities (spaced out over multiple VACUUM operations) to freeze
tuples on any heap tuples that (for whatever reason) are harder to get
a cleanup lock on, and then catch up on relfrozenxid by a huge amount
whenever we "get lucky". We have to "keep an open mind" to ever have
any chance of "getting lucky" in this sense, though.

VACUUM is the only
mechanism that can claw back MultixactId space, so allowing VACUUM to
consume MultiXactId space (for any reason) adds to the risk that the
system will trigger the multiStopLimit wraparound protection mechanism.

Strictly speaking that's not quite true, you can also drop/truncate tables ;)

Fixed.

--
Peter Geoghegan

Attachments:

v7-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchapplication/octet-stream; name=v7-0005-Avoid-allocating-MultiXacts-during-VACUUM.patchDownload

From 0d716d557045ab642b400f41d32d0f3a5feeb349 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 31 Jul 2022 13:53:19 -0700
Subject: [PATCH v7 5/6] Avoid allocating MultiXacts during VACUUM.

Pass down vacuumlazy.c's OldestXmin cutoff to FreezeMultiXactId(), and
teach it the difference between OldestXmin and FreezeLimit.  Use this
high-level context to intelligently avoid allocating new MultiXactIds
during VACUUM operations.  We should always prefer to avoid allocating
new MultiXacts during VACUUM on general principle.  VACUUM is the only
mechanism that can claw back MultixactId space (barring extreme measures
like DROP TABLE), so allowing VACUUM to consume MultiXactId space adds
to the risk that the system will trigger the multiStopLimit wraparound
protection mechanism.

FreezeMultiXactId() is now eager when it's cheap to process xmax, and
lazy when it's expensive/risky to process xmax (because an expensive
second pass that might result in allocating a new Multi is required).
We make a similar trade-off to the one made by lazy_scan_noprune() when
a cleanup lock isn't available on some heap page.  We can usually put
off freezing (for the time being) when it's inconvenient to proceed.  We
need only accept an older final relfrozenxid/relminmxid value to make
that safe, which is typically a good trade-off.

Note that MultiXactIds are processed eagerly in all cases by triggering
page-level freezing whenever FreezeMultiXactId() processes a Multi
(though not in the no-op processing case).  We don't do the same thing
with an XID based xmax.  This is closer to the historic behavior.
---
 src/backend/access/heap/heapam.c | 53 +++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0d1c2affc..069f107cb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6122,11 +6122,21 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
  *
  * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ *
+ * Allocating new MultiXactIds during VACUUM is something that we always
+ * prefer to avoid.  Caller passes us down all the context required to do this
+ * on their behalf: cutoff_xid, cutoff_multi, limit_xid, and limit_multi.
+ *
+ * We must never leave behind XIDs/MXIDs from before the "limits" given.
+ * There is lots of leeway around XIDs/MXIDs >= caller's "limits", though.
+ * When it's possible to inexpensively process xmax right away, we're eager
+ * about it.  Otherwise we're lazy about it -- next time it might be cheap.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
+				  TransactionId limit_xid, MultiXactId limit_multi,
 				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
@@ -6219,13 +6229,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below the limit for Xids, so we
 	 * need to walk the whole members array to figure out what to do, if
-	 * anything.
+	 * anything
 	 */
-
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6236,12 +6243,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP optimization */
 	for (i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		if (TransactionIdPrecedes(members[i].xid, limit_xid))
 		{
 			need_replace = true;
 			break;
@@ -6250,11 +6256,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			temp_xid_out = members[i].xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* The Multi itself must be >= limit_multi, too */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, limit_multi);
+
 	if (!need_replace)
 	{
 		/*
@@ -6271,6 +6276,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_would_freeze will indicate that the tuple must be frozen.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
@@ -6370,7 +6378,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				/*
 				 * Running locker cannot possibly be older than the cutoff.
 				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * cutoff_xid is VACUUM's OldestXmin, which is also the
 				 * initial value used for top-level NewRelfrozenXid tracking
 				 * state.  A running locker cannot be older than VACUUM's
 				 * OldestXmin, either, so we don't need a temp_xid_out step.
@@ -6526,7 +6534,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given limit for Xids.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
@@ -6544,6 +6552,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
 									cutoff_xid, cutoff_multi,
+									limit_xid, limit_multi,
 									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
@@ -6644,7 +6653,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * with NoFreezeNewRelfrozenXid.)
 		 */
 		if ((flags & FRM_NOOP) == 0)
+		{
 			xtrack->FreezeRequired = true;
+
+			/*
+			 * Verify that FreezeMultiXactId() only indicates that we must set
+			 * xmax to a new Multi (or to a preexisting Xid from an updater)
+			 * when it had no choice (not without violating the rule requiring
+			 * lazy_scan_prune respects FreezeLimit/MultiXactCutoff cutoffs).
+			 */
+			if ((flags & (FRM_RETURN_IS_XID | FRM_RETURN_IS_MULTI)) != 0)
+				Assert(heap_tuple_would_freeze(tuple,
+											   limit_xid, limit_multi,
+											   &xtrack->NoFreezeNewRelfrozenXid,
+											   &xtrack->NoFreezeNewRelminMxid));
+		}
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
-- 
2.34.1

v7-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchapplication/octet-stream; name=v7-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patchDownload

From 154033703b882846226055e475b5589dfd63e156 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v7 2/6] Teach VACUUM to use visibility map snapshot.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip.  The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").  It also
gives VACUUM reliable information about how many pages will be scanned,
before its physical heap scan even begins.  That makes it easier to
model the costs that VACUUM incurs using a top-down, up-front approach.

Non-aggressive VACUUMs now make an up-front choice about VM skipping
strategy: they decide whether to prioritize early advancement of
relfrozenxid (eager behavior) over avoiding work by skipping all-visible
pages (lazy behavior).  Nothing about the details of how lazy_scan_prune
freezes changes just yet, though a later commit will add the concept of
freezing strategies.

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.  VACUUM will now either scan
every all-visible page, or none at all.  This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.  It was possible that a single all-visible page
located in a range of all-frozen blocks would render it unsafe to
advance relfrozenxid later on; lazy_scan_skip just didn't have enough
high-level context about the table as a whole.  Now our policy around
skipping all-visible pages is exactly the same condition as whether or
not it's safe to advance relfrozenxid later on; nothing is left to
chance.

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work).  For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().
Note in particular that this completely ignores the impact of allocating
large buffers when vacuuming large tables.

XXX: Commit bf136cf6 was also concerned about triggering readahead as a
primitive form of prefetching.  Do we also need to add I/O explicit I/O
prefetching hints to make up for what may have been lost?
---
 src/include/access/visibilitymap.h      |   7 +
 src/backend/access/heap/vacuumlazy.c    | 342 +++++++++++++-----------
 src/backend/access/heap/visibilitymap.c | 164 ++++++++++++
 3 files changed, 359 insertions(+), 154 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..49f197bb3 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,9 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +38,10 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+									  BlockNumber *all_visible, BlockNumber *all_frozen);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
+extern uint8 visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9c84f8397..328562b61 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,10 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * Threshold that controls whether non-aggressive VACUUMs will skip any
+ * all-visible pages
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -146,8 +146,10 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
+	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	bool		skipallvis;
+	/* Skip (don't scan) all-frozen pages? */
+	bool		skipallfrozen;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -177,7 +179,8 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map used by lazy_scan_skip */
+	vmsnapshot *vmsnap;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -250,10 +253,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber all_visible,
+									  BlockNumber all_frozen);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel,
+								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -316,7 +320,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	bool		verbose,
 				instrument,
 				aggressive,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
@@ -324,6 +327,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				all_visible,
+				all_frozen,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -369,7 +375,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
 
-	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
 		/*
@@ -377,7 +382,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * visibility map (even those set all-frozen)
 		 */
 		aggressive = true;
-		skipwithvm = false;
 	}
 
 	/*
@@ -402,20 +406,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
@@ -442,7 +432,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
 	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
+	/* Initialize skipallvis/skipallfrozen before lazy_scan_strategy call */
+	vacrel->skipallvis = !aggressive;
+	vacrel->skipallfrozen = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -503,12 +495,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * confused about whether a tuple should be frozen or removed.  (In the
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
-	 *
-	 * We must determine rel_pages _after_ OldestXmin has been established.
-	 * lazy_scan_heap's physical heap scan (scan of pages < rel_pages) is
-	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
-	 * XIDs must at least be considered for freezing (though not necessarily
-	 * frozen) during its scan.
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
@@ -521,7 +507,36 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
-	vacrel->skippedallvis = false;
+
+	/*
+	 * VACUUM must scan all pages that might have XIDs < OldestXmin in tuple
+	 * headers to be able to safely advance relfrozenxid later on.  There is
+	 * no good reason to scan any additional pages. (Actually we might opt to
+	 * skip all-visible pages.  Either way we won't scan pages for no reason.)
+	 *
+	 * Now that OldestXmin and rel_pages are acquired, acquire an immutable
+	 * snapshot of the visibility map as well.  lazy_scan_skip works off of
+	 * the vmsnap, not the authoritative VM, which can continue to change.
+	 * Pages that lazy_scan_heap will scan are fixed and known in advance.
+	 *
+	 * The exact number of pages that lazy_scan_heap will scan also depends on
+	 * our choice of skipping strategy.  VACUUM can either choose to skip any
+	 * all-visible pages lazily, or choose to scan those same pages instead.
+	 * Decide on a skipping strategy to determine final scanned_pages.
+	 */
+	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
+										&all_visible, &all_frozen);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   all_visible, all_frozen);
+	if (verbose)
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace, vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -538,6 +553,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -584,12 +600,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
 		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -634,6 +649,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -661,10 +679,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -857,13 +871,12 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -877,42 +890,24 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = lazy_scan_skip(vacrel, 0, &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		if (blkno < next_block_to_scan)
+			continue;
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = lazy_scan_skip(vacrel, blkno + 1,
+											&next_all_visible);
 
 		vacrel->scanned_pages++;
 
@@ -1122,10 +1117,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1153,12 +1147,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1197,7 +1189,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1290,47 +1282,121 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *	lazy_scan_strategy() -- Determine skipping strategy.
  *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * Determines if the ongoing VACUUM operation should skip all-visible pages
+ * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
  *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
+ * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
+lazy_scan_strategy(LVRelState *vacrel,
+				   BlockNumber all_visible,
+				   BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
+				scanned_pages_skipallvis,
+				scanned_pages_skipallfrozen;
+	uint8		mapbits;
 
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
+	Assert(vacrel->scanned_pages == 0);
+	Assert(rel_pages >= all_visible && all_visible >= all_frozen);
+
+	/*
+	 * First figure out the final scanned_pages for each of the skipping
+	 * policies that lazy_scan_skip might end up using: skipallvis (skip both
+	 * all-frozen and all-visible) and skipallfrozen (just skip all-frozen).
+	 */
+	scanned_pages_skipallvis = rel_pages - all_visible;
+	scanned_pages_skipallfrozen = rel_pages - all_frozen;
+
+	/*
+	 * Even if the last page is skippable, it will still get scanned because
+	 * of lazy_scan_skip's "try to set nonempty_pages for last page" rule.
+	 * Reconcile that rule with what vmsnap says about the last page now.
+	 *
+	 * When vmsnap thinks that we will be skipping the last page (we won't),
+	 * increment scanned_pages to compensate.  Otherwise change nothing.
+	 */
+	mapbits = visibilitymap_snap_status(vacrel->vmsnap, rel_pages - 1);
+	if (mapbits & VISIBILITYMAP_ALL_VISIBLE)
+		scanned_pages_skipallvis++;
+	if (mapbits & VISIBILITYMAP_ALL_FROZEN)
+		scanned_pages_skipallfrozen++;
+
+	/*
+	 * Okay, now we have all the information we need to decide on a strategy
+	 */
+	if (!vacrel->skipallfrozen)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		return rel_pages;
+	}
+	else if (vacrel->aggressive)
+		Assert(!vacrel->skipallvis);
+	else
+	{
+		BlockNumber nextra,
+					nextra_threshold;
+
+		/*
+		 * Decide on whether or not we'll skip all-visible pages.
+		 *
+		 * In general, VACUUM doesn't necessarily have to freeze anything to
+		 * be able to advance relfrozenxid and/or relminmxid by a significant
+		 * number of XIDs/MXIDs.  The oldest tuples might turn out to have
+		 * been deleted since VACUUM last ran, or this VACUUM might find that
+		 * there simply are no MultiXacts that even need to be considered.
+		 *
+		 * It's hard to predict whether this VACUUM operation will work out
+		 * that way, so be lazy (just skip) unless the added cost is very low.
+		 * We opt for a skipallfrozen-only VACUUM when the number of extra
+		 * pages (extra scanned pages that are all-visible but not all-frozen)
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 */
+		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
+		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+		nextra_threshold = Max(32, nextra_threshold);
+
+		vacrel->skipallvis = nextra >= nextra_threshold;
+	}
+
+	/* Return the appropriate variant of scanned_pages */
+	if (vacrel->skipallvis)
+		return scanned_pages_skipallvis;
+	if (vacrel->skipallfrozen)
+		return scanned_pages_skipallfrozen;
+
+	return rel_pages;			/* keep compiler quiet */
+}
+
+/*
+ *	lazy_scan_skip() -- get the next block to scan according to vmsnap.
+ *
+ * lazy_scan_heap() caller passes the next block in line.  We return the next
+ * block to scan.  Caller skips the blocks preceding returned block, if any.
+ *
+ * The all-visible status of the returned block is set in *all_visible, too.
+ * Block usually won't be all-visible (since it's unskippable), but it can be
+ * when next_block is rel's last page and when DISABLE_PAGE_SKIPPING in use.
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, BlockNumber next_block, bool *all_visible)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_block_to_scan = next_block;
+
+	*all_visible = true;
+	while (next_block_to_scan < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_snap_status(vacrel->vmsnap,
+														next_block_to_scan);
 
 		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
 			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
+			*all_visible = false;
 			break;
 		}
 
@@ -1341,58 +1407,26 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 		 * lock on rel to attempt a truncation that fails anyway, just because
 		 * there are tuples on the last page (it is likely that there will be
 		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
 		 */
-		if (next_unskippable_block == rel_pages - 1)
+		if (next_block_to_scan == rel_pages - 1)
 			break;
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
+		Assert(vacrel->skipallfrozen || !vacrel->skipallvis);
+		if (!vacrel->skipallfrozen)
 			break;
 
 		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 * Don't skip all-visible pages when lazy_scan_strategy determined
+		 * that it was more important for this VACUUM to advance relfrozenxid
 		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
+		if (!vacrel->skipallvis && (mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+			break;
 
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
+		next_block_to_scan++;
 	}
 
-	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
-	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
-	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
-	else
-	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
-	}
-
-	return next_unskippable_block;
+	return next_block_to_scan;
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e..cfe3cf9b6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,9 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap - get read-only snapshot of visibility map
+ *		visibilitymap_snap_release - release previously acquired snapshot
+ *		visibilitymap_snap_status - get status of bits from vm snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +55,9 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset (only tuples deleted before its
+ * OldestXmin cutoff are considered dead).
  *
  * LOCKING
  *
@@ -124,6 +130,22 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Snapshot of visibility map at a point in time.
+ *
+ * TODO This is currently always just a palloc()'d buffer -- give more thought
+ * to resource management (at a minimum add spilling to temp file).
+ */
+struct vmsnapshot
+{
+	/* Snapshot may contain zero or more visibility map pages */
+	BlockNumber nvmpages;
+
+	/* Copy of VM pages from the time that visibilitymap_snap() was called */
+	PGAlignedBlock vmpages[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
@@ -373,6 +395,148 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap - get read-only snapshot of visibility map
+ *
+ * Initializes caller's snapshot, allocating memory in caller's memory context.
+ * Caller can use visibilitymap_snapshot_status to get the status of individual
+ * heap pages at the point that we were called.
+ *
+ * Used by VACUUM to determine which pages it must scan up front.  This avoids
+ * useless scans of concurrently unset heap pages.  VACUUM prefers to leave
+ * them to be scanned during the next VACUUM operation.
+ *
+ * rel_pages is the current size of the heap relation.
+ */
+vmsnapshot *
+visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+				   BlockNumber *all_visible, BlockNumber *all_frozen)
+{
+	BlockNumber nvmpages,
+				mapBlockLast;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap %s %d", RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	*all_visible = 0;
+	*all_frozen = 0;
+	mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages);
+	nvmpages = mapBlockLast + 1;
+	vmsnap = palloc(offsetof(vmsnapshot, vmpages) +
+					sizeof(PGAlignedBlock) * nvmpages);
+
+	vmsnap->nvmpages = nvmpages;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Page		localvmpage;
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		localvmpage = vmsnap->vmpages->data + mapBlock * BLCKSZ;
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(localvmpage, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/*
+		 * Visibility map page copied to local buffer for caller's snapshot.
+		 * Caller requires an exact count of all-visible and all-frozen blocks
+		 * in the heap relation.  Handle that now.
+		 *
+		 * Must "truncate" our local copy of the VM to avoid incorrectly
+		 * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+		 * this by clearing irrelevant bits on the last VM page copied.
+		 */
+		map = PageGetContents(localvmpage);
+		if (mapBlock == mapBlockLast)
+		{
+			/* byte and bit for first heap page not to be scanned by VACUUM */
+			uint32		truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			uint8		truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
+
+			if (truncByte != 0 || truncOffset != 0)
+			{
+				/* Clear any bits set for heap pages >= rel_pages */
+				MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+				map[truncByte] &= (1 << truncOffset) - 1;
+			}
+
+			/* Now it's safe to tally bits from this final VM page below */
+		}
+
+		/* Tally the all-visible and all-frozen counts from this page */
+		umap = (uint64 *) map;
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			*all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+	}
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Just frees the memory allocated by visibilitymap_snap for now (presumably
+ * this will need to release temp files in later revisions of the patch)
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	pfree(vmsnap);
+}
+
+/*
+ *	visibilitymap_snap_status - get status of bits from vm snapshot
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to caller's snapshot of the visibility map?
+ */
+uint8
+visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+	char	   *map;
+	uint8		result;
+	Page		localvmpage;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_status %d", heapBlk);
+#endif
+
+	/* If we didn't copy the VM page, assume heapBlk not all-visible */
+	if (mapBlock >= vmsnap->nvmpages)
+		return 0;
+
+	localvmpage = ((Page) vmsnap->vmpages) + mapBlock * BLCKSZ;
+	map = PageGetContents(localvmpage);
+
+	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+	return result;
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
-- 
2.34.1

v7-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchapplication/octet-stream; name=v7-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchDownload

From 22727994dd5d068055038bbd3b3770add8f2e0fa Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 23 Jul 2022 17:19:01 -0700
Subject: [PATCH v7 6/6] Size VACUUM's dead_items space using VM snapshot.

VACUUM knows precisely how many pages it will scan ahead of time from
its snapshot of the visibility map following recent work.  Apply that
information to size the dead_items space for TIDs more precisely (use
scanned_pages instead of rel_pages to cap the allocation).

This can make the memory allocation significantly smaller, without any
added risk of undersizing the array.  Since VACUUM's final scanned_pages
is fully predetermined (by the visibility map snapshot), there is no
question of interference from another backend that concurrently unsets
some heap page's visibility map bit.  Many details of how VACUUM will
process the target relation are "locked in" from the very beginning.
---
 src/backend/access/heap/vacuumlazy.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c5a653db0..6493f4bef 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -293,7 +293,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -566,7 +567,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -3184,14 +3185,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3200,15 +3200,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3230,12 +3228,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
-- 
2.34.1

v7-0004-Make-VACUUM-s-aggressive-behaviors-continuous.patchapplication/octet-stream; name=v7-0004-Make-VACUUM-s-aggressive-behaviors-continuous.patchDownload

From 6d0565fb2294e3cbb5c843fb384423f7831dd13d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v7 4/6] Make VACUUM's aggressive behaviors continuous.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Before then, every lazy VACUUM
was "equally aggressive": each operation froze whatever tuples before
the age-wise cutoff needed to be frozen.  And each table's relfrozenxid
was updated at the end.  In short, the previous behavior was much less
efficient, but did at least have one thing going for it: it was much
easier to understand at a high level.

VACUUM no longer applies a separate mode of operation (aggressive mode).
There are still antiwraparound autovacuums, but they're now little more
than another way that autovacuum.c can launch an autovacuum worker to
run VACUUM.  The same set of behaviors previously associated with
aggressive mode are retained, but now get applied selectively, on a
timeline attuned to the needs of the table.

The closer that a table's age gets to the autovacuum_freeze_max_age
cutoff, the less VACUUM will care about avoiding the cost of scanning
extra pages to advance relfrozenxid "early".  This new approach cares
about both costs (extra pages scanned) and benefits (the need for
relfrozenxid advancement), unlike the previous approach driven by
vacuum_freeze_table_age, which "escalated to aggressive mode" purely
based on a simple XID age cutoff.  The vacuum_freeze_table_age GUC is
now relegated to a compatibility option.  Its default value is now -1,
which is interpreted as "current value of autovacuum_freeze_max_age".

VACUUM will still advance relfrozenxid at roughly the same XID-age-wise
cadence as before with static tables, but can also advance relfrozenxid
much more frequently in tables where that happens to make sense.  In
practice many tables will tend to have relfrozenxid advanced by some
amount during every VACUUM, especially larger tables and very small
tables.

The emphasis is now on keeping each table's age reasonably recent over
time, across multiple successive VACUUM operations, while spreading out
the burden of freezing, avoiding big spikes.  Freezing is now primarily
treated as an overhead of long term storage of tuples in physical heap
pages.  There is less emphasis on the role freezing plays in preventing
the system from reaching the point of an xidStopLimit outage.

Now every VACUUM might need to wait for a cleanup lock, though few will.
It can only happen when required to advance relfrozenxid to no less than
half way between the existing relfrozenxid and nextXID.  In general
there is no telling how long VACUUM might spend waiting for a cleanup
lock, so it's usually more useful to focus on keeping up with freezing
at the level of the whole table.  VACUUM can afford to set relfrozenxid
to a significantly older value in the short term, since there are now
more opportunities to advance relfrozenxid in the long term.
---
 src/include/commands/vacuum.h                 |   7 +-
 src/backend/access/heap/vacuumlazy.c          | 223 +++---
 src/backend/access/transam/multixact.c        |   5 +-
 src/backend/commands/cluster.c                |  10 +-
 src/backend/commands/vacuum.c                 | 113 +--
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   4 +-
 doc/src/sgml/config.sgml                      | 103 +--
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  27 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  30 +-
 src/test/regress/expected/reloptions.out      |   6 +-
 src/test/regress/sql/reloptions.sql           |   6 +-
 19 files changed, 638 insertions(+), 663 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 52379f819..a70df0218 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -290,7 +290,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_set_xid_limits(Relation rel,
+extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age,
 								  int multixact_freeze_min_age,
 								  int freeze_table_age,
@@ -298,7 +298,10 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  TransactionId *oldestXmin,
 								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
-								  MultiXactId *multiXactCutoff);
+								  MultiXactId *multiXactCutoff,
+								  TransactionId *minXid,
+								  MultiXactId *minMulti,
+								  double *antiwrapfrac);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 7a719ee7b..c5a653db0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,11 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages when using the lazy freezing strategy
+ * Thresholds that control whether VACUUM will skip any all-visible pages when
+ * using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
+#define SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES	0.15
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -144,9 +145,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
-	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	/* Skip (don't scan) all-visible pages? */
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
@@ -178,6 +177,9 @@ typedef struct LVRelState
 	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Earliest permissible NewRelfrozenXid/NewRelminMxid values */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
@@ -258,7 +260,8 @@ static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
 									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
-									  BlockNumber all_frozen);
+									  BlockNumber all_frozen,
+									  double antiwrapfrac);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
 								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
@@ -322,13 +325,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				aggressive,
 				frozenxid_updated,
 				minmulti_updated;
 	TransactionId OldestXmin,
-				FreezeLimit;
+				FreezeLimit,
+				MinXid;
 	MultiXactId OldestMxact,
-				MultiXactCutoff;
+				MultiXactCutoff,
+				MinMulti;
+	double		antiwrapfrac;
 	BlockNumber orig_rel_pages,
 				eager_threshold,
 				all_visible,
@@ -367,33 +372,33 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
 	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
-	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
-	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 * used to determine which XIDs/MultiXactIds will be frozen.
 	 *
 	 * Also determine our cutoff for applying the eager/all-visible freezing
-	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
-	 * even during non-aggressive VACUUMs.
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy.
 	 */
-	aggressive = vacuum_set_xid_limits(rel,
-									   params->freeze_min_age,
-									   params->multixact_freeze_min_age,
-									   params->freeze_table_age,
-									   params->multixact_freeze_table_age,
-									   &OldestXmin, &OldestMxact,
-									   &FreezeLimit, &MultiXactCutoff);
+	vacuum_set_xid_limits(rel,
+						  params->freeze_min_age,
+						  params->multixact_freeze_min_age,
+						  params->freeze_table_age,
+						  params->multixact_freeze_table_age,
+						  &OldestXmin, &OldestMxact,
+						  &FreezeLimit, &MultiXactCutoff,
+						  &MinXid, &MinMulti, &antiwrapfrac);
 	eager_threshold = params->freeze_strategy_threshold < 0 ?
 		vacuum_freeze_strategy_threshold :
 		params->freeze_strategy_threshold;
 
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		aggressive = true;
-	}
+	/*
+	 * Make sure that antiwraparound autovacuums always have the opportunity
+	 * to advance relfrozenxid to a value >= MinXid.
+	 *
+	 * This is needed so that antiwraparound autovacuums reliably advance
+	 * relfrozenxid to the satisfaction of autovacuum.c, even when the
+	 * autovacuum_freeze_max_age reloption (not GUC) triggered the autovacuum.
+	 */
+	if (params->is_wraparound)
+		antiwrapfrac = 1.0;
 
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
@@ -442,10 +447,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
-	vacrel->aggressive = aggressive;
 	/* Initialize skipallvis/skipallfrozen before lazy_scan_strategy call */
-	vacrel->skipallvis = !aggressive;
-	vacrel->skipallfrozen = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
+	vacrel->skipallvis = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
+	vacrel->skipallfrozen = vacrel->skipallvis;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -515,6 +519,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->FreezeLimit = FreezeLimit;
 	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* MinXid limits final relfrozenxid's age (always <= FreezeLimit) */
+	vacrel->MinXid = MinXid;
+	/* MinMulti limits final relminmxid's age (always <= MultiXactCutoff) */
+	vacrel->MinMulti = MinMulti;
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
@@ -538,7 +546,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
 	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
-									   all_visible, all_frozen);
+									   all_visible, all_frozen,
+									   antiwrapfrac);
 	if (verbose)
 		ereport(INFO,
 				(errmsg("vacuuming \"%s.%s.%s\"",
@@ -599,25 +608,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
-		   TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
-										 vacrel->relfrozenxid,
-										 vacrel->NewRelfrozenXid));
+		   TransactionIdPrecedesOrEquals(MinXid, vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
-									   vacrel->relminmxid,
-									   vacrel->NewRelminMxid));
+		   MultiXactIdPrecedesOrEquals(MinMulti, vacrel->NewRelminMxid));
 	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
-		 * lazy_scan_strategy call determined it would skip all-visible pages
+		 * Must keep original relfrozenxid when lazy_scan_strategy call
+		 * decided to skip all-visible pages
 		 */
-		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -693,23 +696,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
-			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
 			else
 			{
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
+				Assert(IsAutoVacuumWorkerProcess());
+				if (params->is_wraparound)
+					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -1041,7 +1032,6 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * lazy_scan_noprune could not do all required processing.  Wait
 			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
 			 */
-			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
 		}
@@ -1300,21 +1290,19 @@ lazy_scan_heap(LVRelState *vacrel)
  * On the other hand we eagerly freeze pages when that strategy spreads out
  * the burden of freezing over time.  Performance stability is important; no
  * one VACUUM operation should need to freeze disproportionately many pages.
- * Antiwraparound VACUUMs of append-only tables should generally be avoided.
  *
  * Also determines if the ongoing VACUUM operation should skip all-visible
- * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
- * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
- * important that relfrozenxid advance in affected tables, which are larger.
- * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
- * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
- * depending on the extra cost - we might need to scan only a few extra pages.
+ * pages when advancing relfrozenxid is still optional (before target rel has
+ * attained an age that forces an antiwraparound autovacuum).  Decision is
+ * based in part on caller's antiwrapfrac argument, which represents how close
+ * the table age is to forcing antiwraparound autovacuum.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
 lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
-				   BlockNumber all_visible, BlockNumber all_frozen)
+				   BlockNumber all_visible, BlockNumber all_frozen,
+				   double antiwrapfrac)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1357,21 +1345,15 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		Assert(vacrel->aggressive && !vacrel->skipallvis);
-		vacrel->allvis_freeze_strategy = true;
-		return rel_pages;
-	}
-	else if (vacrel->aggressive)
-	{
-		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
 		vacrel->allvis_freeze_strategy = true;
+		return rel_pages;
 	}
 	else if (rel_pages >= eager_threshold)
 	{
 		/*
-		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
-		 * GUC-based threshold for eager freezing.
+		 * VACUUM of table whose rel_pages now exceeds GUC-based threshold for
+		 * eager freezing.
 		 *
 		 * We always scan all-visible pages when the threshold is crossed, so
 		 * that relfrozenxid can be advanced.  There will typically be few or
@@ -1386,9 +1368,6 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		BlockNumber nextra,
 					nextra_threshold;
 
-		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
-		vacrel->allvis_freeze_strategy = false;
-
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1402,13 +1381,44 @@ lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
 		 * that way, so be lazy (just skip) unless the added cost is very low.
 		 * We opt for a skipallfrozen-only VACUUM when the number of extra
 		 * pages (extra scanned pages that are all-visible but not all-frozen)
-		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small)
+		 * if relfrozenxid has yet to attain an age that uses 50% of the XID
+		 * space available before the GUC cutoff for antiwraparound
+		 * autovacuum.  A more aggressive threshold of 15% is used when
+		 * relfrozenxid is older than that.
 		 */
 		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
-		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+
+		if (antiwrapfrac < 0.5)
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_THRESHOLD_PAGES;
+		else
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES;
+
 		nextra_threshold = Max(32, nextra_threshold);
 
-		vacrel->skipallvis = nextra >= nextra_threshold;
+		/*
+		 * We must advance relfrozenxid when it already attained an age that
+		 * consumes >= 90% of the available XID space (or MXID space) before
+		 * the crossover point for antiwraparound autovacuum.
+		 *
+		 * Also use eager freezing strategy when we're past the "90% towards
+		 * wraparound" point, even though the table size is below the usual
+		 * eager_threshold table size cutoff.  The added cost is usually not
+		 * too great.  We may be able to fall into a pattern of continually
+		 * advancing relfrozenxid this way.
+		 */
+		if (antiwrapfrac < 0.9)
+		{
+			vacrel->skipallvis = nextra >= nextra_threshold;
+			vacrel->allvis_freeze_strategy = false;
+		}
+		else
+		{
+			vacrel->skipallvis = false;
+			vacrel->allvis_freeze_strategy = true;
+		}
 	}
 
 	/* Return the appropriate variant of scanned_pages */
@@ -2023,11 +2033,9 @@ retry:
  * operation left LP_DEAD items behind.  We'll at least collect any such items
  * in the dead_items array for removal from indexes.
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We may return false to indicate that a full cleanup lock is required for
+ * processing by lazy_scan_prune.  This is only necessary when VACUUM needs to
+ * freeze some tuple XIDs from one or more tuples on the page.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
@@ -2095,36 +2103,23 @@ lazy_scan_noprune(LVRelState *vacrel,
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_would_freeze(tupleheader,
-									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff,
+									vacrel->MinXid, vacrel->MinMulti,
 									&NewRelfrozenXid, &NewRelminMxid))
 		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
 			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
+			 * Tuple with XID < MinXid (or MXID < MinMulti)
 			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
+			 * VACUUM must always be able to advance rel's relfrozenxid and
+			 * relminmxid to minimum values.  The ongoing VACUUM won't be able
+			 * to do that unless it can freeze an XID (or MXID) from this
+			 * tuple now.
+			 *
+			 * The only safe option is to have caller perform processing of
+			 * this page using lazy_scan_prune.  Caller might have to wait a
+			 * while for a cleanup lock, but it can't be helped.
 			 */
+			vacrel->offnum = InvalidOffsetNumber;
+			return false;
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 204aa9504..ba575c5fd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2816,10 +2816,7 @@ ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
  * freeze table and the minimum freeze age based on the effective
  * autovacuum_multixact_freeze_max_age this function returns.  In the worst
  * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will try to freeze every multixact.
- *
- * It's possible that these thresholds should be user-tunable, but for now
- * we keep it simple.
+ * table will freeze every multixact.
  */
 int
 MultiXactMemberFreezeThreshold(void)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 3b78a2f10..d2950fd6e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -824,9 +824,12 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin,
-				FreezeXid;
+				FreezeXid,
+				MinXid;
 	MultiXactId OldestMxact,
-				MultiXactCutoff;
+				MultiXactCutoff,
+				MinMulti;
+	double		antiwrapfrac;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -915,7 +918,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &OldestMxact,
-						  &FreezeXid, &MultiXactCutoff);
+						  &FreezeXid, &MultiXactCutoff, &MinXid, &MinMulti,
+						  &antiwrapfrac);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index df2bd53b9..5bdab6eb0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -943,21 +943,25 @@ get_all_vacuum_rels(int options)
  * - oldestMxact is the Mxid below which MultiXacts are definitely not
  *   seen as visible by any running transaction.
  * - freezeLimit is the Xid below which all Xids are definitely replaced by
- *   FrozenTransactionId during aggressive vacuums.
+ *   FrozenTransactionId in heap pages that caller can cleanup lock.
  * - multiXactCutoff is the value below which all MultiXactIds are definitely
- *   removed from Xmax during aggressive vacuums.
+ *   removed from Xmax in heap pages that caller can cleanup lock.
+ * - minXid is the earliest valid relfrozenxid value to set in pg_class.
+ * - minMulti is the earliest valid relminmxid value to set in pg_class.
+ * - antiwrapfrac is how close the table's age is to the point that autovacuum
+ *   will launch an antiwraparound autovacuum worker.
  *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
- * minimum).
+ * The antiwrapfrac value 1.0 represents the point that autovacuum.c
+ * scheduling considers advancing relfrozenxid strictly necessary.  Values
+ * between 0.0 and 1.0 represent how close the table is to the point of
+ * mandatory relfrozenxid/relminmxid advancement (up to minXid/minMulti).
  *
  * oldestXmin and oldestMxact are the most recent values that can ever be
  * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
  * vacuumlazy.c caller later on.  These values should be passed when it turns
  * out that VACUUM will leave no unfrozen XIDs/MXIDs behind in the table.
  */
-bool
+void
 vacuum_set_xid_limits(Relation rel,
 					  int freeze_min_age,
 					  int multixact_freeze_min_age,
@@ -966,15 +970,20 @@ vacuum_set_xid_limits(Relation rel,
 					  TransactionId *oldestXmin,
 					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
-					  MultiXactId *multiXactCutoff)
+					  MultiXactId *multiXactCutoff,
+					  TransactionId *minXid,
+					  MultiXactId *minMulti,
+					  double *antiwrapfrac)
 {
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
-	int			effective_multixact_freeze_max_age;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
+	int			effective_multixact_freeze_max_age,
+				relfrozenxid_age,
+				relminmxid_age;
 
 	/*
 	 * Acquire oldestXmin.
@@ -1065,8 +1074,8 @@ vacuum_set_xid_limits(Relation rel,
 		*multiXactCutoff = *oldestMxact;
 
 	/*
-	 * Done setting output parameters; check if oldestXmin or oldestMxact are
-	 * held back to an unsafe degree in passing
+	 * Check if oldestXmin or oldestMxact are held back to an unsafe degree in
+	 * passing
 	 */
 	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
 	if (!TransactionIdIsNormal(safeOldestXmin))
@@ -1086,48 +1095,64 @@ vacuum_set_xid_limits(Relation rel,
 						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
+	 * Work out how close we are to needing an antiwraparound VACUUM.
 	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.   The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/* Final antiwrapfrac can come from either XID or MXID table age */
+	relfrozenxid_age = Max(nextXID - rel->rd_rel->relfrozenxid, 1);
+	relminmxid_age = Max(nextMXID - rel->rd_rel->relminmxid, 1);
+	freeze_table_age = Max(freeze_table_age, 1);
+	multixact_freeze_table_age = Max(multixact_freeze_table_age, 1);
+	XIDFrac = (double) relfrozenxid_age / (double) freeze_table_age;
+	MXIDFrac = (double) relminmxid_age / (double) multixact_freeze_table_age;
+	*antiwrapfrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Pages that caller can cleanup lock immediately will never be left with
+	 * XIDs < freezeLimit (nor with MXIDs < multiXactCutoff).  Determine
+	 * values for a distinct set of cutoffs applied to pages that cannot be
+	 * immediately cleanup locked. The cutoffs govern caller's wait behavior.
+	 *
+	 * It is safer to accept earlier final relfrozenxid and relminmxid values
+	 * than it would be to wait indefinitely for a cleanup lock.  Waiting for
+	 * a cleanup lock to freeze one heap page risks not freezing every other
+	 * eligible heap page.  Keeping up the momentum is what matters most.
+	 */
+	*minXid = nextXID - (freeze_table_age / 2);
+	if (!TransactionIdIsNormal(*minXid))
+		*minXid = FirstNormalTransactionId;
+	/* minXid must always be <= freezeLimit */
+	if (TransactionIdPrecedes(*freezeLimit, *minXid))
+		*minXid = *freezeLimit;
+
+	*minMulti = nextMXID - (multixact_freeze_table_age / 2);
+	if (*minMulti < FirstMultiXactId)
+		*minMulti = FirstMultiXactId;
+	/* minMulti must always be <= multiXactCutoff */
+	if (MultiXactIdPrecedes(*multiXactCutoff, *minMulti))
+		*minMulti = *multiXactCutoff;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 55a355f58..b586b4aff 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -234,8 +234,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->n_dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5ca4a71d7..4dd70c334 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2456,10 +2456,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2476,10 +2476,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a409e6281..544dcf57d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -692,11 +692,11 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
+#vacuum_freeze_table_age = -1
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
+#vacuum_multixact_freeze_table_age = -1
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 109cc4727..4e39a42fe 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8215,7 +8215,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8404,7 +8404,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9120,31 +9120,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
-      <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
-      <indexterm>
-       <primary><varname>vacuum_freeze_table_age</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
-       </para>
-      </listitem>
-     </varlistentry>
-
      <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
       <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
       <indexterm>
@@ -9160,6 +9135,39 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
+      <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_table_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-xid-space"/>.
+       </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9179,7 +9187,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9225,19 +9233,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9249,10 +9265,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in multixacts) that
-        <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages with an older multixact ID.  The
-        default is 5 million multixacts.
+        Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
+        should use to decide whether to trigger freezing of pages with
+        an older multixact ID.  The default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 554b3a75d..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
-    transactions older than this cutoff XID are guaranteed to have been frozen.
-    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c582021d2..43ffbbbd3 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-min-age"/> and <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
@@ -154,13 +154,8 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>DISABLE_PAGE_SKIPPING</literal></term>
     <listitem>
      <para>
-      Normally, <command>VACUUM</command> will skip pages based on the <link
-      linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      Normally, <command>VACUUM</command> will skip pages based on the
+      <link linkend="vacuum-for-visibility-map">visibility map</link>.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -215,7 +210,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..998adf526 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,18 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), without ever being
+# prepared to wait for a cleanup lock (we'll never wait on a cleanup
+# lock because autovacuum_freeze_max_age and vacuum_freeze_table_age use
+# their default settings).
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +78,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +94,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +105,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +118,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +130,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +138,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..9963b165f 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +127,7 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..5038dbeb3 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,7 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +71,7 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+VACUUM reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.34.1

v7-0003-Add-eager-freezing-strategy-to-VACUUM.patchapplication/octet-stream; name=v7-0003-Add-eager-freezing-strategy-to-VACUUM.patchDownload

From 88d05bffac2bca6439025953dd6dcd58f8e7976f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 15:13:27 -0700
Subject: [PATCH v7 3/6] Add eager freezing strategy to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  Use of the eager strategy (an
alternative to the classic lazy freezing strategy) is controlled by a
new GUC, vacuum_freeze_strategy_threshold (and an associated
autovacuum_* reloption).  Tables whose rel_pages are >= the cutoff will
have VACUUM use the eager freezing strategy.  Otherwise we use the lazy
freezing strategy, which is the classic approach.

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).

If and when a smaller table (a table that uses lazy freezing at first)
grows past the table size threshold, the next VACUUM against the table
shouldn't have to do too much extra freezing to catch up when we perform
eager freezing for the first time (the table still won't be very large).
Once VACUUM has caught up, the amount of work required in each VACUUM
operation should be roughly proportionate to the number of new pages, at
least with a pure append-only table.

In summary, we try to get the benefit of the lazy freezing strategy,
without ever allowing VACUUM to fall uncomfortably far behind.  In
particular, we avoid accumulating an excessive number of unfrozen
all-visible pages in any one table.  This approach is often enough to
keep relfrozenxid recent, but we still have aggressive/antiwraparound
autovacuums for tables where it doesn't work out that way.

Note that freezing strategy is distinct from (though related to) the
strategy for skipping pages with the visibility map.  In practice tables
that use eager freezing always eagerly scan all-visible pages (they
prioritize advancing relfrozenxid), partly because we expect few or no
all-visible pages there (at least during the second or subsequent VACUUM
that uses eager freezing).  When VACUUM uses the classic/lazy freezing
strategy, VACUUM will also scan pages eagerly (i.e. it will scan any
all-visible pages and only skip all-frozen pages) when the added cost is
relatively low.
---
 src/include/access/heapam.h                   |  6 ++
 src/include/commands/vacuum.h                 |  4 +
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++
 src/backend/access/heap/heapam.c              | 11 ++-
 src/backend/access/heap/vacuumlazy.c          | 76 ++++++++++++++++---
 src/backend/commands/vacuum.c                 |  4 +
 src/backend/postmaster/autovacuum.c           | 10 +++
 src/backend/utils/misc/guc_tables.c           | 11 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 31 ++++++--
 doc/src/sgml/maintenance.sgml                 |  6 +-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++
 13 files changed, 164 insertions(+), 22 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ddc81c0d1..2105907a8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -130,6 +130,12 @@ typedef struct HeapTupleFreeze
  * and all unfrozen XIDs or MXIDs that remain after VACUUM finishes _must_
  * have values >= the final relfrozenxid/relminmxid values in pg_class.  This
  * includes XIDs that remain as MultiXact members from any tuple's xmax.
+ *
+ * When 'freeze_required' flag isn't set after all tuples are examined, the
+ * final choice on freezing is made by VACUUM itself.  We keep open the option
+ * to freeze or not freeze (a decision that VACUUM makes based on performance
+ * considerations) by maintaining an alternative set of "no freeze" variants
+ * of our relfrozenxid/relminmxid trackers.
  */
 typedef struct HeapPageFreeze
 {
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f..52379f819 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -256,6 +259,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca..e195b63d7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b734489..4b680501c 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 10de1adc3..0d1c2affc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6440,7 +6440,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * WAL-log what we would need to do, and return true.  Return false if nothing
  * is to be changed.  In addition, set *totally_frozen to true if the tuple
  * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * more freezing will eventually be required (assuming page is to be frozen).
+ *
+ * Although this interface is primarily tuple-based, caller decides on whether
+ * or not to freeze the page as a whole.  We'll often help caller to prepare a
+ * complete "freeze plan" that it ultimately discards.  However, our caller
+ * doesn't always get to choose; it must freeze when xtrack.freeze is set
+ * here.  This ensures that any XIDs < limit_xid are never left behind.
  *
  * VACUUM caller must assemble HeapTupleFreeze entries for every tuple that we
  * returned true for when called.  A later heap_freeze_execute_prepared call
@@ -6634,7 +6640,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * Trigger page level freezing to ensure that we reliably process
 		 * MultiXacts as instructed by FreezeMultiXactId() in all cases.
 		 * There is no way to opt out of this, since FreezeMultiXactId()
-		 * doesn't provide for that.
+		 * doesn't provide for that. (It helps us with NewRelfrozenXid, not
+		 * with NoFreezeNewRelfrozenXid.)
 		 */
 		if ((flags & FRM_NOOP) == 0)
 			xtrack->FreezeRequired = true;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 328562b61..7a719ee7b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -110,7 +110,7 @@
 
 /*
  * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages
+ * all-visible pages when using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
@@ -150,6 +150,8 @@ typedef struct LVRelState
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
+	/* Proactively freeze all tuples on pages about to be set all-visible? */
+	bool		allvis_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -254,6 +256,7 @@ typedef struct LVSavedErrInfo
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber eager_threshold,
 									  BlockNumber all_visible,
 									  BlockNumber all_frozen);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
@@ -327,6 +330,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	MultiXactId OldestMxact,
 				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
+				eager_threshold,
 				all_visible,
 				all_frozen,
 				scanned_pages,
@@ -366,6 +370,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
 	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
 	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 *
+	 * Also determine our cutoff for applying the eager/all-visible freezing
+	 * strategy.  If rel_pages is larger than this cutoff we use the strategy,
+	 * even during non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -374,6 +382,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->multixact_freeze_table_age,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
+	eager_threshold = params->freeze_strategy_threshold < 0 ?
+		vacuum_freeze_strategy_threshold :
+		params->freeze_strategy_threshold;
 
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
@@ -526,7 +537,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
-	scanned_pages = lazy_scan_strategy(vacrel,
+	scanned_pages = lazy_scan_strategy(vacrel, eager_threshold,
 									   all_visible, all_frozen);
 	if (verbose)
 		ereport(INFO,
@@ -1282,17 +1293,28 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine skipping strategy.
+ *	lazy_scan_strategy() -- Determine freezing/skipping strategy.
  *
- * Determines if the ongoing VACUUM operation should skip all-visible pages
- * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * Our traditional/lazy freezing strategy is useful when putting off the work
+ * of freezing totally avoids work that turns out to have been unnecessary.
+ * On the other hand we eagerly freeze pages when that strategy spreads out
+ * the burden of freezing over time.  Performance stability is important; no
+ * one VACUUM operation should need to freeze disproportionately many pages.
+ * Antiwraparound VACUUMs of append-only tables should generally be avoided.
+ *
+ * Also determines if the ongoing VACUUM operation should skip all-visible
+ * pages for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * When VACUUM freezes eagerly it always also scans pages eagerly, since it's
+ * important that relfrozenxid advance in affected tables, which are larger.
+ * When VACUUM freezes lazily it might make sense to scan pages lazily (skip
+ * all-visible pages) or eagerly (be capable of relfrozenxid advancement),
+ * depending on the extra cost - we might need to scan only a few extra pages.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_strategy(LVRelState *vacrel,
-				   BlockNumber all_visible,
-				   BlockNumber all_frozen)
+lazy_scan_strategy(LVRelState *vacrel, BlockNumber eager_threshold,
+				   BlockNumber all_visible, BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1325,21 +1347,48 @@ lazy_scan_strategy(LVRelState *vacrel,
 		scanned_pages_skipallfrozen++;
 
 	/*
-	 * Okay, now we have all the information we need to decide on a strategy
+	 * Okay, now we have all the information we need to decide on a strategy.
+	 *
+	 * We use the all-visible/eager freezing strategy when a threshold
+	 * controlled by the freeze_strategy_threshold GUC/reloption is crossed.
+	 * VACUUM won't accumulate any unfrozen all-visible pages over time in
+	 * tables above the threshold.  The system won't fall behind on freezing.
 	 */
 	if (!vacrel->skipallfrozen)
 	{
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		Assert(vacrel->aggressive && !vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
 		return rel_pages;
 	}
 	else if (vacrel->aggressive)
+	{
+		/* Always freeze all-visible pages during aggressive VACUUMs */
 		Assert(!vacrel->skipallvis);
+		vacrel->allvis_freeze_strategy = true;
+	}
+	else if (rel_pages >= eager_threshold)
+	{
+		/*
+		 * Non-aggressive VACUUM of table whose rel_pages now exceeds
+		 * GUC-based threshold for eager freezing.
+		 *
+		 * We always scan all-visible pages when the threshold is crossed, so
+		 * that relfrozenxid can be advanced.  There will typically be few or
+		 * no all-visible pages (only all-frozen) in the table anyway, at
+		 * least after the first VACUUM that exceeds the threshold.
+		 */
+		vacrel->allvis_freeze_strategy = true;
+		vacrel->skipallvis = false;
+	}
 	else
 	{
 		BlockNumber nextra,
 					nextra_threshold;
 
+		/* Non-aggressive VACUUM of small table -- use lazy freeze strategy */
+		vacrel->allvis_freeze_strategy = false;
+
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1847,8 +1896,15 @@ retry:
 	 *
 	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
+	 *
+	 * When ongoing VACUUM opted to use the all-visible freezing strategy we
+	 * freeze any page that will become all-visible, making it all-frozen
+	 * instead. (Actually, there are edge-cases where this might not result in
+	 * marking the page all-frozen in the visibility map, but that should have
+	 * only a negligible impact.)
 	 */
-	if (xtrack.FreezeRequired || tuples_frozen == 0)
+	if (xtrack.FreezeRequired || tuples_frozen == 0 ||
+		(vacrel->allvis_freeze_strategy && prunestate->all_visible))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3c8ea2147..df2bd53b9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 601834d4b..72be67da0 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 836b49484..5ca4a71d7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2483,6 +2483,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c35..a409e6281 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,6 +693,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bd50ea8e4..109cc4727 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9145,6 +9145,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9153,9 +9168,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9232,10 +9249,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        Specifies the cutoff age (in multixacts) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages with an older multixact ID.  The
+        default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..554b3a75d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -588,9 +588,9 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
+    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
+    transactions older than this cutoff XID are guaranteed to have been frozen.
+    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.34.1

v7-0001-Add-page-level-freezing-to-VACUUM.patchapplication/octet-stream; name=v7-0001-Add-page-level-freezing-to-VACUUM.patchDownload

From 89a9779ff68c77f150f701caffc750fae7410781 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v7 1/6] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

This approach decouples the question of _how_ VACUUM could/will freeze a
given heap page (which of its XIDs are eligible to be frozen) from the
question of whether it actually makes sense to do so right now.

Just adding page-level freezing does not change all that much on its
own: VACUUM will still typically freeze very lazily, since we're only
forcing freezing of all of a page's eligible tuples when we decide to
freeze at least one (on the basis of XID age and FreezeLimit).  For now
VACUUM still freezes everything almost as lazily as it always has.
Later work will teach VACUUM to apply an alternative eager freezing
strategy that triggers page-level freezing earlier, based on additional
criteria.
---
 src/include/access/heapam.h          |  49 +++++-
 src/backend/access/heap/heapam.c     | 234 +++++++++++++++------------
 src/backend/access/heap/vacuumlazy.c |  95 +++++++----
 3 files changed, 233 insertions(+), 145 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 810baaf9d..ddc81c0d1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -112,6 +112,40 @@ typedef struct HeapTupleFreeze
 	OffsetNumber offset;
 } HeapTupleFreeze;
 
+/*
+ * State used by VACUUM to track the details of freezing all eligible tuples
+ * on a given heap page.
+ *
+ * VACUUM prepares freeze plans for each page via heap_prepare_freeze_tuple
+ * calls (every tuple with storage gets its own call).  This page-level freeze
+ * state is updated across each call, which ultimately determines whether or
+ * not freezing the page is required. (VACUUM freezes the page via a call to
+ * heap_freeze_execute_prepared, which freezes using prepared freeze plans.)
+ *
+ * Aside from the basic question of whether or not freezing will go ahead, the
+ * state also tracks the oldest extant XID/MXID in the table as a whole, for
+ * the purposes of advancing relfrozenxid/relminmxid values in pg_class later
+ * on.  Each heap_prepare_freeze_tuple call pushes NewRelfrozenXid and/or
+ * NewRelminMxid back as required to avoid unsafe final pg_class values.  Any
+ * and all unfrozen XIDs or MXIDs that remain after VACUUM finishes _must_
+ * have values >= the final relfrozenxid/relminmxid values in pg_class.  This
+ * includes XIDs that remain as MultiXact members from any tuple's xmax.
+ */
+typedef struct HeapPageFreeze
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		FreezeRequired;
+
+	/* Values used when heap_freeze_execute_prepared is called for page */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* Used by callers that choose to not freeze the page */
+	TransactionId NoFreezeNewRelfrozenXid;
+	MultiXactId NoFreezeNewRelminMxid;
+
+} HeapPageFreeze;
+
 /* ----------------
  *		function prototypes for heap access method
  *
@@ -180,19 +214,20 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId relfrozenxid, TransactionId relminmxid,
 									  TransactionId cutoff_xid, TransactionId cutoff_multi,
+									  TransactionId limit_xid, MultiXactId limit_multi,
 									  HeapTupleFreeze *frz, bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  HeapPageFreeze *xtrack);
 extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-										 TransactionId FreezeLimit,
+										 TransactionId OldestXmin,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
-									TransactionId *relfrozenxid_out,
-									MultiXactId *relminmxid_out);
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
+									TransactionId MustFreezeLimit,
+									MultiXactId MustFreezeMultiLimit,
+									TransactionId *NewRelfrozenXid,
+									MultiXactId *NewRelminMxid);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d18c5ca6f..10de1adc3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6371,7 +6371,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				 * Running locker cannot possibly be older than the cutoff.
 				 *
 				 * The cutoff is <= VACUUM's OldestXmin, which is also the
-				 * initial value used for top-level relfrozenxid_out tracking
+				 * initial value used for top-level NewRelfrozenXid tracking
 				 * state.  A running locker cannot be older than VACUUM's
 				 * OldestXmin, either, so we don't need a temp_xid_out step.
 				 */
@@ -6444,26 +6444,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  *
  * VACUUM caller must assemble HeapTupleFreeze entries for every tuple that we
  * returned true for when called.  A later heap_freeze_execute_prepared call
- * will execute freezing for caller's page as a whole.
+ * will execute freezing for caller's page as a whole.  Caller must initialize
+ * xtrack fields for page as a whole before calling here with first tuple for
+ * the page.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6473,14 +6461,15 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  TransactionId limit_xid, MultiXactId limit_multi,
 						  HeapTupleFreeze *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  HeapPageFreeze *xtrack)
 {
 	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin,
+				freeze_xmax;
 	TransactionId xid;
 
 	frz->frzflags = 0;
@@ -6489,18 +6478,17 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen iff our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+	{
+		freeze_xmin = false;
+		xmin_already_frozen = true;
+		/* No need for NewRelfrozenXid handling for already-frozen xmin */
+	}
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6509,8 +6497,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoff_xid);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
@@ -6523,9 +6511,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		}
 		else
 		{
-			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			/* xmin to remain unfrozen.  Could push back NewRelfrozenXid. */
+			if (TransactionIdPrecedes(xid, xtrack->NewRelfrozenXid))
+				xtrack->NewRelfrozenXid = xid;
 		}
 	}
 
@@ -6536,7 +6524,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.  It needs
+	 * to return true for any tuple that we would force to be frozen here.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6544,7 +6533,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
+		TransactionId mxid_oldest_xid_out = xtrack->NewRelfrozenXid;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
@@ -6558,13 +6547,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/*
 			 * xmax will become an updater Xid (original MultiXact's updater
 			 * member Xid will be carried forward as a simple Xid in Xmax).
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
+			 * Might have to ratchet back NewRelfrozenXid here, though never
+			 * NewRelminMxid.
 			 */
 			Assert(!freeze_xmax);
 			Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
+			if (TransactionIdPrecedes(newxmax, xtrack->NewRelfrozenXid))
+				xtrack->NewRelfrozenXid = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6587,15 +6576,15 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/*
 			 * xmax is an old MultiXactId that we have to replace with a new
 			 * MultiXactId, to carry forward two or more original member XIDs.
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
+			 * Might have to ratchet back NewRelfrozenXid here, though never
+			 * NewRelminMxid.
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, xtrack->NewRelminMxid));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->NewRelfrozenXid));
+			xtrack->NewRelfrozenXid = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6617,26 +6606,38 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		{
 			/*
 			 * xmax is a MultiXactId, and nothing about it changes for now.
-			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
+			 * Might have to ratchet back NewRelminMxid, NewRelfrozenXid, or
 			 * both together.
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 xtrack->NewRelfrozenXid));
+			if (MultiXactIdPrecedes(xid, xtrack->NewRelminMxid))
+				xtrack->NewRelminMxid = xid;
+			xtrack->NewRelfrozenXid = mxid_oldest_xid_out;
 		}
 		else
 		{
 			/*
 			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.
-			 * Won't have to ratchet back relminmxid_out or relfrozenxid_out.
+			 * Won't have to ratchet back NewRelminMxid or NewRelfrozenXid.
+			 *
+			 * Note: heap_tuple_would_freeze() might not insist that this xmax
+			 * be frozen now, but we always freeze Multis proactively.
 			 */
 			Assert(freeze_xmax);
 			Assert(!TransactionIdIsValid(newxmax));
 		}
+
+		/*
+		 * Trigger page level freezing to ensure that we reliably process
+		 * MultiXacts as instructed by FreezeMultiXactId() in all cases.
+		 * There is no way to opt out of this, since FreezeMultiXactId()
+		 * doesn't provide for that.
+		 */
+		if ((flags & FRM_NOOP) == 0)
+			xtrack->FreezeRequired = true;
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6661,13 +6662,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
 			freeze_xmax = true;
-			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
+			/* No need for NewRelfrozenXid handling, since we'll freeze xmax */
 		}
 		else
 		{
 			freeze_xmax = false;
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, xtrack->NewRelfrozenXid))
+				xtrack->NewRelfrozenXid = xid;
 		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
@@ -6675,7 +6676,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		freeze_xmax = false;
 		xmax_already_frozen = true;
-		/* No need for relfrozenxid_out handling for already-frozen xmax */
+		/* No need for NewRelfrozenXid handling for already-frozen xmax */
 	}
 	else
 		ereport(ERROR,
@@ -6683,6 +6684,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+		Assert(changed);
+	}
 	if (freeze_xmax)
 	{
 		Assert(!xmax_already_frozen);
@@ -6709,18 +6715,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
 
-		/*
-		 * For Xvac, we ignore the cutoff_xid and just always perform the
-		 * freeze operation.  The oldest release in which such a value can
-		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
-		 */
+		/* For Xvac, we always freeze proactively */
 		if (TransactionIdIsNormal(xid))
 		{
+			Assert(TransactionIdPrecedes(xid, cutoff_xid));
+
 			/*
 			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
 			 * have failed; whereas a non-dead MOVED_IN tuple must mean the
@@ -6731,18 +6730,39 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			else
 				frz->frzflags |= XLH_FREEZE_XVAC;
 
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
+			/* Set XMIN_COMMITTED defensively */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
+
+			/*
+			 * Force freezing any page with an xvac to keep things simple.
+			 * This allows totally_frozen tracking to ignore xvac.
+			 */
 			changed = true;
+			xtrack->FreezeRequired = true;
 		}
 	}
 
-	*totally_frozen = (xmin_frozen &&
+	/*
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen (provided caller executes freeze plan for the page)
+	 */
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
+
+	/*
+	 * Force vacuumlazy.c to freeze page when avoiding it would violate the
+	 * rule that XIDs < limit_xid (and MXIDs < limit_multi) must never remain
+	 */
+	if (!xtrack->FreezeRequired &&
+		!(xmin_already_frozen && xmax_already_frozen))
+	{
+		xtrack->FreezeRequired =
+			heap_tuple_would_freeze(tuple, limit_xid, limit_multi,
+									&xtrack->NoFreezeNewRelfrozenXid,
+									&xtrack->NoFreezeNewRelminMxid);
+	}
+
 	return changed;
 }
 
@@ -6786,13 +6806,13 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
  */
 void
 heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-							 TransactionId FreezeLimit,
+							 TransactionId OldestXmin,
 							 HeapTupleFreeze *tuples, int ntuples)
 {
 	Page		page = BufferGetPage(buffer);
 
 	Assert(ntuples > 0);
-	Assert(TransactionIdIsNormal(FreezeLimit));
+	Assert(TransactionIdIsNormal(OldestXmin));
 
 	START_CRIT_SECTION();
 
@@ -6821,11 +6841,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
 
 		/*
-		 * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
-		 * Back up caller's FreezeLimit to avoid false conflicts when
-		 * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
+		 * OldestXmin is the first XID not frozen by VACUUM.  Back up caller's
+		 * OldestXmin to avoid false conflicts.
 		 */
-		snapshotConflictHorizon = FreezeLimit;
+		snapshotConflictHorizon = OldestXmin;
 		TransactionIdRetreat(snapshotConflictHorizon);
 
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
@@ -6867,14 +6886,20 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	HeapTupleFreeze frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	HeapPageFreeze dummy;
+
+	dummy.FreezeRequired = true;
+	dummy.NewRelfrozenXid = cutoff_xid;
+	dummy.NewRelminMxid = cutoff_multi;
+	dummy.NoFreezeNewRelfrozenXid = cutoff_xid;
+	dummy.NoFreezeNewRelminMxid = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
+										  cutoff_xid, cutoff_multi,
 										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+										  &dummy);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7304,15 +7329,16 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
  * could be processed by pruning away the whole tuple instead of freezing.
  *
- * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
- * like the heap_prepare_freeze_tuple arguments that they're based on.  We
- * never freeze here, which makes tracking the oldest extant XID/MXID simple.
+ * The *NewRelfrozenXid and *NewRelminMxid input/output arguments work just
+ * like the similar fields from the FreezeCutoffs struct.  We never freeze
+ * here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
-						TransactionId *relfrozenxid_out,
-						MultiXactId *relminmxid_out)
+heap_tuple_would_freeze(HeapTupleHeader tuple,
+						TransactionId MustFreezeLimit,
+						MultiXactId MustFreezeMultiLimit,
+						TransactionId *NewRelfrozenXid,
+						MultiXactId *NewRelminMxid)
 {
 	TransactionId xid;
 	MultiXactId multi;
@@ -7322,9 +7348,9 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (TransactionIdIsNormal(xid))
 	{
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, *NewRelfrozenXid))
+			*NewRelfrozenXid = xid;
+		if (TransactionIdPrecedes(xid, MustFreezeLimit))
 			would_freeze = true;
 	}
 
@@ -7339,9 +7365,9 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	if (TransactionIdIsNormal(xid))
 	{
 		/* xmax is a non-permanent XID */
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, *NewRelfrozenXid))
+			*NewRelfrozenXid = xid;
+		if (TransactionIdPrecedes(xid, MustFreezeLimit))
 			would_freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
@@ -7351,8 +7377,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 	{
 		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, *NewRelminMxid))
+			*NewRelminMxid = multi;
 		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
 		would_freeze = true;
 	}
@@ -7362,9 +7388,9 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		MultiXactMember *members;
 		int			nmembers;
 
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
+		if (MultiXactIdPrecedes(multi, *NewRelminMxid))
+			*NewRelminMxid = multi;
+		if (MultiXactIdPrecedes(multi, MustFreezeMultiLimit))
 			would_freeze = true;
 
 		/* need to check whether any member of the mxact is old */
@@ -7375,9 +7401,9 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		{
 			xid = members[i].xid;
 			Assert(TransactionIdIsNormal(xid));
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(xid, *NewRelfrozenXid))
+				*NewRelfrozenXid = xid;
+			if (TransactionIdPrecedes(xid, MustFreezeLimit))
 				would_freeze = true;
 		}
 		if (nmembers > 0)
@@ -7389,9 +7415,9 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		xid = HeapTupleHeaderGetXvac(tuple);
 		if (TransactionIdIsNormal(xid))
 		{
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			if (TransactionIdPrecedes(xid, *NewRelfrozenXid))
+				*NewRelfrozenXid = xid;
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			would_freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 834ab83a0..9c84f8397 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,8 +169,9 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
+	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
@@ -511,6 +512,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
@@ -1563,8 +1565,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	HeapPageFreeze xtrack;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
@@ -1580,8 +1582,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	xtrack.FreezeRequired = false;
+	xtrack.NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	xtrack.NewRelminMxid = vacrel->NewRelminMxid;
+	xtrack.NoFreezeNewRelfrozenXid = vacrel->NewRelfrozenXid;
+	xtrack.NoFreezeNewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1634,27 +1639,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1782,11 +1783,13 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[tuples_frozen],
 									  &tuple_totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &xtrack))
 		{
 			/* Save prepared freeze plan for later */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1807,9 +1810,33 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (xtrack.FreezeRequired || tuples_frozen == 0)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (might be zero eligible tuples).
+		 */
+		vacrel->NewRelfrozenXid = xtrack.NewRelfrozenXid;
+		vacrel->NewRelminMxid = xtrack.NewRelminMxid;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* Not freezing this page, so use alternative cutoffs */
+		vacrel->NewRelfrozenXid = xtrack.NoFreezeNewRelfrozenXid;
+		vacrel->NewRelminMxid = xtrack.NoFreezeNewRelminMxid;
+
+		/* Might still set page all-visible, but never all-frozen */
+		tuples_frozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1817,12 +1844,12 @@ retry:
 	 */
 	if (tuples_frozen > 0)
 	{
-		Assert(prunestate->hastup);
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		vacrel->frozen_pages++;
 
 		/* Execute all freeze plans for page as a single atomic action */
-		heap_freeze_execute_prepared(vacrel->rel, buf, vacrel->FreezeLimit,
+		heap_freeze_execute_prepared(vacrel->rel, buf, vacrel->OldestXmin,
 									 frozen, tuples_frozen);
 	}
 
@@ -1841,7 +1868,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1849,8 +1876,7 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1871,9 +1897,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1887,6 +1910,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->all_visible = false;
+		prunestate->has_lpdead_items = true;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-- 
2.34.1

#34

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#33)

6 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Fri, Nov 18, 2022 at 5:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

I've already prototyped a dedicated immutable "cutoffs" struct, which
is instantiated exactly once per VACUUM. Seems like a good approach to
me. The immutable state can be shared by heapam.c's
heap_prepare_freeze_tuple(), vacuumlazy.c, and even
vacuum_set_xid_limits() -- so everybody can work off of the same
struct directly. Will try to get that into shape for the next
revision.

Attached is v8.

Notable improvement over v7:

* As anticipated on November 18th, his revision adds a new refactoring
commit/patch, which adds a struct that contains fields like
FreezeLimit and OldestXmin, which is used by vacuumlazy.c to pass the
information to heap_prepare_freeze_tuple().

This refactoring makes everything easier to understand -- it's a
significant structural improvement.

* The changes intended to avoid allocating a new Multi during VACUUM
no longer appear in their own commit. That was squashed/combined with
the earlier page-level freezing commit.

This is another structural improvement.

The FreezeMultiXactId() changes were never really an optimization, and
I shouldn't have explained them that way. They are only needed to
avoid MultiXactId related regressions that page-level freezing would
otherwise cause. Doing these changes in the page-level freezing patch
makes that far clearer.

* Fixes an issue with snapshotConflictHorizon values for FREEZE_PAGE
records, where earlier revisions could have more false recovery
conflicts relative to the behavior on HEAD.

In other words, v8 addresses a concern that you (Andres) had in your
review of v6, here:

Won't using OldestXmin instead of FreezeLimit potentially cause additional
conflicts? Is there any reason to not compute an accurate value?

As anticipated, it is possible to generate valid FREEZE_PAGE
snapshotConflictHorizon using LVPagePruneState.visibility_cutoff_xid
in almost all cases -- so we should avoid almost all false recovery
conflicts. Granted, my approach here only works when the page will
become eligible to mark all-frozen (otherwise we can't trust
LVPagePruneState.visibility_cutoff_xid and have to fall back on
OldestXmin), but that's not really a problem in practice. Since in
practice page-level freezing is supposed to find a way to freeze pages
as a group, or not at all (so falling back on OldestXmin should be
very rare).

I could be more precise about generating a FREEZE_PAGE
snapshotConflictHorizon than this, but that didn't seem worth the
added complexity (I'd prefer to be able to ignore MultiXacts/xmax for
this stuff). I'm pretty sure that the new v8 approach is more than
good enough. It's actually an improvement on HEAD, where
snapshotConflictHorizon is derived from FreezeLimit, an approach with
the same basic problem as deriving snapshotConflictHorizon from
OldestXmin. Namely: using FreezeLimit is a poor proxy for what we
really want to use, which is a cutoff that comes from the specific
latest XID in some specific tuple header on the page we're freezing.

There are no remaining blockers to commit for the first two patches
from v8 (the two patches that add page-level freezing). I think that
I'll be able to commit page-level freezing in a matter of weeks, in
fact. All specific outstanding concerns about page-level freezing have
been addressed.

I believe that page-level freezing is uncontroversial. Unlike later
patches in the series, it changes nothing user-facing about VACUUM --
nothing very high level. Having the freeze plan deduplication work
added by commit 9e540599 helps here. The focus is WAL overhead over
time, and page level freezing can almost be understood as a mechanical
improvement to freezing that keeps costs over time down.

--
Peter Geoghegan

Attachments:

v8-0004-Add-eager-freezing-strategy-to-VACUUM.patchapplication/x-patch; name=v8-0004-Add-eager-freezing-strategy-to-VACUUM.patchDownload

From 78d6a5c1f0c71b7c3e2ff925a6d8efa36e12a8a4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 15:13:27 -0700
Subject: [PATCH v8 4/6] Add eager freezing strategy to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  Use of the eager strategy (an
alternative to the classic lazy freezing strategy) is controlled by a
new GUC, vacuum_freeze_strategy_threshold (and an associated
autovacuum_* reloption).  Tables whose rel_pages are >= the cutoff will
have VACUUM use the eager freezing strategy.  Otherwise we use the lazy
freezing strategy, which is the classic approach.

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).

If and when a smaller table (a table that uses lazy freezing at first)
grows past the table size threshold, the next VACUUM against the table
shouldn't have to do too much extra freezing to catch up when we perform
eager freezing for the first time (the table still won't be very large).
Once VACUUM has caught up, the amount of work required in each VACUUM
operation should be roughly proportionate to the number of new pages, at
least with a pure append-only table.

In summary, we try to get the benefit of the lazy freezing strategy,
without ever allowing VACUUM to fall uncomfortably far behind.  In
particular, we avoid accumulating an excessive number of unfrozen
all-visible pages in any one table.  This approach is often enough to
keep relfrozenxid recent, but we still have antiwraparound autovacuums
for tables where it doesn't work out that way.

Note that freezing strategy is distinct from (though related to) the
strategy for skipping pages with the visibility map.  In practice tables
that use eager freezing always eagerly scan all-visible pages (they
prioritize advancing relfrozenxid), partly because we expect few or no
all-visible pages there (at least during the second or subsequent VACUUM
that uses eager freezing).  When VACUUM uses the classic/lazy freezing
strategy, VACUUM will also scan pages eagerly (i.e. it will scan any
all-visible pages and only skip all-frozen pages) when the added cost is
relatively low.

This is preparation for an upcoming commit that completely removes
aggressive mode VACUUMs.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/heapam.h                   |  6 ++
 src/include/commands/vacuum.h                 | 10 ++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 ++++
 src/backend/access/heap/heapam.c              | 13 ++++-
 src/backend/access/heap/vacuumlazy.c          | 55 ++++++++++++++++---
 src/backend/commands/vacuum.c                 | 16 +++++-
 src/backend/postmaster/autovacuum.c           | 10 ++++
 src/backend/utils/misc/guc_tables.c           | 11 ++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 31 ++++++++---
 doc/src/sgml/maintenance.sgml                 |  6 +-
 doc/src/sgml/ref/create_table.sgml            | 14 +++++
 13 files changed, 164 insertions(+), 21 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ca4fab970..57d824740 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -131,6 +131,12 @@ typedef struct HeapTupleFreeze
  * and all unfrozen XIDs or MXIDs that remain after VACUUM finishes _must_
  * have values >= the final relfrozenxid/relminmxid values in pg_class.  This
  * includes XIDs that remain as MultiXact members from any tuple's xmax.
+ *
+ * When 'freeze_required' flag isn't set after all tuples are examined, the
+ * final choice on freezing is made by VACUUM itself.  We keep open the option
+ * to freeze or not freeze (a decision that VACUUM makes based on performance
+ * considerations) by maintaining an alternative set of "no freeze" variants
+ * of our relfrozenxid/relminmxid trackers in heap_prepare_freeze_tuple.
  */
 typedef struct HeapPageFreeze
 {
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 02289f42e..122fb93e2 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -272,6 +275,12 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
+	 * rel's main fork) for triggering eager/all-visible freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold;
 };
 
 /*
@@ -295,6 +304,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca..e195b63d7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b734489..4b680501c 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 45cdc1ae8..caa34bd35 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6466,7 +6466,7 @@ FreezeMultiXactId(MultiXactId multi, HeapTupleHeader tuple,
  * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the FreezeLimit and/or MultiXactCutoff cutoffs.  If so,
+ * are older than the OldestXmin and/or OldestMxact freeze cutoffs.  If so,
  * setup enough state (in the *frz output argument) to later execute and
  * WAL-log what caller needs to do for the tuple, and return true.  Return
  * false if nothing can be changed about the tuple right now.
@@ -6478,8 +6478,15 @@ FreezeMultiXactId(MultiXactId multi, HeapTupleHeader tuple,
  *
  * VACUUM caller must assemble HeapTupleFreeze freeze plan entries for every
  * tuple that we returned true for, and call heap_freeze_execute_prepared to
- * execute freezing.  Caller must initialize pagefrz fields for page as a
- * whole before first call here for each heap page.
+ * execute freezing for the page as a whole.  Caller must initialize pagefrz
+ * fields for page as a whole before first call here for each heap page.
+ *
+ * VACUUM caller decides on whether or not to freeze the page as a whole.
+ * We'll often prepare freeze plans for a page that caller just discards.
+ * However, VACUUM doesn't always get to make a choice; it must freeze when
+ * pagefrz.freeze_required is set, to ensure that any XIDs < FreezeLimit (and
+ * MXIDs < MultiXactCutoff) can never be left behind.  We make sure that
+ * VACUUM always follows that rule.
  *
  * We sometimes force freezing of xmax MultiXactId values long before it is
  * strictly necessary to do so just to ensure the FreezeLimit postcondition.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 145d9f24f..a1984d68e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -110,7 +110,7 @@
 
 /*
  * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages
+ * all-visible pages when using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
@@ -154,6 +154,8 @@ typedef struct LVRelState
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -1244,10 +1246,21 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine skipping strategy.
+ *	lazy_scan_strategy() -- Determine freezing/skipping strategy.
  *
- * Determines if the ongoing VACUUM operation should skip all-visible pages
- * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
+ * Our traditional/lazy freezing strategy is useful when putting off the work
+ * of freezing totally avoids work that turns out to have been unnecessary.
+ * On the other hand we eagerly freeze pages when that strategy spreads out
+ * the burden of freezing over time.  Performance stability is important; no
+ * one VACUUM operation should need to freeze disproportionately many pages.
+ *
+ * Also determines if the ongoing VACUUM operation should skip all-visible
+ * pages when advancing relfrozenxid is optional.  When VACUUM freezes eagerly
+ * it always also scans pages eagerly, since it's important that relfrozenxid
+ * advance in affected tables, which are larger.  When VACUUM freezes lazily
+ * it might make sense to scan pages lazily (skip all-visible pages) or
+ * eagerly (be capable of relfrozenxid advancement), depending on the extra
+ * cost - we might need to scan only a few extra pages.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
@@ -1287,17 +1300,35 @@ lazy_scan_strategy(LVRelState *vacrel,
 		scanned_pages_skipallfrozen++;
 
 	/*
-	 * Okay, now we have all the information we need to decide on a strategy
+	 * Okay, now we have all the information we need to decide on a strategy.
+	 *
+	 * We use the all-visible/eager freezing strategy when a threshold
+	 * controlled by the freeze_strategy_threshold GUC/reloption is crossed.
+	 * VACUUM won't accumulate any unfrozen all-visible pages over time in
+	 * tables above the threshold.  The system won't fall behind on freezing.
 	 */
+	if (rel_pages >= vacrel->cutoffs.freeze_strategy_threshold)
 	{
 		/*
-		 * TODO: Add code for eager freezing strategy here in next commit
+		 * VACUUM of table whose rel_pages now exceeds GUC-based threshold for
+		 * eager freezing.
+		 *
+		 * We always scan all-visible pages when the threshold is crossed, so
+		 * that relfrozenxid can be advanced.  There will typically be few or
+		 * no all-visible pages (only all-frozen) in the table anyway, at
+		 * least after the first VACUUM that exceeds the threshold.
 		 */
+		vacrel->eager_freeze_strategy = true;
+		vacrel->skipallvis = false;
 	}
+	else
 	{
 		BlockNumber nextra,
 					nextra_threshold;
 
+		/* VACUUM of small table -- use lazy freeze strategy */
+		vacrel->eager_freeze_strategy = false;
+
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1806,8 +1837,18 @@ retry:
 	 *
 	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
+	 *
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will become all-visible, making it all-frozen instead.
+	 * (Actually, the all-visible/eager freezing strategy doesn't quite work
+	 * that way.  It triggers freezing for pages that it sees will thereby be
+	 * set all-frozen in the VM immediately afterwards -- a stricter test.
+	 * Some pages that can be set all-visible cannot also be set all-frozen,
+	 * even after freezing, due to the presence of lock-only MultiXactIds.)
 	 */
-	if (pagefrz.freeze_required || tuples_frozen == 0)
+	if (pagefrz.freeze_required || tuples_frozen == 0 ||
+		(prunestate->all_visible && prunestate->all_frozen &&
+		 vacrel->eager_freeze_strategy))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0fb211845..ffa8eac12 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
@@ -943,7 +947,8 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -960,6 +965,7 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/*
 	 * Acquire OldestXmin.
@@ -1070,6 +1076,14 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 				 errhint("Close open transactions soon to avoid wraparound problems.\n"
 						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
+
 	/*
 	 * Assert that all cutoff invariants hold.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 601834d4b..72be67da0 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 349dd6a53..d3c8ae87d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2503,6 +2503,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c35..a409e6281 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,6 +693,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 24b1624ba..9c5861bd7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9145,6 +9145,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9153,9 +9168,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9232,10 +9249,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        Specifies the cutoff age (in multixacts) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages with an older multixact ID.  The
+        default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..554b3a75d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -588,9 +588,9 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
+    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
+    transactions older than this cutoff XID are guaranteed to have been frozen.
+    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.38.1

v8-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchapplication/x-patch; name=v8-0006-Size-VACUUM-s-dead_items-space-using-VM-snapshot.patchDownload

From 5c05352c5bf675c278bc983f9c7711d60c8aa629 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 23 Jul 2022 17:19:01 -0700
Subject: [PATCH v8 6/6] Size VACUUM's dead_items space using VM snapshot.

VACUUM knows precisely how many pages it will scan ahead of time from
its snapshot of the visibility map following recent work.  Apply that
information to size the dead_items space for TIDs more precisely (use
scanned_pages instead of rel_pages to cap the allocation).

This can make the memory allocation significantly smaller, without any
added risk of undersizing the array.  Since VACUUM's final scanned_pages
is fully predetermined (by the visibility map snapshot), there is no
question of interference from another backend that concurrently unsets
some heap page's visibility map bit.  Many details of how VACUUM will
process the target relation are "locked in" from the very beginning.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CAH2-Wzn9MquY1=msQUaS9Rj0HMGfgZisCCoVdc38T=AZM_ZV9w@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e6c2ff89f..acd27b447 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -279,7 +279,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -507,7 +508,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -3146,14 +3147,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3162,15 +3162,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3192,12 +3190,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
-- 
2.38.1

v8-0003-Teach-VACUUM-to-use-visibility-map-snapshots.patchapplication/x-patch; name=v8-0003-Teach-VACUUM-to-use-visibility-map-snapshots.patchDownload

From 4940359df2c386eeca0e9975d3861c8c2342f0f4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v8 3/6] Teach VACUUM to use visibility map snapshots.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.

This has significant advantages over the previous approach of using the
authoritative VM fork to decide on which pages to skip.  The number of
heap pages processed will no longer increase when some other backend
concurrently modifies a skippable page, since VACUUM will continue to
see the page as skippable (which is correct because the page really is
still skippable "relative to VACUUM's OldestXmin cutoff").  It also
gives VACUUM reliable information about how many pages will be scanned,
before its physical heap scan even begins.  That makes it easier to
model the costs that VACUUM incurs using a top-down, up-front approach.

Non-aggressive VACUUMs now make an up-front choice about VM skipping
strategy: they decide whether to prioritize early advancement of
relfrozenxid (eager behavior) over avoiding work by skipping all-visible
pages (lazy behavior).  Nothing about the details of how lazy_scan_prune
freezes changes just yet, though a later commit will add the concept of
freezing strategies.

Non-aggressive VACUUMs now explicitly commit to (or decide against)
early relfrozenxid advancement up-front.  VACUUM will now either scan
every all-visible page, or none at all.  This replaces lazy_scan_skip's
SKIP_PAGES_THRESHOLD behavior, which was intended to enable early
relfrozenxid advancement (see commit bf136cf6), but left many of the
details to chance.  It was possible that a single all-visible page
located in a range of all-frozen blocks would render it unsafe to
advance relfrozenxid later on; lazy_scan_skip just didn't have enough
high-level context about the table as a whole.  Now our policy around
skipping all-visible pages is exactly the same condition as whether or
not it's safe to advance relfrozenxid later on; nothing is left to
chance.

Note that DISABLE_PAGE_SKIPPING no longer forces aggressive mode.  A
later commit will completely remove the concept of aggressive mode
VACUUM, but things work out a bit simpler if we do this part now.

TODO: We don't spill VM snapshots to disk just yet (resource management
aspects of VM snapshots still need work).  For now a VM snapshot is just
a copy of the VM pages stored in local buffers allocated by palloc().
Note in particular that this completely ignores the impact of allocating
large buffers when vacuuming large tables.

XXX: Commit bf136cf6 was also concerned about triggering readahead as a
primitive form of prefetching.  Do we also need to add I/O explicit I/O
prefetching hints to make up for what may have been lost?

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h      |   7 +
 src/include/commands/vacuum.h           |   2 +-
 src/backend/access/heap/vacuumlazy.c    | 352 +++++++++++++-----------
 src/backend/access/heap/visibilitymap.c | 164 +++++++++++
 doc/src/sgml/ref/vacuum.sgml            |   9 +-
 5 files changed, 363 insertions(+), 171 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..49f197bb3 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,9 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +38,10 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+									  BlockNumber *all_visible, BlockNumber *all_frozen);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
+extern uint8 visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 43ee24b12..02289f42e 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 
 /*
  * Values used by index_cleanup and truncate params.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9753b6b08..145d9f24f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,10 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * Threshold that controls whether non-aggressive VACUUMs will skip any
+ * all-visible pages
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -150,8 +150,10 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
+	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	bool		skipallvis;
+	/* Skip (don't scan) all-frozen pages? */
+	bool		skipallfrozen;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -168,7 +170,8 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map used by lazy_scan_skip */
+	vmsnapshot *vmsnap;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -241,10 +244,11 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  BlockNumber all_visible,
+									  BlockNumber all_frozen);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel,
+								  BlockNumber next_block, bool *all_visible);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -307,11 +311,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	bool		verbose,
 				instrument,
 				aggressive,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	struct VacuumCutoffs cutoffs;
 	BlockNumber orig_rel_pages,
+				all_visible,
+				all_frozen,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -348,17 +354,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	aggressive = vacuum_set_xid_limits(rel, params, &cutoffs);
 
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		aggressive = true;
-		skipwithvm = false;
-	}
-
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
 	 * up an error context callback to display additional information on any
@@ -381,20 +376,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
@@ -422,7 +403,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
 	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
+	vacrel->skipallvis = false; /* arbitrary initial value */
+	/* skipallfrozen indicates DISABLE_PAGE_SKIPPING to lazy_scan_strategy */
+	vacrel->skipallfrozen = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -478,12 +461,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * confused about whether a tuple should be frozen or removed.  (In the
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
-	 *
-	 * We must determine rel_pages _after_ OldestXmin has been established.
-	 * lazy_scan_heap's physical heap scan (scan of pages < rel_pages) is
-	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
-	 * XIDs must at least be considered for freezing (though not necessarily
-	 * frozen) during its scan.
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->cutoffs = cutoffs;
@@ -491,7 +468,35 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
+
+	/*
+	 * VACUUM must scan all pages that might have XIDs < OldestXmin in tuple
+	 * headers to be able to safely advance relfrozenxid later on.  There is
+	 * no good reason to scan any additional pages. (Actually we might opt to
+	 * skip all-visible pages.  Either way we won't scan pages for no reason.)
+	 *
+	 * Now that OldestXmin and rel_pages are acquired, acquire an immutable
+	 * snapshot of the visibility map as well.  lazy_scan_skip works off of
+	 * the vmsnap, not the authoritative VM, which can continue to change.
+	 * Pages that lazy_scan_heap will scan are fixed and known in advance.
+	 *
+	 * The exact number of pages that lazy_scan_heap will scan also depends on
+	 * our choice of skipping strategy.  VACUUM can either choose to skip any
+	 * all-visible pages lazily, or choose to scan those same pages instead.
+	 * Decide on a skipping strategy to determine final scanned_pages.
+	 */
+	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
+										&all_visible, &all_frozen);
+	scanned_pages = lazy_scan_strategy(vacrel, all_visible, all_frozen);
+	if (verbose)
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace, vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -508,6 +513,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -554,12 +560,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(aggressive ? cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
+		 * lazy_scan_strategy call determined it would skip all-visible pages
 		 */
 		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -604,6 +609,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -631,10 +639,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -829,13 +833,12 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -849,42 +852,24 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = lazy_scan_skip(vacrel, 0, &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		if (blkno < next_block_to_scan)
+			continue;
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = lazy_scan_skip(vacrel, blkno + 1,
+											&next_all_visible);
 
 		vacrel->scanned_pages++;
 
@@ -1094,10 +1079,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1125,12 +1109,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1169,7 +1151,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1262,47 +1244,123 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *	lazy_scan_strategy() -- Determine skipping strategy.
  *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * Determines if the ongoing VACUUM operation should skip all-visible pages
+ * for non-aggressive VACUUMs, where advancing relfrozenxid is optional.
  *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
+ * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
+lazy_scan_strategy(LVRelState *vacrel,
+				   BlockNumber all_visible,
+				   BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
+				scanned_pages_skipallvis,
+				scanned_pages_skipallfrozen;
+	uint8		mapbits;
 
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
+	Assert(vacrel->scanned_pages == 0);
+	Assert(rel_pages >= all_visible && all_visible >= all_frozen);
+
+	/*
+	 * First figure out the final scanned_pages for each of the skipping
+	 * policies that lazy_scan_skip might end up using: skipallvis (skip both
+	 * all-frozen and all-visible) and skipallfrozen (just skip all-frozen).
+	 */
+	scanned_pages_skipallvis = rel_pages - all_visible;
+	scanned_pages_skipallfrozen = rel_pages - all_frozen;
+
+	/*
+	 * Even if the last page is skippable, it will still get scanned because
+	 * of lazy_scan_skip's "try to set nonempty_pages for last page" rule.
+	 * Reconcile that rule with what vmsnap says about the last page now.
+	 *
+	 * When vmsnap thinks that we will be skipping the last page (we won't),
+	 * increment scanned_pages to compensate.  Otherwise change nothing.
+	 */
+	mapbits = visibilitymap_snap_status(vacrel->vmsnap, rel_pages - 1);
+	if (mapbits & VISIBILITYMAP_ALL_VISIBLE)
+		scanned_pages_skipallvis++;
+	if (mapbits & VISIBILITYMAP_ALL_FROZEN)
+		scanned_pages_skipallfrozen++;
+
+	/*
+	 * Okay, now we have all the information we need to decide on a strategy
+	 */
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
+		/*
+		 * TODO: Add code for eager freezing strategy here in next commit
+		 */
+	}
+	{
+		BlockNumber nextra,
+					nextra_threshold;
+
+		/*
+		 * Decide on whether or not we'll skip all-visible pages.
+		 *
+		 * In general, VACUUM doesn't necessarily have to freeze anything to
+		 * be able to advance relfrozenxid and/or relminmxid by a significant
+		 * number of XIDs/MXIDs.  The oldest tuples might turn out to have
+		 * been deleted since VACUUM last ran, or this VACUUM might find that
+		 * there simply are no MultiXacts that even need to be considered.
+		 *
+		 * It's hard to predict whether this VACUUM operation will work out
+		 * that way, so be lazy (just skip) unless the added cost is very low.
+		 * We opt for a skipallfrozen-only VACUUM when the number of extra
+		 * pages (extra scanned pages that are all-visible but not all-frozen)
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 */
+		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
+		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+		nextra_threshold = Max(32, nextra_threshold);
+
+		/* Only skipallvis when DISABLE_PAGE_SKIPPING not in use */
+		vacrel->skipallvis = nextra >= nextra_threshold &&
+			vacrel->skipallfrozen && !vacrel->aggressive;
+	}
+
+	/* Return the appropriate variant of scanned_pages */
+	if (vacrel->skipallvis)
+	{
+		Assert(!vacrel->aggressive);
+		Assert(vacrel->skipallfrozen);
+		return scanned_pages_skipallvis;
+	}
+	if (vacrel->skipallfrozen)
+		return scanned_pages_skipallfrozen;
+
+	return rel_pages;			/* DISABLE_PAGE_SKIPPING */
+}
+
+/*
+ *	lazy_scan_skip() -- get the next block to scan according to vmsnap.
+ *
+ * lazy_scan_heap() caller passes the next block in line.  We return the next
+ * block to scan.  Caller skips the blocks preceding returned block, if any.
+ *
+ * The all-visible status of the returned block is set in *all_visible, too.
+ * Block usually won't be all-visible (since it's unskippable), but it can be
+ * when next_block is rel's last page, or when DISABLE_PAGE_SKIPPING is used.
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, BlockNumber next_block, bool *all_visible)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_block_to_scan = next_block;
+
+	*all_visible = true;
+	while (next_block_to_scan < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_snap_status(vacrel->vmsnap,
+														next_block_to_scan);
 
 		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
 		{
 			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
+			*all_visible = false;
 			break;
 		}
 
@@ -1313,58 +1371,26 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 		 * lock on rel to attempt a truncation that fails anyway, just because
 		 * there are tuples on the last page (it is likely that there will be
 		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
 		 */
-		if (next_unskippable_block == rel_pages - 1)
+		if (next_block_to_scan == rel_pages - 1)
 			break;
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
+		Assert(vacrel->skipallfrozen || !vacrel->skipallvis);
+		if (!vacrel->skipallfrozen)
 			break;
 
 		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 * Don't skip all-visible pages when lazy_scan_strategy determined
+		 * that it was more important for this VACUUM to advance relfrozenxid
 		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
+		if (!vacrel->skipallvis && (mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+			break;
 
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
+		next_block_to_scan++;
 	}
 
-	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
-	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
-	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
-	else
-	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
-	}
-
-	return next_unskippable_block;
+	return next_block_to_scan;
 }
 
 /*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e..cfe3cf9b6 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,9 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap - get read-only snapshot of visibility map
+ *		visibilitymap_snap_release - release previously acquired snapshot
+ *		visibilitymap_snap_status - get status of bits from vm snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +55,9 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset (only tuples deleted before its
+ * OldestXmin cutoff are considered dead).
  *
  * LOCKING
  *
@@ -124,6 +130,22 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Snapshot of visibility map at a point in time.
+ *
+ * TODO This is currently always just a palloc()'d buffer -- give more thought
+ * to resource management (at a minimum add spilling to temp file).
+ */
+struct vmsnapshot
+{
+	/* Snapshot may contain zero or more visibility map pages */
+	BlockNumber nvmpages;
+
+	/* Copy of VM pages from the time that visibilitymap_snap() was called */
+	PGAlignedBlock vmpages[FLEXIBLE_ARRAY_MEMBER];
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
@@ -373,6 +395,148 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap - get read-only snapshot of visibility map
+ *
+ * Initializes caller's snapshot, allocating memory in caller's memory context.
+ * Caller can use visibilitymap_snapshot_status to get the status of individual
+ * heap pages at the point that we were called.
+ *
+ * Used by VACUUM to determine which pages it must scan up front.  This avoids
+ * useless scans of concurrently unset heap pages.  VACUUM prefers to leave
+ * them to be scanned during the next VACUUM operation.
+ *
+ * rel_pages is the current size of the heap relation.
+ */
+vmsnapshot *
+visibilitymap_snap(Relation rel, BlockNumber rel_pages,
+				   BlockNumber *all_visible, BlockNumber *all_frozen)
+{
+	BlockNumber nvmpages,
+				mapBlockLast;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap %s %d", RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	*all_visible = 0;
+	*all_frozen = 0;
+	mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages);
+	nvmpages = mapBlockLast + 1;
+	vmsnap = palloc(offsetof(vmsnapshot, vmpages) +
+					sizeof(PGAlignedBlock) * nvmpages);
+
+	vmsnap->nvmpages = nvmpages;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Page		localvmpage;
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		localvmpage = vmsnap->vmpages->data + mapBlock * BLCKSZ;
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(localvmpage, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/*
+		 * Visibility map page copied to local buffer for caller's snapshot.
+		 * Caller requires an exact count of all-visible and all-frozen blocks
+		 * in the heap relation.  Handle that now.
+		 *
+		 * Must "truncate" our local copy of the VM to avoid incorrectly
+		 * counting heap pages >= rel_pages as all-visible/all-frozen.  Handle
+		 * this by clearing irrelevant bits on the last VM page copied.
+		 */
+		map = PageGetContents(localvmpage);
+		if (mapBlock == mapBlockLast)
+		{
+			/* byte and bit for first heap page not to be scanned by VACUUM */
+			uint32		truncByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			uint8		truncOffset = HEAPBLK_TO_OFFSET(rel_pages);
+
+			if (truncByte != 0 || truncOffset != 0)
+			{
+				/* Clear any bits set for heap pages >= rel_pages */
+				MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
+				map[truncByte] &= (1 << truncOffset) - 1;
+			}
+
+			/* Now it's safe to tally bits from this final VM page below */
+		}
+
+		/* Tally the all-visible and all-frozen counts from this page */
+		umap = (uint64 *) map;
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			*all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			*all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+	}
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Just frees the memory allocated by visibilitymap_snap for now (presumably
+ * this will need to release temp files in later revisions of the patch)
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	pfree(vmsnap);
+}
+
+/*
+ *	visibilitymap_snap_status - get status of bits from vm snapshot
+ *
+ * Are all tuples on heapBlk visible to all or are marked frozen, according
+ * to caller's snapshot of the visibility map?
+ */
+uint8
+visibilitymap_snap_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+	char	   *map;
+	uint8		result;
+	Page		localvmpage;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_status %d", heapBlk);
+#endif
+
+	/* If we didn't copy the VM page, assume heapBlk not all-visible */
+	if (mapBlock >= vmsnap->nvmpages)
+		return 0;
+
+	localvmpage = ((Page) vmsnap->vmpages) + mapBlock * BLCKSZ;
+	map = PageGetContents(localvmpage);
+
+	result = ((map[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+	return result;
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c582021d2..78e35abb9 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -154,13 +154,8 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>DISABLE_PAGE_SKIPPING</literal></term>
     <listitem>
      <para>
-      Normally, <command>VACUUM</command> will skip pages based on the <link
-      linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      Normally, <command>VACUUM</command> will skip pages based on the
+      <link linkend="vacuum-for-visibility-map">visibility map</link>.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
-- 
2.38.1

v8-0001-Refactor-how-VACUUM-passes-around-its-XID-cutoffs.patchapplication/x-patch; name=v8-0001-Refactor-how-VACUUM-passes-around-its-XID-cutoffs.patchDownload

From 9e36bb144fa2bc050f1ad2eb203fe089f6603255 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 19 Nov 2022 16:37:53 -0800
Subject: [PATCH v8 1/6] Refactor how VACUUM passes around its XID cutoffs.

Use a dedicated struct for the XID/MXID cutoffs used by VACUUM, such as
FreezeLimit and OldestXmin.  This state is initialized in vacuum.c, and
then passed around (via const pointers) by code from vacuumlazy.c to
external freezing related routines like heap_prepare_freeze_tuple.

Also simplify some of the logic for dealing with frozen xmin in
heap_prepare_freeze_tuple: add dedicated "xmin_already_frozen" state to
clearly distinguish xmin XIDs that we're going to freeze from those that
were already frozen from before.  This makes its xmin handling code
symmetrical with its xmax handling code.  This is preparation for an
upcoming commit that adds page level freezing.

Also refactor the control flow within FreezeMultiXactId(), while adding
stricter sanity checks.  We now test OldestXmin directly (instead than
using FreezeLimit as an inexact proxy for OldestXmin).  Also promote an
assertion to detect multiple updater XIDs within a single multi into a
new "can't happen" error.  This is also in preparation for the page
level freezing commit, which will need to cede control of page level
freezing to FreezeMultiXactId() with pages that have MultiXactIds that
might need to be frozen.  This helps to preserve the historic eager
freezing behavior used when processing MultiXactIds, while still doing
lazy processing for MultiXactIds where eager processing happen to be
expensive.
---
 src/include/access/heapam.h          |  18 +-
 src/include/commands/vacuum.h        |  44 ++-
 src/backend/access/heap/heapam.c     | 497 +++++++++++++--------------
 src/backend/access/heap/vacuumlazy.c | 125 +++----
 src/backend/commands/cluster.c       |  25 +-
 src/backend/commands/vacuum.c        |  78 ++---
 6 files changed, 395 insertions(+), 392 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 810baaf9d..abc3a1f34 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -38,6 +38,7 @@
 
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
+struct VacuumCutoffs;
 
 #define MaxLockTupleMode	LockTupleExclusive
 
@@ -178,21 +179,20 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
-									  TransactionId relfrozenxid, TransactionId relminmxid,
-									  TransactionId cutoff_xid, TransactionId cutoff_multi,
+									  const struct VacuumCutoffs *cutoffs,
 									  HeapTupleFreeze *frz, bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  TransactionId *NewRelFrozenXid,
+									  MultiXactId *NewRelminMxid);
 extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 										 TransactionId FreezeLimit,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
-							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
-									TransactionId *relfrozenxid_out,
-									MultiXactId *relminmxid_out);
+							  TransactionId FreezeLimit, TransactionId MultiXactCutoff);
+extern bool heap_tuple_should_freeze(HeapTupleHeader tuple,
+									 const struct VacuumCutoffs *cutoffs,
+									 TransactionId *NewRelfrozenXid,
+									 MultiXactId *NewRelminMxid);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index b63751c46..43ee24b12 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -235,6 +235,45 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
+/*
+ * VacuumCutoffs is immutable state that describes the cutoffs used by VACUUM.
+ * Established at the beginning of each VACUUM operation.
+ */
+struct VacuumCutoffs
+{
+	/*
+	 * Existing pg_class fields at start of VACUUM (used for sanity checks)
+	 */
+	TransactionId relfrozenxid;
+	MultiXactId relminmxid;
+
+	/*
+	 * OldestXmin is the Xid below which tuples deleted by any xact (that
+	 * committed) should be considered DEAD, not just RECENTLY_DEAD.
+	 *
+	 * OldestMxact is the Mxid below which MultiXacts are definitely not seen
+	 * as visible by any running transaction.
+	 *
+	 * OldestXmin and OldestMxact are also the most recent values that can
+	 * ever be passed to vac_update_relstats() as frozenxid and minmulti
+	 * arguments at the end of VACUUM.  These same values should be passed
+	 * when it turns out that VACUUM will leave no unfrozen XIDs/MXIDs behind
+	 * in the table.
+	 */
+	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
+
+	/*
+	 * FreezeLimit is the Xid below which all Xids are definitely replaced by
+	 * FrozenTransactionId in heap pages that VACUUM can cleanup lock.
+	 *
+	 * MultiXactCutoff is the value below which all MultiXactIds are
+	 * definitely removed from Xmax in heap pages VACUUM can cleanup lock.
+	 */
+	TransactionId FreezeLimit;
+	MultiXactId MultiXactCutoff;
+};
+
 /*
  * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
  */
@@ -287,10 +326,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *minmulti_updated,
 								bool in_outer_xact);
 extern bool vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
-								  TransactionId *OldestXmin,
-								  MultiXactId *OldestMxact,
-								  TransactionId *FreezeLimit,
-								  MultiXactId *MultiXactCutoff);
+								  struct VacuumCutoffs *cutoffs);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 747db5037..74b3a459e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -52,6 +52,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -6125,12 +6126,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
-				  TransactionId relfrozenxid, TransactionId relminmxid,
-				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
+				  const struct VacuumCutoffs *cutoffs, uint16 *flags,
+				  TransactionId *mxid_oldest_xid_out)
 {
-	TransactionId xid = InvalidTransactionId;
-	int			i;
+	TransactionId newxmax = InvalidTransactionId;
 	MultiXactMember *members;
 	int			nmembers;
 	bool		need_replace;
@@ -6153,12 +6152,12 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		*flags |= FRM_INVALIDATE_XMAX;
 		return InvalidTransactionId;
 	}
-	else if (MultiXactIdPrecedes(multi, relminmxid))
+	else if (MultiXactIdPrecedes(multi, cutoffs->relminmxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
-								 multi, relminmxid)));
-	else if (MultiXactIdPrecedes(multi, cutoff_multi))
+								 multi, cutoffs->relminmxid)));
+	else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
 	{
 		/*
 		 * This old multi cannot possibly have members still running, but
@@ -6171,39 +6170,39 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
-									 multi, cutoff_multi)));
+									 multi, cutoffs->MultiXactCutoff)));
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
 			*flags |= FRM_INVALIDATE_XMAX;
-			xid = InvalidTransactionId;
+			newxmax = InvalidTransactionId;
 		}
 		else
 		{
-			/* replace multi by update xid */
-			xid = MultiXactIdGetUpdateXid(multi, t_infomask);
+			/* replace multi with single XID for its updater */
+			newxmax = MultiXactIdGetUpdateXid(multi, t_infomask);
 
 			/* wasn't only a lock, xid needs to be valid */
-			Assert(TransactionIdIsValid(xid));
+			Assert(TransactionIdIsValid(newxmax));
 
-			if (TransactionIdPrecedes(xid, relfrozenxid))
+			if (TransactionIdPrecedes(newxmax, cutoffs->relfrozenxid))
 				ereport(ERROR,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before relfrozenxid %u",
-										 xid, relfrozenxid)));
+										 newxmax, cutoffs->relfrozenxid)));
 
 			/*
-			 * If the xid is older than the cutoff, it has to have aborted,
-			 * otherwise the tuple would have gotten pruned away.
+			 * If the new xmax xid is older than OldestXmin, it has to have
+			 * aborted, otherwise the tuple would have been pruned away
 			 */
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(newxmax, cutoffs->OldestXmin))
 			{
-				if (TransactionIdDidCommit(xid))
+				if (TransactionIdDidCommit(newxmax))
 					ereport(ERROR,
 							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg_internal("cannot freeze committed update xid %u", xid)));
+							 errmsg_internal("cannot freeze committed update xid %u", newxmax)));
 				*flags |= FRM_INVALIDATE_XMAX;
-				xid = InvalidTransactionId;
+				newxmax = InvalidTransactionId;
 			}
 			else
 			{
@@ -6215,17 +6214,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
 		 * when no Xids will remain
 		 */
-		return xid;
+		return newxmax;
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below FreezeLimit xid cutoff, so we
 	 * need to walk the whole members array to figure out what to do, if
 	 * anything.
 	 */
-
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6236,12 +6232,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
 	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
-	for (i = 0; i < nmembers; i++)
+	for (int i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		TransactionId xid = members[i].xid;
+
+		Assert(!TransactionIdPrecedes(xid, cutoffs->relfrozenxid));
+
+		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
 			need_replace = true;
 			break;
@@ -6251,7 +6250,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
+	 * In the simplest case, there is no member older than FreezeLimit; we can
 	 * keep the existing MultiXactId as-is, avoiding a more expensive second
 	 * pass over the multi
 	 */
@@ -6279,110 +6278,98 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	update_committed = false;
 	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
 
-	for (i = 0; i < nmembers; i++)
+	/*
+	 * Determine whether to keep each member txid, or to ignore it instead
+	 */
+	for (int i = 0; i < nmembers; i++)
 	{
-		/*
-		 * Determine whether to keep this member or ignore it.
-		 */
-		if (ISUPDATE_from_mxstatus(members[i].status))
+		TransactionId xid = members[i].xid;
+		MultiXactStatus mstatus = members[i].status;
+
+		Assert(TransactionIdIsValid(xid));
+		Assert(!TransactionIdPrecedes(xid, cutoffs->relfrozenxid));
+
+		if (!ISUPDATE_from_mxstatus(mstatus))
 		{
-			TransactionId txid = members[i].xid;
-
-			Assert(TransactionIdIsValid(txid));
-			if (TransactionIdPrecedes(txid, relfrozenxid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("found update xid %u from before relfrozenxid %u",
-										 txid, relfrozenxid)));
-
 			/*
-			 * It's an update; should we keep it?  If the transaction is known
-			 * aborted or crashed then it's okay to ignore it, otherwise not.
-			 * Note that an updater older than cutoff_xid cannot possibly be
-			 * committed, because HeapTupleSatisfiesVacuum would have returned
-			 * HEAPTUPLE_DEAD and we would not be trying to freeze the tuple.
-			 *
-			 * As with all tuple visibility routines, it's critical to test
-			 * TransactionIdIsInProgress before TransactionIdDidCommit,
-			 * because of race conditions explained in detail in
-			 * heapam_visibility.c.
+			 * Locker XID (not updater XID).  We only keep lockers that are
+			 * still running.
 			 */
-			if (TransactionIdIsCurrentTransactionId(txid) ||
-				TransactionIdIsInProgress(txid))
-			{
-				Assert(!TransactionIdIsValid(update_xid));
-				update_xid = txid;
-			}
-			else if (TransactionIdDidCommit(txid))
-			{
-				/*
-				 * The transaction committed, so we can tell caller to set
-				 * HEAP_XMAX_COMMITTED.  (We can only do this because we know
-				 * the transaction is not running.)
-				 */
-				Assert(!TransactionIdIsValid(update_xid));
-				update_committed = true;
-				update_xid = txid;
-			}
-			else
-			{
-				/*
-				 * Not in progress, not committed -- must be aborted or
-				 * crashed; we can ignore it.
-				 */
-			}
-
-			/*
-			 * Since the tuple wasn't totally removed when vacuum pruned, the
-			 * update Xid cannot possibly be older than the xid cutoff. The
-			 * presence of such a tuple would cause corruption, so be paranoid
-			 * and check.
-			 */
-			if (TransactionIdIsValid(update_xid) &&
-				TransactionIdPrecedes(update_xid, cutoff_xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("found update xid %u from before xid cutoff %u",
-										 update_xid, cutoff_xid)));
-
-			/*
-			 * We determined that this is an Xid corresponding to an update
-			 * that must be retained -- add it to new members list for later.
-			 *
-			 * Also consider pushing back temp_xid_out, which is needed when
-			 * we later conclude that a new multi is required (i.e. when we go
-			 * on to set FRM_RETURN_IS_MULTI for our caller because we also
-			 * need to retain a locker that's still running).
-			 */
-			if (TransactionIdIsValid(update_xid))
+			if (TransactionIdIsCurrentTransactionId(xid) ||
+				TransactionIdIsInProgress(xid))
 			{
 				newmembers[nnewmembers++] = members[i];
-				if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
-					temp_xid_out = members[i].xid;
+				has_lockers = true;
+
+				/*
+				 * Cannot possibly be older than VACUUM's OldestXmin, so we
+				 * don't need a NewRelfrozenXid step here
+				 */
+				Assert(TransactionIdPrecedesOrEquals(cutoffs->OldestXmin, xid));
 			}
+
+			continue;
+		}
+
+		/*
+		 * Updater XID (not locker XID).  Should we keep it?
+		 *
+		 * Since the tuple wasn't totally removed when vacuum pruned, the
+		 * update Xid cannot possibly be older than OldestXmin cutoff. The
+		 * presence of such a tuple would cause corruption, so be paranoid and
+		 * check.
+		 */
+		if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("found update xid %u from before removable cutoff %u",
+									 xid, cutoffs->OldestXmin)));
+		if (TransactionIdIsValid(update_xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("multixact %u has two or more updating members",
+									 multi),
+					 errdetail_internal("First updater XID=%u second updater XID=%u.",
+										update_xid, xid)));
+
+		/*
+		 * If the transaction is known aborted or crashed then it's okay to
+		 * ignore it, otherwise not.
+		 *
+		 * As with all tuple visibility routines, it's critical to test
+		 * TransactionIdIsInProgress before TransactionIdDidCommit, because of
+		 * race conditions explained in detail in heapam_visibility.c.
+		 */
+		if (TransactionIdIsCurrentTransactionId(xid) ||
+			TransactionIdIsInProgress(xid))
+			update_xid = xid;
+		else if (TransactionIdDidCommit(xid))
+		{
+			/*
+			 * The transaction committed, so we can tell caller to set
+			 * HEAP_XMAX_COMMITTED.  (We can only do this because we know the
+			 * transaction is not running.)
+			 */
+			update_committed = true;
+			update_xid = xid;
 		}
 		else
 		{
-			/* We only keep lockers if they are still running */
-			if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
-				TransactionIdIsInProgress(members[i].xid))
-			{
-				/*
-				 * Running locker cannot possibly be older than the cutoff.
-				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
-				 * initial value used for top-level relfrozenxid_out tracking
-				 * state.  A running locker cannot be older than VACUUM's
-				 * OldestXmin, either, so we don't need a temp_xid_out step.
-				 */
-				Assert(TransactionIdIsNormal(members[i].xid));
-				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
-				Assert(!TransactionIdPrecedes(members[i].xid,
-											  *mxid_oldest_xid_out));
-				newmembers[nnewmembers++] = members[i];
-				has_lockers = true;
-			}
+			/*
+			 * Not in progress, not committed -- must be aborted or crashed;
+			 * we can ignore it.
+			 */
+			continue;
 		}
+
+		/*
+		 * We determined that this is an Xid corresponding to an update that
+		 * must be retained -- add it to new members list for later.  Also
+		 * consider pushing back mxid_oldest_xid_out.
+		 */
+		newmembers[nnewmembers++] = members[i];
+		if (TransactionIdPrecedes(xid, temp_xid_out))
+			temp_xid_out = xid;
 	}
 
 	pfree(members);
@@ -6395,7 +6382,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* nothing worth keeping!? Tell caller to remove the whole thing */
 		*flags |= FRM_INVALIDATE_XMAX;
-		xid = InvalidTransactionId;
+		newxmax = InvalidTransactionId;
 		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
@@ -6411,7 +6398,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		*flags |= FRM_RETURN_IS_XID;
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
-		xid = update_xid;
+		newxmax = update_xid;
 		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
 	}
 	else
@@ -6421,14 +6408,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * one, to set as new Xmax in the tuple.  The oldest surviving member
 		 * might push back mxid_oldest_xid_out.
 		 */
-		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
+		newxmax = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
 		*mxid_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
 
-	return xid;
+	return newxmax;
 }
 
 /*
@@ -6450,19 +6437,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
+ * The *NewRelFrozenXid and *NewRelminMxid arguments are the current target
  * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
  * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
  * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
  * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
+ * Each call here pushes back *NewRelFrozenXid and/or *NewRelminMxid as needed
+ * to avoid unsafe final values in rel's authoritative pg_class tuple.
  *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
@@ -6471,16 +6452,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  */
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
-						  TransactionId relfrozenxid, TransactionId relminmxid,
-						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  const struct VacuumCutoffs *cutoffs,
 						  HeapTupleFreeze *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  TransactionId *NewRelFrozenXid,
+						  MultiXactId *NewRelminMxid)
 {
-	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		frzplan_set = false;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin,
+				freeze_xmax;
 	TransactionId xid;
 
 	frz->frzflags = 0;
@@ -6489,54 +6470,51 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen when our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+	{
+		freeze_xmin = false;
+		xmin_already_frozen = true;
+		/* No need for NewRelfrozenXid handling for already-frozen xmin */
+	}
 	else
 	{
-		if (TransactionIdPrecedes(xid, relfrozenxid))
+		if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
-									 xid, relfrozenxid)));
+									 xid, cutoffs->relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->FreezeLimit);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-										 xid, cutoff_xid)));
+										 xid, cutoffs->FreezeLimit)));
 
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
-			changed = true;
+			frzplan_set = true;
 		}
 		else
 		{
-			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			/* xmin to remain unfrozen.  Could push back NewRelfrozenXid. */
+			if (TransactionIdPrecedes(xid, *NewRelFrozenXid))
+				*NewRelFrozenXid = xid;
 		}
 	}
 
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given FreezeLimit.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
-	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6545,11 +6523,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		/* Raw xmax is a MultiXactId */
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
+		TransactionId mxid_oldest_xid_out = *NewRelFrozenXid;
 
-		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
-									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi,
+		newxmax = FreezeMultiXactId(xid, tuple->t_infomask, cutoffs,
 									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
@@ -6559,13 +6535,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/*
 			 * xmax will become an updater Xid (original MultiXact's updater
 			 * member Xid will be carried forward as a simple Xid in Xmax).
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
+			 * Might have to ratchet back NewRelfrozenXid here, though never
+			 * NewRelminMxid.
 			 */
 			Assert(!freeze_xmax);
 			Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
+			if (TransactionIdPrecedes(newxmax, *NewRelFrozenXid))
+				*NewRelFrozenXid = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6578,7 +6554,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
-			changed = true;
+			frzplan_set = true;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6588,15 +6564,15 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/*
 			 * xmax is an old MultiXactId that we have to replace with a new
 			 * MultiXactId, to carry forward two or more original member XIDs.
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
+			 * Might have to ratchet back NewRelfrozenXid here, though never
+			 * NewRelminMxid.
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, *NewRelminMxid));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 *NewRelFrozenXid));
+			*NewRelFrozenXid = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6612,28 +6588,28 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 			frz->xmax = newxmax;
 
-			changed = true;
+			frzplan_set = true;
 		}
 		else if (flags & FRM_NOOP)
 		{
 			/*
 			 * xmax is a MultiXactId, and nothing about it changes for now.
-			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
+			 * Might have to ratchet back NewRelminMxid, NewRelfrozenXid, or
 			 * both together.
 			 */
 			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
+												 *NewRelFrozenXid));
+			if (MultiXactIdPrecedes(xid, *NewRelminMxid))
+				*NewRelminMxid = xid;
+			*NewRelFrozenXid = mxid_oldest_xid_out;
 		}
 		else
 		{
 			/*
 			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.
-			 * Won't have to ratchet back relminmxid_out or relfrozenxid_out.
+			 * Won't have to ratchet back NewRelminMxid or NewRelfrozenXid.
 			 */
 			Assert(freeze_xmax);
 			Assert(!TransactionIdIsValid(newxmax));
@@ -6642,13 +6618,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	else if (TransactionIdIsNormal(xid))
 	{
 		/* Raw xmax is normal XID */
-		if (TransactionIdPrecedes(xid, relfrozenxid))
+		if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
-									 xid, relfrozenxid)));
+									 xid, cutoffs->relfrozenxid)));
 
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
 			/*
 			 * If we freeze xmax, make absolutely sure that it's not an XID
@@ -6663,13 +6639,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
 			freeze_xmax = true;
-			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
+			/* No need for NewRelfrozenXid handling, since we'll freeze xmax */
 		}
 		else
 		{
 			freeze_xmax = false;
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, *NewRelFrozenXid))
+				*NewRelFrozenXid = xid;
 		}
 	}
 	else if (!TransactionIdIsValid(xid))
@@ -6678,14 +6654,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		Assert((tuple->t_infomask & HEAP_XMAX_IS_MULTI) == 0);
 		freeze_xmax = false;
 		xmax_already_frozen = true;
-		/* No need for relfrozenxid_out handling for already-frozen xmax */
+		/* No need for NewRelfrozenXid handling for already-frozen xmax */
 	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
+				 errmsg_internal("found raw xmax %u (infomask 0x%04x) not invalid and not multi",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+		Assert(frzplan_set);
+	}
 	if (freeze_xmax)
 	{
 		Assert(!xmax_already_frozen);
@@ -6701,7 +6682,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		frz->t_infomask |= HEAP_XMAX_INVALID;
 		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
 		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
-		changed = true;
+		frzplan_set = true;
 	}
 
 	/*
@@ -6713,17 +6694,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		xid = HeapTupleHeaderGetXvac(tuple);
 
 		/*
-		 * For Xvac, we ignore the cutoff_xid and just always perform the
-		 * freeze operation.  The oldest release in which such a value can
-		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
+		 * For Xvac, we always freeze proactively.  This allows totally_frozen
+		 * tracking to ignore xvac.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
+			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
+			Assert(TransactionIdPrecedes(xid, cutoffs->OldestXmin));
+
 			/*
 			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
 			 * have failed; whereas a non-dead MOVED_IN tuple must mean the
@@ -6734,19 +6712,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			else
 				frz->frzflags |= XLH_FREEZE_XVAC;
 
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
+			/* Set XMIN_COMMITTED defensively */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
-			changed = true;
+			frzplan_set = true;
 		}
 	}
 
-	*totally_frozen = (xmin_frozen &&
+	/*
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen
+	 */
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
-	return changed;
+
+	return frzplan_set;
 }
 
 /*
@@ -6865,19 +6845,25 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 bool
 heap_freeze_tuple(HeapTupleHeader tuple,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
-				  TransactionId cutoff_xid, TransactionId cutoff_multi)
+				  TransactionId FreezeLimit, TransactionId MultiXactCutoff)
 {
 	HeapTupleFreeze frz;
 	bool		do_freeze;
-	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	bool		totally_frozen;
+	struct VacuumCutoffs cutoffs;
+	TransactionId NewRelfrozenXid = FreezeLimit;
+	MultiXactId NewRelminMxid = MultiXactCutoff;
 
-	do_freeze = heap_prepare_freeze_tuple(tuple,
-										  relfrozenxid, relminmxid,
-										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+	cutoffs.relfrozenxid = relfrozenxid;
+	cutoffs.relminmxid = relminmxid;
+	cutoffs.OldestXmin = FreezeLimit;
+	cutoffs.OldestMxact = MultiXactCutoff;
+	cutoffs.FreezeLimit = FreezeLimit;
+	cutoffs.MultiXactCutoff = MultiXactCutoff;
+
+	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs,
+										  &frz, &totally_frozen,
+										  &NewRelfrozenXid, &NewRelminMxid);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7300,35 +7286,41 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
 }
 
 /*
- * heap_tuple_would_freeze
+ * heap_tuple_should_freeze
  *
  * Return value indicates if heap_prepare_freeze_tuple sibling function would
  * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
  * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
  * could be processed by pruning away the whole tuple instead of freezing.
  *
- * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
+ * The *NewRelfrozenXid and *NewRelminMxid input/output arguments work just
  * like the heap_prepare_freeze_tuple arguments that they're based on.  We
  * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
-						TransactionId *relfrozenxid_out,
-						MultiXactId *relminmxid_out)
+heap_tuple_should_freeze(HeapTupleHeader tuple,
+						 const struct VacuumCutoffs *cutoffs,
+						 TransactionId *NewRelfrozenXid,
+						 MultiXactId *NewRelminMxid)
 {
+	TransactionId MustFreezeLimit;
+	MultiXactId MustFreezeMultiLimit;
 	TransactionId xid;
 	MultiXactId multi;
-	bool		would_freeze = false;
+	bool		freeze = false;
+
+	MustFreezeLimit = cutoffs->FreezeLimit;
+	MustFreezeMultiLimit = cutoffs->MultiXactCutoff;
 
 	/* First deal with xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (TransactionIdIsNormal(xid))
 	{
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
-			would_freeze = true;
+		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
+		if (TransactionIdPrecedes(xid, *NewRelfrozenXid))
+			*NewRelfrozenXid = xid;
+		if (TransactionIdPrecedes(xid, MustFreezeLimit))
+			freeze = true;
 	}
 
 	/* Now deal with xmax */
@@ -7341,11 +7333,12 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 	if (TransactionIdIsNormal(xid))
 	{
+		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
 		/* xmax is a non-permanent XID */
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
-			would_freeze = true;
+		if (TransactionIdPrecedes(xid, *NewRelfrozenXid))
+			*NewRelfrozenXid = xid;
+		if (TransactionIdPrecedes(xid, MustFreezeLimit))
+			freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
 	{
@@ -7354,10 +7347,10 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 	{
 		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, *NewRelminMxid))
+			*NewRelminMxid = multi;
 		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
-		would_freeze = true;
+		freeze = true;
 	}
 	else
 	{
@@ -7365,10 +7358,11 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		MultiXactMember *members;
 		int			nmembers;
 
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
-			would_freeze = true;
+		Assert(MultiXactIdPrecedesOrEquals(cutoffs->relminmxid, multi));
+		if (MultiXactIdPrecedes(multi, *NewRelminMxid))
+			*NewRelminMxid = multi;
+		if (MultiXactIdPrecedes(multi, MustFreezeMultiLimit))
+			freeze = true;
 
 		/* need to check whether any member of the mxact is old */
 		nmembers = GetMultiXactIdMembers(multi, &members, false,
@@ -7377,11 +7371,11 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		for (int i = 0; i < nmembers; i++)
 		{
 			xid = members[i].xid;
-			Assert(TransactionIdIsNormal(xid));
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
-				would_freeze = true;
+			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
+			if (TransactionIdPrecedes(xid, *NewRelfrozenXid))
+				*NewRelfrozenXid = xid;
+			if (TransactionIdPrecedes(xid, MustFreezeLimit))
+				freeze = true;
 		}
 		if (nmembers > 0)
 			pfree(members);
@@ -7392,14 +7386,15 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		xid = HeapTupleHeaderGetXvac(tuple);
 		if (TransactionIdIsNormal(xid))
 		{
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
-			would_freeze = true;
+			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
+			if (TransactionIdPrecedes(xid, *NewRelfrozenXid))
+				*NewRelfrozenXid = xid;
+			/* heap_prepare_freeze_tuple forces xvac freezing */
+			freeze = true;
 		}
 	}
 
-	return would_freeze;
+	return freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7e..b3668e57b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,6 +144,10 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Buffer access strategy and parallel vacuum state */
+	BufferAccessStrategy bstrategy;
+	ParallelVacuumState *pvs;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -158,21 +162,9 @@ typedef struct LVRelState
 	bool		do_index_cleanup;
 	bool		do_rel_truncate;
 
-	/* Buffer access strategy and parallel vacuum state */
-	BufferAccessStrategy bstrategy;
-	ParallelVacuumState *pvs;
-
-	/* rel's initial relfrozenxid and relminmxid */
-	TransactionId relfrozenxid;
-	MultiXactId relminmxid;
-	double		old_live_tuples;	/* previous value of pg_class.reltuples */
-
 	/* VACUUM operation's cutoffs for freezing and pruning */
-	TransactionId OldestXmin;
+	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
-	TransactionId FreezeLimit;
-	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
@@ -318,10 +310,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
-	TransactionId OldestXmin,
-				FreezeLimit;
-	MultiXactId OldestMxact,
-				MultiXactCutoff;
+	struct VacuumCutoffs cutoffs;
 	BlockNumber orig_rel_pages,
 				new_rel_pages,
 				new_rel_allvisible;
@@ -354,14 +343,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 								  RelationGetRelid(rel));
 
 	/*
-	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
-	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
-	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
-	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
+	 * Get cutoffs that determine which deleted tuples are considered DEAD,
+	 * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze
 	 */
-	aggressive = vacuum_set_xid_limits(rel, params, &OldestXmin, &OldestMxact,
-									   &FreezeLimit, &MultiXactCutoff);
+	aggressive = vacuum_set_xid_limits(rel, params, &cutoffs);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -415,6 +400,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
+	vacrel->bstrategy = bstrategy;
 	if (instrument && vacrel->nindexes > 0)
 	{
 		/* Copy index names used by instrumentation (not error reporting) */
@@ -459,11 +445,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
-	vacrel->bstrategy = bstrategy;
-	vacrel->relfrozenxid = rel->rd_rel->relfrozenxid;
-	vacrel->relminmxid = rel->rd_rel->relminmxid;
-	vacrel->old_live_tuples = rel->rd_rel->reltuples;
-
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
 	vacrel->removed_pages = 0;
@@ -505,15 +486,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * frozen) during its scan.
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
-	vacrel->OldestXmin = OldestXmin;
+	vacrel->cutoffs = cutoffs;
 	vacrel->vistest = GlobalVisTestFor(rel);
-	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
-	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
-	vacrel->MultiXactCutoff = MultiXactCutoff;
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = OldestXmin;
-	vacrel->NewRelminMxid = OldestMxact;
+	vacrel->NewRelfrozenXid = cutoffs.OldestXmin;
+	vacrel->NewRelminMxid = cutoffs.OldestMxact;
 	vacrel->skippedallvis = false;
 
 	/*
@@ -569,13 +546,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
-		   TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
-										 vacrel->relfrozenxid,
+	Assert(vacrel->NewRelfrozenXid == cutoffs.OldestXmin ||
+		   TransactionIdPrecedesOrEquals(aggressive ? cutoffs.FreezeLimit :
+										 vacrel->cutoffs.relfrozenxid,
 										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
-									   vacrel->relminmxid,
+	Assert(vacrel->NewRelminMxid == cutoffs.OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(aggressive ? cutoffs.MultiXactCutoff :
+									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
 	if (vacrel->skippedallvis)
 	{
@@ -702,20 +679,22 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
 								 (long long) vacrel->missed_dead_tuples,
 								 vacrel->missed_dead_pages);
-			diff = (int32) (ReadNextTransactionId() - OldestXmin);
+			diff = (int32) (ReadNextTransactionId() - cutoffs.OldestXmin);
 			appendStringInfo(&buf,
 							 _("removable cutoff: %u, which was %d XIDs old when operation ended\n"),
-							 OldestXmin, diff);
+							 cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid -
+								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
 								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid -
+								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
 								 vacrel->NewRelminMxid, diff);
@@ -1610,7 +1589,7 @@ retry:
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
-		bool		tuple_totally_frozen;
+		bool		totally_frozen;
 
 		/*
 		 * Set the offset number so that we can display it along with any
@@ -1666,7 +1645,8 @@ retry:
 		 * since heap_page_prune() looked.  Handle that here by restarting.
 		 * (See comments at the top of function for a full explanation.)
 		 */
-		res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
+		res = HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
+									   buf);
 
 		if (unlikely(res == HEAPTUPLE_DEAD))
 			goto retry;
@@ -1723,7 +1703,8 @@ retry:
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					if (!TransactionIdPrecedes(xmin,
+											   vacrel->cutoffs.OldestXmin))
 					{
 						prunestate->all_visible = false;
 						break;
@@ -1774,13 +1755,8 @@ retry:
 		prunestate->hastup = true;	/* page makes rel truncation unsafe */
 
 		/* Tuple with storage -- consider need to freeze */
-		if (heap_prepare_freeze_tuple(tuple.t_data,
-									  vacrel->relfrozenxid,
-									  vacrel->relminmxid,
-									  vacrel->FreezeLimit,
-									  vacrel->MultiXactCutoff,
-									  &frozen[tuples_frozen],
-									  &tuple_totally_frozen,
+		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs,
+									  &frozen[tuples_frozen], &totally_frozen,
 									  &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Save prepared freeze plan for later */
@@ -1791,7 +1767,7 @@ retry:
 		 * If tuple is not frozen (and not about to become frozen) then caller
 		 * had better not go on to set this page's VM bit
 		 */
-		if (!tuple_totally_frozen)
+		if (!totally_frozen)
 			prunestate->all_frozen = false;
 	}
 
@@ -1817,7 +1793,8 @@ retry:
 		vacrel->frozen_pages++;
 
 		/* Execute all freeze plans for page as a single atomic action */
-		heap_freeze_execute_prepared(vacrel->rel, buf, vacrel->FreezeLimit,
+		heap_freeze_execute_prepared(vacrel->rel, buf,
+									 vacrel->cutoffs.FreezeLimit,
 									 frozen, tuples_frozen);
 	}
 
@@ -1972,10 +1949,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_would_freeze(tupleheader,
-									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff,
-									&NewRelfrozenXid, &NewRelminMxid))
+		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
+									 &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
 			if (vacrel->aggressive)
@@ -2010,7 +1985,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
+										 buf))
 		{
 			case HEAPTUPLE_DELETE_IN_PROGRESS:
 			case HEAPTUPLE_LIVE:
@@ -2274,6 +2250,7 @@ static bool
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
 	bool		allindexes = true;
+	double		old_live_tuples = vacrel->rel->rd_rel->reltuples;
 
 	Assert(vacrel->nindexes > 0);
 	Assert(vacrel->do_index_vacuuming);
@@ -2297,9 +2274,9 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 			Relation	indrel = vacrel->indrels[idx];
 			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
 
-			vacrel->indstats[idx] =
-				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
-									  vacrel);
+			vacrel->indstats[idx] = lazy_vacuum_one_index(indrel, istat,
+														  old_live_tuples,
+														  vacrel);
 
 			if (lazy_check_wraparound_failsafe(vacrel))
 			{
@@ -2312,7 +2289,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	else
 	{
 		/* Outsource everything to parallel variant */
-		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, vacrel->old_live_tuples,
+		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples,
 											vacrel->num_index_scans);
 
 		/*
@@ -2581,15 +2558,15 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 static bool
 lazy_check_wraparound_failsafe(LVRelState *vacrel)
 {
-	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
-	Assert(MultiXactIdIsValid(vacrel->relminmxid));
+	Assert(TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid));
+	Assert(MultiXactIdIsValid(vacrel->cutoffs.relminmxid));
 
 	/* Don't warn more than once per VACUUM */
 	if (vacrel->failsafe_active)
 		return true;
 
-	if (unlikely(vacuum_xid_failsafe_check(vacrel->relfrozenxid,
-										   vacrel->relminmxid)))
+	if (unlikely(vacuum_xid_failsafe_check(vacrel->cutoffs.relfrozenxid,
+										   vacrel->cutoffs.relminmxid)))
 	{
 		vacrel->failsafe_active = true;
 
@@ -3246,7 +3223,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
+										 buf))
 		{
 			case HEAPTUPLE_LIVE:
 				{
@@ -3265,7 +3243,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					if (!TransactionIdPrecedes(xmin,
+											   vacrel->cutoffs.OldestXmin))
 					{
 						all_visible = false;
 						*all_frozen = false;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 07e091bb8..6cfea04a9 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -824,10 +824,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	VacuumParams params;
-	TransactionId OldestXmin,
-				FreezeXid;
-	MultiXactId OldestMxact,
-				MultiXactCutoff;
+	struct VacuumCutoffs cutoffs;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -916,23 +913,24 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	memset(&params, 0, sizeof(VacuumParams));
-	vacuum_set_xid_limits(OldHeap, &params, &OldestXmin, &OldestMxact,
-						  &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, &params, &cutoffs);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
 	 * backwards, so take the max.
 	 */
 	if (TransactionIdIsValid(OldHeap->rd_rel->relfrozenxid) &&
-		TransactionIdPrecedes(FreezeXid, OldHeap->rd_rel->relfrozenxid))
-		FreezeXid = OldHeap->rd_rel->relfrozenxid;
+		TransactionIdPrecedes(cutoffs.FreezeLimit,
+							  OldHeap->rd_rel->relfrozenxid))
+		cutoffs.FreezeLimit = OldHeap->rd_rel->relfrozenxid;
 
 	/*
 	 * MultiXactCutoff, similarly, shouldn't go backwards either.
 	 */
 	if (MultiXactIdIsValid(OldHeap->rd_rel->relminmxid) &&
-		MultiXactIdPrecedes(MultiXactCutoff, OldHeap->rd_rel->relminmxid))
-		MultiXactCutoff = OldHeap->rd_rel->relminmxid;
+		MultiXactIdPrecedes(cutoffs.MultiXactCutoff,
+							OldHeap->rd_rel->relminmxid))
+		cutoffs.MultiXactCutoff = OldHeap->rd_rel->relminmxid;
 
 	/*
 	 * Decide whether to use an indexscan or seqscan-and-optional-sort to scan
@@ -971,13 +969,14 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * values (e.g. because the AM doesn't use freezing).
 	 */
 	table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
-									OldestXmin, &FreezeXid, &MultiXactCutoff,
+									cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+									&cutoffs.MultiXactCutoff,
 									&num_tuples, &tups_vacuumed,
 									&tups_recently_dead);
 
 	/* return selected values to caller, get set as relfrozenxid/minmxid */
-	*pFreezeXid = FreezeXid;
-	*pCutoffMulti = MultiXactCutoff;
+	*pFreezeXid = cutoffs.FreezeLimit;
+	*pCutoffMulti = cutoffs.MultiXactCutoff;
 
 	/* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
 	NewHeap->rd_toastoid = InvalidOid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b5d0ac161..0fb211845 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -933,30 +933,11 @@ get_all_vacuum_rels(int options)
  *
  * The target relation and VACUUM parameters are our inputs.
  *
- * Our output parameters are:
- * - OldestXmin is the Xid below which tuples deleted by any xact (that
- *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - OldestMxact is the Mxid below which MultiXacts are definitely not
- *   seen as visible by any running transaction.
- * - FreezeLimit is the Xid below which all Xids are definitely frozen or
- *   removed during aggressive vacuums.
- * - MultiXactCutoff is the value below which all MultiXactIds are definitely
- *   removed from Xmax during aggressive vacuums.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
- *
- * OldestXmin and OldestMxact are the most recent values that can ever be
- * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
- * vacuumlazy.c caller later on.  These values should be passed when it turns
- * out that VACUUM will leave no unfrozen XIDs/MXIDs behind in the table.
+ * Output parameters are the cutoffs that VACUUM caller should use.
  */
 bool
 vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
-					  TransactionId *OldestXmin, MultiXactId *OldestMxact,
-					  TransactionId *FreezeLimit, MultiXactId *MultiXactCutoff)
+					  struct VacuumCutoffs *cutoffs)
 {
 	int			freeze_min_age,
 				multixact_freeze_min_age,
@@ -970,6 +951,10 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 				safeOldestMxact,
 				aggressiveMXIDCutoff;
 
+	/* Determining table age details  */
+	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
+	cutoffs->relminmxid = rel->rd_rel->relminmxid;
+
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
@@ -987,14 +972,14 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	 * that only one vacuum process can be working on a particular table at
 	 * any time, and that each vacuum is always an independent transaction.
 	 */
-	*OldestXmin = GetOldestNonRemovableTransactionId(rel);
+	cutoffs->OldestXmin = GetOldestNonRemovableTransactionId(rel);
 
 	if (OldSnapshotThresholdActive())
 	{
 		TransactionId limit_xmin;
 		TimestampTz limit_ts;
 
-		if (TransactionIdLimitedForOldSnapshots(*OldestXmin, rel,
+		if (TransactionIdLimitedForOldSnapshots(cutoffs->OldestXmin, rel,
 												&limit_xmin, &limit_ts))
 		{
 			/*
@@ -1004,15 +989,15 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 			 * frequency), but would still be a significant improvement.
 			 */
 			SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
-			*OldestXmin = limit_xmin;
+			cutoffs->OldestXmin = limit_xmin;
 		}
 	}
 
-	Assert(TransactionIdIsNormal(*OldestXmin));
+	Assert(TransactionIdIsNormal(cutoffs->OldestXmin));
 
 	/* Acquire OldestMxact */
-	*OldestMxact = GetOldestMultiXactId();
-	Assert(MultiXactIdIsValid(*OldestMxact));
+	cutoffs->OldestMxact = GetOldestMultiXactId();
+	Assert(MultiXactIdIsValid(cutoffs->OldestMxact));
 
 	/* Acquire next XID/next MXID values used to apply age-based settings */
 	nextXID = ReadNextTransactionId();
@@ -1030,12 +1015,12 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	Assert(freeze_min_age >= 0);
 
 	/* Compute FreezeLimit, being careful to generate a normal XID */
-	*FreezeLimit = nextXID - freeze_min_age;
-	if (!TransactionIdIsNormal(*FreezeLimit))
-		*FreezeLimit = FirstNormalTransactionId;
+	cutoffs->FreezeLimit = nextXID - freeze_min_age;
+	if (!TransactionIdIsNormal(cutoffs->FreezeLimit))
+		cutoffs->FreezeLimit = FirstNormalTransactionId;
 	/* FreezeLimit must always be <= OldestXmin */
-	if (TransactionIdPrecedes(*OldestXmin, *FreezeLimit))
-		*FreezeLimit = *OldestXmin;
+	if (TransactionIdPrecedes(cutoffs->OldestXmin, cutoffs->FreezeLimit))
+		cutoffs->FreezeLimit = cutoffs->OldestXmin;
 
 	/*
 	 * Compute the multixact age for which freezing is urgent.  This is
@@ -1057,16 +1042,16 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	Assert(multixact_freeze_min_age >= 0);
 
 	/* Compute MultiXactCutoff, being careful to generate a valid value */
-	*MultiXactCutoff = nextMXID - multixact_freeze_min_age;
-	if (*MultiXactCutoff < FirstMultiXactId)
-		*MultiXactCutoff = FirstMultiXactId;
+	cutoffs->MultiXactCutoff = nextMXID - multixact_freeze_min_age;
+	if (cutoffs->MultiXactCutoff < FirstMultiXactId)
+		cutoffs->MultiXactCutoff = FirstMultiXactId;
 	/* MultiXactCutoff must always be <= OldestMxact */
-	if (MultiXactIdPrecedes(*OldestMxact, *MultiXactCutoff))
-		*MultiXactCutoff = *OldestMxact;
+	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
+		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
 	/*
-	 * Done setting output parameters; check if OldestXmin or OldestMxact are
-	 * held back to an unsafe degree in passing
+	 * Check if OldestXmin or OldestMxact are held back to an unsafe degree in
+	 * passing
 	 */
 	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
 	if (!TransactionIdIsNormal(safeOldestXmin))
@@ -1074,20 +1059,29 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	safeOldestMxact = nextMXID - effective_multixact_freeze_max_age;
 	if (safeOldestMxact < FirstMultiXactId)
 		safeOldestMxact = FirstMultiXactId;
-	if (TransactionIdPrecedes(*OldestXmin, safeOldestXmin))
+	if (TransactionIdPrecedes(cutoffs->OldestXmin, safeOldestXmin))
 		ereport(WARNING,
 				(errmsg("cutoff for removing and freezing tuples is far in the past"),
 				 errhint("Close open transactions soon to avoid wraparound problems.\n"
 						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
-	if (MultiXactIdPrecedes(*OldestMxact, safeOldestMxact))
+	if (MultiXactIdPrecedes(cutoffs->OldestMxact, safeOldestMxact))
 		ereport(WARNING,
 				(errmsg("cutoff for freezing multixacts is far in the past"),
 				 errhint("Close open transactions soon to avoid wraparound problems.\n"
 						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
+	 * Assert that all cutoff invariants hold.
 	 *
+	 * We omit relfrozenxid and relminmxid assertions here because there are
+	 * edge cases that allow OldestXmin to go slightly backwards.  This is
+	 * okay because vac_update_relstats() won't allow either to go backwards.
+	 */
+	Assert(TransactionIdPrecedesOrEquals(cutoffs->FreezeLimit,
+										 cutoffs->OldestXmin));
+	Assert(MultiXactIdPrecedesOrEquals(cutoffs->MultiXactCutoff,
+									   cutoffs->OldestMxact));
+	/*
 	 * Determine the table freeze age to use: as specified by the caller, or
 	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
 	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-- 
2.38.1

v8-0005-Make-VACUUM-s-aggressive-behaviors-continuous.patchapplication/x-patch; name=v8-0005-Make-VACUUM-s-aggressive-behaviors-continuous.patchDownload

From 1ff3564ecb457ac17332077a6135a505ea50d5d2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v8 5/6] Make VACUUM's aggressive behaviors continuous.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Before then, every lazy VACUUM
was "equally aggressive": each operation froze whatever tuples before
the age-wise cutoff needed to be frozen.  And each table's relfrozenxid
was updated at the end.  In short, the previous behavior was much less
efficient, but did at least have one thing going for it: it was much
easier to understand at a high level.

VACUUM no longer applies a separate mode of operation (aggressive mode).
There are still antiwraparound autovacuums, but they're now little more
than another way that autovacuum.c can launch an autovacuum worker to
run VACUUM.  The same set of behaviors previously associated with
aggressive mode are retained, but now get applied selectively, on a
timeline attuned to the needs of the table.

The closer that a table's age gets to the autovacuum_freeze_max_age
cutoff, the less VACUUM will care about avoiding the cost of scanning
extra pages to advance relfrozenxid "early".  This new approach cares
about both costs (extra pages scanned) and benefits (the need for
relfrozenxid advancement), unlike the previous approach driven by
vacuum_freeze_table_age, which "escalated to aggressive mode" purely
based on a simple XID age cutoff.  The vacuum_freeze_table_age GUC is
now relegated to a compatibility option.  Its default value is now -1,
which is interpreted as "current value of autovacuum_freeze_max_age".

VACUUM will still advance relfrozenxid at roughly the same XID-age-wise
cadence as before with static tables, but can also advance relfrozenxid
much more frequently in tables where that happens to make sense.  In
practice many tables will tend to have relfrozenxid advanced by some
amount during every VACUUM, especially larger tables and very small
tables.

The emphasis is now on keeping each table's age reasonably recent over
time, across multiple successive VACUUM operations, while spreading out
the burden of freezing, avoiding big spikes.  Freezing is now primarily
treated as an overhead of long term storage of tuples in physical heap
pages.  There is less emphasis on the role freezing plays in preventing
the system from reaching the point of an xidStopLimit outage.

Now every VACUUM might need to wait for a cleanup lock, though few will.
It can only happen when required to advance relfrozenxid to no less than
half way between the existing relfrozenxid and nextXID.  In general
there is no telling how long VACUUM might spend waiting for a cleanup
lock, so it's usually more useful to focus on keeping up with freezing
at the level of the whole table.  VACUUM can afford to set relfrozenxid
to a significantly older value in the short term, since there are now
more opportunities to advance relfrozenxid in the long term.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/access/heapam.h                   |   2 +-
 src/include/commands/vacuum.h                 |  12 +-
 src/backend/access/heap/heapam.c              |  24 +-
 src/backend/access/heap/vacuumlazy.c          | 163 ++--
 src/backend/access/transam/multixact.c        |   5 +-
 src/backend/commands/cluster.c                |   3 +-
 src/backend/commands/vacuum.c                 | 138 ++--
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      | 103 +--
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  18 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  30 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 21 files changed, 656 insertions(+), 634 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 57d824740..961c8f76b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -229,7 +229,7 @@ extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId FreezeLimit, TransactionId MultiXactCutoff);
 extern bool heap_tuple_should_freeze(HeapTupleHeader tuple,
-									 const struct VacuumCutoffs *cutoffs,
+									 const struct VacuumCutoffs *cutoffs, bool MinCutoffs,
 									 TransactionId *NewRelfrozenXid,
 									 MultiXactId *NewRelminMxid);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 122fb93e2..a4ed7674e 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -276,6 +276,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
 	 * rel's main fork) for triggering eager/all-visible freezing strategy
@@ -335,8 +342,9 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
-								  struct VacuumCutoffs *cutoffs);
+extern void vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
+								  struct VacuumCutoffs *cutoffs,
+								  double *antiwrapfrac);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index caa34bd35..c1608b1ec 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6303,7 +6303,7 @@ FreezeMultiXactId(MultiXactId multi, HeapTupleHeader tuple,
 	 */
 	axid = cutoffs->OldestXmin;
 	amxid = cutoffs->OldestMxact;
-	Assert(heap_tuple_should_freeze(tuple, cutoffs, &axid, &amxid));
+	Assert(heap_tuple_should_freeze(tuple, cutoffs, false, &axid, &amxid));
 
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
@@ -6784,7 +6784,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									   xmax_already_frozen))
 	{
 		pagefrz->freeze_required =
-			heap_tuple_should_freeze(tuple, cutoffs,
+			heap_tuple_should_freeze(tuple, cutoffs, false,
 									 &pagefrz->NoFreezeNewRelfrozenXid,
 									 &pagefrz->NoFreezeNewRelminMxid);
 	}
@@ -7350,13 +7350,19 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
  * could be processed by pruning away the whole tuple instead of freezing.
  *
+ * Callers that specify 'MinCutoffs=false' have us apply the same FreezeLimit
+ * and MultiXactCutoff cutoffs used in heap_prepare_freeze_tuple.  Otherwise
+ * we use MinXid and MinMulti cutoffs, which are earlier cutoffs that VACUUM
+ * must always advance relfrozenxid/relminmxid up to, even when that means
+ * that it has to wait on a cleanup lock.
+ *
  * The *NewRelfrozenXid and *NewRelminMxid input/output arguments work just
  * like the similar fields from the FreezeCutoffs struct.  We never freeze
  * here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
 heap_tuple_should_freeze(HeapTupleHeader tuple,
-						 const struct VacuumCutoffs *cutoffs,
+						 const struct VacuumCutoffs *cutoffs, bool MinCutoffs,
 						 TransactionId *NewRelfrozenXid,
 						 MultiXactId *NewRelminMxid)
 {
@@ -7366,8 +7372,16 @@ heap_tuple_should_freeze(HeapTupleHeader tuple,
 	MultiXactId multi;
 	bool		freeze = false;
 
-	MustFreezeLimit = cutoffs->FreezeLimit;
-	MustFreezeMultiLimit = cutoffs->MultiXactCutoff;
+	if (!MinCutoffs)
+	{
+		MustFreezeLimit = cutoffs->FreezeLimit;
+		MustFreezeMultiLimit = cutoffs->MultiXactCutoff;
+	}
+	else
+	{
+		MustFreezeLimit = cutoffs->MinXid;
+		MustFreezeMultiLimit = cutoffs->MinMulti;
+	}
 
 	/* First deal with xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a1984d68e..e6c2ff89f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -109,10 +109,11 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Threshold that controls whether non-aggressive VACUUMs will skip any
- * all-visible pages when using the lazy freezing strategy
+ * Thresholds that control whether VACUUM will skip any all-visible pages when
+ * using the lazy freezing strategy
  */
 #define SKIPALLVIS_THRESHOLD_PAGES	0.05	/* i.e. 5% of rel_pages */
+#define SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES	0.15
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -148,9 +149,7 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
-	/* Skip (don't scan) all-visible pages? (must be !aggressive) */
+	/* Skip (don't scan) all-visible pages? */
 	bool		skipallvis;
 	/* Skip (don't scan) all-frozen pages? */
 	bool		skipallfrozen;
@@ -246,7 +245,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel, double antiwrapfrac,
 									  BlockNumber all_visible,
 									  BlockNumber all_frozen);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel,
@@ -312,10 +311,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				aggressive,
 				frozenxid_updated,
 				minmulti_updated;
 	struct VacuumCutoffs cutoffs;
+	double		antiwrapfrac;
 	BlockNumber orig_rel_pages,
 				all_visible,
 				all_frozen,
@@ -354,7 +353,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Get cutoffs that determine which deleted tuples are considered DEAD,
 	 * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze
 	 */
-	aggressive = vacuum_set_xid_limits(rel, params, &cutoffs);
+	vacuum_set_xid_limits(rel, params, &cutoffs, &antiwrapfrac);
 
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
@@ -404,7 +403,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
-	vacrel->aggressive = aggressive;
 	vacrel->skipallvis = false; /* arbitrary initial value */
 	/* skipallfrozen indicates DISABLE_PAGE_SKIPPING to lazy_scan_strategy */
 	vacrel->skipallfrozen = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
@@ -489,7 +487,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->vmsnap = visibilitymap_snap(rel, orig_rel_pages,
 										&all_visible, &all_frozen);
-	scanned_pages = lazy_scan_strategy(vacrel, all_visible, all_frozen);
+	scanned_pages = lazy_scan_strategy(vacrel, antiwrapfrac,
+									   all_visible, all_frozen);
 	if (verbose)
 		ereport(INFO,
 				(errmsg("vacuuming \"%s.%s.%s\"",
@@ -550,25 +549,21 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(aggressive ? cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(aggressive ? cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->skipallvis)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM whose
-		 * lazy_scan_strategy call determined it would skip all-visible pages
+		 * Must keep original relfrozenxid when lazy_scan_strategy call
+		 * decided to skip all-visible pages
 		 */
-		Assert(!aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -644,23 +639,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
-			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
 			else
 			{
-				if (aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
+				Assert(IsAutoVacuumWorkerProcess());
+				if (params->is_wraparound)
+					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -994,7 +977,6 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * lazy_scan_noprune could not do all required processing.  Wait
 			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
 			 */
-			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
 		}
@@ -1255,19 +1237,16 @@ lazy_scan_heap(LVRelState *vacrel)
  * one VACUUM operation should need to freeze disproportionately many pages.
  *
  * Also determines if the ongoing VACUUM operation should skip all-visible
- * pages when advancing relfrozenxid is optional.  When VACUUM freezes eagerly
- * it always also scans pages eagerly, since it's important that relfrozenxid
- * advance in affected tables, which are larger.  When VACUUM freezes lazily
- * it might make sense to scan pages lazily (skip all-visible pages) or
- * eagerly (be capable of relfrozenxid advancement), depending on the extra
- * cost - we might need to scan only a few extra pages.
+ * pages when advancing relfrozenxid is still optional (before target rel has
+ * attained an age that forces an antiwraparound autovacuum).  Decision is
+ * based in part on caller's antiwrapfrac argument, which represents how close
+ * the table age is to forcing antiwraparound autovacuum.
  *
  * Returns final scanned_pages for the VACUUM operation.
  */
 static BlockNumber
-lazy_scan_strategy(LVRelState *vacrel,
-				   BlockNumber all_visible,
-				   BlockNumber all_frozen)
+lazy_scan_strategy(LVRelState *vacrel, double antiwrapfrac,
+				   BlockNumber all_visible, BlockNumber all_frozen)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				scanned_pages_skipallvis,
@@ -1326,9 +1305,6 @@ lazy_scan_strategy(LVRelState *vacrel,
 		BlockNumber nextra,
 					nextra_threshold;
 
-		/* VACUUM of small table -- use lazy freeze strategy */
-		vacrel->eager_freeze_strategy = false;
-
 		/*
 		 * Decide on whether or not we'll skip all-visible pages.
 		 *
@@ -1342,21 +1318,51 @@ lazy_scan_strategy(LVRelState *vacrel,
 		 * that way, so be lazy (just skip) unless the added cost is very low.
 		 * We opt for a skipallfrozen-only VACUUM when the number of extra
 		 * pages (extra scanned pages that are all-visible but not all-frozen)
-		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small).
+		 * is less than 5% of rel_pages (or 32 pages when rel_pages is small)
+		 * if relfrozenxid has yet to attain an age that uses 50% of the XID
+		 * space available before the GUC cutoff for antiwraparound
+		 * autovacuum.  A more aggressive threshold of 15% is used when
+		 * relfrozenxid is older than that.
 		 */
 		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
-		nextra_threshold = (double) rel_pages * SKIPALLVIS_THRESHOLD_PAGES;
+
+		if (antiwrapfrac < 0.5)
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_THRESHOLD_PAGES;
+		else
+			nextra_threshold = (double) rel_pages *
+				SKIPALLVIS_MIDPOINT_THRESHOLD_PAGES;
+
 		nextra_threshold = Max(32, nextra_threshold);
 
-		/* Only skipallvis when DISABLE_PAGE_SKIPPING not in use */
-		vacrel->skipallvis = nextra >= nextra_threshold &&
-			vacrel->skipallfrozen && !vacrel->aggressive;
+		/*
+		 * We must advance relfrozenxid when it already attained an age that
+		 * consumes >= 90% of the available XID space (or MXID space) before
+		 * the crossover point for antiwraparound autovacuum.
+		 *
+		 * Also use eager freezing strategy when we're past the "90% towards
+		 * wraparound" point, even though the table size is below the usual
+		 * eager_threshold table size cutoff.  The added cost is usually not
+		 * too great.  We may be able to fall into a pattern of continually
+		 * advancing relfrozenxid this way.
+		 */
+		if (antiwrapfrac < 0.9)
+		{
+			/* Only skipallvis when DISABLE_PAGE_SKIPPING not in use */
+			vacrel->skipallvis = nextra >= nextra_threshold &&
+				vacrel->skipallfrozen;
+			vacrel->eager_freeze_strategy = false;
+		}
+		else
+		{
+			vacrel->skipallvis = false;
+			vacrel->eager_freeze_strategy = true;
+		}
 	}
 
 	/* Return the appropriate variant of scanned_pages */
 	if (vacrel->skipallvis)
 	{
-		Assert(!vacrel->aggressive);
 		Assert(vacrel->skipallfrozen);
 		return scanned_pages_skipallvis;
 	}
@@ -1983,11 +1989,13 @@ retry:
  * operation left LP_DEAD items behind.  We'll at least collect any such items
  * in the dead_items array for removal from indexes.
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We may return false to indicate that a full cleanup lock is required for
+ * processing by lazy_scan_prune.  This is only necessary when VACUUM needs to
+ * freeze some tuple XIDs from one or more tuples on the page.  This should
+ * only happen when multiple successive VACUUM operations all fail to get a
+ * cleanup lock on the same heap page (assuming default or at least typical
+ * freeze settings).  Waiting for a cleanup lock should be avoided unless it's
+ * the only way to advance relfrozenxid by enough to satisfy autovacuum.c.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
@@ -2054,35 +2062,24 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
+		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs, true,
 									 &NewRelfrozenXid, &NewRelminMxid))
 		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
 			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
+			 * Tuple with XID < MinXid (or MXID < MinMulti)
 			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
+			 * VACUUM must always be able to advance rel's relfrozenxid and
+			 * relminmxid to minimum values.  The ongoing VACUUM won't be able
+			 * to do that unless it can freeze an XID (or MXID) from this
+			 * tuple now.
+			 *
+			 * The only safe option is to have caller perform processing of
+			 * this page using lazy_scan_prune.  Caller might have to wait a
+			 * long time for a cleanup lock, which can be very disruptive, but
+			 * it can't be helped.
 			 */
+			vacrel->offnum = InvalidOffsetNumber;
+			return false;
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 204aa9504..ba575c5fd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2816,10 +2816,7 @@ ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
  * freeze table and the minimum freeze age based on the effective
  * autovacuum_multixact_freeze_max_age this function returns.  In the worst
  * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will try to freeze every multixact.
- *
- * It's possible that these thresholds should be user-tunable, but for now
- * we keep it simple.
+ * table will freeze every multixact.
  */
 int
 MultiXactMemberFreezeThreshold(void)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 6cfea04a9..c2f708e33 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -825,6 +825,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	VacuumParams params;
 	struct VacuumCutoffs cutoffs;
+	double		antiwrapfrac;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -913,7 +914,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	memset(&params, 0, sizeof(VacuumParams));
-	vacuum_set_xid_limits(OldHeap, &params, &cutoffs);
+	vacuum_set_xid_limits(OldHeap, &params, &cutoffs, &antiwrapfrac);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ffa8eac12..cd0684f44 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -937,24 +937,33 @@ get_all_vacuum_rels(int options)
  *
  * The target relation and VACUUM parameters are our inputs.
  *
- * Output parameters are the cutoffs that VACUUM caller should use.
+ * Output parameters are the cutoffs that VACUUM caller should use, and
+ * antiwrapfrac, which indicates how close table is to requiring that
+ * autovacuum.c launch an antiwraparound autovacuum.
+ *
+ * The antiwrapfrac value 1.0 represents the point that autovacuum.c
+ * scheduling considers advancing relfrozenxid strictly necessary.  Values
+ * between 0.0 and 1.0 represent how close the table is to the point of
+ * mandatory relfrozenxid/relminmxid advancement.
  */
-bool
+void
 vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
-					  struct VacuumCutoffs *cutoffs)
+					  struct VacuumCutoffs *cutoffs, double *antiwrapfrac)
 {
 	int			freeze_min_age,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
 				effective_multixact_freeze_max_age,
-				freeze_strategy_threshold;
+				freeze_strategy_threshold,
+				relfrozenxid_age,
+				relminmxid_age;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Determining table age details  */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1084,6 +1093,74 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
 
+	/*
+	 * Work out how close we are to needing an antiwraparound VACUUM.
+	 *
+	 * Determine the table freeze age to use: as specified by the caller, or
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
+	 */
+	if (freeze_table_age < 0)
+		freeze_table_age = vacuum_freeze_table_age;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
+
+	/*
+	 * Similar to the above, determine the table freeze age to use for
+	 * multixacts: as specified by the caller, or the value of the
+	 * vacuum_multixact_freeze_table_age GUC.   The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
+	 */
+	if (multixact_freeze_table_age < 0)
+		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
+
+	/* Final antiwrapfrac can come from either XID or MXID table age */
+	relfrozenxid_age = Max(nextXID - rel->rd_rel->relfrozenxid, 1);
+	relminmxid_age = Max(nextMXID - rel->rd_rel->relminmxid, 1);
+	freeze_table_age = Max(freeze_table_age, 1);
+	multixact_freeze_table_age = Max(multixact_freeze_table_age, 1);
+	XIDFrac = (double) relfrozenxid_age / (double) freeze_table_age;
+	MXIDFrac = (double) relminmxid_age / (double) multixact_freeze_table_age;
+	*antiwrapfrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		*antiwrapfrac = 1.0;
+
+	/*
+	 * Pages that caller can cleanup lock immediately will never be left with
+	 * XIDs < FreezeLimit (nor with MXIDs < MultiXactCutoff).  Determine
+	 * values for a distinct set of cutoffs applied to pages that cannot be
+	 * immediately cleanup locked.  The cutoffs govern caller's wait behavior.
+	 *
+	 * It is safer to accept earlier final relfrozenxid and relminmxid values
+	 * than it would be to wait indefinitely for a cleanup lock.  Waiting for
+	 * a cleanup lock to freeze one heap page risks not freezing every other
+	 * eligible heap page.  Keeping up the momentum is what matters most.
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age / 2);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age / 2);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Assert that all cutoff invariants hold.
 	 *
@@ -1095,47 +1172,10 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 										 cutoffs->OldestXmin));
 	Assert(MultiXactIdPrecedesOrEquals(cutoffs->MultiXactCutoff,
 									   cutoffs->OldestMxact));
-	/*
-	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
-	 */
-	if (freeze_table_age < 0)
-		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
-
-	/*
-	 * Similar to the above, determine the table freeze age to use for
-	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
-	 */
-	if (multixact_freeze_table_age < 0)
-		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
-
-	/* Non-aggressive VACUUM */
-	return false;
+	Assert(TransactionIdPrecedesOrEquals(cutoffs->MinXid,
+										 cutoffs->FreezeLimit));
+	Assert(MultiXactIdPrecedesOrEquals(cutoffs->MinMulti,
+									   cutoffs->MultiXactCutoff));
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index f92e16e7a..12dfafd67 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->n_dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index d3c8ae87d..939e4b5aa 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2476,10 +2476,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2496,10 +2496,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a409e6281..01f37b493 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -659,6 +659,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -692,11 +699,9 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9c5861bd7..938603283 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8215,7 +8215,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8404,7 +8404,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9120,31 +9120,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
-      <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
-      <indexterm>
-       <primary><varname>vacuum_freeze_table_age</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
-       </para>
-      </listitem>
-     </varlistentry>
-
      <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
       <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
       <indexterm>
@@ -9160,6 +9135,39 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
+      <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_table_age</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-xid-space"/>.
+       </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9179,7 +9187,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9225,19 +9233,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9249,10 +9265,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in multixacts) that
-        <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages with an older multixact ID.  The
-        default is 5 million multixacts.
+        Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
+        should use to decide whether to trigger freezing of pages with
+        an older multixact ID.  The default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 554b3a75d..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
-    transactions older than this cutoff XID are guaranteed to have been frozen.
-    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 78e35abb9..43ffbbbd3 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-min-age"/> and <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
@@ -210,7 +210,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..927410258 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,18 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), without ever being
+# prepared to wait for a cleanup lock (we'll never wait on a cleanup
+# lock because the separate MinXid cutoff for waiting will still be
+# well before FreezeLimit, given our default autovacuum_freeze_max_age).
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +78,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +94,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +105,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +118,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +130,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +138,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.38.1

v8-0002-Add-page-level-freezing-to-VACUUM.patchapplication/x-patch; name=v8-0002-Add-page-level-freezing-to-VACUUM.patchDownload

From 8701fdd939766362254600401f2d66fd0af007e6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v8 2/6] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

FreezeMultiXactId() now uses both FreezeLimit and OldestXmin to decide
how to process MultiXacts (not just FreezeLimit).  We always prefer to
avoid allocating new MultiXacts during VACUUM on general principle.
Page-level freezing can be triggered and use a maximally aggressive XID
cutoff to freeze XIDs (OldestXmin), while using a less aggressive XID
cutoff (FreezeLimit) to determine whether or not members from a Multi
need to be frozen expensively.  VACUUM will process Multis very eagerly
when it's cheap to do so, and very lazily when it's expensive to do so.

We can choose when and how to freeze Multixacts provided we never leave
behind a Multi that's < MultiXactCutoff, or a Multi with one or more XID
members < FreezeLimit.  Provided VACUUM's NewRelfrozenXid/NewRelminMxid
tracking account for all this, we are free to choose what to do about
each Multi based on the costs and the benefits.  VACUUM should be just
as capable of avoiding an expensive second pass over each Multi (which
must check the commit status of each member XID) as it was before, even
when page-level freezing is triggered on many pages with recently
allocated MultiXactIds.

Later work will teach VACUUM to apply an alternative eager freezing
strategy that triggers page-level freezing earlier, based on additional
criteria.  This commit improves the cost profile of freezing by building
on the freeze plan deduplication optimization added by commit 9e540599.
The high level user facing design of VACUUM hasn't really changed just
yet.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/heapam.h          |  39 +++-
 src/backend/access/heap/heapam.c     | 298 ++++++++++++++++-----------
 src/backend/access/heap/vacuumlazy.c | 106 +++++++---
 3 files changed, 282 insertions(+), 161 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index abc3a1f34..ca4fab970 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -113,6 +113,40 @@ typedef struct HeapTupleFreeze
 	OffsetNumber offset;
 } HeapTupleFreeze;
 
+/*
+ * State used by VACUUM to track the details of freezing all eligible tuples
+ * on a given heap page.
+ *
+ * VACUUM prepares freeze plans for each page via heap_prepare_freeze_tuple
+ * calls (every tuple with storage gets its own call).  This page-level freeze
+ * state is updated across each call, which ultimately determines whether or
+ * not freezing the page is required. (VACUUM freezes the page via a call to
+ * heap_freeze_execute_prepared, which freezes using prepared freeze plans.)
+ *
+ * Aside from the basic question of whether or not freezing will go ahead, the
+ * state also tracks the oldest extant XID/MXID in the table as a whole, for
+ * the purposes of advancing relfrozenxid/relminmxid values in pg_class later
+ * on.  Each heap_prepare_freeze_tuple call pushes NewRelfrozenXid and/or
+ * NewRelminMxid back as required to avoid unsafe final pg_class values.  Any
+ * and all unfrozen XIDs or MXIDs that remain after VACUUM finishes _must_
+ * have values >= the final relfrozenxid/relminmxid values in pg_class.  This
+ * includes XIDs that remain as MultiXact members from any tuple's xmax.
+ */
+typedef struct HeapPageFreeze
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze_required;
+
+	/* Values used when heap_freeze_execute_prepared is called for page */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
+
+	/* "No freeze" variants used when page freezing doesn't take place */
+	TransactionId NoFreezeNewRelfrozenXid;
+	MultiXactId NoFreezeNewRelminMxid;
+
+} HeapPageFreeze;
+
 /* ----------------
  *		function prototypes for heap access method
  *
@@ -181,10 +215,9 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  const struct VacuumCutoffs *cutoffs,
 									  HeapTupleFreeze *frz, bool *totally_frozen,
-									  TransactionId *NewRelFrozenXid,
-									  MultiXactId *NewRelminMxid);
+									  HeapPageFreeze *pagefrz);
 extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-										 TransactionId FreezeLimit,
+										 TransactionId snapshotConflictHorizon,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 74b3a459e..45cdc1ae8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6102,9 +6102,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		MultiXactId.
  *
  * "flags" is an output value; it's used to tell caller what to do on return.
- *
- * "mxid_oldest_xid_out" is an output value; it's used to track the oldest
- * extant Xid within any Multixact that will remain after freezing executes.
+ * "pagefrz" is an input/output value, used to manage page level freezing.
  *
  * Possible values that we can set in "flags":
  * FRM_NOOP
@@ -6119,16 +6117,34 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
  *
- * "mxid_oldest_xid_out" is only set when "flags" contains either FRM_NOOP or
- * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
+ * Caller delegates control of page freezing to us.  In practice we always
+ * force freezing of caller's page unless FRM_NOOP processing is indicated.
+ * We help caller ensure that XIDs < FreezeLimit and MXIDs < MultiXactCutoff
+ * can never be left behind.  We freely choose when and how to process each
+ * Multi, without ever violating the cutoff invariants for freezing.
  *
- * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ * It's useful to remove Multis on a proactive timeline (relative to freezing
+ * XIDs) to keep MultiXact member SLRU buffer misses to a minimum.  It can also
+ * be cheaper in the short run, for us, since we too can avoid SLRU buffer
+ * misses through eager processing.
+ *
+ * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set, though only
+ * when FreezeLimit and/or MultiXactCutoff cutoffs leave us with no choice.
+ * This can usually be put off, which is usually enough to avoid it altogether.
+ *
+ * NB: Caller must maintain "no freeze" NewRelfrozenXid/NewRelminMxid variants
+ * using heap_tuple_should_freeze when we haven't forced page-level freezing.
+ *
+ * NB: Caller should avoid needlessly calling heap_tuple_should_freeze when we
+ * have already forced page-level freezing, since that might incur the same
+ * SLRU buffer misses that we specifically intended to avoid by freezing.
  */
 static TransactionId
-FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
+FreezeMultiXactId(MultiXactId multi, HeapTupleHeader tuple,
 				  const struct VacuumCutoffs *cutoffs, uint16 *flags,
-				  TransactionId *mxid_oldest_xid_out)
+				  HeapPageFreeze *pagefrz)
 {
+	uint16		t_infomask = tuple->t_infomask;
 	TransactionId newxmax = InvalidTransactionId;
 	MultiXactMember *members;
 	int			nmembers;
@@ -6138,7 +6154,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
-	TransactionId temp_xid_out;
+	TransactionId NewRelfrozenXid = pagefrz->NewRelfrozenXid;
+	TransactionId axid PG_USED_FOR_ASSERTS_ONLY;
+	MultiXactId amxid PG_USED_FOR_ASSERTS_ONLY;
 
 	*flags = 0;
 
@@ -6150,14 +6168,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* Ensure infomask bits are appropriately set/reset */
 		*flags |= FRM_INVALIDATE_XMAX;
-		return InvalidTransactionId;
+		pagefrz->freeze_required = true;
+		Assert(!TransactionIdIsValid(newxmax));
+		return newxmax;
 	}
 	else if (MultiXactIdPrecedes(multi, cutoffs->relminmxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
 								 multi, cutoffs->relminmxid)));
-	else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
+	else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))
 	{
 		/*
 		 * This old multi cannot possibly have members still running, but
@@ -6170,7 +6190,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
-									 multi, cutoffs->MultiXactCutoff)));
+									 multi, cutoffs->OldestMxact)));
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
@@ -6206,14 +6226,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 			else
 			{
+				if (TransactionIdPrecedes(newxmax, NewRelfrozenXid))
+					NewRelfrozenXid = newxmax;
 				*flags |= FRM_RETURN_IS_XID;
 			}
 		}
 
-		/*
-		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
-		 * when no Xids will remain
-		 */
+		pagefrz->NewRelfrozenXid = NewRelfrozenXid;
+		pagefrz->freeze_required = true;
 		return newxmax;
 	}
 
@@ -6229,11 +6249,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* Nothing worth keeping */
 		*flags |= FRM_INVALIDATE_XMAX;
-		return InvalidTransactionId;
+		pagefrz->freeze_required = true;
+		Assert(!TransactionIdIsValid(newxmax));
+		return newxmax;
 	}
 
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	NewRelfrozenXid = pagefrz->NewRelfrozenXid; /* init for FRM_NOOP */
 	for (int i = 0; i < nmembers; i++)
 	{
 		TransactionId xid = members[i].xid;
@@ -6242,26 +6264,31 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
+			/* Can't violate the FreezeLimit invariant */
 			need_replace = true;
 			break;
 		}
-		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
-			temp_xid_out = members[i].xid;
+		if (TransactionIdPrecedes(xid, NewRelfrozenXid))
+			NewRelfrozenXid = xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than FreezeLimit; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* Can't violate the MultiXactCutoff invariant, either */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff);
+
 	if (!need_replace)
 	{
 		/*
-		 * When mxid_oldest_xid_out gets pushed back here it's likely that the
-		 * update Xid was the oldest member, but we don't rely on that
+		 * FRM_NOOP case is the only one where we don't force page-level
+		 * freezing (see header comments).
+		 *
+		 * Might have to ratchet back NewRelminMxid, NewRelfrozenXid, or both
+		 * together.
 		 */
 		*flags |= FRM_NOOP;
-		*mxid_oldest_xid_out = temp_xid_out;
+		pagefrz->NewRelfrozenXid = NewRelfrozenXid;
+		if (MultiXactIdPrecedes(multi, pagefrz->NewRelminMxid))
+			pagefrz->NewRelminMxid = multi;
 		pfree(members);
 		return multi;
 	}
@@ -6270,13 +6297,20 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_should_freeze will indicate that the tuple should be frozen.
 	 */
+	axid = cutoffs->OldestXmin;
+	amxid = cutoffs->OldestMxact;
+	Assert(heap_tuple_should_freeze(tuple, cutoffs, &axid, &amxid));
+
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
 	update_xid = InvalidTransactionId;
 	update_committed = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
+	NewRelfrozenXid = pagefrz->NewRelfrozenXid; /* init for second pass */
 
 	/*
 	 * Determine whether to keep each member txid, or to ignore it instead
@@ -6365,11 +6399,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		/*
 		 * We determined that this is an Xid corresponding to an update that
 		 * must be retained -- add it to new members list for later.  Also
-		 * consider pushing back mxid_oldest_xid_out.
+		 * consider pushing back NewRelfrozenXid tracker.
 		 */
 		newmembers[nnewmembers++] = members[i];
-		if (TransactionIdPrecedes(xid, temp_xid_out))
-			temp_xid_out = xid;
+		if (TransactionIdPrecedes(xid, NewRelfrozenXid))
+			NewRelfrozenXid = xid;
 	}
 
 	pfree(members);
@@ -6380,10 +6414,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (nnewmembers == 0)
 	{
-		/* nothing worth keeping!? Tell caller to remove the whole thing */
+		/*
+		 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.  Won't
+		 * have to ratchet back NewRelfrozenXid or NewRelminMxid.
+		 */
 		*flags |= FRM_INVALIDATE_XMAX;
 		newxmax = InvalidTransactionId;
-		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
+
+		Assert(pagefrz->NewRelfrozenXid == NewRelfrozenXid);
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
 	{
@@ -6399,22 +6437,28 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		newxmax = update_xid;
-		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
+
+		/* Might have already pushed back NewRelfrozenXid with update_xid */
+		Assert(TransactionIdPrecedesOrEquals(NewRelfrozenXid, update_xid));
 	}
 	else
 	{
 		/*
 		 * Create a new multixact with the surviving members of the previous
 		 * one, to set as new Xmax in the tuple.  The oldest surviving member
-		 * might push back mxid_oldest_xid_out.
+		 * might have already pushed back NewRelfrozenXid.
 		 */
 		newxmax = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
-		*mxid_oldest_xid_out = temp_xid_out;
+
+		/* Never need to push back NewRelminMxid when newxmax is new multi */
+		Assert(MultiXactIdPrecedesOrEquals(cutoffs->OldestMxact, newxmax));
 	}
 
 	pfree(newmembers);
 
+	pagefrz->NewRelfrozenXid = NewRelfrozenXid;
+	pagefrz->freeze_required = true;
 	return newxmax;
 }
 
@@ -6422,29 +6466,33 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the specified cutoff XID and cutoff MultiXactId.  If so,
+ * are older than the FreezeLimit and/or MultiXactCutoff cutoffs.  If so,
  * setup enough state (in the *frz output argument) to later execute and
- * WAL-log what we would need to do, and return true.  Return false if nothing
- * is to be changed.  In addition, set *totally_frozen to true if the tuple
- * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * WAL-log what caller needs to do for the tuple, and return true.  Return
+ * false if nothing can be changed about the tuple right now.
  *
- * VACUUM caller must assemble HeapTupleFreeze entries for every tuple that we
- * returned true for when called.  A later heap_freeze_execute_prepared call
- * will execute freezing for caller's page as a whole.
+ * Also sets *totally_frozen to true if the tuple will be totally frozen once
+ * caller executes returned freeze plan (or if the tuple was already totally
+ * frozen by an earlier VACUUM).  This indicates that there are no remaining
+ * XIDs or MultiXactIds that will need to be processed by a future VACUUM.
+ *
+ * VACUUM caller must assemble HeapTupleFreeze freeze plan entries for every
+ * tuple that we returned true for, and call heap_freeze_execute_prepared to
+ * execute freezing.  Caller must initialize pagefrz fields for page as a
+ * whole before first call here for each heap page.
+ *
+ * We sometimes force freezing of xmax MultiXactId values long before it is
+ * strictly necessary to do so just to ensure the FreezeLimit postcondition.
+ * It's worth processing MultiXactIds proactively when it is cheap to do so,
+ * and it's convenient to make that happen by piggy-backing it on the "force
+ * freezing" mechanism.  Conversely, we sometimes delay freezing MultiXactIds
+ * because it is expensive right now (though only when it's still possible to
+ * do so without violating the FreezeLimit/MultiXactCutoff postcondition).
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *NewRelFrozenXid and *NewRelminMxid arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *NewRelFrozenXid and/or *NewRelminMxid as needed
- * to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6454,8 +6502,7 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  const struct VacuumCutoffs *cutoffs,
 						  HeapTupleFreeze *frz, bool *totally_frozen,
-						  TransactionId *NewRelFrozenXid,
-						  MultiXactId *NewRelminMxid)
+						  HeapPageFreeze *pagefrz)
 {
 	bool		frzplan_set = false;
 	bool		xmin_already_frozen = false,
@@ -6471,7 +6518,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/*
 	 * Process xmin, while keeping track of whether it's already frozen, or
-	 * will become frozen when our freeze plan is executed by caller (could be
+	 * will become frozen iff our freeze plan is executed by caller (could be
 	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
@@ -6489,59 +6536,66 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->FreezeLimit);
+		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
 		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-										 xid, cutoffs->FreezeLimit)));
+										 xid, cutoffs->OldestXmin)));
 
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			frzplan_set = true;
 		}
 		else
 		{
-			/* xmin to remain unfrozen.  Could push back NewRelfrozenXid. */
-			if (TransactionIdPrecedes(xid, *NewRelFrozenXid))
-				*NewRelFrozenXid = xid;
+			/* No need for NewRelfrozenXid handling for non-eligible xmin */
+			Assert(TransactionIdPrecedesOrEquals(pagefrz->NewRelfrozenXid,
+												 cutoffs->OldestXmin));
 		}
 	}
 
-	/*
-	 * Process xmax.  To thoroughly examine the current Xmax value we need to
-	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given FreezeLimit.  In that case, those values might need
-	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
-	 * it out --- if there's a live updater Xid, it needs to be kept.
-	 */
+	/* Now process xmax */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
-
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		/* Raw xmax is a MultiXactId */
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *NewRelFrozenXid;
-
-		newxmax = FreezeMultiXactId(xid, tuple->t_infomask, cutoffs,
-									&flags, &mxid_oldest_xid_out);
 
+		/*
+		 * We will either remove xmax completely (in the "freeze_xmax" path),
+		 * process xmax by modifying xmax in some other way, or perform no-op
+		 * xmax processing (which must still manage NewRelfrozenXid and
+		 * NewRelminMxid safety, often by accessing multi members XIDs).
+		 *
+		 * The only rule is that the FreezeLimit/MultiXactCutoff invariant
+		 * must never be violated.  FreezeMultiXactId decides on the rest.
+		 */
+		newxmax = FreezeMultiXactId(xid, tuple, cutoffs, &flags, pagefrz);
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
-		if (flags & FRM_RETURN_IS_XID)
+		if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 * This is the only case where 'freeze_required' won't have been
+			 * set for us by FreezeMultiXactId.
+			 */
+			Assert(!freeze_xmax);
+			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
+			Assert(!MultiXactIdPrecedes(newxmax, pagefrz->NewRelminMxid));
+		}
+		else if (flags & FRM_RETURN_IS_XID)
 		{
 			/*
 			 * xmax will become an updater Xid (original MultiXact's updater
 			 * member Xid will be carried forward as a simple Xid in Xmax).
-			 * Might have to ratchet back NewRelfrozenXid here, though never
-			 * NewRelminMxid.
 			 */
 			Assert(!freeze_xmax);
-			Assert(TransactionIdIsValid(newxmax));
-			if (TransactionIdPrecedes(newxmax, *NewRelFrozenXid))
-				*NewRelFrozenXid = newxmax;
+			Assert(pagefrz->freeze_required);
+			Assert(!TransactionIdPrecedes(newxmax, pagefrz->NewRelfrozenXid));
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6564,15 +6618,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/*
 			 * xmax is an old MultiXactId that we have to replace with a new
 			 * MultiXactId, to carry forward two or more original member XIDs.
-			 * Might have to ratchet back NewRelfrozenXid here, though never
-			 * NewRelminMxid.
 			 */
 			Assert(!freeze_xmax);
+			Assert(pagefrz->freeze_required);
 			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *NewRelminMxid));
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *NewRelFrozenXid));
-			*NewRelFrozenXid = mxid_oldest_xid_out;
+			Assert(!MultiXactIdPrecedes(newxmax, pagefrz->NewRelminMxid));
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6585,33 +6635,18 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			GetMultiXactIdHintBits(newxmax, &newbits, &newbits2);
 			frz->t_infomask |= newbits;
 			frz->t_infomask2 |= newbits2;
-
 			frz->xmax = newxmax;
 
 			frzplan_set = true;
 		}
-		else if (flags & FRM_NOOP)
-		{
-			/*
-			 * xmax is a MultiXactId, and nothing about it changes for now.
-			 * Might have to ratchet back NewRelminMxid, NewRelfrozenXid, or
-			 * both together.
-			 */
-			Assert(!freeze_xmax);
-			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *NewRelFrozenXid));
-			if (MultiXactIdPrecedes(xid, *NewRelminMxid))
-				*NewRelminMxid = xid;
-			*NewRelFrozenXid = mxid_oldest_xid_out;
-		}
 		else
 		{
 			/*
-			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.
-			 * Won't have to ratchet back NewRelminMxid or NewRelfrozenXid.
+			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.  We
+			 * will "freeze xmax", in the strictest sense.
 			 */
 			Assert(freeze_xmax);
+			Assert(pagefrz->freeze_required);
 			Assert(!TransactionIdIsValid(newxmax));
 		}
 	}
@@ -6624,7 +6659,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
+		if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
 		{
 			/*
 			 * If we freeze xmax, make absolutely sure that it's not an XID
@@ -6644,8 +6679,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			freeze_xmax = false;
-			if (TransactionIdPrecedes(xid, *NewRelFrozenXid))
-				*NewRelFrozenXid = xid;
+			/* No need for NewRelfrozenXid handling for non-eligible xmax */
+			Assert(TransactionIdPrecedesOrEquals(pagefrz->NewRelfrozenXid,
+												 cutoffs->OldestXmin));
 		}
 	}
 	else if (!TransactionIdIsValid(xid))
@@ -6716,16 +6752,36 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
 			frzplan_set = true;
+			pagefrz->freeze_required = true;
 		}
 	}
 
 	/*
 	 * Determine if this tuple is already totally frozen, or will become
-	 * totally frozen
+	 * totally frozen (provided caller executes freeze plan for the page)
 	 */
 	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
 
+	/*
+	 * Force vacuumlazy.c to freeze page when avoiding it would violate the
+	 * rule that XIDs < FreezeLimit (and MXIDs < MultiXactCutoff) must never
+	 * remain.
+	 *
+	 * We have to do this even when we have no freeze plan for caller's tuple,
+	 * since "no freeze" tracking is still required (unless we already know
+	 * that freezing the page will go ahead, in which case we can skip it and
+	 * just rely on "freeze" NewRelfrozenXid tracking).
+	 */
+	if (!pagefrz->freeze_required && !(xmin_already_frozen &&
+									   xmax_already_frozen))
+	{
+		pagefrz->freeze_required =
+			heap_tuple_should_freeze(tuple, cutoffs,
+									 &pagefrz->NoFreezeNewRelfrozenXid,
+									 &pagefrz->NoFreezeNewRelminMxid);
+	}
+
 	return frzplan_set;
 }
 
@@ -6769,13 +6825,12 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
  */
 void
 heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-							 TransactionId FreezeLimit,
+							 TransactionId snapshotConflictHorizon,
 							 HeapTupleFreeze *tuples, int ntuples)
 {
 	Page		page = BufferGetPage(buffer);
 
 	Assert(ntuples > 0);
-	Assert(TransactionIdIsNormal(FreezeLimit));
 
 	START_CRIT_SECTION();
 
@@ -6798,19 +6853,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		int			nplans;
 		xl_heap_freeze_page xlrec;
 		XLogRecPtr	recptr;
-		TransactionId snapshotConflictHorizon;
 
 		/* Prepare deduplicated representation for use in WAL record */
 		nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
 
-		/*
-		 * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
-		 * Back up caller's FreezeLimit to avoid false conflicts when
-		 * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
-		 */
-		snapshotConflictHorizon = FreezeLimit;
-		TransactionIdRetreat(snapshotConflictHorizon);
-
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -6851,8 +6897,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	bool		do_freeze;
 	bool		totally_frozen;
 	struct VacuumCutoffs cutoffs;
-	TransactionId NewRelfrozenXid = FreezeLimit;
-	MultiXactId NewRelminMxid = MultiXactCutoff;
+	HeapPageFreeze pagefrz;
 
 	cutoffs.relfrozenxid = relfrozenxid;
 	cutoffs.relminmxid = relminmxid;
@@ -6861,9 +6906,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 
+	pagefrz.freeze_required = true;
+	pagefrz.NewRelfrozenXid = FreezeLimit;
+	pagefrz.NewRelminMxid = MultiXactCutoff;
+	pagefrz.NoFreezeNewRelfrozenXid = FreezeLimit;
+	pagefrz.NoFreezeNewRelminMxid = MultiXactCutoff;
+
 	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs,
-										  &frz, &totally_frozen,
-										  &NewRelfrozenXid, &NewRelminMxid);
+										  &frz, &totally_frozen, &pagefrz);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7294,8 +7344,8 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * could be processed by pruning away the whole tuple instead of freezing.
  *
  * The *NewRelfrozenXid and *NewRelminMxid input/output arguments work just
- * like the heap_prepare_freeze_tuple arguments that they're based on.  We
- * never freeze here, which makes tracking the oldest extant XID/MXID simple.
+ * like the similar fields from the FreezeCutoffs struct.  We never freeze
+ * here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
 heap_tuple_should_freeze(HeapTupleHeader tuple,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b3668e57b..9753b6b08 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1537,8 +1537,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	HeapPageFreeze pagefrz;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
@@ -1554,8 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.freeze_required = false;
+	pagefrz.NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.NewRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.NoFreezeNewRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.NoFreezeNewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1608,27 +1611,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1757,7 +1756,7 @@ retry:
 		/* Tuple with storage -- consider need to freeze */
 		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs,
 									  &frozen[tuples_frozen], &totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &pagefrz))
 		{
 			/* Save prepared freeze plan for later */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1778,23 +1777,62 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (pagefrz.freeze_required || tuples_frozen == 0)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (might be zero eligible tuples).
+		 */
+		vacrel->NewRelfrozenXid = pagefrz.NewRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.NewRelminMxid;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* Not freezing this page, so use alternative cutoffs */
+		vacrel->NewRelfrozenXid = pagefrz.NoFreezeNewRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.NoFreezeNewRelminMxid;
+
+		/* Might still set page all-visible, but never all-frozen */
+		tuples_frozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
-	 * first (arbitrary)
 	 */
 	if (tuples_frozen > 0)
 	{
-		Assert(prunestate->hastup);
+		TransactionId snapshotConflictHorizon;
+
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		vacrel->frozen_pages++;
 
+		/*
+		 * We can use the latest xmin cutoff (which is generally used for 'VM
+		 * set' conflicts) as our cutoff for freeze conflicts when the whole
+		 * page is eligible to become all-frozen in the VM once frozen by us.
+		 * Otherwise use a conservative cutoff (just back up from OldestXmin).
+		 */
+		if (prunestate->all_visible && prunestate->all_frozen)
+			snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+		else
+		{
+			snapshotConflictHorizon = vacrel->cutoffs.OldestXmin;
+			TransactionIdRetreat(snapshotConflictHorizon);
+		}
+
 		/* Execute all freeze plans for page as a single atomic action */
 		heap_freeze_execute_prepared(vacrel->rel, buf,
-									 vacrel->cutoffs.FreezeLimit,
+									 snapshotConflictHorizon,
 									 frozen, tuples_frozen);
 	}
 
@@ -1813,7 +1851,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1821,8 +1859,7 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1843,9 +1880,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1859,6 +1893,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->all_visible = false;
+		prunestate->has_lpdead_items = true;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
-- 
2.38.1

#35

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Peter Geoghegan (#34)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2022-11-23 15:06:52 -0800, Peter Geoghegan wrote:

Attached is v8.

The docs don't build:
https://cirrus-ci.com/task/5456939761532928
[20:00:58.203] postgres.sgml:52: element link: validity error : IDREF attribute linkend references an unknown ID "vacuum-for-wraparound"

Greetings,

Andres Freund

#36

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Andres Freund (#35)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Dec 6, 2022 at 10:42 AM Andres Freund <andres@anarazel.de> wrote:

The docs don't build:
https://cirrus-ci.com/task/5456939761532928
[20:00:58.203] postgres.sgml:52: element link: validity error : IDREF attribute linkend references an unknown ID "vacuum-for-wraparound"

Thanks for pointing this out. FWIW it is a result of Bruce's recent
addition of the transaction processing chapter to the docs.

My intention is to post v9 later in the week, which will fix the doc
build, and a lot more besides that. If you are planning on doing
another round of review, I'd suggest that you hold off until then. v9
will have structural improvements that will likely make it easier to
understand all the steps leading up to removing aggressive mode
completely. It'll be easier to relate each local step/patch to the
bigger picture for VACUUM.

v9 will also address some of the concerns you raised in your review
that weren't covered by v8, especially about the VM snapshotting
infrastructure. But also your concerns about the transition from lazy
strategies to eager strategies. The "catch up freezing" performed by
the first VACUUM operation run against a table that just exceeded the
GUC-controlled table size threshold will have far more limited impact,
because the burden of freezing will be spread out across multiple
VACUUM operations. The big idea behind the patch series is to relieve
users from having to think about a special type of VACUUM that has to
do much more freezing than other VACUUMs that ran against the same
table in the recent past, of course, so it is important to avoid
accidentally allowing any behavior that looks kind of like the ghost
of aggressive VACUUM.

--
Peter Geoghegan

#37

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#36)

5 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Dec 6, 2022 at 1:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

v9 will also address some of the concerns you raised in your review
that weren't covered by v8, especially about the VM snapshotting
infrastructure. But also your concerns about the transition from lazy
strategies to eager strategies.

Attached is v9. Highlights:

* VM snapshot infrastructure now spills using temp files when required
(only in larger tables).

v9 is the first version that has a credible approach to resource
management, which was something I put off until recently. We only use
a fixed amount of memory now, which should be acceptable from the
viewpoint of VACUUM resource management. The temp files use the
BufFile infrastructure in a relatively straightforward way.

* VM snapshot infrastructure now uses explicit prefetching.

Our approach is straightforward, and perhaps even obvious: we prefetch
at the point that VACUUM requests the next block in line. There is a
configurable prefetch distance, controlled by
maintenance_io_concurrency. We "stage" a couple of thousand
BlockNumbers in VACUUM's vmsnap by bulk-reading from the vmsnap's
local copy of the visibility map -- these staged blocks are returned
to VACUUM to scan, with interlaced prefetching of later blocks from
the same local BlockNumber array.

The addition of prefetching ought to be enough to avoid regressions
that might otherwise result from the removal of SKIP_PAGES_THRESHOLD
from vacuumlazy.c (see commit bf136cf6 from around the time the
visibility map first went in for the full context). While I definitely
need to do more performance validation work around prefetching
(especially on high latency network-attached storage), I imagine that
it won't be too hard to get into shape for commit. It's certainly not
committable yet, but it's vastly better than v8.

The visibility map snapshot interface (presented by visibilitymap.h)
also changed in v9, mostly to support prefetching. We now have an
iterator style interface (so vacuumlazy.c cannot request random
access). This iterator interface is implemented by visibilitymap.c
using logic similar to the current lazy_scan_skip() logic from
vacuumlazy.c (which is gone).

All told, visibilitymap.c knows quite a bit more than it used to about
high level requirements from vacuumlazy.c. For example it has explicit
awareness of VM skipping strategies.

* Page-level freezing commit now freezes a page whenever VACUUM
detects that pruning ran and generated an FPI.

Following a suggestion by Andres, page-level freezing is now always
triggered when pruning needs an FPI. Note that this optimization gets
applied regardless of freezing strategy (unless you turn off
full_page_writes, I suppose).

This optimization is added by the second patch
(v9-0002-Add-page-level-freezing-to-VACUUM.patch).

* Fixed the doc build.

* Much improved criteria for deciding on freezing and vmsnap skipping
strategies in vacuumlazy.c lazy_scan_strategy function -- improved
"cost model".

VACUUM should now give users a far smoother "transition" from lazy
processing to eager processing. A table that starts out small (smaller
than vacuum_freeze_strategy_threshold), but gradually grows, and
eventually becomes fairly large (perhaps to a multiple of
vacuum_freeze_strategy_threshold in size) will now experience a far
more gradual transition, with catch-up freezing spread out multiple
VACUUM operations. We avoid big jumps in the overhead of freezing,
where one particular VACUUM operation does all required "catch-up
freezing" in one go.

My approach is to "stagger" the timeline for switching freezing
strategy and vmsnap skipping strategy. We now change over from lazy to
eager freezing strategy when the table size threshold (controlled by
vacuum_freeze_strategy_threshold) is first crossed, just like in v8.
But unlike v8, v9 will switch over to eager skipping in some later
VACUUM operation (barring edge cases). This is implemented in a fairly
simple way: we now apply a "separate" threshold that is based on
vacuum_freeze_strategy_threshold: a threshold that's *twice* the
current value of the vacuum_freeze_strategy_threshold GUC/reloption
threshold.

My approach of "staggering" multiple distinct behaviors to avoid
having them all kick in during the same VACUUM operation isn't new to
v9. The behavior around waiting for cleanup locks (added by
v9-0005-Finish-removing-aggressive-mode-VACUUM.patch) is another
example of the same general idea.

In general I think that VACUUM shouldn't switch to more aggressive
behaviors all at the same time, in the same VACUUM. Each distinct
aggressive behavior has totally different properties, so there is no
reason why VACUUM should start to apply each and every one of them at
the same time. Some "aggressive" behaviors have the potential to make
things quite a lot worse, in fact. The cure must not be worse than the
disease.

* Related to the previous item (about the "cost model" that chooses a
strategy), we now have a much more sophisticated approach when it
comes to when and how we decide to advance relfrozenxid in smaller
tables (tables whose size is < vacuum_freeze_strategy_threshold). This
improves things for tables that start out small, and stay small.
Tables where we're unlikely to want to advance relfrozenxid in every
single VACUUM (better to be lazy with such a table), but still want to
be clever about advancing relfrozenxid "opportunistically".

The way that VACUUM weighs both table age and the added cost of
relfrozenxid advancement is more sophisticated in v9. The goal is to
make it more likely that VACUUM will stumble upon opportunities to
advance relfrozenxid when it happens to be cheap, which can happen for
many reasons. All of which have a great deal to do with workload
characteristics.

As in v8, v9 makes VACUUM willing to advance relfrozenxid without
concern for table age, whenever it notices that the cost of doing so
happens to be very cheap (in practice this means that the number of
"extra" heap pages scanned is < 5% of rel_pages). However, in v9 we
now go further by scaling this threshold through interpolation, based
on table age.

We have the same "5% of rel_pages" threshold when table age is less
than half way towards the point that autovacuum.c will launch an
antiwraparound autovacuum -- when we still have only minimal concern
about table age. But the rel_pages-wise threshold starts to grow once
table age gets past that "half way towards antiwrap AV" point. We
interpolate the rel_pages-wise threshold using a new approach in v9.

At first the rel_pages-wise threshold grows quite slowly (relative to
the rate at which table age approaches the point of forcing an
antiwraparound AV). For example, when we're 60% of the way towards
needing an antiwraparound AV, and VACUUM runs, we'll eagerly advance
relfrozenxid provided that the "extra" cost of doing so happens to be
less than ~22% of rel_pages. It "accelerates" from there (assuming
fixed rel_pages).

VACUUM will now tend to take advantage of individual table
characteristics that make it relatively cheap to advance relfrozenxid.
Bear in mind that these characteristics are not fixed for the same
table. The "extra" cost of advancing relfrozenxid during this VACUUM
(whether measured in absolute terms, of as a proportion of the net
amount of work just to do simple vacuuming) just isn't predictable
with real workloads. Especially not with the FPI opportunistic
freezing stuff from the second patch (the "freeze when heap pruning
gets an FPI" thing) in place. We should expect significant "natural
variation" among tables, and within the same table over time -- this
is a good thing.

For example, imagine a table that experiences a bunch of random
deletes, which leads to a VACUUM that must visit most heap pages (say
85% of rel_pages). Let's suppose that those deletes are a once-off
thing. The added cost of advancing relfrozenxid in the next VACUUM
still isn't trivial (assuming the remaining 15% of pages are
all-visible). But it is probably still worth doing if table age is at
least starting to become a concern. It might actually be a lot cheaper
to advance relfrozenxid early.

* Numerous structural improvements, lots of code polishing.

The patches have been reordered in a way that should make review a bit
easier. Now the commit messages are written in a way that clearly
anticipates the removal of aggressive mode VACUUM, which the last
patch actually finishes. Most of the earlier commits are presented as
preparation for completely removing aggressive mode VACUUM.

The first patch (which refactors how VACUUM passes around cutoffs like
FreezeLimit and OldestXmin by using a dedicated struct) is much
improved. heap_prepare_freeze_tuple() now takes a more explicit
approach to tracking what needs to happen for the tuple's freeze plan.
This allowed me to pepper it with defensive assertions. It's also a
lot clearer IMV. For example, we now have separate freeze_xmax and
replace_xmax tracker variables.

The second patch in the series (the page-level freezing patch) is also
much improved. I'm much happier with the way that
heap_prepare_freeze_tuple() now explicitly delegates control of
page-level freezing to FreezeMultiXactId() in v9, for example.

Note that I squashed the patch that taught VACUUM to size dead_items
using scanned_pages into the main visibility map patch
(v9-0004-Add-eager-and-lazy-VM-strategies-to-VACUUM.patch). That's why
there are only 5 patches (down from 6) in v9.

--
Peter Geoghegan

Attachments:

v9-0005-Finish-removing-aggressive-mode-VACUUM.patchapplication/x-patch; name=v9-0005-Finish-removing-aggressive-mode-VACUUM.patchDownload

From 8a093dc52e355c4fc6de86adacbefa679cb6a4da Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v9 5/5] Finish removing aggressive mode VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Although pre-visibility-map
VACUUM was far less efficient (especially after 9.6's commit fd31cd26),
its naive approach had one notable advantage: users only had to think
about a single kind of lazy vacuum (the only kind that existed).

Break the final remaining dependency on aggressive mode: replace the
rules governing when VACUUM will wait for a cleanup lock with a new set
of rules more attuned to the needs of the table.  With that last
dependency gone, there is no need for aggressive mode, so get rid of it.
Users once again only have to think about one kind of lazy vacuum.

VACUUM now places particular emphasis on performance stability.  The
burden of freezing physical heap pages is now more or less spread out as
much as possible.  Each table's age will now tend to follow what VACUUM
does, rather than having VACUUM's behavior driven by table age.  The
table age tail no longer wags the VACUUM dog.

In general, all of the behaviors associated with aggressive mode prior
to Postgres 16 have been retained; they just get applied selectively, on
a more dynamic timeline.  For example, the aforementioned change to
VACUUM's cleanup lock behavior retains the general idea of waiting for a
cleanup lock in the event of not being able to get one right away (to
make sure that older XIDs get frozen during the ongoing VACUUM).  All
that changes is the cutoffs -- the timeline.  We use new, dedicated
cutoffs for this, rather than applying FreezeLimit/MultiXactCutoff.

FreezeLimit is now only used when deciding what we want to do about
freezing on a page that has already been cleanup locked.  The new
cutoffs (MinXid and MinMulti) are typically far earlier than FreezeLimit
or MultiXactCutoff.  In fact, they'll often use an XID that's even older
than the target rel's existing relfrozenxid, which means that VACUUM
cannot possibly end up waiting for a cleanup lock.  We don't need an
explicit aggressive mode to decide VACUUM's policy on waiting.

It is okay to aggressively punt on waiting for a cleanup lock like this
because VACUUM now understands the importance of never falling too far
behind on the work of freezing physical heap pages at the level of the
whole table.  Prior to Postgres 16, VACUUM tended to do all freezing and
relfrozenxid advancement in aggressive mode, especially in large tables.
Aggressive VACUUM had to advance the table's relfrozenxid by relatively
many XIDs (up to FreezeLimit, not just up to MinXid) because table age
was more or less treated as a proxy for freeze debt.  It would therefore
have been risky for aggressive VACUUM to "squander" any opportunity at
advancing relfrozenxid (by accepting a much older final value, say).
But since we now freeze much more eagerly, opportunities to advance
relfrozenxid (at least by some small amount) are much more plentiful.

VACUUM now tends to get on with freezing every other eligible page in
the table instead of waiting, thus making it less likely that we'll fall
behind on freezing at the level of the whole table (or whole database),
thus making it even safer to punt even more aggressively when heap pages
cannot be cleanup locked.  Page-level freezing helps VACUUM sustain this
virtuous cycle; only one VACUUM operation has to get lucky _once_ in
order for us to freeze _all_ of the tuples from a troublesome heap page.

There are still antiwraparound autovacuums, but they're now little more
than another way that autovacuum.c can launch an autovacuum worker to
run VACUUM.

XXX We also need to avoid the special auto-cancellation behavior for
antiwraparound autovacuums to make all this safe.  See also, related
patch for this: https://commitfest.postgresql.org/41/4027/

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/access/heapam.h                   |   2 +-
 src/include/commands/vacuum.h                 |   9 +-
 src/backend/access/heap/heapam.c              |  18 +-
 src/backend/access/heap/vacuumlazy.c          |  90 +--
 src/backend/commands/vacuum.c                 |  42 +-
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 doc/src/sgml/config.sgml                      |  68 +-
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |   2 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 doc/src/sgml/xact.sgml                        |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  30 +-
 16 files changed, 495 insertions(+), 529 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 0782fed14..5a959d711 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -254,7 +254,7 @@ extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId FreezeLimit, TransactionId MultiXactCutoff);
 extern bool heap_tuple_should_freeze(HeapTupleHeader tuple,
-									 const struct VacuumCutoffs *cutoffs,
+									 const struct VacuumCutoffs *cutoffs, bool MinCutoffs,
 									 TransactionId *NoFreezePageRelfrozenXid,
 									 MultiXactId *NoFreezePageRelminMxid);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4dcef3e67..78d6507c5 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -276,6 +276,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
 	 * rel's main fork) for triggering eager/all-visible freezing strategy
@@ -335,7 +342,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+extern void vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 							   struct VacuumCutoffs *cutoffs,
 							   double *tableagefrac);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0baebe432..b2e86e21b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6305,7 +6305,7 @@ FreezeMultiXactId(MultiXactId multi, HeapTupleHeader tuple,
 	 * We only reach this far when replacing xmax is absolutely mandatory.
 	 * heap_tuple_should_freeze will indicate that the tuple should be frozen.
 	 */
-	Assert(heap_tuple_should_freeze(tuple, cutoffs, &axid, &amxid));
+	Assert(heap_tuple_should_freeze(tuple, cutoffs, false, &axid, &amxid));
 
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
@@ -6765,7 +6765,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									   xmax_already_frozen))
 	{
 		pagefrz->freeze_required =
-			heap_tuple_should_freeze(tuple, cutoffs,
+			heap_tuple_should_freeze(tuple, cutoffs, false,
 									 &pagefrz->NoFreezePageRelfrozenXid,
 									 &pagefrz->NoFreezePageRelminMxid);
 	}
@@ -7331,6 +7331,11 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * force freezing of the page containing tuple.  This happens whenever the
  * tuple contains XID/MXID fields with values < FreezeLimit/MultiXactCutoff.
  *
+ * Callers that pass 'MinCutoffs=true' have us apply earlier cutoffs instead:
+ * the MinXid and MinMulti cutoffs.  VACUUM never sets relfrozenxid/relminmxid
+ * to values < MinXid/MinMulti, even when following that rule forces VACUUM to
+ * wait for a heap page cleanup lock indefinitely.
+ *
  * The *NoFreezePageRelfrozenXid and *NoFreezePageRelminMxid input/output
  * arguments help VACUUM track the oldest extant XID/MXID remaining in rel.
  * Our working assumption is that caller won't decide to freeze this tuple.
@@ -7339,7 +7344,7 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  */
 bool
 heap_tuple_should_freeze(HeapTupleHeader tuple,
-						 const struct VacuumCutoffs *cutoffs,
+						 const struct VacuumCutoffs *cutoffs, bool MinCutoffs,
 						 TransactionId *NoFreezePageRelfrozenXid,
 						 MultiXactId *NoFreezePageRelminMxid)
 {
@@ -7349,6 +7354,13 @@ heap_tuple_should_freeze(HeapTupleHeader tuple,
 	MultiXactId multi;
 	bool		freeze = false;
 
+	if (MinCutoffs)
+	{
+		/* Use earlier cleanup lock cutoffs */
+		MustFreezeLimit = cutoffs->MinXid;
+		MustFreezeMultiLimit = cutoffs->MinMulti;
+	}
+
 	/* First deal with xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (TransactionIdIsNormal(xid))
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 60c1e2cec..85c399f95 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -156,8 +156,6 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -460,8 +458,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs,
-											&tableagefrac);
+	vacuum_get_cutoffs(rel, params, &vacrel->cutoffs, &tableagefrac);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
@@ -539,17 +536,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(vacrel->cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(vacrel->cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->vmstrat == VMSNAP_SKIP_ALL_VISIBLE)
 	{
@@ -557,7 +551,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * Must keep original relfrozenxid when lazy_scan_strategy call
 		 * decided to skip all-visible pages
 		 */
-		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -633,23 +626,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
-			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
 			else
 			{
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
+				Assert(IsAutoVacuumWorkerProcess());
+				if (params->is_wraparound)
+					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -989,7 +970,6 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * lazy_scan_noprune could not do all required processing.  Wait
 			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
 			 */
-			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
 		}
@@ -1419,14 +1399,11 @@ lazy_scan_strategy(LVRelState *vacrel, const VacuumParams *params,
 
 	/*
 	 * Override choice of skipping strategy (force vmsnap to scan every page
-	 * in the range of rel_pages) in DISABLE_PAGE_SKIPPING case.  Also
-	 * defensively force all-frozen in aggressive VACUUMs.
+	 * in the range of rel_pages) in DISABLE_PAGE_SKIPPING case
 	 */
 	Assert(vacrel->vmstrat != VMSNAP_SKIP_NONE);
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 		vacrel->vmstrat = VMSNAP_SKIP_NONE;
-	else if (vacrel->aggressive)
-		vacrel->vmstrat = VMSNAP_SKIP_ALL_FROZEN;
 
 	/* Inform vmsnap infrastructure of our chosen strategy */
 	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
@@ -2002,11 +1979,13 @@ retry:
  * operation left LP_DEAD items behind.  We'll at least collect any such items
  * in the dead_items array for removal from indexes.
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We may return false to indicate that a full cleanup lock is required for
+ * processing by lazy_scan_prune.  This is only necessary when VACUUM needs to
+ * freeze some tuple XIDs from one or more tuples on the page.  This should
+ * only happen when multiple successive VACUUM operations all fail to get a
+ * cleanup lock on the same heap page (assuming default or at least typical
+ * freeze settings).  Waiting for a cleanup lock should be avoided unless it's
+ * the only way to advance relfrozenxid by enough to satisfy autovacuum.c.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
@@ -2073,36 +2052,25 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
+		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs, true,
 									 &NoFreezePageRelfrozenXid,
 									 &NoFreezePageRelminMxid))
 		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
 			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
+			 * Tuple with XID < MinXid (or MXID < MinMulti)
 			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
+			 * VACUUM must always be able to advance rel's relfrozenxid and
+			 * relminmxid to minimum values.  The ongoing VACUUM won't be able
+			 * to do that unless it can freeze an XID (or MXID) from this
+			 * tuple now.
+			 *
+			 * The only safe option is to have caller perform processing of
+			 * this page using lazy_scan_prune.  Caller might have to wait a
+			 * long time for a cleanup lock, which can be very disruptive, but
+			 * it can't be helped.
 			 */
+			vacrel->offnum = InvalidOffsetNumber;
+			return false;
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e2f586687..bc1337bfe 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -930,13 +930,8 @@ get_all_vacuum_rels(int options)
  * to advance relfrozenxid before the point that it is strictly necessary.
  * VACUUM can (and often does) opt to advance relfrozenxid proactively.  It is
  * especially likely with tables where the _added_ costs happen to be low.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
  */
-bool
+void
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				   struct VacuumCutoffs *cutoffs, double *tableagefrac)
 {
@@ -1106,6 +1101,39 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 		multixact_freeze_table_age > effective_multixact_freeze_max_age)
 		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
+	/*
+	 * Determine the cutoffs used by VACUUM to decide on whether to wait for a
+	 * cleanup lock on a page (that it can't cleanup lock right away).  These
+	 * are the MinXid and MinMulti cutoffs.  They are related to the cutoffs
+	 * for freezing (FreezeLimit and MultiXactCutoff), and are only applied on
+	 * pages that we cannot freeze right away.  See vacuumlazy.c for details.
+	 *
+	 * VACUUM can ratchet back NewRelfrozenXid and/or NewRelminMxid instead of
+	 * waiting indefinitely for a cleanup lock in almost all cases.  The high
+	 * level goal is to create as many opportunities as possible to freeze
+	 * (across many successive VACUUM operations), while avoiding waiting for
+	 * a cleanup lock whenever possible.  Any time spent waiting is time spent
+	 * not freezing other eligible pages, which is typically a bad trade-off.
+	 *
+	 * As a consequence of all this, MinXid and MinMulti also act as limits on
+	 * the oldest acceptable values that can ever be set in pg_class by VACUUM
+	 * (though this is only relevant when they have already attained XID/XMID
+	 * ages that approach freeze_table_age and/or multixact_freeze_table_age).
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age * 0.5);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.5);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
 	 * XMID table age (whichever is greater currently).
@@ -1123,8 +1151,6 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 */
 	if (params->is_wraparound)
 		*tableagefrac = 1.0;
-
-	return (*tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index f9788c30a..0c80896cc 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 02186ce36..ae2f3fdea 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8199,7 +8199,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8388,7 +8388,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9104,6 +9104,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
       <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9137,21 +9152,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
-      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
-      <indexterm>
-       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Specifies the cutoff size (in pages) that <command>VACUUM</command>
-        should use to decide whether to its eager freezing strategy.
-        The default is 4 gigabytes (<literal>4GB</literal>).
-       </para>
-      </listitem>
-     </varlistentry>
-
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9171,7 +9171,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9217,19 +9217,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 554b3a75d..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
-    transactions older than this cutoff XID are guaranteed to have been frozen.
-    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9cae899d5..f1d2a8cc2 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -210,7 +210,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/doc/src/sgml/xact.sgml b/doc/src/sgml/xact.sgml
index b467660ee..c4146539f 100644
--- a/doc/src/sgml/xact.sgml
+++ b/doc/src/sgml/xact.sgml
@@ -49,8 +49,8 @@
 
   <para>
    The internal transaction ID type <type>xid</type> is 32 bits wide
-   and <link linkend="vacuum-for-wraparound">wraps around</link> every
-   4 billion transactions. A 32-bit epoch is incremented during each
+   and <link linkend="vacuum-xid-space">wraps around</link> every
+   2 billion transactions. A 32-bit epoch is incremented during each
    wraparound. There is also a 64-bit type <type>xid8</type> which
    includes this epoch and therefore does not wrap around during the
    life of an installation;  it can be converted to xid by casting.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..927410258 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,18 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), without ever being
+# prepared to wait for a cleanup lock (we'll never wait on a cleanup
+# lock because the separate MinXid cutoff for waiting will still be
+# well before FreezeLimit, given our default autovacuum_freeze_max_age).
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +78,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +94,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +105,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +118,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +130,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +138,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
-- 
2.38.1

v9-0001-Refactor-how-VACUUM-passes-around-its-XID-cutoffs.patchapplication/x-patch; name=v9-0001-Refactor-how-VACUUM-passes-around-its-XID-cutoffs.patchDownload

From 52f1f787563b858ca00665346b49cf8579b77534 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 19 Nov 2022 16:37:53 -0800
Subject: [PATCH v9 1/5] Refactor how VACUUM passes around its XID cutoffs.

Use a dedicated struct for the XID/MXID cutoffs used by VACUUM, such as
FreezeLimit and OldestXmin.  This state is initialized in vacuum.c, and
then passed around (via const pointers) by code from vacuumlazy.c to
external freezing related routines like heap_prepare_freeze_tuple.

Also simplify some of the logic for dealing with frozen xmin in
heap_prepare_freeze_tuple: add dedicated "xmin_already_frozen" state to
clearly distinguish xmin XIDs that we're going to freeze from those that
were already frozen from before.  This makes its xmin handling code
symmetrical with its xmax handling code.  This is preparation for an
upcoming commit that adds page level freezing.

Also refactor the control flow within FreezeMultiXactId(), while adding
stricter sanity checks.  We now test OldestXmin directly (instead of
using FreezeLimit as an inexact proxy for OldestXmin).  This is further
preparation for the page level freezing work, which will make the
function's caller give over control of page level freezing when needed
(whenever heap_prepare_freeze_tuple encounters a tuple/page that happens
to contain one or more MultiXactIds).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznS9TxXmz2_=SY+SyJyDFbiOftKofM9=aDo68BbXNBUMA@mail.gmail.com
---
 src/include/access/heapam.h            |  10 +-
 src/include/access/tableam.h           |   2 +-
 src/include/commands/vacuum.h          |  49 ++-
 src/backend/access/heap/heapam.c       | 466 ++++++++++++-------------
 src/backend/access/heap/vacuumlazy.c   | 197 +++++------
 src/backend/access/transam/multixact.c |   9 +-
 src/backend/commands/cluster.c         |  25 +-
 src/backend/commands/vacuum.c          | 120 +++----
 8 files changed, 424 insertions(+), 454 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 810baaf9d..53eb01176 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -38,6 +38,7 @@
 
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
+struct VacuumCutoffs;
 
 #define MaxLockTupleMode	LockTupleExclusive
 
@@ -178,8 +179,7 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
-									  TransactionId relfrozenxid, TransactionId relminmxid,
-									  TransactionId cutoff_xid, TransactionId cutoff_multi,
+									  const struct VacuumCutoffs *cutoffs,
 									  HeapTupleFreeze *frz, bool *totally_frozen,
 									  TransactionId *relfrozenxid_out,
 									  MultiXactId *relminmxid_out);
@@ -188,9 +188,9 @@ extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
-							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
+							  TransactionId FreezeLimit, TransactionId MultiXactCutoff);
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
+									const struct VacuumCutoffs *cutoffs,
 									TransactionId *relfrozenxid_out,
 									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4d1ef405c..1320ee6db 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1634,7 +1634,7 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
  *   in that index's order; if false and OldIndex is InvalidOid, no sorting is
  *   performed
  * - OldIndex - see use_sort
- * - OldestXmin - computed by vacuum_set_xid_limits(), even when
+ * - OldestXmin - computed by vacuum_get_cutoffs(), even when
  *   not needed for the relation's AM
  * - *xid_cutoff - ditto
  * - *multi_cutoff - ditto
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8..896d1b1ac 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -235,6 +235,45 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
+/*
+ * VacuumCutoffs is immutable state that describes the cutoffs used by VACUUM.
+ * Established at the beginning of each VACUUM operation.
+ */
+struct VacuumCutoffs
+{
+	/*
+	 * Existing pg_class fields at start of VACUUM (used for sanity checks)
+	 */
+	TransactionId relfrozenxid;
+	MultiXactId relminmxid;
+
+	/*
+	 * OldestXmin is the Xid below which tuples deleted by any xact (that
+	 * committed) should be considered DEAD, not just RECENTLY_DEAD.
+	 *
+	 * OldestMxact is the Mxid below which MultiXacts are definitely not seen
+	 * as visible by any running transaction.
+	 *
+	 * OldestXmin and OldestMxact are also the most recent values that can
+	 * ever be passed to vac_update_relstats() as frozenxid and minmulti
+	 * arguments at the end of VACUUM.  These same values should be passed
+	 * when it turns out that VACUUM will leave no unfrozen XIDs/MXIDs behind
+	 * in the table.
+	 */
+	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
+
+	/*
+	 * FreezeLimit is the Xid below which all Xids are definitely replaced by
+	 * FrozenTransactionId in heap pages that VACUUM can cleanup lock.
+	 *
+	 * MultiXactCutoff is the value below which all MultiXactIds are
+	 * definitely removed from Xmax in heap pages VACUUM can cleanup lock.
+	 */
+	TransactionId FreezeLimit;
+	MultiXactId MultiXactCutoff;
+};
+
 /*
  * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
  */
@@ -286,13 +325,9 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
-								  TransactionId *OldestXmin,
-								  MultiXactId *OldestMxact,
-								  TransactionId *FreezeLimit,
-								  MultiXactId *MultiXactCutoff);
-extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
-									  MultiXactId relminmxid);
+extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+							   struct VacuumCutoffs *cutoffs);
+extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
 extern void vacuum_delay_point(void);
 extern bool vacuum_is_permitted_for_relation(Oid relid, Form_pg_class reltuple,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 747db5037..6c0634b38 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -52,6 +52,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -6125,12 +6126,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
-				  TransactionId relfrozenxid, TransactionId relminmxid,
-				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
+				  const struct VacuumCutoffs *cutoffs, uint16 *flags,
+				  TransactionId *mxid_oldest_xid_out)
 {
-	TransactionId xid = InvalidTransactionId;
-	int			i;
+	TransactionId newxmax = InvalidTransactionId;
 	MultiXactMember *members;
 	int			nmembers;
 	bool		need_replace;
@@ -6153,12 +6152,12 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		*flags |= FRM_INVALIDATE_XMAX;
 		return InvalidTransactionId;
 	}
-	else if (MultiXactIdPrecedes(multi, relminmxid))
+	else if (MultiXactIdPrecedes(multi, cutoffs->relminmxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
-								 multi, relminmxid)));
-	else if (MultiXactIdPrecedes(multi, cutoff_multi))
+								 multi, cutoffs->relminmxid)));
+	else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
 	{
 		/*
 		 * This old multi cannot possibly have members still running, but
@@ -6171,39 +6170,39 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
-									 multi, cutoff_multi)));
+									 multi, cutoffs->MultiXactCutoff)));
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
 			*flags |= FRM_INVALIDATE_XMAX;
-			xid = InvalidTransactionId;
+			newxmax = InvalidTransactionId;
 		}
 		else
 		{
-			/* replace multi by update xid */
-			xid = MultiXactIdGetUpdateXid(multi, t_infomask);
+			/* replace multi with single XID for its updater */
+			newxmax = MultiXactIdGetUpdateXid(multi, t_infomask);
 
 			/* wasn't only a lock, xid needs to be valid */
-			Assert(TransactionIdIsValid(xid));
+			Assert(TransactionIdIsValid(newxmax));
 
-			if (TransactionIdPrecedes(xid, relfrozenxid))
+			if (TransactionIdPrecedes(newxmax, cutoffs->relfrozenxid))
 				ereport(ERROR,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before relfrozenxid %u",
-										 xid, relfrozenxid)));
+										 newxmax, cutoffs->relfrozenxid)));
 
 			/*
-			 * If the xid is older than the cutoff, it has to have aborted,
-			 * otherwise the tuple would have gotten pruned away.
+			 * If the new xmax xid is older than OldestXmin, it has to have
+			 * aborted, otherwise the tuple would have been pruned away
 			 */
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(newxmax, cutoffs->OldestXmin))
 			{
-				if (TransactionIdDidCommit(xid))
+				if (TransactionIdDidCommit(newxmax))
 					ereport(ERROR,
 							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg_internal("cannot freeze committed update xid %u", xid)));
+							 errmsg_internal("cannot freeze committed update xid %u", newxmax)));
 				*flags |= FRM_INVALIDATE_XMAX;
-				xid = InvalidTransactionId;
+				newxmax = InvalidTransactionId;
 			}
 			else
 			{
@@ -6215,17 +6214,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
 		 * when no Xids will remain
 		 */
-		return xid;
+		return newxmax;
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below FreezeLimit xid cutoff, so we
 	 * need to walk the whole members array to figure out what to do, if
 	 * anything.
 	 */
-
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6236,12 +6232,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
 	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
-	for (i = 0; i < nmembers; i++)
+	for (int i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		TransactionId xid = members[i].xid;
+
+		Assert(!TransactionIdPrecedes(xid, cutoffs->relfrozenxid));
+
+		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
 			need_replace = true;
 			break;
@@ -6251,7 +6250,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
+	 * In the simplest case, there is no member older than FreezeLimit; we can
 	 * keep the existing MultiXactId as-is, avoiding a more expensive second
 	 * pass over the multi
 	 */
@@ -6279,110 +6278,97 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	update_committed = false;
 	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
 
-	for (i = 0; i < nmembers; i++)
+	/*
+	 * Determine whether to keep each member xid, or to ignore it instead
+	 */
+	for (int i = 0; i < nmembers; i++)
 	{
-		/*
-		 * Determine whether to keep this member or ignore it.
-		 */
-		if (ISUPDATE_from_mxstatus(members[i].status))
+		TransactionId xid = members[i].xid;
+		MultiXactStatus mstatus = members[i].status;
+
+		Assert(!TransactionIdPrecedes(xid, cutoffs->relfrozenxid));
+
+		if (!ISUPDATE_from_mxstatus(mstatus))
 		{
-			TransactionId txid = members[i].xid;
-
-			Assert(TransactionIdIsValid(txid));
-			if (TransactionIdPrecedes(txid, relfrozenxid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("found update xid %u from before relfrozenxid %u",
-										 txid, relfrozenxid)));
-
 			/*
-			 * It's an update; should we keep it?  If the transaction is known
-			 * aborted or crashed then it's okay to ignore it, otherwise not.
-			 * Note that an updater older than cutoff_xid cannot possibly be
-			 * committed, because HeapTupleSatisfiesVacuum would have returned
-			 * HEAPTUPLE_DEAD and we would not be trying to freeze the tuple.
-			 *
-			 * As with all tuple visibility routines, it's critical to test
-			 * TransactionIdIsInProgress before TransactionIdDidCommit,
-			 * because of race conditions explained in detail in
-			 * heapam_visibility.c.
+			 * Locker XID (not updater XID).  We only keep lockers that are
+			 * still running.
 			 */
-			if (TransactionIdIsCurrentTransactionId(txid) ||
-				TransactionIdIsInProgress(txid))
-			{
-				Assert(!TransactionIdIsValid(update_xid));
-				update_xid = txid;
-			}
-			else if (TransactionIdDidCommit(txid))
-			{
-				/*
-				 * The transaction committed, so we can tell caller to set
-				 * HEAP_XMAX_COMMITTED.  (We can only do this because we know
-				 * the transaction is not running.)
-				 */
-				Assert(!TransactionIdIsValid(update_xid));
-				update_committed = true;
-				update_xid = txid;
-			}
-			else
-			{
-				/*
-				 * Not in progress, not committed -- must be aborted or
-				 * crashed; we can ignore it.
-				 */
-			}
-
-			/*
-			 * Since the tuple wasn't totally removed when vacuum pruned, the
-			 * update Xid cannot possibly be older than the xid cutoff. The
-			 * presence of such a tuple would cause corruption, so be paranoid
-			 * and check.
-			 */
-			if (TransactionIdIsValid(update_xid) &&
-				TransactionIdPrecedes(update_xid, cutoff_xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("found update xid %u from before xid cutoff %u",
-										 update_xid, cutoff_xid)));
-
-			/*
-			 * We determined that this is an Xid corresponding to an update
-			 * that must be retained -- add it to new members list for later.
-			 *
-			 * Also consider pushing back temp_xid_out, which is needed when
-			 * we later conclude that a new multi is required (i.e. when we go
-			 * on to set FRM_RETURN_IS_MULTI for our caller because we also
-			 * need to retain a locker that's still running).
-			 */
-			if (TransactionIdIsValid(update_xid))
+			if (TransactionIdIsCurrentTransactionId(xid) ||
+				TransactionIdIsInProgress(xid))
 			{
 				newmembers[nnewmembers++] = members[i];
-				if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
-					temp_xid_out = members[i].xid;
+				has_lockers = true;
+
+				/*
+				 * Cannot possibly be older than VACUUM's OldestXmin, so we
+				 * don't need a NewRelfrozenXid step here
+				 */
+				Assert(TransactionIdPrecedesOrEquals(cutoffs->OldestXmin, xid));
 			}
+
+			continue;
+		}
+
+		/*
+		 * Updater XID (not locker XID).  Should we keep it?
+		 *
+		 * Since the tuple wasn't totally removed when vacuum pruned, the
+		 * update Xid cannot possibly be older than OldestXmin cutoff. The
+		 * presence of such a tuple would cause corruption, so be paranoid and
+		 * check.
+		 */
+		if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("found update xid %u from before removable cutoff %u",
+									 xid, cutoffs->OldestXmin)));
+		if (TransactionIdIsValid(update_xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("multixact %u has two or more updating members",
+									 multi),
+					 errdetail_internal("First updater XID=%u second updater XID=%u.",
+										update_xid, xid)));
+
+		/*
+		 * If the transaction is known aborted or crashed then it's okay to
+		 * ignore it, otherwise not.
+		 *
+		 * As with all tuple visibility routines, it's critical to test
+		 * TransactionIdIsInProgress before TransactionIdDidCommit, because of
+		 * race conditions explained in detail in heapam_visibility.c.
+		 */
+		if (TransactionIdIsCurrentTransactionId(xid) ||
+			TransactionIdIsInProgress(xid))
+			update_xid = xid;
+		else if (TransactionIdDidCommit(xid))
+		{
+			/*
+			 * The transaction committed, so we can tell caller to set
+			 * HEAP_XMAX_COMMITTED.  (We can only do this because we know the
+			 * transaction is not running.)
+			 */
+			update_committed = true;
+			update_xid = xid;
 		}
 		else
 		{
-			/* We only keep lockers if they are still running */
-			if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
-				TransactionIdIsInProgress(members[i].xid))
-			{
-				/*
-				 * Running locker cannot possibly be older than the cutoff.
-				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
-				 * initial value used for top-level relfrozenxid_out tracking
-				 * state.  A running locker cannot be older than VACUUM's
-				 * OldestXmin, either, so we don't need a temp_xid_out step.
-				 */
-				Assert(TransactionIdIsNormal(members[i].xid));
-				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
-				Assert(!TransactionIdPrecedes(members[i].xid,
-											  *mxid_oldest_xid_out));
-				newmembers[nnewmembers++] = members[i];
-				has_lockers = true;
-			}
+			/*
+			 * Not in progress, not committed -- must be aborted or crashed;
+			 * we can ignore it.
+			 */
+			continue;
 		}
+
+		/*
+		 * We determined that this is an Xid corresponding to an update that
+		 * must be retained -- add it to new members list for later.  Also
+		 * consider pushing back mxid_oldest_xid_out.
+		 */
+		newmembers[nnewmembers++] = members[i];
+		if (TransactionIdPrecedes(xid, temp_xid_out))
+			temp_xid_out = xid;
 	}
 
 	pfree(members);
@@ -6395,7 +6381,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* nothing worth keeping!? Tell caller to remove the whole thing */
 		*flags |= FRM_INVALIDATE_XMAX;
-		xid = InvalidTransactionId;
+		newxmax = InvalidTransactionId;
 		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
@@ -6411,7 +6397,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		*flags |= FRM_RETURN_IS_XID;
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
-		xid = update_xid;
+		newxmax = update_xid;
 		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
 	}
 	else
@@ -6421,26 +6407,29 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * one, to set as new Xmax in the tuple.  The oldest surviving member
 		 * might push back mxid_oldest_xid_out.
 		 */
-		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
+		newxmax = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
 		*mxid_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
 
-	return xid;
+	return newxmax;
 }
 
 /*
  * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the specified cutoff XID and cutoff MultiXactId.  If so,
+ * are older than the FreezeLimit and/or MultiXactCutoff freeze cutoffs.  If so,
  * setup enough state (in the *frz output argument) to later execute and
- * WAL-log what we would need to do, and return true.  Return false if nothing
- * is to be changed.  In addition, set *totally_frozen to true if the tuple
- * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * WAL-log what caller needs to do for the tuple, and return true.  Return
+ * false if nothing can be changed about the tuple right now.
+ *
+ * Also sets *totally_frozen to true if the tuple will be totally frozen once
+ * caller executes returned freeze plan (or if the tuple was already totally
+ * frozen by an earlier VACUUM).  This indicates that there are no remaining
+ * XIDs or MultiXactIds that will need to be processed by a future VACUUM.
  *
  * VACUUM caller must assemble HeapTupleFreeze entries for every tuple that we
  * returned true for when called.  A later heap_freeze_execute_prepared call
@@ -6458,12 +6447,6 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
  * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
  *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6471,16 +6454,17 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  */
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
-						  TransactionId relfrozenxid, TransactionId relminmxid,
-						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  const struct VacuumCutoffs *cutoffs,
 						  HeapTupleFreeze *frz, bool *totally_frozen,
 						  TransactionId *relfrozenxid_out,
 						  MultiXactId *relminmxid_out)
 {
-	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin = false,
+				replace_xvac = false,
+				replace_xmax = false,
+				freeze_xmax = false;
 	TransactionId xid;
 
 	frz->frzflags = 0;
@@ -6489,37 +6473,29 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen when our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+		xmin_already_frozen = true;
 	else
 	{
-		if (TransactionIdPrecedes(xid, relfrozenxid))
+		if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
-									 xid, relfrozenxid)));
+									 xid, cutoffs->relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->FreezeLimit);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-										 xid, cutoff_xid)));
-
-			frz->t_infomask |= HEAP_XMIN_FROZEN;
-			changed = true;
+										 xid, cutoffs->FreezeLimit)));
 		}
 		else
 		{
@@ -6529,17 +6505,31 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		}
 	}
 
+	/*
+	 * Old-style VACUUM FULL is gone, but we have to process xvac for as long
+	 * as we support having MOVED_OFF/MOVED_IN tuples in the database
+	 */
+	xid = HeapTupleHeaderGetXvac(tuple);
+	if (TransactionIdIsNormal(xid))
+	{
+		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
+		Assert(TransactionIdPrecedes(xid, cutoffs->OldestXmin));
+
+		/*
+		 * For Xvac, we always freeze proactively.  This allows totally_frozen
+		 * tracking to ignore xvac.
+		 */
+		replace_xvac = true;
+	}
+
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given FreezeLimit.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
-	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
-
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		/* Raw xmax is a MultiXactId */
@@ -6547,13 +6537,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		uint16		flags;
 		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
 
-		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
-									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi,
+		newxmax = FreezeMultiXactId(xid, tuple->t_infomask, cutoffs,
 									&flags, &mxid_oldest_xid_out);
 
-		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
-
 		if (flags & FRM_RETURN_IS_XID)
 		{
 			/*
@@ -6562,8 +6548,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * Might have to ratchet back relfrozenxid_out here, though never
 			 * relminmxid_out.
 			 */
-			Assert(!freeze_xmax);
-			Assert(TransactionIdIsValid(newxmax));
+			Assert(!TransactionIdPrecedes(newxmax, cutoffs->OldestXmin));
 			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
 				*relfrozenxid_out = newxmax;
 
@@ -6578,7 +6563,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
-			changed = true;
+			replace_xmax = true;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6591,9 +6576,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * Might have to ratchet back relfrozenxid_out here, though never
 			 * relminmxid_out.
 			 */
-			Assert(!freeze_xmax);
-			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, cutoffs->OldestMxact));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
 												 *relfrozenxid_out));
 			*relfrozenxid_out = mxid_oldest_xid_out;
@@ -6609,10 +6592,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			GetMultiXactIdHintBits(newxmax, &newbits, &newbits2);
 			frz->t_infomask |= newbits;
 			frz->t_infomask2 |= newbits2;
-
 			frz->xmax = newxmax;
-
-			changed = true;
+			replace_xmax = true;
 		}
 		else if (flags & FRM_NOOP)
 		{
@@ -6621,7 +6602,6 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
 			 * both together.
 			 */
-			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
 												 *relfrozenxid_out));
@@ -6632,23 +6612,25 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			/*
-			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.
-			 * Won't have to ratchet back relminmxid_out or relfrozenxid_out.
+			 * Freeze plan for tuple "freezes xmax" in the strictest sense:
+			 * it'll leave nothing in xmax (neither an Xid nor a MultiXactId).
 			 */
-			Assert(freeze_xmax);
+			Assert(flags & FRM_INVALIDATE_XMAX);
+			Assert(MultiXactIdPrecedes(xid, cutoffs->OldestMxact));
 			Assert(!TransactionIdIsValid(newxmax));
+			freeze_xmax = true;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
 		/* Raw xmax is normal XID */
-		if (TransactionIdPrecedes(xid, relfrozenxid))
+		if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
-									 xid, relfrozenxid)));
+									 xid, cutoffs->relfrozenxid)));
 
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
 			/*
 			 * If we freeze xmax, make absolutely sure that it's not an XID
@@ -6667,7 +6649,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		}
 		else
 		{
-			freeze_xmax = false;
+			/* Might have to ratchet back relfrozenxid_out */
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
 		}
@@ -6676,19 +6658,41 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		/* Raw xmax is InvalidTransactionId XID */
 		Assert((tuple->t_infomask & HEAP_XMAX_IS_MULTI) == 0);
-		freeze_xmax = false;
 		xmax_already_frozen = true;
-		/* No need for relfrozenxid_out handling for already-frozen xmax */
 	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
+				 errmsg_internal("found raw xmax %u (infomask 0x%04x) not invalid and not multi",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+
+		frz->t_infomask |= HEAP_XMIN_FROZEN;
+	}
+	if (replace_xvac)
+	{
+		/*
+		 * If a MOVED_OFF tuple is not dead, the xvac transaction must have
+		 * failed; whereas a non-dead MOVED_IN tuple must mean the xvac
+		 * transaction succeeded.
+		 */
+		if (tuple->t_infomask & HEAP_MOVED_OFF)
+			frz->frzflags |= XLH_INVALID_XVAC;
+		else
+			frz->frzflags |= XLH_FREEZE_XVAC;
+	}
+	if (replace_xmax)
+	{
+		Assert(!xmax_already_frozen && !freeze_xmax);
+
+		/* Already set t_infomask/t_infomask2 flags in freeze plan */
+	}
 	if (freeze_xmax)
 	{
-		Assert(!xmax_already_frozen);
+		Assert(!xmax_already_frozen && !replace_xmax);
 
 		frz->xmax = InvalidTransactionId;
 
@@ -6701,52 +6705,20 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		frz->t_infomask |= HEAP_XMAX_INVALID;
 		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
 		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
-		changed = true;
 	}
 
 	/*
-	 * Old-style VACUUM FULL is gone, but we have to keep this code as long as
-	 * we support having MOVED_OFF/MOVED_IN tuples in the database.
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen
 	 */
-	if (tuple->t_infomask & HEAP_MOVED)
-	{
-		xid = HeapTupleHeaderGetXvac(tuple);
-
-		/*
-		 * For Xvac, we ignore the cutoff_xid and just always perform the
-		 * freeze operation.  The oldest release in which such a value can
-		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
-		 */
-		if (TransactionIdIsNormal(xid))
-		{
-			/*
-			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
-			 * have failed; whereas a non-dead MOVED_IN tuple must mean the
-			 * xvac transaction succeeded.
-			 */
-			if (tuple->t_infomask & HEAP_MOVED_OFF)
-				frz->frzflags |= XLH_INVALID_XVAC;
-			else
-				frz->frzflags |= XLH_FREEZE_XVAC;
-
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
-			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
-			frz->t_infomask |= HEAP_XMIN_COMMITTED;
-			changed = true;
-		}
-	}
-
-	*totally_frozen = (xmin_frozen &&
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
-	return changed;
+
+	/* A "totally_frozen" tuple must not leave anything behind in xmax */
+	Assert(!*totally_frozen || !replace_xmax);
+
+	/* Tell caller if this tuple has a usable freeze plan set in *frz */
+	return freeze_xmin || replace_xvac || replace_xmax || freeze_xmax;
 }
 
 /*
@@ -6865,19 +6837,25 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 bool
 heap_freeze_tuple(HeapTupleHeader tuple,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
-				  TransactionId cutoff_xid, TransactionId cutoff_multi)
+				  TransactionId FreezeLimit, TransactionId MultiXactCutoff)
 {
 	HeapTupleFreeze frz;
 	bool		do_freeze;
-	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	bool		totally_frozen;
+	struct VacuumCutoffs cutoffs;
+	TransactionId NewRelfrozenXid = FreezeLimit;
+	MultiXactId NewRelminMxid = MultiXactCutoff;
 
-	do_freeze = heap_prepare_freeze_tuple(tuple,
-										  relfrozenxid, relminmxid,
-										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+	cutoffs.relfrozenxid = relfrozenxid;
+	cutoffs.relminmxid = relminmxid;
+	cutoffs.OldestXmin = FreezeLimit;
+	cutoffs.OldestMxact = MultiXactCutoff;
+	cutoffs.FreezeLimit = FreezeLimit;
+	cutoffs.MultiXactCutoff = MultiXactCutoff;
+
+	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs,
+										  &frz, &totally_frozen,
+										  &NewRelfrozenXid, &NewRelminMxid);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7312,11 +7290,13 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
+heap_tuple_would_freeze(HeapTupleHeader tuple,
+						const struct VacuumCutoffs *cutoffs,
 						TransactionId *relfrozenxid_out,
 						MultiXactId *relminmxid_out)
 {
+	TransactionId cutoff_xid = cutoffs->FreezeLimit;
+	MultiXactId cutoff_multi = cutoffs->MultiXactCutoff;
 	TransactionId xid;
 	MultiXactId multi;
 	bool		would_freeze = false;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7e..b234072e8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,6 +144,10 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Buffer access strategy and parallel vacuum state */
+	BufferAccessStrategy bstrategy;
+	ParallelVacuumState *pvs;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -158,21 +162,9 @@ typedef struct LVRelState
 	bool		do_index_cleanup;
 	bool		do_rel_truncate;
 
-	/* Buffer access strategy and parallel vacuum state */
-	BufferAccessStrategy bstrategy;
-	ParallelVacuumState *pvs;
-
-	/* rel's initial relfrozenxid and relminmxid */
-	TransactionId relfrozenxid;
-	MultiXactId relminmxid;
-	double		old_live_tuples;	/* previous value of pg_class.reltuples */
-
 	/* VACUUM operation's cutoffs for freezing and pruning */
-	TransactionId OldestXmin;
+	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
-	TransactionId FreezeLimit;
-	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
@@ -314,14 +306,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				aggressive,
 				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
-	TransactionId OldestXmin,
-				FreezeLimit;
-	MultiXactId OldestMxact,
-				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
 				new_rel_pages,
 				new_rel_allvisible;
@@ -353,27 +340,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
 								  RelationGetRelid(rel));
 
-	/*
-	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
-	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
-	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
-	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
-	 */
-	aggressive = vacuum_set_xid_limits(rel, params, &OldestXmin, &OldestMxact,
-									   &FreezeLimit, &MultiXactCutoff);
-
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		aggressive = true;
-		skipwithvm = false;
-	}
-
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
 	 * up an error context callback to display additional information on any
@@ -396,25 +362,12 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
+	vacrel->bstrategy = bstrategy;
 	if (instrument && vacrel->nindexes > 0)
 	{
 		/* Copy index names used by instrumentation (not error reporting) */
@@ -435,8 +388,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
-	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -459,11 +410,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
-	vacrel->bstrategy = bstrategy;
-	vacrel->relfrozenxid = rel->rd_rel->relfrozenxid;
-	vacrel->relminmxid = rel->rd_rel->relminmxid;
-	vacrel->old_live_tuples = rel->rd_rel->reltuples;
-
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
 	vacrel->removed_pages = 0;
@@ -489,32 +435,53 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->missed_dead_tuples = 0;
 
 	/*
-	 * Determine the extent of the blocks that we'll scan in lazy_scan_heap,
-	 * and finalize cutoffs used for freezing and pruning in lazy_scan_prune.
+	 * Get cutoffs that determine which deleted tuples are considered DEAD,
+	 * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
+	 * the extent of the blocks that we'll scan in lazy_scan_heap.  It has to
+	 * happen in this order to ensure that the OldestXmin cutoff field works
+	 * as an upper bound on the XIDs stored in the pages we'll actually scan
+	 * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs).
 	 *
+	 * Next acquire vistest, a related cutoff that's used in heap_page_prune.
 	 * We expect vistest will always make heap_page_prune remove any deleted
 	 * tuple whose xmax is < OldestXmin.  lazy_scan_prune must never become
 	 * confused about whether a tuple should be frozen or removed.  (In the
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
-	 *
-	 * We must determine rel_pages _after_ OldestXmin has been established.
-	 * lazy_scan_heap's physical heap scan (scan of pages < rel_pages) is
-	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
-	 * XIDs must at least be considered for freezing (though not necessarily
-	 * frozen) during its scan.
 	 */
+	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
-	vacrel->OldestXmin = OldestXmin;
 	vacrel->vistest = GlobalVisTestFor(rel);
-	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
-	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
-	vacrel->MultiXactCutoff = MultiXactCutoff;
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = OldestXmin;
-	vacrel->NewRelminMxid = OldestMxact;
+	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 	vacrel->skippedallvis = false;
+	skipwithvm = true;
+	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+	{
+		/*
+		 * Force aggressive mode, and disable skipping blocks using the
+		 * visibility map (even those set all-frozen)
+		 */
+		vacrel->aggressive = true;
+		skipwithvm = false;
+	}
+
+	vacrel->skipwithvm = skipwithvm;
+
+	if (verbose)
+	{
+		if (vacrel->aggressive)
+			ereport(INFO,
+					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
+							get_database_name(MyDatabaseId),
+							vacrel->relnamespace, vacrel->relname)));
+		else
+			ereport(INFO,
+					(errmsg("vacuuming \"%s.%s.%s\"",
+							get_database_name(MyDatabaseId),
+							vacrel->relnamespace, vacrel->relname)));
+	}
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -569,13 +536,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
-		   TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
-										 vacrel->relfrozenxid,
+	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
+										 vacrel->cutoffs.relfrozenxid,
 										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
-									   vacrel->relminmxid,
+	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
+									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
 	if (vacrel->skippedallvis)
 	{
@@ -584,7 +551,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * chose to skip an all-visible page range.  The state that tracks new
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
-		Assert(!aggressive);
+		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -669,14 +636,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				 * implies aggressive.  Produce distinct output for the corner
 				 * case all the same, just in case.
 				 */
-				if (aggressive)
+				if (vacrel->aggressive)
 					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			}
 			else
 			{
-				if (aggressive)
+				if (vacrel->aggressive)
 					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
@@ -702,20 +669,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
 								 (long long) vacrel->missed_dead_tuples,
 								 vacrel->missed_dead_pages);
-			diff = (int32) (ReadNextTransactionId() - OldestXmin);
+			diff = (int32) (ReadNextTransactionId() -
+							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
 							 _("removable cutoff: %u, which was %d XIDs old when operation ended\n"),
-							 OldestXmin, diff);
+							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid -
+								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
 								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid -
+								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
 								 vacrel->NewRelminMxid, diff);
@@ -1610,7 +1580,7 @@ retry:
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
-		bool		tuple_totally_frozen;
+		bool		totally_frozen;
 
 		/*
 		 * Set the offset number so that we can display it along with any
@@ -1666,7 +1636,8 @@ retry:
 		 * since heap_page_prune() looked.  Handle that here by restarting.
 		 * (See comments at the top of function for a full explanation.)
 		 */
-		res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
+		res = HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
+									   buf);
 
 		if (unlikely(res == HEAPTUPLE_DEAD))
 			goto retry;
@@ -1723,7 +1694,8 @@ retry:
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					if (!TransactionIdPrecedes(xmin,
+											   vacrel->cutoffs.OldestXmin))
 					{
 						prunestate->all_visible = false;
 						break;
@@ -1774,13 +1746,8 @@ retry:
 		prunestate->hastup = true;	/* page makes rel truncation unsafe */
 
 		/* Tuple with storage -- consider need to freeze */
-		if (heap_prepare_freeze_tuple(tuple.t_data,
-									  vacrel->relfrozenxid,
-									  vacrel->relminmxid,
-									  vacrel->FreezeLimit,
-									  vacrel->MultiXactCutoff,
-									  &frozen[tuples_frozen],
-									  &tuple_totally_frozen,
+		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs,
+									  &frozen[tuples_frozen], &totally_frozen,
 									  &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Save prepared freeze plan for later */
@@ -1791,7 +1758,7 @@ retry:
 		 * If tuple is not frozen (and not about to become frozen) then caller
 		 * had better not go on to set this page's VM bit
 		 */
-		if (!tuple_totally_frozen)
+		if (!totally_frozen)
 			prunestate->all_frozen = false;
 	}
 
@@ -1817,7 +1784,8 @@ retry:
 		vacrel->frozen_pages++;
 
 		/* Execute all freeze plans for page as a single atomic action */
-		heap_freeze_execute_prepared(vacrel->rel, buf, vacrel->FreezeLimit,
+		heap_freeze_execute_prepared(vacrel->rel, buf,
+									 vacrel->cutoffs.FreezeLimit,
 									 frozen, tuples_frozen);
 	}
 
@@ -1972,9 +1940,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_would_freeze(tupleheader,
-									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff,
+		if (heap_tuple_would_freeze(tupleheader, &vacrel->cutoffs,
 									&NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
@@ -2010,7 +1976,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
+										 buf))
 		{
 			case HEAPTUPLE_DELETE_IN_PROGRESS:
 			case HEAPTUPLE_LIVE:
@@ -2274,6 +2241,7 @@ static bool
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
 	bool		allindexes = true;
+	double		old_live_tuples = vacrel->rel->rd_rel->reltuples;
 
 	Assert(vacrel->nindexes > 0);
 	Assert(vacrel->do_index_vacuuming);
@@ -2297,9 +2265,9 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 			Relation	indrel = vacrel->indrels[idx];
 			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
 
-			vacrel->indstats[idx] =
-				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
-									  vacrel);
+			vacrel->indstats[idx] = lazy_vacuum_one_index(indrel, istat,
+														  old_live_tuples,
+														  vacrel);
 
 			if (lazy_check_wraparound_failsafe(vacrel))
 			{
@@ -2312,7 +2280,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	else
 	{
 		/* Outsource everything to parallel variant */
-		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, vacrel->old_live_tuples,
+		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples,
 											vacrel->num_index_scans);
 
 		/*
@@ -2581,15 +2549,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 static bool
 lazy_check_wraparound_failsafe(LVRelState *vacrel)
 {
-	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
-	Assert(MultiXactIdIsValid(vacrel->relminmxid));
+	Assert(TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid));
+	Assert(MultiXactIdIsValid(vacrel->cutoffs.relminmxid));
 
 	/* Don't warn more than once per VACUUM */
 	if (vacrel->failsafe_active)
 		return true;
 
-	if (unlikely(vacuum_xid_failsafe_check(vacrel->relfrozenxid,
-										   vacrel->relminmxid)))
+	if (unlikely(vacuum_xid_failsafe_check(&vacrel->cutoffs)))
 	{
 		vacrel->failsafe_active = true;
 
@@ -3246,7 +3213,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
+										 buf))
 		{
 			case HEAPTUPLE_LIVE:
 				{
@@ -3265,7 +3233,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					if (!TransactionIdPrecedes(xmin,
+											   vacrel->cutoffs.OldestXmin))
 					{
 						all_visible = false;
 						*all_frozen = false;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e1191a756..28514a1c5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2813,14 +2813,11 @@ ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
  * As the fraction of the member space currently in use grows, we become
  * more aggressive in clamping this value.  That not only causes autovacuum
  * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_set_xid_limits() clamps the
- * freeze table and the minimum freeze age based on the effective
+ * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
+ * freeze table and the minimum freeze age cutoffs based on the effective
  * autovacuum_multixact_freeze_max_age this function returns.  In the worst
  * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will try to freeze every multixact.
- *
- * It's possible that these thresholds should be user-tunable, but for now
- * we keep it simple.
+ * table will freeze every multixact.
  */
 int
 MultiXactMemberFreezeThreshold(void)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 07e091bb8..b0e310604 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -824,10 +824,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	VacuumParams params;
-	TransactionId OldestXmin,
-				FreezeXid;
-	MultiXactId OldestMxact,
-				MultiXactCutoff;
+	struct VacuumCutoffs cutoffs;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -916,23 +913,24 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	memset(&params, 0, sizeof(VacuumParams));
-	vacuum_set_xid_limits(OldHeap, &params, &OldestXmin, &OldestMxact,
-						  &FreezeXid, &MultiXactCutoff);
+	vacuum_get_cutoffs(OldHeap, &params, &cutoffs);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
 	 * backwards, so take the max.
 	 */
 	if (TransactionIdIsValid(OldHeap->rd_rel->relfrozenxid) &&
-		TransactionIdPrecedes(FreezeXid, OldHeap->rd_rel->relfrozenxid))
-		FreezeXid = OldHeap->rd_rel->relfrozenxid;
+		TransactionIdPrecedes(cutoffs.FreezeLimit,
+							  OldHeap->rd_rel->relfrozenxid))
+		cutoffs.FreezeLimit = OldHeap->rd_rel->relfrozenxid;
 
 	/*
 	 * MultiXactCutoff, similarly, shouldn't go backwards either.
 	 */
 	if (MultiXactIdIsValid(OldHeap->rd_rel->relminmxid) &&
-		MultiXactIdPrecedes(MultiXactCutoff, OldHeap->rd_rel->relminmxid))
-		MultiXactCutoff = OldHeap->rd_rel->relminmxid;
+		MultiXactIdPrecedes(cutoffs.MultiXactCutoff,
+							OldHeap->rd_rel->relminmxid))
+		cutoffs.MultiXactCutoff = OldHeap->rd_rel->relminmxid;
 
 	/*
 	 * Decide whether to use an indexscan or seqscan-and-optional-sort to scan
@@ -971,13 +969,14 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * values (e.g. because the AM doesn't use freezing).
 	 */
 	table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
-									OldestXmin, &FreezeXid, &MultiXactCutoff,
+									cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+									&cutoffs.MultiXactCutoff,
 									&num_tuples, &tups_vacuumed,
 									&tups_recently_dead);
 
 	/* return selected values to caller, get set as relfrozenxid/minmxid */
-	*pFreezeXid = FreezeXid;
-	*pCutoffMulti = MultiXactCutoff;
+	*pFreezeXid = cutoffs.FreezeLimit;
+	*pCutoffMulti = cutoffs.MultiXactCutoff;
 
 	/* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
 	NewHeap->rd_toastoid = InvalidOid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a6d5ed1f6..cdc39d17d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -912,34 +912,20 @@ get_all_vacuum_rels(int options)
 }
 
 /*
- * vacuum_set_xid_limits() -- compute OldestXmin and freeze cutoff points
+ * vacuum_get_cutoffs() -- compute OldestXmin and freeze cutoff points
  *
  * The target relation and VACUUM parameters are our inputs.
  *
- * Our output parameters are:
- * - OldestXmin is the Xid below which tuples deleted by any xact (that
- *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - OldestMxact is the Mxid below which MultiXacts are definitely not
- *   seen as visible by any running transaction.
- * - FreezeLimit is the Xid below which all Xids are definitely frozen or
- *   removed during aggressive vacuums.
- * - MultiXactCutoff is the value below which all MultiXactIds are definitely
- *   removed from Xmax during aggressive vacuums.
+ * Output parameters are the cutoffs that VACUUM caller should use.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
  * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
  * minimum).
- *
- * OldestXmin and OldestMxact are the most recent values that can ever be
- * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
- * vacuumlazy.c caller later on.  These values should be passed when it turns
- * out that VACUUM will leave no unfrozen XIDs/MXIDs behind in the table.
  */
 bool
-vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
-					  TransactionId *OldestXmin, MultiXactId *OldestMxact,
-					  TransactionId *FreezeLimit, MultiXactId *MultiXactCutoff)
+vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+				   struct VacuumCutoffs *cutoffs)
 {
 	int			freeze_min_age,
 				multixact_freeze_min_age,
@@ -959,6 +945,10 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
 
+	/* Set pg_class fields in cutoffs */
+	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
+	cutoffs->relminmxid = rel->rd_rel->relminmxid;
+
 	/*
 	 * Acquire OldestXmin.
 	 *
@@ -970,14 +960,14 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	 * that only one vacuum process can be working on a particular table at
 	 * any time, and that each vacuum is always an independent transaction.
 	 */
-	*OldestXmin = GetOldestNonRemovableTransactionId(rel);
+	cutoffs->OldestXmin = GetOldestNonRemovableTransactionId(rel);
 
 	if (OldSnapshotThresholdActive())
 	{
 		TransactionId limit_xmin;
 		TimestampTz limit_ts;
 
-		if (TransactionIdLimitedForOldSnapshots(*OldestXmin, rel,
+		if (TransactionIdLimitedForOldSnapshots(cutoffs->OldestXmin, rel,
 												&limit_xmin, &limit_ts))
 		{
 			/*
@@ -987,20 +977,48 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 			 * frequency), but would still be a significant improvement.
 			 */
 			SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
-			*OldestXmin = limit_xmin;
+			cutoffs->OldestXmin = limit_xmin;
 		}
 	}
 
-	Assert(TransactionIdIsNormal(*OldestXmin));
+	Assert(TransactionIdIsNormal(cutoffs->OldestXmin));
 
 	/* Acquire OldestMxact */
-	*OldestMxact = GetOldestMultiXactId();
-	Assert(MultiXactIdIsValid(*OldestMxact));
+	cutoffs->OldestMxact = GetOldestMultiXactId();
+	Assert(MultiXactIdIsValid(cutoffs->OldestMxact));
 
 	/* Acquire next XID/next MXID values used to apply age-based settings */
 	nextXID = ReadNextTransactionId();
 	nextMXID = ReadNextMultiXactId();
 
+	/*
+	 * Also compute the multixact age for which freezing is urgent.  This is
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
+	 * short of multixact member space.
+	 */
+	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+
+	/*
+	 * Almost ready to set freeze output parameters; check if OldestXmin or
+	 * OldestMxact are held back to an unsafe degree before we start on that
+	 */
+	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
+	if (!TransactionIdIsNormal(safeOldestXmin))
+		safeOldestXmin = FirstNormalTransactionId;
+	safeOldestMxact = nextMXID - effective_multixact_freeze_max_age;
+	if (safeOldestMxact < FirstMultiXactId)
+		safeOldestMxact = FirstMultiXactId;
+	if (TransactionIdPrecedes(cutoffs->OldestXmin, safeOldestXmin))
+		ereport(WARNING,
+				(errmsg("cutoff for removing and freezing tuples is far in the past"),
+				 errhint("Close open transactions soon to avoid wraparound problems.\n"
+						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+	if (MultiXactIdPrecedes(cutoffs->OldestMxact, safeOldestMxact))
+		ereport(WARNING,
+				(errmsg("cutoff for freezing multixacts is far in the past"),
+				 errhint("Close open transactions soon to avoid wraparound problems.\n"
+						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+
 	/*
 	 * Determine the minimum freeze age to use: as specified by the caller, or
 	 * vacuum_freeze_min_age, but in any case not more than half
@@ -1013,19 +1031,12 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	Assert(freeze_min_age >= 0);
 
 	/* Compute FreezeLimit, being careful to generate a normal XID */
-	*FreezeLimit = nextXID - freeze_min_age;
-	if (!TransactionIdIsNormal(*FreezeLimit))
-		*FreezeLimit = FirstNormalTransactionId;
+	cutoffs->FreezeLimit = nextXID - freeze_min_age;
+	if (!TransactionIdIsNormal(cutoffs->FreezeLimit))
+		cutoffs->FreezeLimit = FirstNormalTransactionId;
 	/* FreezeLimit must always be <= OldestXmin */
-	if (TransactionIdPrecedes(*OldestXmin, *FreezeLimit))
-		*FreezeLimit = *OldestXmin;
-
-	/*
-	 * Compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
-	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	if (TransactionIdPrecedes(cutoffs->OldestXmin, cutoffs->FreezeLimit))
+		cutoffs->FreezeLimit = cutoffs->OldestXmin;
 
 	/*
 	 * Determine the minimum multixact freeze age to use: as specified by
@@ -1040,33 +1051,12 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	Assert(multixact_freeze_min_age >= 0);
 
 	/* Compute MultiXactCutoff, being careful to generate a valid value */
-	*MultiXactCutoff = nextMXID - multixact_freeze_min_age;
-	if (*MultiXactCutoff < FirstMultiXactId)
-		*MultiXactCutoff = FirstMultiXactId;
+	cutoffs->MultiXactCutoff = nextMXID - multixact_freeze_min_age;
+	if (cutoffs->MultiXactCutoff < FirstMultiXactId)
+		cutoffs->MultiXactCutoff = FirstMultiXactId;
 	/* MultiXactCutoff must always be <= OldestMxact */
-	if (MultiXactIdPrecedes(*OldestMxact, *MultiXactCutoff))
-		*MultiXactCutoff = *OldestMxact;
-
-	/*
-	 * Done setting output parameters; check if OldestXmin or OldestMxact are
-	 * held back to an unsafe degree in passing
-	 */
-	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
-	if (!TransactionIdIsNormal(safeOldestXmin))
-		safeOldestXmin = FirstNormalTransactionId;
-	safeOldestMxact = nextMXID - effective_multixact_freeze_max_age;
-	if (safeOldestMxact < FirstMultiXactId)
-		safeOldestMxact = FirstMultiXactId;
-	if (TransactionIdPrecedes(*OldestXmin, safeOldestXmin))
-		ereport(WARNING,
-				(errmsg("cutoff for removing and freezing tuples is far in the past"),
-				 errhint("Close open transactions soon to avoid wraparound problems.\n"
-						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
-	if (MultiXactIdPrecedes(*OldestMxact, safeOldestMxact))
-		ereport(WARNING,
-				(errmsg("cutoff for freezing multixacts is far in the past"),
-				 errhint("Close open transactions soon to avoid wraparound problems.\n"
-						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
+		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
@@ -1118,13 +1108,13 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
  * mechanism to determine if its table's relfrozenxid and relminmxid are now
  * dangerously far in the past.
  *
- * Input parameters are the target relation's relfrozenxid and relminmxid.
- *
  * When we return true, VACUUM caller triggers the failsafe.
  */
 bool
-vacuum_xid_failsafe_check(TransactionId relfrozenxid, MultiXactId relminmxid)
+vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs)
 {
+	TransactionId relfrozenxid = cutoffs->relfrozenxid;
+	MultiXactId relminmxid = cutoffs->relminmxid;
 	TransactionId xid_skip_limit;
 	MultiXactId multi_skip_limit;
 	int			skip_index_vacuum;
-- 
2.38.1

v9-0003-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchapplication/x-patch; name=v9-0003-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchDownload

From 4ca55a929e49d194d533df8f68114ef815502af2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v9 3/5] Add eager and lazy freezing strategies to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  VACUUM determines its freezing
strategy based on the value of the new vacuum_freeze_strategy_threshold
GUC (or reloption) in most cases: Tables that exceeds the size threshold
use the eager freezing strategy.  Otherwise VACUUM uses the lazy
freezing strategy,  which is essentially the same approach that VACUUM
has always taken to freezing (though not quite, due to the influence of
page level freezing following recent work).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).
Eager freezing is generally only applied when vacuuming larger tables,
where freezing most individual heap pages at the first opportunity (in
the first VACUUM operation where they can definitely be set all-visible)
will improve performance stability.

A later commit will add skipping strategies, which are designed to work
in tandem with the freezing strategy work from this commit.  Note that
the vacuum_freeze_strategy_threshold GUC will also influence VACUUM's
choice of skipping strategy.  There will be lazy and eager skipping
strategies to complement VACUUM's lazy and eager freezing strategies.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/commands/vacuum.h                 | 10 +++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 ++++++
 src/backend/access/heap/vacuumlazy.c          | 37 ++++++++++++++++++-
 src/backend/commands/vacuum.c                 | 17 ++++++++-
 src/backend/postmaster/autovacuum.c           | 10 +++++
 src/backend/utils/misc/guc_tables.c           | 11 ++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 19 +++++++++-
 doc/src/sgml/maintenance.sgml                 |  6 +--
 doc/src/sgml/ref/create_table.sgml            | 14 +++++++
 doc/src/sgml/ref/vacuum.sgml                  | 16 ++++----
 12 files changed, 139 insertions(+), 14 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 896d1b1ac..de28d581a 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -272,6 +275,12 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
+	 * rel's main fork) for triggering eager/all-visible freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold;
 };
 
 /*
@@ -295,6 +304,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca..e195b63d7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b734489..4b680501c 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index fe64bd6ed..307842582 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -152,6 +152,8 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -241,6 +243,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void lazy_scan_strategy(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -469,6 +472,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
+	/*
+	 * Determine freezing strategy used by VACUUM
+	 */
+	lazy_scan_strategy(vacrel);
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -1252,6 +1259,25 @@ lazy_scan_heap(LVRelState *vacrel)
 		lazy_cleanup_all_indexes(vacrel);
 }
 
+/*
+ *	lazy_scan_strategy() -- Determine freezing strategy.
+ *
+ * Our traditional/lazy freezing strategy is useful when putting off the work
+ * of freezing totally avoids work that turns out to have been unnecessary.
+ * On the other hand we eagerly freeze pages when that strategy spreads out
+ * the burden of freezing over time.
+ */
+static void
+lazy_scan_strategy(LVRelState *vacrel)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	Assert(vacrel->scanned_pages == 0);
+
+	vacrel->eager_freeze_strategy =
+		rel_pages >= vacrel->cutoffs.freeze_strategy_threshold;
+}
+
 /*
  *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
@@ -1773,9 +1799,18 @@ retry:
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 	 * freeze when pruning generated an FPI, if doing so means that we set the
 	 * page all-frozen afterwards (this could happen during second heap pass).
+	 *
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will become all-visible, making it all-frozen instead.
+	 * (Actually, the all-visible/eager freezing strategy doesn't quite work
+	 * that way.  It triggers freezing for pages that it sees will thereby be
+	 * set all-frozen in the VM immediately afterwards -- a stricter test.
+	 * Some pages that can be set all-visible cannot also be set all-frozen,
+	 * even after freezing, due to the presence of lock-only MultiXactIds.)
 	 */
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
-		(prunestate->all_visible && prunestate->all_frozen && prune_fpi))
+		(prunestate->all_visible && prunestate->all_frozen &&
+		 (vacrel->eager_freeze_strategy || prune_fpi)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index cdc39d17d..420b85be6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
@@ -931,7 +935,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -944,6 +949,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/* Set pg_class fields in cutoffs */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1058,6 +1064,15 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
 		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	Assert(freeze_strategy_threshold >= 0);
+	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
+
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 0746d8022..23e316e59 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1bf14eec6..549a2e969 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2503,6 +2503,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 043864597..4763cb6bb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -692,6 +692,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9830a0309..094f9a35d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9129,6 +9129,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-min-age" xreflabel="vacuum_freeze_min_age">
       <term><varname>vacuum_freeze_min_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9139,7 +9154,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         Specifies the cutoff age (in transactions) that
         <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages that have an older XID.
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..554b3a75d 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -588,9 +588,9 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
+    advanced <structfield>relfrozenxid</structfield>.  All rows inserted by
+    transactions older than this cutoff XID are guaranteed to have been frozen.
+    Similarly, the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 9cd880ea3..f61433c7d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> and <xref
+       linkend="guc-vacuum-freeze-table-age"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
-- 
2.38.1

v9-0002-Add-page-level-freezing-to-VACUUM.patchapplication/x-patch; name=v9-0002-Add-page-level-freezing-to-VACUUM.patchDownload

From 0264c7f056f9712a5ed351ee2ce7af732a5a09aa Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v9 2/5] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

Also teach VACUUM to trigger page-level freezing whenever it detects
that heap pruning generated an FPI as torn page protection.  We'll have
already written a large amount of WAL to do that much, so it's very
likely a good idea to get freezing out of the way for the page early.
This only happens in cases where it will directly lead to marking the
page all-frozen in the visibility map.

FreezeMultiXactId() now uses both FreezeLimit and OldestXmin to decide
how to process MultiXacts (not just FreezeLimit).  We always prefer to
avoid allocating new MultiXacts during VACUUM on general principle.
Page-level freezing can be triggered and use a maximally aggressive XID
cutoff to freeze XIDs (OldestXmin), while using a less aggressive XID
cutoff (FreezeLimit) to determine whether or not members from a Multi
need to be frozen expensively.  VACUUM will process Multis very eagerly
when it's cheap to do so, and very lazily when it's expensive to do so.

We can choose when and how to freeze Multixacts provided we never leave
behind a Multi that's < MultiXactCutoff, or a Multi with one or more XID
members < FreezeLimit.  Provided VACUUM's NewRelfrozenXid/NewRelminMxid
tracking accounts for all this, we are free to choose what to do about
each Multi based on the costs and the benefits.  VACUUM should be just
as capable of avoiding an expensive second pass over each Multi (which
must check the commit status of each member XID) as it was before, even
when page-level freezing is triggered on many pages with recently
allocated MultiXactIds.

Later work will teach VACUUM to explicitly apply distinct lazy and eager
freezing strategies, which are policies around how each VACUUM operation
should go determining if it must freeze any given heap page.  This
commit just adds the basic concept of page-level freezing, as well as
the heap prune FPI trigger criteria, which gets applied in every VACUUM
(on systems with full page writes enabled).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/heapam.h          |  82 +++++-
 src/backend/access/heap/heapam.c     | 420 +++++++++++++++------------
 src/backend/access/heap/pruneheap.c  |  16 +-
 src/backend/access/heap/vacuumlazy.c | 128 +++++---
 doc/src/sgml/config.sgml             |  11 +-
 5 files changed, 412 insertions(+), 245 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 53eb01176..0782fed14 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -113,6 +113,71 @@ typedef struct HeapTupleFreeze
 	OffsetNumber offset;
 } HeapTupleFreeze;
 
+/*
+ * State used by VACUUM to track the details of freezing all eligible tuples
+ * on a given heap page.
+ *
+ * VACUUM prepares freeze plans for each page via heap_prepare_freeze_tuple
+ * calls (every tuple with storage gets its own call).  This page-level freeze
+ * state is updated across each call, which ultimately determines whether or
+ * not freezing the page is required. (VACUUM freezes the page via a call to
+ * heap_freeze_execute_prepared, which freezes using prepared freeze plans.)
+ *
+ * Aside from the basic question of whether or not freezing will go ahead, the
+ * state also tracks the oldest extant XID/MXID in the table as a whole, for
+ * the purposes of advancing relfrozenxid/relminmxid values in pg_class later
+ * on.  Each heap_prepare_freeze_tuple call pushes NewRelfrozenXid and/or
+ * NewRelminMxid back as required to avoid unsafe final pg_class values.  Any
+ * and all unfrozen XIDs or MXIDs that remain after VACUUM finishes _must_
+ * have values >= the final relfrozenxid/relminmxid values in pg_class.  This
+ * includes XIDs that remain as MultiXact members from any tuple's xmax.
+ *
+ * When 'freeze_required' flag isn't set after all tuples are examined, the
+ * final choice on freezing is made by vacuumlazy.c.  It can decide to trigger
+ * freezing based on whatever criteria it deems appropriate.  However, it is
+ * highly recommended that vacuumlazy.c avoid freezing any page that cannot be
+ * marked all-frozen in the visibility map afterwards.
+ *
+ * Freezing is typically optional for most individual pages scanned during any
+ * given VACUUM operation.  This allows vacuumlazy.c to manage the cost of
+ * freezing at the level of the entire VACUUM operation/entire heap relation.
+ */
+typedef struct HeapPageFreeze
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze_required;
+
+	/*
+	 * "No freeze" NewRelfrozenXid/NewRelminMxid trackers.
+	 *
+	 * These trackers are maintained in the same way as the trackers used when
+	 * VACUUM scans a page that isn't cleanup locked.  Both code paths are
+	 * based on the same general idea (do less work for this page during the
+	 * ongoing VACUUM, at the cost of having to accept older final values).
+	 */
+	TransactionId NoFreezePageRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid;
+
+	/*
+	 * Trackers used when heap_freeze_execute_prepared freezes the page.
+	 *
+	 * When we freeze a page, we generally freeze all XIDs < OldestXmin, only
+	 * leaving behind XIDs that are ineligible for freezing, if any.  And so
+	 * you might wonder why these trackers are necessary at all; why should
+	 * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that
+	 * ratchet back the rel-level NewRelfrozenXid/NewRelminMxid trackers?
+	 *
+	 * It is useful to use a definition of "freeze the page" that does not
+	 * overspecify how MultiXacts are affected.  heap_prepare_freeze_tuple
+	 * generally prefers to remove Multis eagerly, but lazy processing is used
+	 * in cases where laziness allows VACUUM to avoid allocating a new Multi.
+	 * The "freeze the page" trackers enable this flexibility.
+	 */
+	TransactionId FreezePageRelfrozenXid;
+	MultiXactId FreezePageRelminMxid;
+
+} HeapPageFreeze;
+
 /* ----------------
  *		function prototypes for heap access method
  *
@@ -180,19 +245,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  const struct VacuumCutoffs *cutoffs,
-									  HeapTupleFreeze *frz, bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  HeapPageFreeze *pagefrz,
+									  HeapTupleFreeze *frz, bool *totally_frozen);
 extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-										 TransactionId FreezeLimit,
+										 TransactionId snapshotConflictHorizon,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId FreezeLimit, TransactionId MultiXactCutoff);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
-									const struct VacuumCutoffs *cutoffs,
-									TransactionId *relfrozenxid_out,
-									MultiXactId *relminmxid_out);
+extern bool heap_tuple_should_freeze(HeapTupleHeader tuple,
+									 const struct VacuumCutoffs *cutoffs,
+									 TransactionId *NoFreezePageRelfrozenXid,
+									 MultiXactId *NoFreezePageRelminMxid);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
@@ -210,7 +274,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts,
-							int *nnewlpdead,
+							int *nnewlpdead, bool *prune_fpi,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6c0634b38..0baebe432 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6102,9 +6102,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		MultiXactId.
  *
  * "flags" is an output value; it's used to tell caller what to do on return.
- *
- * "mxid_oldest_xid_out" is an output value; it's used to track the oldest
- * extant Xid within any Multixact that will remain after freezing executes.
+ * "pagefrz" is an input/output value, used to manage page level freezing.
  *
  * Possible values that we can set in "flags":
  * FRM_NOOP
@@ -6119,16 +6117,34 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
  *
- * "mxid_oldest_xid_out" is only set when "flags" contains either FRM_NOOP or
- * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
+ * Caller delegates control of page freezing to us.  In practice we always
+ * force freezing of caller's page unless FRM_NOOP processing is indicated.
+ * We help caller ensure that XIDs < FreezeLimit and MXIDs < MultiXactCutoff
+ * can never be left behind.  We freely choose when and how to process each
+ * Multi, without ever violating the cutoff invariants for freezing.
  *
- * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ * It's useful to remove Multis on a proactive timeline (relative to freezing
+ * XIDs) to keep MultiXact member SLRU buffer misses to a minimum.  It can also
+ * be cheaper in the short run, for us, since we too can avoid SLRU buffer
+ * misses through eager processing.
+ *
+ * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set, though only
+ * when FreezeLimit and/or MultiXactCutoff cutoffs leave us with no choice.
+ * This can usually be put off, which is usually enough to avoid it altogether.
+ *
+ * NB: Caller must maintain "no freeze" NewRelfrozenXid/NewRelminMxid trackers
+ * using heap_tuple_should_freeze when we haven't forced page-level freezing.
+ *
+ * NB: Caller should avoid needlessly calling heap_tuple_should_freeze when we
+ * have already forced page-level freezing, since that might incur the same
+ * SLRU buffer misses that we specifically intended to avoid by freezing.
  */
 static TransactionId
-FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
+FreezeMultiXactId(MultiXactId multi, HeapTupleHeader tuple,
 				  const struct VacuumCutoffs *cutoffs, uint16 *flags,
-				  TransactionId *mxid_oldest_xid_out)
+				  HeapPageFreeze *pagefrz)
 {
+	uint16		t_infomask = tuple->t_infomask;
 	TransactionId newxmax = InvalidTransactionId;
 	MultiXactMember *members;
 	int			nmembers;
@@ -6138,7 +6154,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
-	TransactionId temp_xid_out;
+	TransactionId FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;
+	TransactionId axid PG_USED_FOR_ASSERTS_ONLY = cutoffs->OldestXmin;
+	MultiXactId amxid PG_USED_FOR_ASSERTS_ONLY = cutoffs->OldestMxact;
 
 	*flags = 0;
 
@@ -6150,14 +6168,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* Ensure infomask bits are appropriately set/reset */
 		*flags |= FRM_INVALIDATE_XMAX;
-		return InvalidTransactionId;
+		pagefrz->freeze_required = true;
+		Assert(!TransactionIdIsValid(newxmax));
+		return newxmax;
 	}
 	else if (MultiXactIdPrecedes(multi, cutoffs->relminmxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
 								 multi, cutoffs->relminmxid)));
-	else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
+	else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))
 	{
 		/*
 		 * This old multi cannot possibly have members still running, but
@@ -6170,7 +6190,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
-									 multi, cutoffs->MultiXactCutoff)));
+									 multi, cutoffs->OldestMxact)));
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
@@ -6206,14 +6226,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 			else
 			{
+				if (TransactionIdPrecedes(newxmax, FreezePageRelfrozenXid))
+					FreezePageRelfrozenXid = newxmax;
 				*flags |= FRM_RETURN_IS_XID;
 			}
 		}
 
-		/*
-		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
-		 * when no Xids will remain
-		 */
+		pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+		pagefrz->freeze_required = true;
 		return newxmax;
 	}
 
@@ -6229,11 +6249,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* Nothing worth keeping */
 		*flags |= FRM_INVALIDATE_XMAX;
-		return InvalidTransactionId;
+		pagefrz->freeze_required = true;
+		Assert(!TransactionIdIsValid(newxmax));
+		return newxmax;
 	}
 
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;	/* for FRM_NOOP */
 	for (int i = 0; i < nmembers; i++)
 	{
 		TransactionId xid = members[i].xid;
@@ -6242,26 +6264,35 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
+			/* Can't violate the FreezeLimit invariant */
 			need_replace = true;
 			break;
 		}
-		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
-			temp_xid_out = members[i].xid;
+		if (TransactionIdPrecedes(xid, FreezePageRelfrozenXid))
+			FreezePageRelfrozenXid = xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than FreezeLimit; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* Can't violate the MultiXactCutoff invariant, either */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff);
+
 	if (!need_replace)
 	{
 		/*
-		 * When mxid_oldest_xid_out gets pushed back here it's likely that the
-		 * update Xid was the oldest member, but we don't rely on that
+		 * FRM_NOOP case is the only one where we don't force page-level
+		 * freezing (see header comments)
 		 */
 		*flags |= FRM_NOOP;
-		*mxid_oldest_xid_out = temp_xid_out;
+
+		/*
+		 * Might have to ratchet back NewRelminMxid, NewRelfrozenXid, or both
+		 * together to make it safe to skip this particular multi/tuple xmax
+		 * if the page is frozen (similar handling will also be required if
+		 * the page isn't frozen, but caller deals with that directly).
+		 */
+		pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+		if (MultiXactIdPrecedes(multi, pagefrz->FreezePageRelminMxid))
+			pagefrz->FreezePageRelminMxid = multi;
 		pfree(members);
 		return multi;
 	}
@@ -6270,13 +6301,18 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_should_freeze will indicate that the tuple should be frozen.
 	 */
+	Assert(heap_tuple_should_freeze(tuple, cutoffs, &axid, &amxid));
+
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
 	update_xid = InvalidTransactionId;
 	update_committed = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
+	FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;	/* re-init */
 
 	/*
 	 * Determine whether to keep each member xid, or to ignore it instead
@@ -6364,11 +6400,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		/*
 		 * We determined that this is an Xid corresponding to an update that
 		 * must be retained -- add it to new members list for later.  Also
-		 * consider pushing back mxid_oldest_xid_out.
+		 * consider pushing back NewRelfrozenXid tracker.
 		 */
 		newmembers[nnewmembers++] = members[i];
-		if (TransactionIdPrecedes(xid, temp_xid_out))
-			temp_xid_out = xid;
+		if (TransactionIdPrecedes(xid, FreezePageRelfrozenXid))
+			FreezePageRelfrozenXid = xid;
 	}
 
 	pfree(members);
@@ -6379,10 +6415,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (nnewmembers == 0)
 	{
-		/* nothing worth keeping!? Tell caller to remove the whole thing */
+		/*
+		 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.  Won't
+		 * have to ratchet back NewRelfrozenXid or NewRelminMxid.
+		 */
 		*flags |= FRM_INVALIDATE_XMAX;
 		newxmax = InvalidTransactionId;
-		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
+
+		Assert(pagefrz->FreezePageRelfrozenXid == FreezePageRelfrozenXid);
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
 	{
@@ -6398,22 +6438,29 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		newxmax = update_xid;
-		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
+
+		/* Might have to push back FreezePageRelfrozenXid/NewRelfrozenXid */
+		Assert(TransactionIdPrecedesOrEquals(FreezePageRelfrozenXid,
+											 update_xid));
 	}
 	else
 	{
 		/*
 		 * Create a new multixact with the surviving members of the previous
 		 * one, to set as new Xmax in the tuple.  The oldest surviving member
-		 * might push back mxid_oldest_xid_out.
+		 * might have already pushed back NewRelfrozenXid.
 		 */
 		newxmax = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
-		*mxid_oldest_xid_out = temp_xid_out;
+
+		/* Never need to push back FreezePageRelminMxid/NewRelminMxid */
+		Assert(MultiXactIdPrecedesOrEquals(cutoffs->OldestMxact, newxmax));
 	}
 
 	pfree(newmembers);
 
+	pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+	pagefrz->freeze_required = true;
 	return newxmax;
 }
 
@@ -6421,9 +6468,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the FreezeLimit and/or MultiXactCutoff freeze cutoffs.  If so,
- * setup enough state (in the *frz output argument) to later execute and
- * WAL-log what caller needs to do for the tuple, and return true.  Return
+ * are older than the OldestXmin and/or OldestMxact freeze cutoffs.  If so,
+ * setup enough state (in the *frz output argument) to enable caller to
+ * process this tuple as part of freezing its page, and return true.  Return
  * false if nothing can be changed about the tuple right now.
  *
  * Also sets *totally_frozen to true if the tuple will be totally frozen once
@@ -6431,22 +6478,30 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * frozen by an earlier VACUUM).  This indicates that there are no remaining
  * XIDs or MultiXactIds that will need to be processed by a future VACUUM.
  *
- * VACUUM caller must assemble HeapTupleFreeze entries for every tuple that we
- * returned true for when called.  A later heap_freeze_execute_prepared call
- * will execute freezing for caller's page as a whole.
+ * VACUUM caller must assemble HeapTupleFreeze freeze plan entries for every
+ * tuple that we returned true for, and call heap_freeze_execute_prepared to
+ * execute freezing.  Caller must initialize pagefrz fields for page as a
+ * whole before first call here for each heap page.
+ *
+ * VACUUM caller decides on whether or not to freeze the page as a whole.
+ * We'll often prepare freeze plans for a page that caller just discards.
+ * However, VACUUM doesn't always get to make a choice; it must freeze when
+ * pagefrz.freeze_required is set, to ensure that any XIDs < FreezeLimit (and
+ * MXIDs < MultiXactCutoff) can never be left behind.  We make sure that
+ * VACUUM always follows that rule.
+ *
+ * We sometimes force freezing of xmax MultiXactId values long before it is
+ * strictly necessary to do so just to ensure the FreezeLimit postcondition.
+ * It's worth processing MultiXactIds proactively when it is cheap to do so,
+ * and it's convenient to make that happen by piggy-backing it on the "force
+ * freezing" mechanism.  Conversely, we sometimes delay freezing MultiXactIds
+ * because it is expensive right now (though only when it's still possible to
+ * do so without violating the FreezeLimit/MultiXactCutoff postcondition).
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6455,9 +6510,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  const struct VacuumCutoffs *cutoffs,
-						  HeapTupleFreeze *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  HeapPageFreeze *pagefrz,
+						  HeapTupleFreeze *frz, bool *totally_frozen)
 {
 	bool		xmin_already_frozen = false,
 				xmax_already_frozen = false;
@@ -6474,7 +6528,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/*
 	 * Process xmin, while keeping track of whether it's already frozen, or
-	 * will become frozen when our freeze plan is executed by caller (could be
+	 * will become frozen iff our freeze plan is executed by caller (could be
 	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
@@ -6488,21 +6542,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->FreezeLimit);
-		if (freeze_xmin)
-		{
-			if (!TransactionIdDidCommit(xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-										 xid, cutoffs->FreezeLimit)));
-		}
-		else
-		{
-			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-		}
+		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
+		if (freeze_xmin && !TransactionIdDidCommit(xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
+									 xid, cutoffs->OldestXmin)));
 	}
 
 	/*
@@ -6519,38 +6564,55 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we always freeze proactively.  This allows totally_frozen
 		 * tracking to ignore xvac.
 		 */
-		replace_xvac = true;
+		replace_xvac = pagefrz->freeze_required = true;
 	}
 
-	/*
-	 * Process xmax.  To thoroughly examine the current Xmax value we need to
-	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given FreezeLimit.  In that case, those values might need
-	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
-	 * it out --- if there's a live updater Xid, it needs to be kept.
-	 */
+	/* Now process xmax */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		/* Raw xmax is a MultiXactId */
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
 
-		newxmax = FreezeMultiXactId(xid, tuple->t_infomask, cutoffs,
-									&flags, &mxid_oldest_xid_out);
+		/*
+		 * We will either remove xmax completely (in the "freeze_xmax" path),
+		 * process xmax by replacing it (in the "replace_xmax" path), or
+		 * perform no-op xmax processing.  The only constraint is that the
+		 * FreezeLimit/MultiXactCutoff invariant must never be violated.
+		 */
+		newxmax = FreezeMultiXactId(xid, tuple, cutoffs, &flags, pagefrz);
 
-		if (flags & FRM_RETURN_IS_XID)
+		if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 * This is the only case where 'freeze_required' won't have been
+			 * set for us by FreezeMultiXactId, as well as the only case where
+			 * neither freeze_xmax nor replace_xmax are set (given a multi).
+			 *
+			 * This is a no-op, but the call to FreezeMultiXactId might have
+			 * ratcheted back NewRelfrozenXid and/or NewRelminMxid for us.
+			 * That makes it safe to freeze the page while leaving this
+			 * particular xmax undisturbed.
+			 *
+			 * FreezeMultiXactId is _not_ responsible for the "no freeze"
+			 * NewRelfrozenXid/NewRelminMxid trackers, though -- that's our
+			 * job.  A call to heap_tuple_should_freeze for this same tuple
+			 * will take place below if 'freeze_required' isn't set already.
+			 * (This approach repeats some of the work from FreezeMultiXactId,
+			 * which is not ideal but makes things simpler.)
+			 */
+			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
+			Assert(!MultiXactIdPrecedes(newxmax, pagefrz->FreezePageRelminMxid));
+		}
+		else if (flags & FRM_RETURN_IS_XID)
 		{
 			/*
 			 * xmax will become an updater Xid (original MultiXact's updater
 			 * member Xid will be carried forward as a simple Xid in Xmax).
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
 			 */
 			Assert(!TransactionIdPrecedes(newxmax, cutoffs->OldestXmin));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6573,13 +6635,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/*
 			 * xmax is an old MultiXactId that we have to replace with a new
 			 * MultiXactId, to carry forward two or more original member XIDs.
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
 			 */
 			Assert(!MultiXactIdPrecedes(newxmax, cutoffs->OldestMxact));
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6595,20 +6652,6 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 			replace_xmax = true;
 		}
-		else if (flags & FRM_NOOP)
-		{
-			/*
-			 * xmax is a MultiXactId, and nothing about it changes for now.
-			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
-			 * both together.
-			 */
-			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
-		}
 		else
 		{
 			/*
@@ -6620,6 +6663,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!TransactionIdIsValid(newxmax));
 			freeze_xmax = true;
 		}
+
+		/* Only FRM_NOOP doesn't force caller to freeze page */
+		Assert(pagefrz->freeze_required || (!freeze_xmax && !replace_xmax));
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6630,29 +6676,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
-		{
-			/*
-			 * If we freeze xmax, make absolutely sure that it's not an XID
-			 * that is important.  (Note, a lock-only xmax can be removed
-			 * independent of committedness, since a committed lock holder has
-			 * released the lock).
-			 */
-			if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
-				TransactionIdDidCommit(xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("cannot freeze committed xmax %u",
-										 xid)));
+		if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
 			freeze_xmax = true;
-			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
-		}
-		else
-		{
-			/* Might have to ratchet back relfrozenxid_out */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-		}
+
+		/*
+		 * If we freeze xmax, make absolutely sure that it's not an XID that
+		 * is important.  (Note, a lock-only xmax can be removed independent
+		 * of committedness, since a committed lock holder has released the
+		 * lock).
+		 */
+		if (freeze_xmax && !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
+			TransactionIdDidCommit(xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("cannot freeze committed xmax %u",
+									 xid)));
 	}
 	else if (!TransactionIdIsValid(xid))
 	{
@@ -6679,6 +6717,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * failed; whereas a non-dead MOVED_IN tuple must mean the xvac
 		 * transaction succeeded.
 		 */
+		Assert(pagefrz->freeze_required);
 		if (tuple->t_infomask & HEAP_MOVED_OFF)
 			frz->frzflags |= XLH_INVALID_XVAC;
 		else
@@ -6687,6 +6726,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	if (replace_xmax)
 	{
 		Assert(!xmax_already_frozen && !freeze_xmax);
+		Assert(pagefrz->freeze_required);
 
 		/* Already set t_infomask/t_infomask2 flags in freeze plan */
 	}
@@ -6709,7 +6749,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/*
 	 * Determine if this tuple is already totally frozen, or will become
-	 * totally frozen
+	 * totally frozen (provided caller executes freeze plan for the page)
 	 */
 	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
@@ -6717,6 +6757,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/* A "totally_frozen" tuple must not leave anything behind in xmax */
 	Assert(!*totally_frozen || !replace_xmax);
 
+	/*
+	 * Check if the option of _not_ freezing caller's page is still in play,
+	 * though don't bother when we already forced freezing earlier on
+	 */
+	if (!pagefrz->freeze_required && !(xmin_already_frozen &&
+									   xmax_already_frozen))
+	{
+		pagefrz->freeze_required =
+			heap_tuple_should_freeze(tuple, cutoffs,
+									 &pagefrz->NoFreezePageRelfrozenXid,
+									 &pagefrz->NoFreezePageRelminMxid);
+	}
+
 	/* Tell caller if this tuple has a usable freeze plan set in *frz */
 	return freeze_xmin || replace_xvac || replace_xmax || freeze_xmax;
 }
@@ -6761,13 +6814,12 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
  */
 void
 heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-							 TransactionId FreezeLimit,
+							 TransactionId snapshotConflictHorizon,
 							 HeapTupleFreeze *tuples, int ntuples)
 {
 	Page		page = BufferGetPage(buffer);
 
 	Assert(ntuples > 0);
-	Assert(TransactionIdIsNormal(FreezeLimit));
 
 	START_CRIT_SECTION();
 
@@ -6790,19 +6842,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		int			nplans;
 		xl_heap_freeze_page xlrec;
 		XLogRecPtr	recptr;
-		TransactionId snapshotConflictHorizon;
 
 		/* Prepare deduplicated representation for use in WAL record */
 		nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
 
-		/*
-		 * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
-		 * Back up caller's FreezeLimit to avoid false conflicts when
-		 * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
-		 */
-		snapshotConflictHorizon = FreezeLimit;
-		TransactionIdRetreat(snapshotConflictHorizon);
-
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -6843,8 +6886,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	bool		do_freeze;
 	bool		totally_frozen;
 	struct VacuumCutoffs cutoffs;
-	TransactionId NewRelfrozenXid = FreezeLimit;
-	MultiXactId NewRelminMxid = MultiXactCutoff;
+	HeapPageFreeze pagefrz;
 
 	cutoffs.relfrozenxid = relfrozenxid;
 	cutoffs.relminmxid = relminmxid;
@@ -6853,9 +6895,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 
-	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs,
-										  &frz, &totally_frozen,
-										  &NewRelfrozenXid, &NewRelminMxid);
+	pagefrz.freeze_required = true;
+	pagefrz.NoFreezePageRelfrozenXid = FreezeLimit;
+	pagefrz.NoFreezePageRelminMxid = MultiXactCutoff;
+	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
+	pagefrz.FreezePageRelminMxid = MultiXactCutoff;
+
+	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs, &pagefrz,
+										  &frz, &totally_frozen);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7278,37 +7325,39 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
 }
 
 /*
- * heap_tuple_would_freeze
+ * heap_tuple_should_freeze
  *
- * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * Return value indicates if heap_prepare_freeze_tuple sibling function should
+ * force freezing of the page containing tuple.  This happens whenever the
+ * tuple contains XID/MXID fields with values < FreezeLimit/MultiXactCutoff.
  *
- * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
- * like the heap_prepare_freeze_tuple arguments that they're based on.  We
- * never freeze here, which makes tracking the oldest extant XID/MXID simple.
+ * The *NoFreezePageRelfrozenXid and *NoFreezePageRelminMxid input/output
+ * arguments help VACUUM track the oldest extant XID/MXID remaining in rel.
+ * Our working assumption is that caller won't decide to freeze this tuple.
+ * It's up to caller to only ratchet back its own top-level trackers after the
+ * point that it commits to not freezing the tuple/page in question.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple,
-						const struct VacuumCutoffs *cutoffs,
-						TransactionId *relfrozenxid_out,
-						MultiXactId *relminmxid_out)
+heap_tuple_should_freeze(HeapTupleHeader tuple,
+						 const struct VacuumCutoffs *cutoffs,
+						 TransactionId *NoFreezePageRelfrozenXid,
+						 MultiXactId *NoFreezePageRelminMxid)
 {
-	TransactionId cutoff_xid = cutoffs->FreezeLimit;
-	MultiXactId cutoff_multi = cutoffs->MultiXactCutoff;
+	TransactionId MustFreezeLimit = cutoffs->FreezeLimit;
+	MultiXactId MustFreezeMultiLimit = cutoffs->MultiXactCutoff;
 	TransactionId xid;
 	MultiXactId multi;
-	bool		would_freeze = false;
+	bool		freeze = false;
 
 	/* First deal with xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (TransactionIdIsNormal(xid))
 	{
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
-			would_freeze = true;
+		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
+		if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+			*NoFreezePageRelfrozenXid = xid;
+		if (TransactionIdPrecedes(xid, MustFreezeLimit))
+			freeze = true;
 	}
 
 	/* Now deal with xmax */
@@ -7321,11 +7370,12 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 
 	if (TransactionIdIsNormal(xid))
 	{
+		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
 		/* xmax is a non-permanent XID */
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
-			would_freeze = true;
+		if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+			*NoFreezePageRelfrozenXid = xid;
+		if (TransactionIdPrecedes(xid, MustFreezeLimit))
+			freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
 	{
@@ -7334,10 +7384,10 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 	{
 		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, *NoFreezePageRelminMxid))
+			*NoFreezePageRelminMxid = multi;
 		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
-		would_freeze = true;
+		freeze = true;
 	}
 	else
 	{
@@ -7345,10 +7395,11 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		MultiXactMember *members;
 		int			nmembers;
 
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
-			would_freeze = true;
+		Assert(MultiXactIdPrecedesOrEquals(cutoffs->relminmxid, multi));
+		if (MultiXactIdPrecedes(multi, *NoFreezePageRelminMxid))
+			*NoFreezePageRelminMxid = multi;
+		if (MultiXactIdPrecedes(multi, MustFreezeMultiLimit))
+			freeze = true;
 
 		/* need to check whether any member of the mxact is old */
 		nmembers = GetMultiXactIdMembers(multi, &members, false,
@@ -7357,11 +7408,11 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		for (int i = 0; i < nmembers; i++)
 		{
 			xid = members[i].xid;
-			Assert(TransactionIdIsNormal(xid));
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
-				would_freeze = true;
+			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
+			if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+				*NoFreezePageRelfrozenXid = xid;
+			if (TransactionIdPrecedes(xid, MustFreezeLimit))
+				freeze = true;
 		}
 		if (nmembers > 0)
 			pfree(members);
@@ -7372,14 +7423,15 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		xid = HeapTupleHeaderGetXvac(tuple);
 		if (TransactionIdIsNormal(xid))
 		{
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
-			would_freeze = true;
+			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
+			if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+				*NoFreezePageRelfrozenXid = xid;
+			/* heap_prepare_freeze_tuple forces xvac freezing */
+			freeze = true;
 		}
 	}
 
-	return would_freeze;
+	return freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9e..e334ee8dc 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -21,6 +21,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/catalog.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -205,9 +206,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		{
 			int			ndeleted,
 						nnewlpdead;
+			bool		fpi;
 
 			ndeleted = heap_page_prune(relation, buffer, vistest, limited_xmin,
-									   limited_ts, &nnewlpdead, NULL);
+									   limited_ts, &nnewlpdead, &fpi, NULL);
 
 			/*
 			 * Report the number of tuples reclaimed to pgstats.  This is
@@ -255,7 +257,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * InvalidTransactionId/0 respectively.
  *
  * Sets *nnewlpdead for caller, indicating the number of items that were
- * newly set LP_DEAD during prune operation.
+ * newly set LP_DEAD during prune operation.  Also sets *prune_fpi for
+ * caller, indicating if pruning generated a full-page image as torn page
+ * protection.
  *
  * off_loc is the offset location required by the caller to use in error
  * callback.
@@ -267,7 +271,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				int *nnewlpdead,
+				int *nnewlpdead, bool *prune_fpi,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -380,6 +384,8 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	*prune_fpi = false;			/* for now */
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -417,6 +423,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
+			int64		wal_fpi_before = pgWalUsage.wal_fpi;
 
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
@@ -448,6 +455,9 @@ heap_page_prune(Relation relation, Buffer buffer,
 			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
+
+			if (wal_fpi_before != pgWalUsage.wal_fpi)
+				*prune_fpi = true;
 		}
 	}
 	else
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b234072e8..fe64bd6ed 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1528,8 +1528,9 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	bool		prune_fpi;
+	HeapPageFreeze pagefrz;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
@@ -1545,8 +1546,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.freeze_required = false;
+	pagefrz.NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.FreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.FreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1564,7 +1568,7 @@ retry:
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
-									 &vacrel->offnum);
+									 &prune_fpi, &vacrel->offnum);
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
@@ -1599,27 +1603,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1746,9 +1746,8 @@ retry:
 		prunestate->hastup = true;	/* page makes rel truncation unsafe */
 
 		/* Tuple with storage -- consider need to freeze */
-		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs,
-									  &frozen[tuples_frozen], &totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs, &pagefrz,
+									  &frozen[tuples_frozen], &totally_frozen))
 		{
 			/* Save prepared freeze plan for later */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1769,23 +1768,65 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
+	 * freeze when pruning generated an FPI, if doing so means that we set the
+	 * page all-frozen afterwards (this could happen during second heap pass).
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (pagefrz.freeze_required || tuples_frozen == 0 ||
+		(prunestate->all_visible && prunestate->all_frozen && prune_fpi))
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (pruning might be all we need).
+		 */
+		vacrel->NewRelfrozenXid = pagefrz.FreezePageRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.FreezePageRelminMxid;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* NewRelfrozenXid <= all XIDs in tuples that weren't pruned away */
+		vacrel->NewRelfrozenXid = pagefrz.NoFreezePageRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.NoFreezePageRelminMxid;
+
+		/* Might still set page all-visible, but never all-frozen */
+		tuples_frozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
-	 * first (arbitrary)
 	 */
 	if (tuples_frozen > 0)
 	{
-		Assert(prunestate->hastup);
+		TransactionId snapshotConflictHorizon;
+
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		vacrel->frozen_pages++;
 
+		/*
+		 * We can use the latest xmin cutoff (which is generally used for 'VM
+		 * set' conflicts) as our cutoff for freeze conflicts when the whole
+		 * page is eligible to become all-frozen in the VM once frozen by us.
+		 * Otherwise use a conservative cutoff (just back up from OldestXmin).
+		 */
+		if (prunestate->all_visible && prunestate->all_frozen)
+			snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+		else
+		{
+			snapshotConflictHorizon = vacrel->cutoffs.OldestXmin;
+			TransactionIdRetreat(snapshotConflictHorizon);
+		}
+
 		/* Execute all freeze plans for page as a single atomic action */
 		heap_freeze_execute_prepared(vacrel->rel, buf,
-									 vacrel->cutoffs.FreezeLimit,
+									 snapshotConflictHorizon,
 									 frozen, tuples_frozen);
 	}
 
@@ -1804,7 +1845,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1812,8 +1853,7 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1834,9 +1874,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1850,6 +1887,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->all_visible = false;
+		prunestate->has_lpdead_items = true;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -1894,8 +1935,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				recently_dead_tuples,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
-	TransactionId NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NewRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1940,8 +1981,9 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_would_freeze(tupleheader, &vacrel->cutoffs,
-									&NewRelfrozenXid, &NewRelminMxid))
+		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
+									 &NoFreezePageRelfrozenXid,
+									 &NoFreezePageRelminMxid))
 		{
 			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
 			if (vacrel->aggressive)
@@ -2022,8 +2064,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items array */
 	if (vacrel->nindexes == 0)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ff6fcd902..9830a0309 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9137,9 +9137,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9217,9 +9217,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        should use to decide whether to trigger freezing of pages with
+        an older multixact ID.  The default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
-- 
2.38.1

v9-0004-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchapplication/x-patch; name=v9-0004-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchDownload

From 3a7dbf780a81a6ec36905259d456b46c50473827 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v9 4/5] Add eager and lazy VM strategies to VACUUM.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will skip pages.  The data structure we use is a local
copy of the visibility map at the start of VACUUM.  It spills to disk as
required.  In practice VACUUM only uses a temp file for relations that
are large enough to have more than a single visibility map page.

Non-aggressive VACUUMs now make an up-front choice about VM snapshot
strategy: they decide whether or not to prioritize early advancement of
relfrozenxid (eager strategy) over avoiding work by skipping all-visible
pages (lazy strategy).  VACUUM decides on its skipping and freezing
strategies together, shortly before the first pass over the heap begins,
since the concepts are closely related, and work in tandem.  Note that
the eager VM strategy often has a significant impact on the total number
of pages frozen by VACUUM, even when lazy freezing is also used.  (In
general VACUUM tends to use either lazy or eager strategies across the
board, though notable exceptions exist.)

Also make the VACUUM command's DISABLE_PAGE_SKIPPING option stop forcing
aggressive mode.  As a consequence, the option will no longer have any
impact on when or how VACUUM waits for a cleanup lock the hard way.  The
option now makes VACUUM distrust the visibility map, and nothing more.
DISABLE_PAGE_SKIPPING now works by making VACUUM opt to use a dedicated
"no skipping" VM snapshot strategy.

This lays the groundwork for completely removing aggressive mode VACUUMs
in a later commit; vmsnap strategies supersede the "early aggressive
VACUUM" concept previously implemented by vacuum_freeze_table_age, which
is now just a compatibility option (its new default of -1 is interpreted
as "just use autovacuum_freeze_max_age").  VACUUM makes a choice about
which VM skip strategy to use by considering how close table age is to
autovacuum_freeze_max_age (actually vacuum_freeze_table_age) directly,
in a way that is roughly comparable to our previous approach.  But table
age is now just one factor considered alongside several other factors.

Also add explicit I/O prefetching of heap pages, which is controlled by
maintenance_io_concurrency.  We prefetch at the point that the next
block in line is requested by VACUUM.  Prefetching is under the direct
control of the visibility map snapshot code, since VACUUM's vmsnap is
now an authoritative guide to which pages VACUUM will scan.  VACUUM's
final scanned_pages is "locked in" when it decides on skipping strategy
(so scanned_pages is finalized before the first heap pass even begins).

Prefetching should totally avoid the loss of performance that might
otherwise result from removing SKIP_PAGES_THRESHOLD in this commit.
SKIP_PAGES_THRESHOLD was intended to force OS readahead and encourage
relfrozenxid advancement.  See commit bf136cf6 from around the time the
visibility map first went in for full details.

Also teach VACUUM to use scanned_pages (not rel_pages) to cap the size
of the dead_items array.  This approach is strictly better, since there
is no question of scanning any pages other than the precise set of pages
already locked in by vmsnap by the time dead_items is allocated.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h            |  17 +
 src/include/commands/vacuum.h                 |   5 +-
 src/backend/access/heap/vacuumlazy.c          | 450 ++++++++-------
 src/backend/access/heap/visibilitymap.c       | 541 ++++++++++++++++++
 src/backend/commands/cluster.c                |   3 +-
 src/backend/commands/vacuum.c                 |  81 +--
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      |  34 +-
 doc/src/sgml/ref/vacuum.sgml                  |   9 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 12 files changed, 906 insertions(+), 267 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..358b6f0fa 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,17 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
+/* VACUUM skipping strategy */
+typedef enum vmstrategy
+{
+	VMSNAP_SKIP_NONE = 0,
+	VMSNAP_SKIP_ALL_VISIBLE,
+	VMSNAP_SKIP_ALL_FROZEN
+} vmstrategy;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +46,12 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+											  BlockNumber *scanned_pages_skipallvis,
+											  BlockNumber *scanned_pages_skipallfrozen);
+extern void visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat);
+extern BlockNumber visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index de28d581a..4dcef3e67 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 
 /*
  * Values used by index_cleanup and truncate params.
@@ -336,7 +336,8 @@ extern void vac_update_relstats(Relation relation,
 								bool *minmulti_updated,
 								bool in_outer_xact);
 extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
-							   struct VacuumCutoffs *cutoffs);
+							   struct VacuumCutoffs *cutoffs,
+							   double *tableagefrac);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
 extern void vacuum_delay_point(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 307842582..60c1e2cec 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -109,10 +109,18 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * Thresholds (expressed as a proportion of rel_pages) that influence VACUUM's
+ * choice of skipping strategy
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define SKIPALLVIS_MIN_PAGES		0.05	/* 5% of rel_pages */
+#define SKIPALLVIS_MAX_PAGES		0.70
+
+/*
+ * tableagefrac-wise cutoffs that control when VACUUM decides on skipping
+ * using SKIPALLVIS_MIN_PAGES and SKIPALLVIS_MAX_PAGES cutoffs respectively
+ */
+#define TABLEAGEFRAC_MIDPOINT		0.5 /* half way to antiwraparound AV */
+#define TABLEAGEFRAC_HIGHPOINT		0.9
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -150,8 +158,6 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -170,7 +176,9 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map (as of time that VACUUM began) */
+	vmsnapshot *vmsnap;
+	vmstrategy	vmstrat;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -243,11 +251,9 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static void lazy_scan_strategy(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  const VacuumParams *params,
+									  double tableagefrac);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -277,7 +283,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -309,10 +316,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
+	double		tableagefrac;
 	BlockNumber orig_rel_pages,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -452,43 +460,34 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
+	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs,
+											&tableagefrac);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		vacrel->aggressive = true;
-		skipwithvm = false;
-	}
-
-	vacrel->skipwithvm = skipwithvm;
 
 	/*
-	 * Determine freezing strategy used by VACUUM
+	 * Now determine skipping and freezing strategies used by this VACUUM.
+	 *
+	 * This process is driven in part by information from VACUUM's visibility
+	 * map snapshot, which will be acquired in passing.  lazy_scan_heap will
+	 * use the same immutable VM snapshot to determine which pages to skip.
+	 * Using an immutable structure (instead of the live visibility map) helps
+	 * VACUUM avoid scanning concurrently modified pages.  These pages can
+	 * only have deleted tuples that OldestXmin will consider RECENTLY_DEAD.
 	 */
-	lazy_scan_strategy(vacrel);
+	scanned_pages = lazy_scan_strategy(vacrel, params, tableagefrac);
 	if (verbose)
-	{
-		if (vacrel->aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace, vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -498,13 +497,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -551,12 +551,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->vmstrat == VMSNAP_SKIP_ALL_VISIBLE)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid when lazy_scan_strategy call
+		 * decided to skip all-visible pages
 		 */
 		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -601,6 +600,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -628,10 +630,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -827,13 +825,12 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -847,42 +844,29 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap,
+												 &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
+		if (blkno < next_block_to_scan)
 		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
-
-			Assert(next_unskippable_block >= blkno + 1);
+			Assert(blkno != rel_pages - 1);
+			continue;
 		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
 
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap,
+													 &next_all_visible);
+		Assert(next_block_to_scan > blkno);
 
 		vacrel->scanned_pages++;
 
@@ -1092,10 +1076,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1123,12 +1106,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1167,7 +1148,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1260,128 +1241,204 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine freezing strategy.
+ *	lazy_scan_strategy() -- Determine freezing/skipping strategy.
  *
  * Our traditional/lazy freezing strategy is useful when putting off the work
  * of freezing totally avoids work that turns out to have been unnecessary.
  * On the other hand we eagerly freeze pages when that strategy spreads out
  * the burden of freezing over time.
+ *
+ * Also determines if the ongoing VACUUM operation should skip all-visible
+ * pages to save work in the near term, or if we should prefer to advance
+ * relfrozenxid/relminmxid in the near term instead.
+ *
+ * Freezing and skipping strategies are structured as two independent choices,
+ * but they are not independent in any practical sense (it's just mechanical).
+ * Eager and lazy behaviors go hand in hand, since the choice of each strategy
+ * is driven by the same information, and similar considerations about the
+ * needs of the table.  Moreover, choosing eager skipping behavior is often
+ * expected to directly result in freezing many more pages, since VACUUM can
+ * only _consider_ freezing pages that it actually scans in the first place.
+ * All-visible pages are only eligible for freezing when not skipped over.
+ *
+ * The single most important justification for the eager behaviors is system
+ * level performance stability.  It is often better to freeze all-visible
+ * pages before we're truly forced to (just to advance relfrozenxid) as a way
+ * of avoiding big spikes, where VACUUM has to freeze many pages all at once.
+ *
+ * Returns final scanned_pages for the VACUUM operation.  The exact number of
+ * pages that lazy_scan_heap scans depends in part on the skipping strategy
+ * decided here.
  */
-static void
-lazy_scan_strategy(LVRelState *vacrel)
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel, const VacuumParams *params,
+				   double tableagefrac)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->rel_pages,
+				force_eager_skip_threshold,
+				scanned_pages_skipallvis,
+				scanned_pages_skipallfrozen;
 
 	Assert(vacrel->scanned_pages == 0);
 
+	/* Acquire a VM snapshot for VACUUM operation */
+	vacrel->vmsnap = visibilitymap_snap_acquire(vacrel->rel, rel_pages,
+												&scanned_pages_skipallvis,
+												&scanned_pages_skipallfrozen);
+	vacrel->vmstrat = VMSNAP_SKIP_NONE;
+
+	/*
+	 * The eager freezing strategy is used when a physical table size
+	 * threshold controlled by the freeze_strategy_threshold GUC/reloption is
+	 * crossed.  Also freeze eagerly whenever table age is close to requiring
+	 * (or is actually undergoing) an antiwraparound autovacuum.
+	 */
 	vacrel->eager_freeze_strategy =
-		rel_pages >= vacrel->cutoffs.freeze_strategy_threshold;
-}
+		(tableagefrac >= TABLEAGEFRAC_HIGHPOINT ||
+		 rel_pages >= vacrel->cutoffs.freeze_strategy_threshold);
 
-/*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
- *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
- *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
- */
-static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
-{
-	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
-
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
+	/*
+	 * Force the use of VMSNAP_SKIP_ALL_FROZEN when rel_pages is now at least
+	 * twice freeze_strategy_threshold.
+	 *
+	 * "Staggering" the freezing and skipping behaviors like this is intended
+	 * to give VACUUM the benefit of the lazy strategies where they are useful
+	 * (when vacuuming smaller tables), while avoiding sharp discontinuities
+	 * in the overhead of freezing when transitioning to eager behaviors.  It
+	 * is useful to make a gradual transition for tables that start out small,
+	 * but continue to grow.  We can mostly avoid any large once-off freezing
+	 * spikes this way.  (Recall that use of the VMSNAP_SKIP_ALL_FROZEN vmsnap
+	 * strategy is often enough to significantly increase the number of pages
+	 * frozen, even when VACUUM also opts to use the lazy freezing strategy.)
+	 *
+	 * force_eager_skip_threshold is useful because it is an _absolute_ cutoff
+	 * that doesn't depend on short-term costs, nor on tableagefrac.  VACUUM
+	 * thereby avoids concentrated build-ups of unfrozen pages in any table.
+	 * This is important during bulk loading, where very few transactions will
+	 * leave behind very many heap pages that we should freeze proactively.
+	 *
+	 * Laziness is only valuable when it totally avoids unnecessary freezing,
+	 * which is much less likely to work out (and much more likely to lead to
+	 * disruptive "catch-up" freezing) with a larger table.
+	 */
+	force_eager_skip_threshold = vacrel->cutoffs.freeze_strategy_threshold;
+	if (force_eager_skip_threshold < MaxBlockNumber / 2)
+		force_eager_skip_threshold *= 2;
+	if (tableagefrac >= TABLEAGEFRAC_HIGHPOINT ||
+		rel_pages >= force_eager_skip_threshold)
 	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
-
-		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-		{
-			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
-			break;
-		}
+		vacrel->vmstrat = VMSNAP_SKIP_ALL_FROZEN;
+	}
+	else
+	{
+		BlockNumber nextra,
+					nextra_min_threshold,
+					nextra_max_threshold,
+					prefer_laziness_threshold;
 
 		/*
-		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
+		 * Neither tableagefrac nor rel_pages crossed the thresholds that
+		 * automatically force use of the VMSNAP_SKIP_ALL_FROZEN strategy.
+		 * Advancing relfrozenxid/relminmxid eagerly may still make sense, but
+		 * we now need to apply more information to decide what to do.
 		 *
-		 * Implement this by always treating the last block as unsafe to skip.
+		 * Determine the number of "extra" scanned_pages incurred by using
+		 * VMSNAP_SKIP_ALL_FROZEN instead of VMSNAP_SKIP_ALL_VISIBLE, which is
+		 * the "extra" cost that our eager VMSNAP_SKIP_ALL_FROZEN strategy
+		 * incurs, if we actually opt to use it.
+		 *
+		 * Also determine guideline "extra" scanned_pages thresholds.  These
+		 * represent minimum and maximum sensible thresholds for rel.
 		 */
-		if (next_unskippable_block == rel_pages - 1)
-			break;
+		nextra = scanned_pages_skipallfrozen - scanned_pages_skipallvis;
+		Assert(rel_pages >= nextra);
+		nextra_min_threshold = (double) rel_pages * SKIPALLVIS_MIN_PAGES;
+		nextra_max_threshold = (double) rel_pages * SKIPALLVIS_MAX_PAGES;
+		Assert(nextra_max_threshold >= nextra_min_threshold);
 
-		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
-			break;
-
-		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
-		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		if (tableagefrac < TABLEAGEFRAC_MIDPOINT)
 		{
-			if (vacrel->aggressive)
-				break;
+			/*
+			 * The table's age is still below table age mid point, so table
+			 * age is still of only minimal concern.  We're still willing to
+			 * act eagerly when it's _very_ cheap to do so.  Specifically,
+			 * when VMSNAP_SKIP_ALL_FROZEN requires VACUUM to scan a number of
+			 * extra pages not exceeding 5% of rel_pages.
+			 */
+			prefer_laziness_threshold = nextra_min_threshold;
+		}
+		else
+		{
+			double		tableagefrac_high_delta,
+						min_scale_up;
 
 			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
+			 * Our tableagefrac is some point between TABLEAGEFRAC_MIDPOINT
+			 * and TABLEAGEFRAC_HIGHPOINT.  This means that table age is
+			 * starting to become a concern, but not to the extent that we're
+			 * forced to use VMSNAP_SKIP_ALL_FROZEN strategy (not yet).  We'll
+			 * need to weigh both costs and benefits to decide on a strategy.
+			 *
+			 * If tableagefrac is only barely over the midway point, then
+			 * we'll choose an "extra blocks" threshold of ~5% of rel_pages.
+			 * The opposite extreme occurs when tableagefrac is very near to
+			 * the high point.  That will make our "extra blocks" threshold
+			 * very aggressive: we'll go with VMSNAP_SKIP_ALL_FROZEN when
+			 * doing so requires we scan a number of extra blocks as high as
+			 * ~70% of rel_pages.  Our final "extra blocks" threshold is most
+			 * likely to fall between the two extremes (when we end up here).
+			 *
+			 * Note that the "extra blocks" thresholds we'll use increases at
+			 * an accelerating rate as tableagefrac itself increases (assuming
+			 * a fixed rel_pages, though if rel_pages actually grows then it's
+			 * probably even more likely that VMSNAP_SKIP_ALL_FROZEN will get
+			 * used before long).
+			 *
+			 * Note also that it is unlikely that tables that require regular
+			 * vacuuming will ever have a VACUUM whose tableagefrac actually
+			 * reaches TABLEAGEFRAC_HIGHPOINT, barring cases where table age
+			 * based settings like autovacuum_freeze_max_age are set to very
+			 * low values (which includes VACUUM FREEZE).
 			 */
-			skipsallvis = true;
+			Assert(tableagefrac < TABLEAGEFRAC_HIGHPOINT);
+			tableagefrac_high_delta = TABLEAGEFRAC_HIGHPOINT - tableagefrac;
+			min_scale_up = 1.0 - (tableagefrac_high_delta /
+								  (TABLEAGEFRAC_HIGHPOINT - TABLEAGEFRAC_MIDPOINT));
+
+			prefer_laziness_threshold =
+				(nextra_min_threshold * (1.0 - min_scale_up)) +
+				(nextra_max_threshold * min_scale_up);
 		}
 
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
+		prefer_laziness_threshold = Max(32, prefer_laziness_threshold);
+		if (nextra >= prefer_laziness_threshold)
+			vacrel->vmstrat = VMSNAP_SKIP_ALL_VISIBLE;
+		else
+			vacrel->vmstrat = VMSNAP_SKIP_ALL_FROZEN;
 	}
 
 	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
-	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 * Override choice of skipping strategy (force vmsnap to scan every page
+	 * in the range of rel_pages) in DISABLE_PAGE_SKIPPING case.  Also
+	 * defensively force all-frozen in aggressive VACUUMs.
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
-	else
-	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
-	}
+	Assert(vacrel->vmstrat != VMSNAP_SKIP_NONE);
+	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+		vacrel->vmstrat = VMSNAP_SKIP_NONE;
+	else if (vacrel->aggressive)
+		vacrel->vmstrat = VMSNAP_SKIP_ALL_FROZEN;
 
-	return next_unskippable_block;
+	/* Inform vmsnap infrastructure of our chosen strategy */
+	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
+
+	/* Return appropriate scanned_pages for final strategy chosen */
+	if (vacrel->vmstrat == VMSNAP_SKIP_ALL_VISIBLE)
+		return scanned_pages_skipallvis;
+	if (vacrel->vmstrat == VMSNAP_SKIP_ALL_FROZEN)
+		return scanned_pages_skipallfrozen;
+
+	/* DISABLE_PAGE_SKIPPING/VMSNAP_SKIP_NONE case */
+	return rel_pages;
 }
 
 /*
@@ -2821,6 +2878,14 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * Also don't attempt it if we are doing early pruning/vacuuming, because a
  * scan which cannot find a truncated heap page cannot determine that the
  * snapshot is too old to read that page.
+ *
+ * Note that we effectively rely on visibilitymap_snap_next() having forced
+ * VACUUM to scan the final page (rel_pages - 1) in all cases.  Without that,
+ * we'd tend to needlessly acquire an AccessExclusiveLock just to attempt rel
+ * truncation that is bound to fail.  VACUUM cannot set vacrel->nonempty_pages
+ * in pages that it skips using the VM, so we must avoid interpreting skipped
+ * pages as empty pages when it makes little sense.  Observing that the final
+ * page has tuples is a simple way of avoiding pathological locking behavior.
  */
 static bool
 should_attempt_truncation(LVRelState *vacrel)
@@ -3111,14 +3176,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3127,15 +3191,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3157,12 +3219,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e..27045032a 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,10 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap_acquire - acquire snapshot of visibility map
+ *		visibilitymap_snap_strategy - set VACUUM's skipping strategy
+ *		visibilitymap_snap_next - get next block to scan from vmsnap
+ *		visibilitymap_snap_release - release previously acquired snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +56,10 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset.  This also provides us with a
+ * convenient way of performing I/O prefetching on behalf of VACUUM, since the
+ * pages that VACUUM's first heap pass will scan are fully predetermined.
  *
  * LOCKING
  *
@@ -92,10 +100,12 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
+#include "utils/spccache.h"
 
 
 /*#define TRACE_VISIBILITYMAP */
@@ -124,9 +134,87 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Prefetching of heap pages takes place as VACUUM requests the next block in
+ * line from its visibility map snapshot
+ *
+ * XXX MIN_PREFETCH_SIZE of 32 is a little on the high side, but matches
+ * hard-coded constant used by vacuumlazy.c when prefetching for rel
+ * truncation.  Might be better to increase the maintenance_io_concurrency
+ * default, or to do nothing like this at all.
+ */
+#define STAGED_BUFSIZE			(MAX_IO_CONCURRENCY * 2)
+#define MIN_PREFETCH_SIZE		((BlockNumber) 32)
+
+typedef struct vmsnapblock
+{
+	BlockNumber scanned_block;
+	bool		all_visible;
+} vmsnapblock;
+
+/*
+ * Snapshot of visibility map at the start of a VACUUM operation
+ */
+struct vmsnapshot
+{
+	/* Target heap rel */
+	Relation	rel;
+	/* Skipping strategy used by VACUUM operation */
+	vmstrategy	strat;
+	/* Per-strategy final scanned_pages */
+	BlockNumber rel_pages;
+	BlockNumber scanned_pages_skipallvis;
+	BlockNumber scanned_pages_skipallfrozen;
+
+	/*
+	 * Materialized visibility map state.
+	 *
+	 * VM snapshots spill to a temp file when required.
+	 */
+	BlockNumber nvmpages;
+	BufFile    *file;
+
+	/*
+	 * Prefetch distance, used to perform I/O prefetching of heap pages
+	 */
+	int			prefetch_distance;
+
+	/* Current VM page cached */
+	BlockNumber curvmpage;
+	char	   *rawmap;
+	PGAlignedBlock vmpage;
+
+	/* Staging area for blocks returned to VACUUM */
+	vmsnapblock staged[STAGED_BUFSIZE];
+	int			current_nblocks_staged;
+
+	/*
+	 * Next block from range of rel_pages to consider placing in staged block
+	 * array (it will be placed there if it's going to be scanned by VACUUM)
+	 */
+	BlockNumber next_block;
+
+	/*
+	 * Number of blocks that we still need to return, and number of blocks
+	 * that we still need to prefetch
+	 */
+	BlockNumber scanned_pages_to_return;
+	BlockNumber scanned_pages_to_prefetch;
+
+	/* offset of next block in line to return (from staged) */
+	int			next_return_idx;
+	/* offset of next block in line to prefetch (from staged) */
+	int			next_prefetch_idx;
+	/* offset of first garbage/invalid element (from staged) */
+	int			first_invalid_idx;
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static void vm_snap_stage_blocks(vmsnapshot *vmsnap);
+static uint8 vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 
 
 /*
@@ -373,6 +461,350 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap_acquire - get read-only snapshot of visibility map
+ *
+ * Initializes VACUUM caller's snapshot, allocating memory in current context.
+ * Used by VACUUM to determine which pages it must scan up front.
+ *
+ * Set scanned_pages_skipallvis and scanned_pages_skipallfrozen to help VACUUM
+ * decide on its skipping strategy.  These are VACUUM's scanned_pages when it
+ * opts to skip all eligible pages and scanned_pages when it opts to just skip
+ * all-frozen pages, respectively.
+ *
+ * Caller finalizes skipping strategy by calling visibilitymap_snap_strategy.
+ * This determines the kind of blocks visibilitymap_snap_next should indicate
+ * need to be scanned by VACUUM.
+ */
+vmsnapshot *
+visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+						   BlockNumber *scanned_pages_skipallvis,
+						   BlockNumber *scanned_pages_skipallfrozen)
+{
+	BlockNumber nvmpages = 0,
+				mapBlockLast = 0,
+				all_visible = 0,
+				all_frozen = 0;
+	uint8		mapbits_last_page = 0;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_acquire %s %u",
+		 RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	if (rel_pages > 0)
+	{
+		mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages - 1);
+		nvmpages = mapBlockLast + 1;
+	}
+
+	/* Allocate and initialize VM snapshot state */
+	vmsnap = palloc0(sizeof(vmsnapshot));
+	vmsnap->rel = rel;
+	vmsnap->strat = VMSNAP_SKIP_NONE;	/* for now */
+	vmsnap->rel_pages = rel_pages;	/* scanned_pages for VMSNAP_SKIP_NONE */
+	vmsnap->scanned_pages_skipallvis = 0;
+	vmsnap->scanned_pages_skipallfrozen = 0;
+
+	/*
+	 * vmsnap temp file state.
+	 *
+	 * Only relations large enough to need more than one visibility map page
+	 * use a temp file (cannot wholly rely on vmsnap's single page cache).
+	 */
+	vmsnap->nvmpages = nvmpages;
+	vmsnap->file = NULL;
+	if (nvmpages > 1)
+		vmsnap->file = BufFileCreateTemp(false);
+	vmsnap->prefetch_distance = 0;
+#ifdef USE_PREFETCH
+	vmsnap->prefetch_distance =
+		get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
+#endif
+	vmsnap->prefetch_distance = Max(vmsnap->prefetch_distance, MIN_PREFETCH_SIZE);
+
+	/* cache of VM pages read from temp file */
+	vmsnap->curvmpage = 0;
+	vmsnap->rawmap = NULL;
+
+	/* staged blocks array state */
+	vmsnap->current_nblocks_staged = 0;
+	vmsnap->next_block = 0;
+	vmsnap->scanned_pages_to_return = 0;
+	vmsnap->scanned_pages_to_prefetch = 0;
+	/* Offsets into staged blocks array */
+	vmsnap->next_return_idx = 0;
+	vmsnap->next_prefetch_idx = 0;
+	vmsnap->first_invalid_idx = 0;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		/* Cache page locally */
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(vmsnap->vmpage.data, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/* Finish off this VM page using snapshot's vmpage cache */
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = map = PageGetContents(vmsnap->vmpage.data);
+		umap = (uint64 *) map;
+
+		if (mapBlock == mapBlockLast)
+		{
+			uint32		mapByte;
+			uint8		mapOffset;
+
+			/*
+			 * The last VM page requires some extra steps.
+			 *
+			 * First get the status of the last heap page (page in the range
+			 * of rel_pages) in passing.
+			 */
+			Assert(mapBlock == HEAPBLK_TO_MAPBLOCK(rel_pages - 1));
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages - 1);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages - 1);
+			mapbits_last_page = ((map[mapByte] >> mapOffset) &
+								 VISIBILITYMAP_VALID_BITS);
+
+			/*
+			 * Also defensively "truncate" our local copy of the last page in
+			 * order to reliably exclude heap pages beyond the range of
+			 * rel_pages.  This is sheer paranoia.
+			 */
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages);
+			if (mapByte != 0 || mapOffset != 0)
+			{
+				MemSet(&map[mapByte + 1], 0, MAPSIZE - (mapByte + 1));
+				map[mapByte] &= (1 << mapOffset) - 1;
+			}
+		}
+
+		/* Maintain count of all-frozen and all-visible pages */
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+
+		/* Finally, write out vmpage cache VM page to vmsnap's temp file */
+		if (vmsnap->file)
+			BufFileWrite(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+	}
+
+	/*
+	 * Done copying all VM pages from authoritative VM into a VM snapshot.
+	 *
+	 * Figure out the final scanned_pages for the two skipping policies that
+	 * we might use: skipallvis (skip both all-frozen and all-visible) and
+	 * skipallfrozen (just skip all-frozen).
+	 */
+	Assert(all_frozen <= all_visible && all_visible <= rel_pages);
+	*scanned_pages_skipallvis = rel_pages - all_visible;
+	*scanned_pages_skipallfrozen = rel_pages - all_frozen;
+
+	/*
+	 * When the last page is skippable in principle, it still won't be treated
+	 * as skippable by visibilitymap_snap_next, which recognizes the last page
+	 * as a special case.  Compensate by incrementing each skipping strategy's
+	 * scanned_pages as needed to avoid counting the last page as skippable.
+	 */
+	if (mapbits_last_page & VISIBILITYMAP_ALL_VISIBLE)
+		(*scanned_pages_skipallvis)++;
+	if (mapbits_last_page & VISIBILITYMAP_ALL_FROZEN)
+		(*scanned_pages_skipallfrozen)++;
+
+	vmsnap->scanned_pages_skipallvis = *scanned_pages_skipallvis;
+	vmsnap->scanned_pages_skipallfrozen = *scanned_pages_skipallfrozen;
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_strategy -- determine VACUUM's skipping strategy.
+ *
+ * VACUUM chooses a vmsnap strategy according to priorities around advancing
+ * relfrozenxid.  See visibilitymap_snap_acquire.
+ */
+void
+visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat)
+{
+	int			nprefetch;
+
+	/* Remember final skipping strategy */
+	vmsnap->strat = strat;
+
+	if (vmsnap->strat == VMSNAP_SKIP_ALL_VISIBLE)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_skipallvis;
+	else if (vmsnap->strat == VMSNAP_SKIP_ALL_FROZEN)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_skipallfrozen;
+	else
+		vmsnap->scanned_pages_to_return = vmsnap->rel_pages;
+
+	vmsnap->scanned_pages_to_prefetch = vmsnap->scanned_pages_to_return;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_strategy %s %d %u",
+		 RelationGetRelationName(vmsnap->rel), (int) strat,
+		 vmsnap->scanned_pages_to_return);
+#endif
+
+	/*
+	 * Stage blocks (may have to read from temp file).
+	 *
+	 * We rely on the assumption that we'll always have a large enough staged
+	 * blocks array to accommodate any possible prefetch distance.
+	 */
+	vm_snap_stage_blocks(vmsnap);
+
+	nprefetch = Min(vmsnap->current_nblocks_staged, vmsnap->prefetch_distance);
+#ifdef USE_PREFETCH
+	for (int i = 0; i < nprefetch; i++)
+	{
+		BlockNumber block = vmsnap->staged[i].scanned_block;
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, block);
+	}
+#endif
+
+	vmsnap->scanned_pages_to_prefetch -= nprefetch;
+	vmsnap->next_prefetch_idx += nprefetch;
+}
+
+/*
+ *	visibilitymap_snap_next -- get next block to scan from vmsnap.
+ *
+ * Returns next block in line for VACUUM to scan according to vmsnap.  Caller
+ * skips any and all blocks preceding returned block.
+ *
+ * The all-visible status of returned block is set in *all_visible.  Block
+ * usually won't be set all-visible (else VACUUM wouldn't need to scan it),
+ * but it can be in certain corner cases.  This includes the VMSNAP_SKIP_NONE
+ * case, as well as a special case that VACUUM expects us to handle: the final
+ * block (rel_pages - 1) is always returned here (regardless of our strategy).
+ *
+ * VACUUM always scans the last page to determine whether it has tuples.  This
+ * is useful as a way of avoiding certain pathological cases with heap rel
+ * truncation.
+ */
+BlockNumber
+visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible)
+{
+	BlockNumber next_block_to_scan;
+	vmsnapblock block;
+
+	*allvisible = true;
+	if (vmsnap->scanned_pages_to_return == 0)
+		return InvalidBlockNumber;
+
+	/* Prepare to return this block */
+	block = vmsnap->staged[vmsnap->next_return_idx++];
+	*allvisible = block.all_visible;
+	next_block_to_scan = block.scanned_block;
+	vmsnap->current_nblocks_staged--;
+	vmsnap->scanned_pages_to_return--;
+
+	/*
+	 * Did the staged blocks array just run out of blocks to return to caller,
+	 * or do we need to stage more blocks for I/O prefetching purposes?
+	 */
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+	if ((vmsnap->current_nblocks_staged == 0 &&
+		 vmsnap->scanned_pages_to_return > 0) ||
+		(vmsnap->next_prefetch_idx == vmsnap->first_invalid_idx &&
+		 vmsnap->scanned_pages_to_prefetch > 0))
+	{
+		if (vmsnap->current_nblocks_staged > 0)
+		{
+			/*
+			 * We've run out of prefetchable blocks, but still have some
+			 * non-returned blocks.  Shift existing blocks to the start of the
+			 * array.  The newly staged blocks go after these ones.
+			 */
+			memmove(&vmsnap->staged[0],
+					&vmsnap->staged[vmsnap->next_return_idx],
+					sizeof(vmsnapblock) * vmsnap->current_nblocks_staged);
+		}
+
+		/*
+		 * Reset offsets in staged blocks array, while accounting for likely
+		 * presence of preexisting blocks that have already been prefetched
+		 * but have yet to be returned to VACUUM caller
+		 */
+		vmsnap->next_prefetch_idx -= vmsnap->next_return_idx;
+		vmsnap->first_invalid_idx -= vmsnap->next_return_idx;
+		vmsnap->next_return_idx = 0;
+
+		/* Stage more blocks (may have to read from temp file) */
+		vm_snap_stage_blocks(vmsnap);
+	}
+
+	/*
+	 * By here we're guaranteed to have at least one prefetchable block in the
+	 * staged blocks array (unless we've already prefetched all blocks that
+	 * will ever be returned to VACUUM caller)
+	 */
+	if (vmsnap->next_prefetch_idx < vmsnap->first_invalid_idx)
+	{
+#ifdef USE_PREFETCH
+		/* Still have remaining blocks to prefetch, so prefetch next one */
+		vmsnapblock prefetch = vmsnap->staged[vmsnap->next_prefetch_idx++];
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, prefetch.scanned_block);
+#else
+		vmsnap->next_prefetch_idx++;
+#endif
+		Assert(vmsnap->current_nblocks_staged > 1);
+		Assert(vmsnap->scanned_pages_to_prefetch > 0);
+		vmsnap->scanned_pages_to_prefetch--;
+	}
+	else
+	{
+		Assert(vmsnap->scanned_pages_to_prefetch == 0);
+	}
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_next %s %u",
+		 RelationGetRelationName(vmsnap->rel), next_block_to_scan);
+#endif
+
+	return next_block_to_scan;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Frees resources allocated in visibilitymap_snap_acquire for VACUUM.
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->scanned_pages_to_return == 0);
+	if (vmsnap->file)
+		BufFileClose(vmsnap->file);
+	pfree(vmsnap);
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
@@ -677,3 +1109,112 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
+
+/*
+ * Stage some heap blocks from vmsnap to return to VACUUM caller.
+ *
+ * Called when we completely run out of staged blocks to return to VACUUM, or
+ * when vmsnap still has some pending staged blocks, but too few to be able to
+ * prefetch incrementally as the remaining blocks are returned to VACUUM.
+ */
+static void
+vm_snap_stage_blocks(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->current_nblocks_staged < STAGED_BUFSIZE);
+	Assert(vmsnap->first_invalid_idx < STAGED_BUFSIZE);
+	Assert(vmsnap->next_return_idx <= vmsnap->first_invalid_idx);
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+
+	while (vmsnap->next_block < vmsnap->rel_pages &&
+		   vmsnap->current_nblocks_staged < STAGED_BUFSIZE)
+	{
+		bool		all_visible = true;
+		vmsnapblock stage;
+
+		for (;;)
+		{
+			uint8		mapbits = vm_snap_get_status(vmsnap,
+													 vmsnap->next_block);
+
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+				all_visible = false;
+				break;
+			}
+
+			/*
+			 * Stop staging blocks just before final page, which must always
+			 * be scanned by VACUUM
+			 */
+			if (vmsnap->next_block == vmsnap->rel_pages - 1)
+				break;
+
+			/* VMSNAP_SKIP_NONE forcing VACUUM to scan every page? */
+			if (vmsnap->strat == VMSNAP_SKIP_NONE)
+				break;
+
+			/*
+			 * Check if it would be unsafe to scan page because it's just
+			 * all-visible, and we're using VISIBILITYMAP_ALL_FROZEN strategy.
+			 */
+			if (vmsnap->strat == VMSNAP_SKIP_ALL_FROZEN &&
+				(mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+				break;
+
+			/* VACUUM will skip this page -- so don't stage it for later */
+			vmsnap->next_block++;
+		}
+
+		/* VACUUM will scan this block, so stage it for later */
+		stage.scanned_block = vmsnap->next_block++;
+		stage.all_visible = all_visible;
+		vmsnap->staged[vmsnap->first_invalid_idx++] = stage;
+		vmsnap->current_nblocks_staged++;
+	}
+}
+
+/*
+ * Get status of bits from vm snapshot
+ */
+static uint8
+vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "vm_snap_get_status %u", heapBlk);
+#endif
+
+	/*
+	 * If we didn't see the VM page when the snapshot was first acquired we
+	 * defensively assume heapBlk not all-visible or all-frozen
+	 */
+	Assert(heapBlk <= vmsnap->rel_pages);
+	if (mapBlock >= vmsnap->nvmpages)
+		return 0;
+
+	/* Read from temp file when required */
+	if (mapBlock != vmsnap->curvmpage)
+	{
+		size_t		nread;
+
+		if (BufFileSeekBlock(vmsnap->file, mapBlock) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek to block %u of vmsnap temporary file",
+							mapBlock)));
+		nread = BufFileRead(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+		if (nread != BLCKSZ)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u of vmsnap temporary file: read only %zu of %zu bytes",
+							mapBlock, nread, (size_t) BLCKSZ)));
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = PageGetContents(vmsnap->vmpage.data);
+	}
+
+	return ((vmsnap->rawmap[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+}
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b0e310604..5b20d5618 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -825,6 +825,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	VacuumParams params;
 	struct VacuumCutoffs cutoffs;
+	double		tableagefrac;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -913,7 +914,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	memset(&params, 0, sizeof(VacuumParams));
-	vacuum_get_cutoffs(OldHeap, &params, &cutoffs);
+	vacuum_get_cutoffs(OldHeap, &params, &cutoffs, &tableagefrac);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 420b85be6..e2f586687 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -920,7 +920,16 @@ get_all_vacuum_rels(int options)
  *
  * The target relation and VACUUM parameters are our inputs.
  *
- * Output parameters are the cutoffs that VACUUM caller should use.
+ * Output parameters are the cutoffs that VACUUM caller should use, and
+ * tableagefrac, which indicates how close rel is to requiring that VACUUM
+ * advance relfrozenxid and/or relminmxid.
+ *
+ * The tableagefrac value 1.0 represents the point that autovacuum.c scheduling
+ * (and VACUUM itself) considers relfrozenxid advancement strictly necessary.
+ * Lower values provide useful context, and influence whether VACUUM will opt
+ * to advance relfrozenxid before the point that it is strictly necessary.
+ * VACUUM can (and often does) opt to advance relfrozenxid proactively.  It is
+ * especially likely with tables where the _added_ costs happen to be low.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
@@ -929,7 +938,7 @@ get_all_vacuum_rels(int options)
  */
 bool
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
-				   struct VacuumCutoffs *cutoffs)
+				   struct VacuumCutoffs *cutoffs, double *tableagefrac)
 {
 	int			freeze_min_age,
 				multixact_freeze_min_age,
@@ -938,11 +947,11 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				effective_multixact_freeze_max_age,
 				freeze_strategy_threshold;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
@@ -1074,48 +1083,48 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
-	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.  The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/*
+	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+	 * XMID table age (whichever is greater currently).
+	 */
+	XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+		((double) freeze_table_age + 0.5);
+	MXIDFrac = (double) (nextMXID - cutoffs->relminmxid) /
+		((double) multixact_freeze_table_age + 0.5);
+	*tableagefrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		*tableagefrac = 1.0;
+
+	return (*tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 549a2e969..554e2bd0c 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2476,10 +2476,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2496,10 +2496,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4763cb6bb..bb50a5486 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -658,6 +658,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -691,11 +698,9 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 094f9a35d..02186ce36 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9112,20 +9112,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index f61433c7d..9cae899d5 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -154,13 +154,8 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>DISABLE_PAGE_SKIPPING</literal></term>
     <listitem>
      <para>
-      Normally, <command>VACUUM</command> will skip pages based on the <link
-      linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      Normally, <command>VACUUM</command> will skip pages based on the
+      <link linkend="vacuum-for-visibility-map">visibility map</link>.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.38.1

#38

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Geoghegan (#37)

Re: New strategies for freezing, advancing relfrozenxid early

On Sat, 2022-12-10 at 18:11 -0800, Peter Geoghegan wrote:

On Tue, Dec 6, 2022 at 1:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

v9 will also address some of the concerns you raised in your review
that weren't covered by v8, especially about the VM snapshotting
infrastructure. But also your concerns about the transition from
lazy
strategies to eager strategies.

Attached is v9. Highlights:

Comments:

* The documentation shouldn't have a heading like "Managing the 32-bit
Transaction ID address space". We already have a concept of "age"
documented, and I think that's all that's needed in the relevant
section. Freezing is driven by a need to keep the age of the oldest
transaction ID in a table to less than ~2B; and also the need to
truncate the clog (and reduce lookups of really old xids). It's fine to
give a brief explanation about why we can't track very old xids, but
it's more of an internal detail and not the main point.

* I'm still having a hard time with vacuum_freeze_strategy_threshold.
Part of it is the name, which doesn't seem to convey the meaning. But
the heuristic also seems off to me. What if you have lots of partitions
in an append-only range-partitioned table? That would tend to use the
lazy freezing strategy (because each partition is small), but that's
not what you want. I understand heuristics aren't perfect, but it feels
like we could do something better. Also, another purpose of this seems
to be to achieve v15 behavior (if v16 behavior causes a problem for
some workload), which seems like a good idea, but perhaps we should
have a more direct setting for that?

* The comment above lazy_scan_strategy() is phrased in terms of the
"traditional approach". It would be more clear if you described the
current strategies and how they're chosen. The pre-16 behavior was as
lazy as possible, so that's easy enough to describe without referring
to history.

* "eager skipping behavior" seems like a weird phrasing because it's
not immediately clear if that means "skip more pages" (eager to skip
pages and lazy to process them) or "skip fewer pages" (lazy to skip the
pages and eager to process the pages).

* The skipping behavior is for all-visible pages is binary: skip them
all, or skip none. That makes sense in the context of relfrozenxid
advancement. But how does that avoid IO spikes? It would seem perfectly
reasonable to me, if relfrozenxid advancement is not a pressing
problem, to process some fraction of the all-visible pages (or perhaps
process enough of them to freeze some fraction). That would ensure that
each VACUUM makes a payment on the deferred costs of freezing. I think
this has already been discussed but it keeps reappearing in my mind, so
maybe we can settle this with a comment (and/or docs)?

* I'm wondering whether vacuum_freeze_min_age makes sense anymore. It
doesn't take effect unless the page is not skipped, which is confusing
from a usability standpoint, and we have better heuristics to decide if
the whole page should be frozen or not anyway (i.e. if an FPI was
already taken then freezing is cheaper).

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#39

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Jeff Davis (#38)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis <pgsql@j-davis.com> wrote:

Freezing is driven by a need to keep the age of the oldest
transaction ID in a table to less than ~2B; and also the need to
truncate the clog (and reduce lookups of really old xids). It's fine to
give a brief explanation about why we can't track very old xids, but
it's more of an internal detail and not the main point.

I agree that that's the conventional definition. What I am proposing
is that we revise that definition a little. We should start the
discussion of freezing in the user level docs by pointing out that
freezing also plays a role at the level of individual pages. An
all-frozen page is self-contained, now and forever (or until it gets
dirtied again, at least). Even on a standby we will reliably avoid
having to do clog lookups for a page that happens to have all of its
tuples frozen.

I don't want to push back too much here. I just don't think that it
makes terribly much sense for the docs to start the conversation about
freezing by talking about the worst consequences of not freezing for
an extended period of time. That's relevant, and it's probably going
to end up as the aspect of freezing that we spend most time on, but it
still doesn't seem like a useful starting point to me.

To me this seems related to the fallacy that relfrozenxid age is any
kind of indicator about how far behind we are on freezing. I think
that there is value in talking about freezing as a maintenance task
for physical heap pages, and only then talking about relfrozenxid and
the circular XID space. The 64-bit XID patch doesn't get rid of
freezing at all, because it is still needed to break the dependency of
tuples stored in heap pages on the pg_xact, and other SLRUs -- which
suggests that you can talk about freezing and advancing relfrozenxid
as different (though still closely related) concepts.

* I'm still having a hard time with vacuum_freeze_strategy_threshold.
Part of it is the name, which doesn't seem to convey the meaning.

I chose the name long ago, and never gave it terribly much thought.
I'm happy to go with whatever name you prefer.

But the heuristic also seems off to me. What if you have lots of partitions
in an append-only range-partitioned table? That would tend to use the
lazy freezing strategy (because each partition is small), but that's
not what you want. I understand heuristics aren't perfect, but it feels
like we could do something better.

It is at least vastly superior to vacuum_freeze_min_age in cases like
this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever
trigger freezing in any autovacuum given a table like pgbench_history
(barring during aggressive mode), due to how it interacts with the
visibility map. So we're practically guaranteed to do literally all
freezing for an append-only table in an aggressive mode VACUUM.

Worst of all, that happens on a timeline that has nothing to do with
the physical characteristics of the table itself (like the number of
unfrozen heap pages or something). In fact, it doesn't even have
anything to do with how many distinct XIDs modified that particular
table -- XID age works at the system level.

By working at the heap rel level (which means the partition level if
it's a partitioned table), and by being based on physical units (table
size), vacuum_freeze_strategy_threshold at least manages to limit the
accumulation of unfrozen heap pages in each individual relation. This
is the fundamental unit at which VACUUM operates. So even if you get
very unlucky and accumulate many unfrozen heap pages that happen to be
distributed across many different tables, you can at least vacuum each
table independently, and in parallel. The really big problems all seem
to involve concentration of unfrozen tables in one particular table
(usually the events table, the largest table in the system by a couple
of orders of magnitude).

That said, I agree that the system-level picture of debt (the system
level view of the number of unfrozen heap pages) is relevant, and that
it isn't directly considered by the patch. I think that that can be
treated as work for a future release. In fact, I think that there is a
great deal that we could teach autovacuum.c about the system level
view of things -- this is only one.

Also, another purpose of this seems
to be to achieve v15 behavior (if v16 behavior causes a problem for
some workload), which seems like a good idea, but perhaps we should
have a more direct setting for that?

Why, though? I think that it happens to make sense to do both with one
setting. Not because it's better to have 2 settings than 1 (though it
is) -- just because it makes sense here, given these specifics.

* The comment above lazy_scan_strategy() is phrased in terms of the
"traditional approach". It would be more clear if you described the
current strategies and how they're chosen. The pre-16 behavior was as
lazy as possible, so that's easy enough to describe without referring
to history.

Agreed. Will fix.

* "eager skipping behavior" seems like a weird phrasing because it's
not immediately clear if that means "skip more pages" (eager to skip
pages and lazy to process them) or "skip fewer pages" (lazy to skip the
pages and eager to process the pages).

I agree that that's a problem. I'll try to come up with a terminology
that doesn't have this problem ahead of the next version.

* The skipping behavior is for all-visible pages is binary: skip them
all, or skip none. That makes sense in the context of relfrozenxid
advancement. But how does that avoid IO spikes? It would seem perfectly
reasonable to me, if relfrozenxid advancement is not a pressing
problem, to process some fraction of the all-visible pages (or perhaps
process enough of them to freeze some fraction).

That's something that v9 will do, unlike earlier versions. So I agree.

In particular, we'll now start freezing eagerly before we switch over
to preferring to advance relfrozenxid for the same table. As I said in
my summary of v9 the other day, we "stagger" the point at which these
two behaviors are first applied, with the goal of smoothing the
transition. We try to disguise the fact that there are still two
different sets of behavior. We try to get the best of both worlds
(eager and lazy behaviors), without the user ever really noticing.

Don't forget that eager behavior with the visibility map is expected
to directly lead to freezing more pages (not a guarantee, but quite
likely). So while skipping strategy and freezing strategy are two
independent things, they're independent in name only, mechanically.
They are not independent things in any practical sense. (The
underlying reason why that is true is of course the same reason why
vacuum_freeze_min_age only really works as designed in aggressive mode
VACUUMs.)

each VACUUM makes a payment on the deferred costs of freezing. I think
this has already been discussed but it keeps reappearing in my mind, so
maybe we can settle this with a comment (and/or docs)?

That said, I believe that we should always advance relfrozenxid in
tables that are already moderately sized -- a table that is already
big enough to be some small multiple of
vacuum_freeze_strategy_threshold should always take an eager approach
to advancing relfrozenxid. That is, I don't think that it makes sense
to pay the cost of freezing down incrementally given a moderately
large table.

Large tables and small tables are qualitatively different things, at
least from a VACUUM point of view. To some degree we can afford to be
wrong about small tables, because that won't cause us any serious
pain. This isn't really true with larger tables -- a VACUUM of a large
table is "too big to fail". Our working assumption for tables that are
still growing now, in the ongoing VACUUM, is that they will continue
to grow.

There is often one very large table, and by the time the next VACUUM
comes around, the table may have accumulated more unfrozen pages than
the entire rest of the database combined (I mean all of the rest of
the database, frozen and unfrozen pages alike). This may even be
common:

https://brandur.org/fragments/events

* I'm wondering whether vacuum_freeze_min_age makes sense anymore. It
doesn't take effect unless the page is not skipped, which is confusing
from a usability standpoint, and we have better heuristics to decide if
the whole page should be frozen or not anyway (i.e. if an FPI was
already taken then freezing is cheaper).

I think that vacuum_freeze_min_age still has a role to play. The only
thing that can trigger freezing during a VACUUM that opts to use a
lazy strategy VACUUM is the FPI-from-pruning trigger mechanism (new to
v9), plus vacuum_freeze_min_age/FreezeLimit. So you cannot really have
a lazy strategy without vacuum_freeze_min_age. The original
vacuum_freeze_min_age design did make sense, at least
pre-visibility-map, because sometimes being lazy about freezing is the
best strategy. Especially with small, frequently updated tables like
most of the pgbench tables.

There is nothing inherently wrong with deciding to freeze (or even to
wait for a cleanup lock) on the basis of a given XID's age. My problem
isn't with that behavior in general. It's with the fact that we use it
even when it's clearly inappropriate -- wildly inappropriate. We have
plenty of information that strongly hints at whether or not laziness
is a good idea. It's a good idea whenever laziness has a decent chance
of avoiding completely unnecessary work altogether, provided we can
afford to be wrong about that without having to pay too high a cost
later on, when we have to course correct. What this mostly boils down
to is this: lazy freezing is generally a good idea in small tables
only.

--
Peter Geoghegan

#40

John Naylor

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Peter Geoghegan (#39)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Dec 13, 2022 at 8:00 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis <pgsql@j-davis.com> wrote:

But the heuristic also seems off to me. What if you have lots of

partitions

in an append-only range-partitioned table? That would tend to use the
lazy freezing strategy (because each partition is small), but that's
not what you want. I understand heuristics aren't perfect, but it feels
like we could do something better.

It is at least vastly superior to vacuum_freeze_min_age in cases like
this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever
trigger freezing in any autovacuum given a table like pgbench_history
(barring during aggressive mode), due to how it interacts with the
visibility map. So we're practically guaranteed to do literally all
freezing for an append-only table in an aggressive mode VACUUM.

Worst of all, that happens on a timeline that has nothing to do with
the physical characteristics of the table itself (like the number of
unfrozen heap pages or something).

If the number of unfrozen heap pages is the thing we care about, perhaps
that, and not the total size of the table, should be the parameter that
drives freezing strategy?

That said, I agree that the system-level picture of debt (the system
level view of the number of unfrozen heap pages) is relevant, and that
it isn't directly considered by the patch. I think that that can be
treated as work for a future release. In fact, I think that there is a
great deal that we could teach autovacuum.c about the system level
view of things -- this is only one.

It seems an easier path to considering system-level of debt (as measured by
unfrozen heap pages) would be to start with considering table-level debt
measured the same way.

--
John Naylor
EDB: http://www.enterprisedb.com

#41

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: John Naylor (#40)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Dec 13, 2022 at 12:29 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

If the number of unfrozen heap pages is the thing we care about, perhaps that, and not the total size of the table, should be the parameter that drives freezing strategy?

That's not the only thing we care about, though. And to the extent we
care about it, we mostly care about the consequences of either
freezing or not freezing eagerly. Concentration of unfrozen pages in
one particular table is a lot more of a concern than the same number
of heap pages being spread out across multiple tables. Those tables
can all be independently vacuumed, and come with their own
relfrozenxid, that can be advanced independently, and are very likely
to be frozen as part of a vacuum that needed to happen anyway.

Pages become frozen pages because VACUUM freezes those pages. Same
with all-visible pages, which could in principle have been made
all-frozen instead, had VACUUM opted to do it that way back when it
processed the page. So VACUUM is not a passive, neutral observer here.
What happens over time and across multiple VACUUM operations is very
relevant. VACUUM needs to pick up where it left off last time, at
least with larger tables, where the time between VACUUMs is naturally
very high, and where each individual VACUUM has to process a huge
number of individual pages. It's not really practical to take a "wait
and see" approach with big tables.

At the very least, a given VACUUM operation has to choose its freezing
strategy based on how it expects the table will look when it's done
vacuuming the table, and how that will impact the next VACUUM against
the same table. Without that, then vacuuming an append-only table will
fall into a pattern of setting pages all-visible in one vacuum, and
then freezing those same pages all-frozen in the very next vacuum
because there are too many. Which makes little sense; we're far better
off freezing the pages at the earliest opportunity instead.

We're going to have to write a WAL record for the visibility map
anyway, so doing everything at the same time has a lot to recommend
it. Even if it turns out to be quite wrong, we may still come out
ahead in terms of absolute volume of WAL written, and especially in
terms of performance stability. To a limited extent we need to reason
about what will happen in the near future. But we also need to reason
about which kinds of mispredictions we cannot afford to make, and
which kinds are okay. Some mistakes hurt a lot more than others.

--
Peter Geoghegan

#42

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#41)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Dec 13, 2022 at 9:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

That's not the only thing we care about, though. And to the extent we
care about it, we mostly care about the consequences of either
freezing or not freezing eagerly. Concentration of unfrozen pages in
one particular table is a lot more of a concern than the same number
of heap pages being spread out across multiple tables. Those tables
can all be independently vacuumed, and come with their own
relfrozenxid, that can be advanced independently, and are very likely
to be frozen as part of a vacuum that needed to happen anyway.

At the suggestion of Jeff, I wrote a Wiki page that shows motivating
examples for the patch series:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples

These are all cases where VACUUM currently doesn't do the right thing
around freezing, in a way that is greatly ameliorated by the patch.
Perhaps this will help other hackers to understand the motivation
behind some of these mechanisms. There are plenty of details that only
make sense in the context of a certain kind of table, with certain
performance characteristics that the design is sensitive to, and seeks
to take advantage of in one way or another.

--
Peter Geoghegan

#43

Matthias van de Meent

boekewurm+postgres@gmail.com

about 3 years ago

In reply to: Peter Geoghegan (#42)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, 14 Dec 2022 at 00:07, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Dec 13, 2022 at 9:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

That's not the only thing we care about, though. And to the extent we
care about it, we mostly care about the consequences of either
freezing or not freezing eagerly. Concentration of unfrozen pages in
one particular table is a lot more of a concern than the same number
of heap pages being spread out across multiple tables. Those tables
can all be independently vacuumed, and come with their own
relfrozenxid, that can be advanced independently, and are very likely
to be frozen as part of a vacuum that needed to happen anyway.

At the suggestion of Jeff, I wrote a Wiki page that shows motivating
examples for the patch series:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples

These are all cases where VACUUM currently doesn't do the right thing
around freezing, in a way that is greatly ameliorated by the patch.
Perhaps this will help other hackers to understand the motivation
behind some of these mechanisms. There are plenty of details that only
make sense in the context of a certain kind of table, with certain
performance characteristics that the design is sensitive to, and seeks
to take advantage of in one way or another.

In this mentioned wiki page, section "Simple append-only", the
following is written:

Our "transition from lazy to eager strategies" concludes with an autovacuum that actually advanced relfrozenxid eagerly:

automatic vacuum of table "regression.public.pgbench_history": index scans: 0
pages: 0 removed, 1078444 remain, 561143 scanned (52.03% of total)
[...]
frozen: 560841 pages from table (52.00% of total) had 88051825 tuples frozen
[...]
WAL usage: 1121683 records, 557662 full page images, 4632208091 bytes

I think that this 'transition from lazy to eager' could benefit from a
limit on how many all_visible blocks each autovacuum iteration can
freeze. This first run of (auto)vacuum after the 8GB threshold seems
to appear as a significant IO event (both in WAL and relation
read/write traffic) with 50% of the table updated and WAL-logged. I
think this should be limited to some degree, such as only freeze
all_visible blocks up to 10% of the table's blocks in eager vacuum, so
that the load is spread across a larger time frame and more VACUUM
runs.

Kind regards,

Matthias van de Meent.

#44

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Matthias van de Meent (#43)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Dec 15, 2022 at 6:50 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

This first run of (auto)vacuum after the 8GB threshold seems
to appear as a significant IO event (both in WAL and relation
read/write traffic) with 50% of the table updated and WAL-logged. I
think this should be limited to some degree, such as only freeze
all_visible blocks up to 10% of the table's blocks in eager vacuum, so
that the load is spread across a larger time frame and more VACUUM
runs.

I agree that the burden of catch-up freezing is excessive here (in
fact I already wrote something to that effect on the wiki page). The
likely solution can be simple enough.

In v9 of the patch, we switch over to eager freezing when table size
crosses 4GB (since that is the value of the
vacuum_freeze_strategy_threshold GUC). The catch up freezing that you
draw attention to here occurs when table size exceeds 8GB, which is a
separate physical table size threshold that forces eager relfrozenxid
advancement. The second threshold is hard-coded to 2x the first one.

I think that this issue can be addressed by making the second
threshold 4x or even 8x vacuum_freeze_strategy_threshold, not just 2x.
That would mean that we'd have to freeze just as many pages whenever
we did the catch-up freezing -- so no change in the added *absolute*
cost of freezing. But, the *relative* cost would be much lower, simply
because catch-up freezing would take place when the table was much
larger. So it would be a lot less noticeable.

Note that we might never reach the second table size threshold before
we must advance relfrozenxid, in any case. The catch-up freezing might
actually take place because table age created pressure to advance
relfrozenxid. It's useful to have a purely physical/table-size
threshold like this, especially in bulk loading scenarios. But it's
not like table age doesn't have any influence at all, anymore. The
cost model weighs physical units/costs as well as table age, and in
general the most likely trigger for advancing relfrozenxid is usually
some combination of the two, not any one factor on its own [1]https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads -- Peter Geoghegan.

[1]: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads -- Peter Geoghegan
--
Peter Geoghegan

#45

Justin Pryzby

pryzby@telsasoft.com

about 3 years ago

In reply to: Peter Geoghegan (#37)

Re: New strategies for freezing, advancing relfrozenxid early

The patches (003 and 005) are missing a word
should use to decide whether to its eager freezing strategy.

On the wiki, missing a word:
builds on related added

#46

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Justin Pryzby (#45)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Dec 15, 2022 at 11:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

The patches (003 and 005) are missing a word
should use to decide whether to its eager freezing strategy.

I mangled this during rebasing for v9, which reordered the commits.
Will be fixed in v10.

On the wiki, missing a word:
builds on related added

Fixed.

Thanks
--
Peter Geoghegan

#47

John Naylor

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Peter Geoghegan (#42)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Dec 14, 2022 at 6:07 AM Peter Geoghegan <pg@bowt.ie> wrote:

At the suggestion of Jeff, I wrote a Wiki page that shows motivating
examples for the patch series:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples

These are all cases where VACUUM currently doesn't do the right thing
around freezing, in a way that is greatly ameliorated by the patch.
Perhaps this will help other hackers to understand the motivation
behind some of these mechanisms. There are plenty of details that only
make sense in the context of a certain kind of table, with certain
performance characteristics that the design is sensitive to, and seeks
to take advantage of in one way or another.

Thanks for this. This is the kind of concrete, data-based evidence that I
find much more convincing, or at least easy to reason about. I'd actually
recommend in the future to open discussion with this kind of analysis --
even before coding, it's possible to indicate what a design is *intended*
to achieve. And reviewers can likewise bring up cases of their own in a
concrete fashion.

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

At the very least, a given VACUUM operation has to choose its freezing
strategy based on how it expects the table will look when it's done
vacuuming the table, and how that will impact the next VACUUM against
the same table. Without that, then vacuuming an append-only table will
fall into a pattern of setting pages all-visible in one vacuum, and
then freezing those same pages all-frozen in the very next vacuum
because there are too many. Which makes little sense; we're far better
off freezing the pages at the earliest opportunity instead.

That makes sense, but I wonder if we can actually be more specific: One
motivating example mentioned is the append-only table. If we detected that
case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist,
we could use that information as one way to drive eager freezing
independently of size. At least in theory -- it's very possible size will
be a necessary part of the decision, but it's less clear that it's as
useful as a user-tunable knob.

If we then ignored the append-only case when evaluating a freezing policy,
maybe other ideas will fall out. I don't have a well-thought out idea about
policy or knobs, but it's worth thinking about.

Aside from that, I've only given the patches a brief reading. Having seen
the VM snapshot in practice (under "Scanned pages, visibility map snapshot"
in the wiki page), it's neat to see fewer pages being scanned. Prefetching
not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates
asynchronous IO. Keeping only one VM snapshot page in memory makes perfect
sense.

I do have a cosmetic, but broad-reaching, nitpick about terms regarding
"skipping strategy". That's phrased as a kind of negative -- what we're
*not* doing. Many times I had to pause and compute in my head what we're
*doing*, i.e. the "scanning strategy". For example, I wonder if the VM
strategies would be easier to read as:

VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY
VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER
VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL

Notice here they're listed in order of increasing eagerness.

--
John Naylor
EDB: http://www.enterprisedb.com

#48

Nikita Malakhov

hukutoc@gmail.com

about 3 years ago

In reply to: John Naylor (#47)

Re: New strategies for freezing, advancing relfrozenxid early

Hi!

I've found this discussion very interesting, in view of vacuuming
TOAST tables is always a problem because these tables tend to
bloat very quickly with dead data - just to remind, all TOAST-able
columns of the relation use the same TOAST table which is one
for the relation, and TOASTed data are not updated - there are
only insert and delete operations.

Have you tested it with large and constantly used TOAST tables?
How would it work with the current TOAST implementation?

We propose a different approach to the TOAST mechanics [1]https://commitfest.postgresql.org/41/3490/,
and a new vacuum would be very promising.

Thank you!

[1]: https://commitfest.postgresql.org/41/3490/

On Fri, Dec 16, 2022 at 10:48 AM John Naylor <john.naylor@enterprisedb.com>
wrote:

On Wed, Dec 14, 2022 at 6:07 AM Peter Geoghegan <pg@bowt.ie> wrote:

At the suggestion of Jeff, I wrote a Wiki page that shows motivating
examples for the patch series:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples

These are all cases where VACUUM currently doesn't do the right thing
around freezing, in a way that is greatly ameliorated by the patch.
Perhaps this will help other hackers to understand the motivation
behind some of these mechanisms. There are plenty of details that only
make sense in the context of a certain kind of table, with certain
performance characteristics that the design is sensitive to, and seeks
to take advantage of in one way or another.

Thanks for this. This is the kind of concrete, data-based evidence that I
find much more convincing, or at least easy to reason about. I'd actually
recommend in the future to open discussion with this kind of analysis --
even before coding, it's possible to indicate what a design is *intended*
to achieve. And reviewers can likewise bring up cases of their own in a
concrete fashion.

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

At the very least, a given VACUUM operation has to choose its freezing
strategy based on how it expects the table will look when it's done
vacuuming the table, and how that will impact the next VACUUM against
the same table. Without that, then vacuuming an append-only table will
fall into a pattern of setting pages all-visible in one vacuum, and
then freezing those same pages all-frozen in the very next vacuum
because there are too many. Which makes little sense; we're far better
off freezing the pages at the earliest opportunity instead.

That makes sense, but I wonder if we can actually be more specific: One
motivating example mentioned is the append-only table. If we detected that
case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist,
we could use that information as one way to drive eager freezing
independently of size. At least in theory -- it's very possible size will
be a necessary part of the decision, but it's less clear that it's as
useful as a user-tunable knob.

If we then ignored the append-only case when evaluating a freezing policy,
maybe other ideas will fall out. I don't have a well-thought out idea about
policy or knobs, but it's worth thinking about.

Aside from that, I've only given the patches a brief reading. Having seen
the VM snapshot in practice (under "Scanned pages, visibility map snapshot"
in the wiki page), it's neat to see fewer pages being scanned. Prefetching
not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates
asynchronous IO. Keeping only one VM snapshot page in memory makes perfect
sense.

I do have a cosmetic, but broad-reaching, nitpick about terms regarding
"skipping strategy". That's phrased as a kind of negative -- what we're
*not* doing. Many times I had to pause and compute in my head what we're
*doing*, i.e. the "scanning strategy". For example, I wonder if the VM
strategies would be easier to read as:

VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY
VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER
VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL

Notice here they're listed in order of increasing eagerness.

--
John Naylor
EDB: http://www.enterprisedb.com

--
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

#49

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: John Naylor (#47)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Dec 15, 2022 at 11:48 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Thanks for this. This is the kind of concrete, data-based evidence that I find much more convincing, or at least easy to reason about.

I'm glad to hear that it helped. It's always difficult to judge where
other people are coming from, especially when it's not clear how much
context is shared. Face time would have helped here, too.

One motivating example mentioned is the append-only table. If we detected that case, which I assume we can because autovacuum_vacuum_insert_* GUCs exist, we could use that information as one way to drive eager freezing independently of size. At least in theory -- it's very possible size will be a necessary part of the decision, but it's less clear that it's as useful as a user-tunable knob.

I am not strongly opposed to that idea, though I have my doubts about
it. I have thought about it already, and it wouldn't be hard to get
the information to vacuumlazy.c (I plan on doing it as part of related
work on antiwraparound autovacuum, in fact [1]https://commitfest.postgresql.org/41/4027/). I'm skeptical of the
general idea that autovacuum.c has enough reliable information to give
detailed recommendations as to how vacuumlazy.c should process the
table.

I have pointed out several major flaws with the autovacuum.c dead
tuple accounting in the past [2]/messages/by-id/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=UuNPzA6Bt4DSMasrGLUq9YA@mail.gmail.com[3]/messages/by-id/CAH2-WznrZC-oHkB+QZQS65o+8_Jtj6RXadjh+8EBqjrD1f8FQQ@mail.gmail.com, but I also think that there are
significant problems with the tuples inserted accounting. Basically, I
think that there are effects which are arguably an example of the
inspection paradox [4]https://towardsdatascience.com/the-inspection-paradox-is-everywhere-2ef1c2e9d709. Insert-based autovacuums occur on a timeline
determined by the "inserted since last autovacuum" statistics. These
statistics are (in part) maintained by autovacuum/VACUUM itself. Which
has no specific understanding of how it might end up chasing its own
tail.

Let me be more concrete about what I mean about autovacuum chasing its
own tail. The autovacuum_vacuum_insert_threshold mechanism works by
triggering an autovacuum whenever the number of tuples inserted since
the last autovacuum/VACUUM reaches a certain threshold -- usually some
fixed proportion of pg_class.reltuples. But the
tuples-inserted-since-last-VACUUM counter gets reset at the end of
VACUUM, not at the start. Whereas VACUUM itself processes only the
subset of pages that needed to be vacuumed at the start of the VACUUM.
There is no attempt to compensate for that disparity. This *isn't*
really a measure of "unvacuumed tuples" (you'd need to compensate to
get that).

This "at the start vs at the end" difference won't matter at all with
smaller tables. And even in larger tables we might hope that the
effect would kind of average out. But what about cases where one
particular VACUUM operation takes an unusually long time, out of a
sequence of successive VACUUMs that run against the same table? For
example, the sequence that you see on the Wiki page, when Postgres
HEAD autovacuum does an aggressive VACUUM on one occasion, which takes
dramatically longer [5]https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Scanned_pages.2C_visibility_map_snapshot -- Peter Geoghegan.

Notice that the sequence in [5]https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Scanned_pages.2C_visibility_map_snapshot -- Peter Geoghegan shows that the patch does one more
autovacuum operation in total, compared to HEAD/master. That's a lot
more -- we're talking about VACUUMs that each take 40+ minutes. That
can be explained by the fact that VACUUM (quite naturally) resets the
"tuples inserted since last VACUUM" at the end of that unusually long
running aggressive autovacuum -- just like any other VACUUM would.
That seems very weird to me. If (say) we happened to have a much
higher vacuum_freeze_table_age setting, then we wouldn't have had an
aggressive VACUUM until much later on (or never, because the benchmark
would just end). And the VACUUM that was aggressive would have been a
regular VACUUM instead, and would therefore have completed far sooner,
and would therefore have had a *totally* different cadence, compared
to what we actually saw -- it becomes distorted in a way that outlasts
the aggressive VACUUM.

With a far higher vacuum_freeze_table_age, we might have even managed
to do two regular autovacuums in the same period that it took a single
aggressive VACUUM to run in (that's not too far from what actually
happened with the patch). The *second* regular autovacuum would then
end up resetting the "inserted since last VACUUM" counter to 0 at the
same time as the long running aggressive VACUUM actually did so (same
wall clock time, same time since the start of the benchmark). Notice
that we'll have done much less useful work (on cleaning up bloat and
setting newer pages all-visible) with the "one long aggressive mode
VACUUM" setup/scenario -- we'll be way behind -- but the statistics
will nevertheless look about the same as they do in the "two fast
autovacuums instead of one slow autovacuum" counterfactual scenario.

In short, autovacuum.c fails to appreciate that a lot of stuff about
the table changes when VACUUM runs. Time hasn't stood still -- the
table was modified and extended throughout. So autovacuum.c hasn't
compensated for how VACUUM actually performed, and, in effect, forgets
how far it has fallen behind. It should be eager to start the nex
autovacuum very quickly, having fallen behind, but it isn't eager.
This is all the more reason to get rid of aggressive mode, but that's
not my point -- my point is that the statistics driving things seem
quite dubious, in all sorts of ways.

Aside from that, I've only given the patches a brief reading.

Thanks for taking a look.

Having seen the VM snapshot in practice (under "Scanned pages, visibility map snapshot" in the wiki page), it's neat to see fewer pages being scanned. Prefetching not only seems superior to SKIP_PAGES_THRESHOLD, but anticipates asynchronous IO.

All of that is true, but more than anything else the VM snapshot
concept appeals to me because it seems to make VACUUMs of large tables
more similar to VACUUMs of small tables. Particularly when one
individual VACUUM happens to take an unusually long amount of time,
for whatever reason (best example right now is aggressive mode, but
there are other ways in which VACUUM can take far longer than
expected). That approach seems much more logical. I also think that
it'll make it easier to teach VACUUM to "pick up where the last VACUUM
left off" in the future.

I understand why you haven't seriously investigated using the same
information for the Radix tree dead_items project. I certainly don't
object. But I still think that having one integrated data structure
(VM snapshots + dead_items) is worth exploring in the future. It's
something that I think is quite promising.

I do have a cosmetic, but broad-reaching, nitpick about terms regarding "skipping strategy". That's phrased as a kind of negative -- what we're *not* doing. Many times I had to pause and compute in my head what we're *doing*, i.e. the "scanning strategy". For example, I wonder if the VM strategies would be easier to read as:

VMSNAP_SKIP_ALL_VISIBLE -> VMSNAP_SCAN_LAZY
VMSNAP_SKIP_ALL_FROZEN -> VMSNAP_SCAN_EAGER
VMSNAP_SKIP_NONE -> VMSNAP_SCAN_ALL

Notice here they're listed in order of increasing eagerness.

I agree that the terminology around skipping strategies is confusing,
and plan to address that in the next version. I'll consider using this
scheme for v10.

[1]: https://commitfest.postgresql.org/41/4027/
[2]: /messages/by-id/CAH2-Wz=MGFwJEpEjVzXwEjY5yx=UuNPzA6Bt4DSMasrGLUq9YA@mail.gmail.com
[3]: /messages/by-id/CAH2-WznrZC-oHkB+QZQS65o+8_Jtj6RXadjh+8EBqjrD1f8FQQ@mail.gmail.com
[4]: https://towardsdatascience.com/the-inspection-paradox-is-everywhere-2ef1c2e9d709
[5]: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Scanned_pages.2C_visibility_map_snapshot -- Peter Geoghegan
--
Peter Geoghegan

#50

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Nikita Malakhov (#48)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Dec 15, 2022 at 11:59 PM Nikita Malakhov <hukutoc@gmail.com> wrote:

I've found this discussion very interesting, in view of vacuuming
TOAST tables is always a problem because these tables tend to
bloat very quickly with dead data - just to remind, all TOAST-able
columns of the relation use the same TOAST table which is one
for the relation, and TOASTed data are not updated - there are
only insert and delete operations.

I don't think that it would be any different to any other table that
happened to have lots of inserts and deletes, such as the table
described here:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Mixed_inserts_and_deletes

In the real world, a table like this would probably consist of some
completely static data, combined with other data that is constantly
deleted and re-inserted -- probably only a small fraction of the table
at any one time. I would expect such a table to work quite well,
because the static pages would all become frozen (at least after a
while), leaving behind only the tuples that are deleted quickly, most
of the time. VACUUM would have a decent chance of noticing that it
will be cheap to advance relfrozenxid in earlier VACUUM operations, as
bloat is cleaned up -- even a VACUUM that happens long before the
point that autovacuum.c will launch an antiwraparound autovacuum has a
decent chance of it. That's not a new idea, really; the
pgbench_branches example from the Wiki page looks like that already,
and even works on Postgres 15.

Here is the part that's new: the pressure to advance relfrozenxid
grows gradually, as table age grows. If table age is still very young,
then we'll only do it if the number of "extra" scanned pages is < 5%
of rel_pages -- only when the added cost is very low (again, like the
pgbench_branches example, mostly). Once table age gets about halfway
towards the point that antiwraparound autovacuuming is required,
VACUUM then starts caring less about costs. It gradually worries less
about the costs, and more about the need to advance it. Ideally it
will happen before antiwraparound autovacuum is actually required.

I'm not sure how much this would help with bloat. I suspect that it
could make a big difference with the right workload. If you always
need frequent autovacuums, just to deal with bloat, then there is
never a good time to run an aggressive antiwraparound autovacuum. An
aggressive AV will probably end up taking much longer than the typical
autovacuum that deals with bloat. While the aggressive AV will remove
as much bloat as any other AV, in theory, that might not help much. If
the aggressive AV takes as long as (say) 5 regular autovacuums would
have taken, and if you really needed those 5 separate autovacuums to
run, just to deal with the bloat, then that's a real problem. The
aggressive AV effectively causes bloat with such a workload.

--
Peter Geoghegan

#51

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#44)

5 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Dec 15, 2022 at 10:53 AM Peter Geoghegan <pg@bowt.ie> wrote:

I agree that the burden of catch-up freezing is excessive here (in
fact I already wrote something to that effect on the wiki page). The
likely solution can be simple enough.

Attached is v10, which fixes this issue, but using a different
approach to the one I sketched here.

This revision also changes the terminology around VM skipping: we now
call the strategies there "scanning strategies", per feedback from
Jeff and John. This does seem a lot clearer.

Also cleaned up the docs a little bit, which were messed up by a
rebasing issue in v9.

I ended up fixing the aforementioned "too much catch-up freezing"
issue by just getting rid of the whole concept of a second table-size
threshold that forces the eager scanning strategy. I now believe that
it's fine to just rely on the generic logic that determines scanning
strategy based on a combination of table age and the added cost of
eager scanning. It'll work in a way that doesn't result in too much of
a freezing spike during any one VACUUM operation, without waiting
until an antiwraparound autovacuum to advance relfrozenxid (it'll
happen far earlier than that, though still quite a lot later than what
you'd see with v9, so as to avoid that big spike in freezing that was
possible in pgbench_history-like tables [1]https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch).

This means that vacuum_freeze_strategy_threshold is now strictly
concerned with freezing. A table that is always frozen eagerly will
inevitably fall into a pattern of advancing relfrozenxid in every
VACUUM operation, but that isn't something that needs to be documented
or anything. We don't need to introduce a special case here.

The other notable change for v10 is in the final patch, which removes
aggressive mode altogether. v10 now makes lazy_scan_noprune less
willing to give up on setting relfrozenxid to a relatively recent XID.
Now lazy_scan_noprune is willing to wait a short while for a cleanup
lock on a heap page (a few tens of milliseconds) when doing so might
be all it takes to preserve VACUUM's ability to advance relfrozenxid
all the way up to FreezeLimit, which is the traditional guarantee made
by aggressive mode VACUUM.

This makes lazy_scan_noprune "under promise and over deliver". It now
only promises to advance relfrozenxid up to MinXid in the very worst
case -- even if that means waiting indefinitely long for a cleanup
lock. That's not a very strong promise, because advancing relfrozenxid
up to MinXid is only barely adequate. At the same time,
lazy_scan_noprune is willing to go to extra trouble to
get a recent enough FreezeLimit -- it'll wait for a few 10s of milliseconds.
It's just not willing to wait indefinitely. This seems likely to give us the
best of both worlds.

This was based in part on something that Andres said about cleanup
locks a while back. He had a concern about cases where even MinXid was
before OldestXmin. To some degree that's addressed here, because I've
also changed the way that MinXid is determined, so that it'll be a
much earlier value. That doesn't have much downside now, because of the
way that lazy_scan_noprune is now "aggressive-ish" when that happens to
make sense.

Not being able to get a cleanup lock on our first attempt is relatively
rare, and when it happens it's often something completely benign. For
example, it might just be that the checkpointer was writing out the
same page at the time, which signifies nothing about it really being
hard to get a cleanup lock -- the checkpointer will have dropped its
conflicting buffer pin almost immediately. It would be a shame to
accept a significantly older final relfrozenxid during an infrequent,
long running antiwraparound autovacuum of larger tables when that
happens -- we should be willing to wait 30 milliseconds (just not 30
minutes, or 30 days).

None of this even comes up for pages whose XIDs are >= FreezeLimit,
which is actually most pages with the patch, even in larger tables.
It's relatively rare for VACUUM to need to process any heap page in
lazy_scan_noprune, but it'll be much rarer still for it to have to do
a "short wait" like this. So "short waits" have a very small downside,
and (at least occasionally) a huge upside.

By inventing a third alternative behavior (to go along with processing
pages via standard lazy_scan_noprune skipping and processing pages in
lazy_scan_prune), VACUUM has the flexibility to respond in a way
that's proportionate to the problem at hand, in one particular heap
page. The new behavior has zero chance of mattering in most individual
tables/workloads, but it's good to have every possible eventuality
covered. I really hate the idea of getting a significantly worse
outcome just because of something that happened in one single heap
page, because the wind changed directions at the wrong time.

[1]: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch

--
Peter Geoghegan

Attachments:

v10-0005-Finish-removing-aggressive-mode-VACUUM.patchapplication/x-patch; name=v10-0005-Finish-removing-aggressive-mode-VACUUM.patchDownload

From 3dafc38a45545e39aae529cb1cf2190a5e56dcb2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v10 5/5] Finish removing aggressive mode VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Although pre-visibility-map
VACUUM was far less efficient (especially after 9.6's commit fd31cd26),
its naive approach had one notable advantage: users only had to think
about a single kind of lazy vacuum (the only kind that existed).

Break the final remaining dependency on aggressive mode: replace the
rules governing when VACUUM will wait for a cleanup lock with a new set
of rules more attuned to the needs of the table.  With that last
dependency gone, there is no need for aggressive mode, so get rid of it.
Users once again only have to think about one kind of lazy vacuum.

VACUUM now places particular emphasis on performance stability.  The
burden of freezing physical heap pages is now more or less spread out as
much as possible.  Each table's age will now tend to follow what VACUUM
does, rather than having VACUUM's behavior driven by table age.  The
table age tail no longer wags the VACUUM dog.

In general, all of the behaviors associated with aggressive mode prior
to Postgres 16 have been retained; they just get applied selectively, on
a more dynamic timeline.  For example, the aforementioned change to
VACUUM's cleanup lock behavior retains the general idea of waiting for a
cleanup lock in the event of not being able to get one right away (to
make sure that older XIDs get frozen during the ongoing VACUUM).  All
that changes is the cutoffs -- the timeline.

We use new, dedicated cutoffs, rather than applying the FreezeLimit and
MultiXactCutoff used when deciding whether we should trigger freezing on
the basis of XID/MXID age.  These minimum fallback cutoffs (which are
called MinXid and MinMulti) are typically far older than the standard
FreezeLimit/MultiXactCutoff cutoffs.  VACUUM doesn't need an aggressive
mode to decide on whether to wait for a cleanup lock anymore; it can
decide everything at the level of individual heap pages.

It is okay to aggressively punt on waiting for a cleanup lock like this
because VACUUM now understands the importance of never falling too far
behind on the work of freezing physical heap pages at the level of the
whole table.  Prior to Postgres 16, VACUUM tended to do all freezing and
relfrozenxid advancement in aggressive mode, especially in large tables.
Aggressive VACUUM had to advance the table's relfrozenxid by relatively
many XIDs (up to FreezeLimit, not just up to MinXid) because table age
was more or less treated as a proxy for freeze debt.  It would therefore
have been risky for aggressive VACUUM to "squander" any opportunity at
advancing relfrozenxid (by accepting a much older final value, say).
But since we now freeze much more eagerly, opportunities to advance
relfrozenxid (at least by some small amount) are much more plentiful.

Also teach VACUUM to wait for a short while for cleanup locks when doing
so has a decent chance of preserving its ability to advance relfrozenxid
up to FreezeLimit (and/or to advance relminmxid up to MultiXactCutoff).
As a result, VACUUM typically manages to advance relfrozenxid by just as
much as it would have had it promised to advance it up to FreezeLimit
(i.e. had it made the traditional aggressive VACUUM guarantee), even
when vacuuming a table that happens to have relatively many cleanup lock
conflicts affecting pages with older XIDs/MXIDs.  VACUUM thereby avoids
missing out on advancing relfrozenxid up to the traditional target
amount when it really can be avoided fairly easily, without promising
anything.  Advancing up to FreezeLimit/MultiXactCutoff in all cases
(regardless of wait duration) comes with significant risks of its own.
VACUUM still promises to advance up to MinXid/MinMulti because that is
at least a proportionate response, needed only when table age really is
the issue (not falling behind on freezing physical heap pages).

There are still antiwraparound autovacuums, but they're now little more
than another way that autovacuum.c can launch an autovacuum worker to
run VACUUM.

XXX We also need to avoid the special auto-cancellation behavior for
antiwraparound autovacuums to make all this safe.  See also, related
patch for this: https://commitfest.postgresql.org/41/4027/

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   9 +-
 src/backend/access/heap/vacuumlazy.c          | 221 +++---
 src/backend/commands/vacuum.c                 |  42 +-
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 doc/src/sgml/config.sgml                      |  10 +-
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  10 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 doc/src/sgml/xact.sgml                        |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  30 +-
 14 files changed, 555 insertions(+), 530 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4dd31bd3f..d21b7fb28 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -276,6 +276,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
@@ -348,7 +355,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+extern void vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 							   struct VacuumCutoffs *cutoffs);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0619d54dd..df2cac24d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -156,8 +156,6 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -261,7 +259,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							LVPagePruneState *prunestate);
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
-							  bool *hastup, bool *recordfreespace);
+							  bool *hastup, bool *recordfreespace,
+							  Size *freespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -458,7 +457,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
+	vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
@@ -536,17 +535,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(vacrel->cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(vacrel->cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
@@ -554,7 +550,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * Must keep original relfrozenxid when lazy_scan_strategy call
 		 * decided to skip all-visible pages
 		 */
-		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -623,33 +618,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
 			memset(&walusage, 0, sizeof(WalUsage));
 			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
 			initStringInfo(&buf);
+
 			if (verbose)
-			{
-				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
-			}
 			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			else
-			{
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			appendStringInfo(&buf, msgfmt,
 							 get_database_name(MyDatabaseId),
 							 vacrel->relnamespace,
@@ -949,6 +925,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			bool		hastup,
 						recordfreespace;
+			Size		freespace;
 
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
@@ -962,10 +939,8 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Collect LP_DEAD items in dead_items array, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
-								  &recordfreespace))
+								  &recordfreespace, &freespace))
 			{
-				Size		freespace = 0;
-
 				/*
 				 * Processed page successfully (without cleanup lock) -- just
 				 * need to perform rel truncation and FSM steps, much like the
@@ -974,21 +949,14 @@ lazy_scan_heap(LVRelState *vacrel)
 				 */
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
-				if (recordfreespace)
-					freespace = PageGetHeapFreeSpace(page);
-				UnlockReleaseBuffer(buf);
 				if (recordfreespace)
 					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+
+				/* lock and pin released by lazy_scan_noprune */
 				continue;
 			}
 
-			/*
-			 * lazy_scan_noprune could not do all required processing.  Wait
-			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
-			 */
-			Assert(vacrel->aggressive);
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockBufferForCleanup(buf);
+			/* cleanup lock acquired by lazy_scan_noprune */
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
@@ -1425,8 +1393,6 @@ lazy_scan_strategy(LVRelState *vacrel, const VacuumParams *params)
 	 */
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 		vacrel->vmstrat = VMSNAP_SCAN_ALL;
-	else if (vacrel->aggressive)
-		vacrel->vmstrat = VMSNAP_SCAN_EAGER;
 
 	/* Inform vmsnap infrastructure of our chosen strategy */
 	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
@@ -2000,17 +1966,32 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items array for removal from indexes (assuming caller's page
+ * can be processed successfully here).
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We return true to indicate that processing succeeded, in which case we'll
+ * have dropped the lock and pin on buf/page.  Else returns false, indicating
+ * that page must be processed by lazy_scan_prune in the usual way after all.
+ * Acquires a cleanup lock on buf/page for caller before returning false.
+ *
+ * We go to considerable trouble to get a cleanup lock on any page that has
+ * XIDs/MXIDs that need to be frozen in order for VACUUM to be able to set
+ * relfrozenxid/relminmxid to values >= FreezeLimit/MultiXactCutoff cutoffs.
+ * But we don't strictly guarantee it; we only guarantee that final values
+ * will be >= MinXid/MinMulti cutoffs in the worst case.
+ *
+ * We prefer to "under promise and over deliver" like this because a strong
+ * guarantee has the potential to make a bad situation even worse.  VACUUM
+ * should avoid waiting for a cleanup lock for an indefinitely long time until
+ * it has already exhausted every available alternative.  It's quite possible
+ * (and perhaps even likely) that the problem will go away on its own.  But
+ * even when it doesn't, our approach at least makes it likely that the first
+ * VACUUM that encounters the issue will catch up on whatever freezing may
+ * still be required for every other page in the target rel.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
- * generic FSM processing for page.
+ * generic FSM processing for page, using *freespace value set here.
  */
 static bool
 lazy_scan_noprune(LVRelState *vacrel,
@@ -2018,7 +1999,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				  BlockNumber blkno,
 				  Page page,
 				  bool *hastup,
-				  bool *recordfreespace)
+				  bool *recordfreespace,
+				  Size *freespace)
 {
 	OffsetNumber offnum,
 				maxoff;
@@ -2026,6 +2008,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples,
 				missed_dead_tuples;
+	bool		should_freeze = false;
 	HeapTupleHeader tupleheader;
 	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
 	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
@@ -2035,6 +2018,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	*hastup = false;			/* for now */
 	*recordfreespace = false;	/* for now */
+	*freespace = PageGetHeapFreeSpace(page);
 
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -2076,34 +2060,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
 									 &NoFreezePageRelfrozenXid,
 									 &NoFreezePageRelminMxid))
-		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
-			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
-			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
-			 */
-		}
+			should_freeze = true;
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
@@ -2152,10 +2109,98 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * By here we know for sure that caller can put off freezing and pruning
-	 * this particular page until the next VACUUM.  Remember its details now.
-	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
+	 * Release lock (but not pin) on page now.  Then consider if we should
+	 * back out of accepting reduced processing for this page.
+	 *
+	 * Our caller's initial inability to get a cleanup lock will often turn
+	 * out to have been nothing more than a momentary blip, and it would be a
+	 * shame if relfrozenxid/relminmxid values < FreezeLimit/MultiXactCutoff
+	 * were used without good reason.  For example, the checkpointer might
+	 * have been writing out this page a moment ago, in which case its buffer
+	 * pin might have already been released by now.
+	 *
+	 * It's also possible that the conflicting buffer pin will continue to
+	 * block cleanup lock acquisition on the buffer for an extended period.
+	 * For example, it isn't uncommon for heap_lock_tuple to sleep while
+	 * holding a buffer pin, in which case a conflicting pin could easily be
+	 * held for much longer than VACUUM can reasonably be expected to wait.
+	 * There are also truly pathological cases to worry about.  For example,
+	 * the case where buggy application code holds open a cursor forever.
 	 */
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	if (should_freeze)
+	{
+		/*
+		 * If page has tuple with a dangerously old XID/MXID (an XID < MinXid,
+		 * or an MXID < MinMulti), then we wait for however long it takes to
+		 * get a cleanup lock.
+		 *
+		 * Check for that first (get it out of the way).
+		 */
+		if (TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								  vacrel->cutoffs.MinXid) ||
+			MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								vacrel->cutoffs.MinMulti))
+		{
+			/*
+			 * MinXid/MinMulti are considered to be only barely adequate final
+			 * values, so we only expect to end up here when previous VACUUMs
+			 * put off processing by lazy_scan_prune in the hope that it would
+			 * never come to this.  That hasn't worked out, so we must wait.
+			 */
+			LockBufferForCleanup(buf);
+			return false;
+		}
+
+		/*
+		 * Page has tuple with XID < FreezeLimit, or MXID < MultiXactCutoff,
+		 * but they're not so old that we're _strictly_ obligated to freeze.
+		 *
+		 * We are willing to go to the trouble of waiting for a cleanup lock
+		 * for a short while for such a page -- just not indefinitely long.
+		 * This avoids squandering opportunities to advance relfrozenxid or
+		 * relminmxid by the target amount during any one VACUUM, which is
+		 * particularly important with larger tables that only get vacuumed
+		 * when autovacuum.c is concerned about table age.  It would not be
+		 * okay if the number of autovacuums such a table ended up requiring
+		 * noticeably exceeded the expected autovacuum_freeze_max_age cadence.
+		 *
+		 * We are willing to wait and try again a total of 3 times.  If that
+		 * doesn't work then we just give up.  We only wait here when it is
+		 * actually expected to preserve current NewRelfrozenXid/NewRelminMxid
+		 * tracker values, and when trackers will actually be used to update
+		 * pg_class later on.  This also tends to limit the impact of waiting
+		 * for VACUUMs that experience relatively many cleanup lock conflicts.
+		 */
+		if (vacrel->vmstrat != VMSNAP_SCAN_LAZY &&
+			(TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								   vacrel->NewRelfrozenXid) ||
+			 MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								 vacrel->NewRelminMxid)))
+		{
+			/* wait 10ms, then 20ms, then 30ms, then give up */
+			for (int i = 1; i <= 3; i++)
+			{
+				CHECK_FOR_INTERRUPTS();
+
+				pg_usleep(1000L * 10L * i);
+				if (ConditionalLockBufferForCleanup(buf))
+				{
+					/* Go process page in lazy_scan_prune after all */
+					return false;
+				}
+			}
+		}
+
+		/* Accept reduced processing for this page after all */
+	}
+
+	/*
+	 * By here we know for sure that caller will put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * Also drop the buffer pin that we held onto during cleanup lock steps.
+	 */
+	ReleaseBuffer(buf);
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5085d9407..f4429e320 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -916,13 +916,8 @@ get_all_vacuum_rels(int options)
  * The target relation and VACUUM parameters are our inputs.
  *
  * Output parameters are the cutoffs that VACUUM caller should use.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
  */
-bool
+void
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				   struct VacuumCutoffs *cutoffs)
 {
@@ -1092,6 +1087,39 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 		multixact_freeze_table_age > effective_multixact_freeze_max_age)
 		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
+	/*
+	 * Determine the cutoffs used by VACUUM to decide on whether to wait for a
+	 * cleanup lock on a page (that it can't cleanup lock right away).  These
+	 * are the MinXid and MinMulti cutoffs.  They are related to the cutoffs
+	 * for freezing (FreezeLimit and MultiXactCutoff), and are only applied on
+	 * pages that we cannot freeze right away.  See vacuumlazy.c for details.
+	 *
+	 * VACUUM can ratchet back NewRelfrozenXid and/or NewRelminMxid instead of
+	 * waiting indefinitely for a cleanup lock in almost all cases.  The high
+	 * level goal is to create as many opportunities as possible to freeze
+	 * (across many successive VACUUM operations), while avoiding waiting for
+	 * a cleanup lock whenever possible.  Any time spent waiting is time spent
+	 * not freezing other eligible pages, which is typically a bad trade-off.
+	 *
+	 * As a consequence of all this, MinXid and MinMulti also act as limits on
+	 * the oldest acceptable values that can ever be set in pg_class by VACUUM
+	 * (though this is only relevant when they have already attained XID/XMID
+	 * ages that approach freeze_table_age and/or multixact_freeze_table_age).
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
 	 * XMID table age (whichever is greater currently).
@@ -1109,8 +1137,6 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 */
 	if (params->is_wraparound)
 		cutoffs->tableagefrac = 1.0;
-
-	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index f9788c30a..0c80896cc 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f36594e7c..6381cd8fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8233,7 +8233,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8422,7 +8422,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9172,7 +9172,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         billion, <command>VACUUM</command> will silently limit the
         effective value to <xref
          linkend="guc-autovacuum-freeze-max-age"/>. For more
-        information see <xref linkend="vacuum-for-wraparound"/>.
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
@@ -9205,7 +9205,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9261,7 +9261,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        billion, <command>VACUUM</command> will silently limit the
        effective value to <xref
         linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
-       information see <xref linkend="vacuum-for-wraparound"/>.
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c137debb1..d4237ec5d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -156,9 +156,11 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
      <para>
       Normally, <command>VACUUM</command> will skip pages based on the <link
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.
+      all tuples are known to be frozen are always skipped.  Pages
+      where all tuples are known to be visible to all transactions are
+      skipped whenever <command>VACUUM</command> determined that
+      advancing <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> was unnecessary.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -213,7 +215,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/doc/src/sgml/xact.sgml b/doc/src/sgml/xact.sgml
index b467660ee..c4146539f 100644
--- a/doc/src/sgml/xact.sgml
+++ b/doc/src/sgml/xact.sgml
@@ -49,8 +49,8 @@
 
   <para>
    The internal transaction ID type <type>xid</type> is 32 bits wide
-   and <link linkend="vacuum-for-wraparound">wraps around</link> every
-   4 billion transactions. A 32-bit epoch is incremented during each
+   and <link linkend="vacuum-xid-space">wraps around</link> every
+   2 billion transactions. A 32-bit epoch is incremented during each
    wraparound. There is also a 64-bit type <type>xid8</type> which
    includes this epoch and therefore does not wrap around during the
    life of an installation;  it can be converted to xid by casting.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..927410258 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,18 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), without ever being
+# prepared to wait for a cleanup lock (we'll never wait on a cleanup
+# lock because the separate MinXid cutoff for waiting will still be
+# well before FreezeLimit, given our default autovacuum_freeze_max_age).
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +78,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +94,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +105,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +118,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +130,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +138,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
-- 
2.38.1

v10-0002-Add-page-level-freezing-to-VACUUM.patchapplication/x-patch; name=v10-0002-Add-page-level-freezing-to-VACUUM.patchDownload

From 76b6141dc84cb66b80eabe9ab7276abe771be8bb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v10 2/5] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).

Also teach VACUUM to trigger page-level freezing whenever it detects
that heap pruning generated an FPI as torn page protection.  We'll have
already written a large amount of WAL to do that much, so it's very
likely a good idea to get freezing out of the way for the page early.
This only happens in cases where it will directly lead to marking the
page all-frozen in the visibility map.

FreezeMultiXactId() now uses both FreezeLimit and OldestXmin to decide
how to process MultiXacts (not just FreezeLimit).  We always prefer to
avoid allocating new MultiXacts during VACUUM on general principle.
Page-level freezing can be triggered and use a maximally aggressive XID
cutoff to freeze XIDs (OldestXmin), while using a less aggressive XID
cutoff (FreezeLimit) to determine whether or not members from a Multi
need to be frozen expensively.  VACUUM will process Multis very eagerly
when it's cheap to do so, and very lazily when it's expensive to do so.

We can choose when and how to freeze Multixacts provided we never leave
behind a Multi that's < MultiXactCutoff, or a Multi with one or more XID
members < FreezeLimit.  Provided VACUUM's NewRelfrozenXid/NewRelminMxid
tracking accounts for all this, we are free to choose what to do about
each Multi based on the costs and the benefits.  VACUUM should be just
as capable of avoiding an expensive second pass over each Multi (which
must check the commit status of each member XID) as it was before, even
when page-level freezing is triggered on many pages with recently
allocated MultiXactIds.

Later work will teach VACUUM to explicitly apply distinct lazy and eager
freezing strategies, which are policies around how each VACUUM operation
should go determining if it must freeze any given heap page.  This
commit just adds the basic concept of page-level freezing, as well as
the heap prune FPI trigger criteria, which gets applied in every VACUUM
(on systems with full page writes enabled).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/heapam.h          |  82 +++++-
 src/backend/access/heap/heapam.c     | 388 +++++++++++++++------------
 src/backend/access/heap/pruneheap.c  |  16 +-
 src/backend/access/heap/vacuumlazy.c | 128 ++++++---
 doc/src/sgml/config.sgml             |  11 +-
 5 files changed, 393 insertions(+), 232 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 53eb01176..0782fed14 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -113,6 +113,71 @@ typedef struct HeapTupleFreeze
 	OffsetNumber offset;
 } HeapTupleFreeze;
 
+/*
+ * State used by VACUUM to track the details of freezing all eligible tuples
+ * on a given heap page.
+ *
+ * VACUUM prepares freeze plans for each page via heap_prepare_freeze_tuple
+ * calls (every tuple with storage gets its own call).  This page-level freeze
+ * state is updated across each call, which ultimately determines whether or
+ * not freezing the page is required. (VACUUM freezes the page via a call to
+ * heap_freeze_execute_prepared, which freezes using prepared freeze plans.)
+ *
+ * Aside from the basic question of whether or not freezing will go ahead, the
+ * state also tracks the oldest extant XID/MXID in the table as a whole, for
+ * the purposes of advancing relfrozenxid/relminmxid values in pg_class later
+ * on.  Each heap_prepare_freeze_tuple call pushes NewRelfrozenXid and/or
+ * NewRelminMxid back as required to avoid unsafe final pg_class values.  Any
+ * and all unfrozen XIDs or MXIDs that remain after VACUUM finishes _must_
+ * have values >= the final relfrozenxid/relminmxid values in pg_class.  This
+ * includes XIDs that remain as MultiXact members from any tuple's xmax.
+ *
+ * When 'freeze_required' flag isn't set after all tuples are examined, the
+ * final choice on freezing is made by vacuumlazy.c.  It can decide to trigger
+ * freezing based on whatever criteria it deems appropriate.  However, it is
+ * highly recommended that vacuumlazy.c avoid freezing any page that cannot be
+ * marked all-frozen in the visibility map afterwards.
+ *
+ * Freezing is typically optional for most individual pages scanned during any
+ * given VACUUM operation.  This allows vacuumlazy.c to manage the cost of
+ * freezing at the level of the entire VACUUM operation/entire heap relation.
+ */
+typedef struct HeapPageFreeze
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze_required;
+
+	/*
+	 * "No freeze" NewRelfrozenXid/NewRelminMxid trackers.
+	 *
+	 * These trackers are maintained in the same way as the trackers used when
+	 * VACUUM scans a page that isn't cleanup locked.  Both code paths are
+	 * based on the same general idea (do less work for this page during the
+	 * ongoing VACUUM, at the cost of having to accept older final values).
+	 */
+	TransactionId NoFreezePageRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid;
+
+	/*
+	 * Trackers used when heap_freeze_execute_prepared freezes the page.
+	 *
+	 * When we freeze a page, we generally freeze all XIDs < OldestXmin, only
+	 * leaving behind XIDs that are ineligible for freezing, if any.  And so
+	 * you might wonder why these trackers are necessary at all; why should
+	 * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that
+	 * ratchet back the rel-level NewRelfrozenXid/NewRelminMxid trackers?
+	 *
+	 * It is useful to use a definition of "freeze the page" that does not
+	 * overspecify how MultiXacts are affected.  heap_prepare_freeze_tuple
+	 * generally prefers to remove Multis eagerly, but lazy processing is used
+	 * in cases where laziness allows VACUUM to avoid allocating a new Multi.
+	 * The "freeze the page" trackers enable this flexibility.
+	 */
+	TransactionId FreezePageRelfrozenXid;
+	MultiXactId FreezePageRelminMxid;
+
+} HeapPageFreeze;
+
 /* ----------------
  *		function prototypes for heap access method
  *
@@ -180,19 +245,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  const struct VacuumCutoffs *cutoffs,
-									  HeapTupleFreeze *frz, bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  HeapPageFreeze *pagefrz,
+									  HeapTupleFreeze *frz, bool *totally_frozen);
 extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-										 TransactionId FreezeLimit,
+										 TransactionId snapshotConflictHorizon,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId FreezeLimit, TransactionId MultiXactCutoff);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
-									const struct VacuumCutoffs *cutoffs,
-									TransactionId *relfrozenxid_out,
-									MultiXactId *relminmxid_out);
+extern bool heap_tuple_should_freeze(HeapTupleHeader tuple,
+									 const struct VacuumCutoffs *cutoffs,
+									 TransactionId *NoFreezePageRelfrozenXid,
+									 MultiXactId *NoFreezePageRelminMxid);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
@@ -210,7 +274,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts,
-							int *nnewlpdead,
+							int *nnewlpdead, bool *prune_fpi,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 389f529af..a9fa88bbb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6098,9 +6098,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		MultiXactId.
  *
  * "flags" is an output value; it's used to tell caller what to do on return.
- *
- * "mxid_oldest_xid_out" is an output value; it's used to track the oldest
- * extant Xid within any Multixact that will remain after freezing executes.
+ * "pagefrz" is an input/output value, used to manage page level freezing.
  *
  * Possible values that we can set in "flags":
  * FRM_NOOP
@@ -6115,16 +6113,34 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
  *
- * "mxid_oldest_xid_out" is only set when "flags" contains either FRM_NOOP or
- * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
+ * Caller delegates control of page freezing to us.  In practice we always
+ * force freezing of caller's page unless FRM_NOOP processing is indicated.
+ * We help caller ensure that XIDs < FreezeLimit and MXIDs < MultiXactCutoff
+ * can never be left behind.  We freely choose when and how to process each
+ * Multi, without ever violating the cutoff invariants for freezing.
  *
- * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ * It's useful to remove Multis on a proactive timeline (relative to freezing
+ * XIDs) to keep MultiXact member SLRU buffer misses to a minimum.  It can also
+ * be cheaper in the short run, for us, since we too can avoid SLRU buffer
+ * misses through eager processing.
+ *
+ * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set, though only
+ * when FreezeLimit and/or MultiXactCutoff cutoffs leave us with no choice.
+ * This can usually be put off, which is usually enough to avoid it altogether.
+ *
+ * NB: Caller must maintain "no freeze" NewRelfrozenXid/NewRelminMxid trackers
+ * using heap_tuple_should_freeze when we haven't forced page-level freezing.
+ *
+ * NB: Caller should avoid needlessly calling heap_tuple_should_freeze when we
+ * have already forced page-level freezing, since that might incur the same
+ * SLRU buffer misses that we specifically intended to avoid by freezing.
  */
 static TransactionId
-FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
+FreezeMultiXactId(MultiXactId multi, HeapTupleHeader tuple,
 				  const struct VacuumCutoffs *cutoffs, uint16 *flags,
-				  TransactionId *mxid_oldest_xid_out)
+				  HeapPageFreeze *pagefrz)
 {
+	uint16		t_infomask = tuple->t_infomask;
 	TransactionId newxmax = InvalidTransactionId;
 	MultiXactMember *members;
 	int			nmembers;
@@ -6134,7 +6150,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
-	TransactionId temp_xid_out;
+	TransactionId FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;
+	TransactionId axid PG_USED_FOR_ASSERTS_ONLY = cutoffs->OldestXmin;
+	MultiXactId amxid PG_USED_FOR_ASSERTS_ONLY = cutoffs->OldestMxact;
 
 	*flags = 0;
 
@@ -6146,14 +6164,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* Ensure infomask bits are appropriately set/reset */
 		*flags |= FRM_INVALIDATE_XMAX;
-		return InvalidTransactionId;
+		pagefrz->freeze_required = true;
+		Assert(!TransactionIdIsValid(newxmax));
+		return newxmax;
 	}
 	else if (MultiXactIdPrecedes(multi, cutoffs->relminmxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
 								 multi, cutoffs->relminmxid)));
-	else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
+	else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))
 	{
 		/*
 		 * This old multi cannot possibly have members still running, but
@@ -6166,7 +6186,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
-									 multi, cutoffs->MultiXactCutoff)));
+									 multi, cutoffs->OldestMxact)));
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
@@ -6202,14 +6222,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 			else
 			{
+				if (TransactionIdPrecedes(newxmax, FreezePageRelfrozenXid))
+					FreezePageRelfrozenXid = newxmax;
 				*flags |= FRM_RETURN_IS_XID;
 			}
 		}
 
-		/*
-		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
-		 * when no Xids will remain
-		 */
+		pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+		pagefrz->freeze_required = true;
 		return newxmax;
 	}
 
@@ -6225,11 +6245,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* Nothing worth keeping */
 		*flags |= FRM_INVALIDATE_XMAX;
-		return InvalidTransactionId;
+		pagefrz->freeze_required = true;
+		Assert(!TransactionIdIsValid(newxmax));
+		return newxmax;
 	}
 
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;	/* for FRM_NOOP */
 	for (int i = 0; i < nmembers; i++)
 	{
 		TransactionId xid = members[i].xid;
@@ -6238,26 +6260,35 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
+			/* Can't violate the FreezeLimit invariant */
 			need_replace = true;
 			break;
 		}
-		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
-			temp_xid_out = members[i].xid;
+		if (TransactionIdPrecedes(xid, FreezePageRelfrozenXid))
+			FreezePageRelfrozenXid = xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than FreezeLimit; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* Can't violate the MultiXactCutoff invariant, either */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff);
+
 	if (!need_replace)
 	{
 		/*
-		 * When mxid_oldest_xid_out gets pushed back here it's likely that the
-		 * update Xid was the oldest member, but we don't rely on that
+		 * FRM_NOOP case is the only one where we don't force page-level
+		 * freezing (see header comments)
 		 */
 		*flags |= FRM_NOOP;
-		*mxid_oldest_xid_out = temp_xid_out;
+
+		/*
+		 * Might have to ratchet back NewRelminMxid, NewRelfrozenXid, or both
+		 * together to make it safe to skip this particular multi/tuple xmax
+		 * if the page is frozen (similar handling will also be required if
+		 * the page isn't frozen, but caller deals with that directly).
+		 */
+		pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+		if (MultiXactIdPrecedes(multi, pagefrz->FreezePageRelminMxid))
+			pagefrz->FreezePageRelminMxid = multi;
 		pfree(members);
 		return multi;
 	}
@@ -6266,13 +6297,18 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_should_freeze will indicate that the tuple should be frozen.
 	 */
+	Assert(heap_tuple_should_freeze(tuple, cutoffs, &axid, &amxid));
+
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
 	update_xid = InvalidTransactionId;
 	update_committed = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
+	FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;	/* re-init */
 
 	/*
 	 * Determine whether to keep each member xid, or to ignore it instead
@@ -6360,11 +6396,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		/*
 		 * We determined that this is an Xid corresponding to an update that
 		 * must be retained -- add it to new members list for later.  Also
-		 * consider pushing back mxid_oldest_xid_out.
+		 * consider pushing back NewRelfrozenXid tracker.
 		 */
 		newmembers[nnewmembers++] = members[i];
-		if (TransactionIdPrecedes(xid, temp_xid_out))
-			temp_xid_out = xid;
+		if (TransactionIdPrecedes(xid, FreezePageRelfrozenXid))
+			FreezePageRelfrozenXid = xid;
 	}
 
 	pfree(members);
@@ -6375,10 +6411,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (nnewmembers == 0)
 	{
-		/* nothing worth keeping!? Tell caller to remove the whole thing */
+		/*
+		 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.  Won't
+		 * have to ratchet back NewRelfrozenXid or NewRelminMxid.
+		 */
 		*flags |= FRM_INVALIDATE_XMAX;
 		newxmax = InvalidTransactionId;
-		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
+
+		Assert(pagefrz->FreezePageRelfrozenXid == FreezePageRelfrozenXid);
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
 	{
@@ -6394,22 +6434,29 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		newxmax = update_xid;
-		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
+
+		/* Might have to push back FreezePageRelfrozenXid/NewRelfrozenXid */
+		Assert(TransactionIdPrecedesOrEquals(FreezePageRelfrozenXid,
+											 update_xid));
 	}
 	else
 	{
 		/*
 		 * Create a new multixact with the surviving members of the previous
 		 * one, to set as new Xmax in the tuple.  The oldest surviving member
-		 * might push back mxid_oldest_xid_out.
+		 * might have already pushed back NewRelfrozenXid.
 		 */
 		newxmax = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
-		*mxid_oldest_xid_out = temp_xid_out;
+
+		/* Never need to push back FreezePageRelminMxid/NewRelminMxid */
+		Assert(MultiXactIdPrecedesOrEquals(cutoffs->OldestMxact, newxmax));
 	}
 
 	pfree(newmembers);
 
+	pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+	pagefrz->freeze_required = true;
 	return newxmax;
 }
 
@@ -6417,9 +6464,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the FreezeLimit and/or MultiXactCutoff freeze cutoffs.  If so,
- * setup enough state (in the *frz output argument) to later execute and
- * WAL-log what caller needs to do for the tuple, and return true.  Return
+ * are older than the OldestXmin and/or OldestMxact freeze cutoffs.  If so,
+ * setup enough state (in the *frz output argument) to enable caller to
+ * process this tuple as part of freezing its page, and return true.  Return
  * false if nothing can be changed about the tuple right now.
  *
  * Also sets *totally_frozen to true if the tuple will be totally frozen once
@@ -6427,22 +6474,30 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * frozen by an earlier VACUUM).  This indicates that there are no remaining
  * XIDs or MultiXactIds that will need to be processed by a future VACUUM.
  *
- * VACUUM caller must assemble HeapTupleFreeze entries for every tuple that we
- * returned true for when called.  A later heap_freeze_execute_prepared call
- * will execute freezing for caller's page as a whole.
+ * VACUUM caller must assemble HeapTupleFreeze freeze plan entries for every
+ * tuple that we returned true for, and call heap_freeze_execute_prepared to
+ * execute freezing.  Caller must initialize pagefrz fields for page as a
+ * whole before first call here for each heap page.
+ *
+ * VACUUM caller decides on whether or not to freeze the page as a whole.
+ * We'll often prepare freeze plans for a page that caller just discards.
+ * However, VACUUM doesn't always get to make a choice; it must freeze when
+ * pagefrz.freeze_required is set, to ensure that any XIDs < FreezeLimit (and
+ * MXIDs < MultiXactCutoff) can never be left behind.  We make sure that
+ * VACUUM always follows that rule.
+ *
+ * We sometimes force freezing of xmax MultiXactId values long before it is
+ * strictly necessary to do so just to ensure the FreezeLimit postcondition.
+ * It's worth processing MultiXactIds proactively when it is cheap to do so,
+ * and it's convenient to make that happen by piggy-backing it on the "force
+ * freezing" mechanism.  Conversely, we sometimes delay freezing MultiXactIds
+ * because it is expensive right now (though only when it's still possible to
+ * do so without violating the FreezeLimit/MultiXactCutoff postcondition).
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6451,9 +6506,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  const struct VacuumCutoffs *cutoffs,
-						  HeapTupleFreeze *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  HeapPageFreeze *pagefrz,
+						  HeapTupleFreeze *frz, bool *totally_frozen)
 {
 	bool		xmin_already_frozen = false,
 				xmax_already_frozen = false;
@@ -6470,7 +6524,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/*
 	 * Process xmin, while keeping track of whether it's already frozen, or
-	 * will become frozen when our freeze plan is executed by caller (could be
+	 * will become frozen iff our freeze plan is executed by caller (could be
 	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
@@ -6484,21 +6538,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->FreezeLimit);
-		if (freeze_xmin)
-		{
-			if (!TransactionIdDidCommit(xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-										 xid, cutoffs->FreezeLimit)));
-		}
-		else
-		{
-			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-		}
+		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
+		if (freeze_xmin && !TransactionIdDidCommit(xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
+									 xid, cutoffs->OldestXmin)));
 	}
 
 	/*
@@ -6515,41 +6560,55 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we always freeze proactively.  This allows totally_frozen
 		 * tracking to ignore xvac.
 		 */
-		replace_xvac = true;
+		replace_xvac = pagefrz->freeze_required = true;
 	}
 
-	/*
-	 * Process xmax.  To thoroughly examine the current Xmax value we need to
-	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given FreezeLimit.  In that case, those values might need
-	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
-	 * it out --- if there's a live updater Xid, it needs to be kept.
-	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
-	 */
+	/* Now process xmax */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
-
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		/* Raw xmax is a MultiXactId */
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
 
-		newxmax = FreezeMultiXactId(xid, tuple->t_infomask, cutoffs,
-									&flags, &mxid_oldest_xid_out);
+		/*
+		 * We will either remove xmax completely (in the "freeze_xmax" path),
+		 * process xmax by replacing it (in the "replace_xmax" path), or
+		 * perform no-op xmax processing.  The only constraint is that the
+		 * FreezeLimit/MultiXactCutoff invariant must never be violated.
+		 */
+		newxmax = FreezeMultiXactId(xid, tuple, cutoffs, &flags, pagefrz);
 
-		if (flags & FRM_RETURN_IS_XID)
+		if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 * This is the only case where 'freeze_required' won't have been
+			 * set for us by FreezeMultiXactId, as well as the only case where
+			 * neither freeze_xmax nor replace_xmax are set (given a multi).
+			 *
+			 * This is a no-op, but the call to FreezeMultiXactId might have
+			 * ratcheted back NewRelfrozenXid and/or NewRelminMxid for us.
+			 * That makes it safe to freeze the page while leaving this
+			 * particular xmax undisturbed.
+			 *
+			 * FreezeMultiXactId is _not_ responsible for the "no freeze"
+			 * NewRelfrozenXid/NewRelminMxid trackers, though -- that's our
+			 * job.  A call to heap_tuple_should_freeze for this same tuple
+			 * will take place below if 'freeze_required' isn't set already.
+			 * (This approach repeats some of the work from FreezeMultiXactId,
+			 * which is not ideal but makes things simpler.)
+			 */
+			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
+			Assert(!MultiXactIdPrecedes(newxmax, pagefrz->FreezePageRelminMxid));
+		}
+		else if (flags & FRM_RETURN_IS_XID)
 		{
 			/*
 			 * xmax will become an updater Xid (original MultiXact's updater
 			 * member Xid will be carried forward as a simple Xid in Xmax).
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
 			 */
 			Assert(!TransactionIdPrecedes(newxmax, cutoffs->OldestXmin));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6572,13 +6631,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/*
 			 * xmax is an old MultiXactId that we have to replace with a new
 			 * MultiXactId, to carry forward two or more original member XIDs.
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
 			 */
 			Assert(!MultiXactIdPrecedes(newxmax, cutoffs->OldestMxact));
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6594,20 +6648,6 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 			replace_xmax = true;
 		}
-		else if (flags & FRM_NOOP)
-		{
-			/*
-			 * xmax is a MultiXactId, and nothing about it changes for now.
-			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
-			 * both together.
-			 */
-			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
-		}
 		else
 		{
 			/*
@@ -6619,6 +6659,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!TransactionIdIsValid(newxmax));
 			freeze_xmax = true;
 		}
+
+		/* Only FRM_NOOP doesn't force caller to freeze page */
+		Assert(pagefrz->freeze_required || (!freeze_xmax && !replace_xmax));
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6629,28 +6672,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
-		{
-			/*
-			 * If we freeze xmax, make absolutely sure that it's not an XID
-			 * that is important.  (Note, a lock-only xmax can be removed
-			 * independent of committedness, since a committed lock holder has
-			 * released the lock).
-			 */
-			if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
-				TransactionIdDidCommit(xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("cannot freeze committed xmax %u",
-										 xid)));
+		if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
 			freeze_xmax = true;
-			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
-		}
-		else
-		{
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-		}
+
+		/*
+		 * If we freeze xmax, make absolutely sure that it's not an XID that
+		 * is important.  (Note, a lock-only xmax can be removed independent
+		 * of committedness, since a committed lock holder has released the
+		 * lock).
+		 */
+		if (freeze_xmax && !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
+			TransactionIdDidCommit(xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("cannot freeze committed xmax %u",
+									 xid)));
 	}
 	else if (!TransactionIdIsValid(xid))
 	{
@@ -6677,6 +6713,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * failed; whereas a non-dead MOVED_IN tuple must mean the xvac
 		 * transaction succeeded.
 		 */
+		Assert(pagefrz->freeze_required);
 		if (tuple->t_infomask & HEAP_MOVED_OFF)
 			frz->frzflags |= XLH_INVALID_XVAC;
 		else
@@ -6685,6 +6722,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	if (replace_xmax)
 	{
 		Assert(!xmax_already_frozen && !freeze_xmax);
+		Assert(pagefrz->freeze_required);
 
 		/* Already set t_infomask/t_infomask2 flags in freeze plan */
 	}
@@ -6707,7 +6745,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/*
 	 * Determine if this tuple is already totally frozen, or will become
-	 * totally frozen
+	 * totally frozen (provided caller executes freeze plan for the page)
 	 */
 	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
@@ -6715,6 +6753,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/* A "totally_frozen" tuple must not leave anything behind in xmax */
 	Assert(!*totally_frozen || !replace_xmax);
 
+	/*
+	 * Check if the option of _not_ freezing caller's page is still in play,
+	 * though don't bother when we already forced freezing earlier on
+	 */
+	if (!pagefrz->freeze_required && !(xmin_already_frozen &&
+									   xmax_already_frozen))
+	{
+		pagefrz->freeze_required =
+			heap_tuple_should_freeze(tuple, cutoffs,
+									 &pagefrz->NoFreezePageRelfrozenXid,
+									 &pagefrz->NoFreezePageRelminMxid);
+	}
+
 	/* Tell caller if this tuple has a usable freeze plan set in *frz */
 	return freeze_xmin || replace_xvac || replace_xmax || freeze_xmax;
 }
@@ -6759,13 +6810,12 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
  */
 void
 heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-							 TransactionId FreezeLimit,
+							 TransactionId snapshotConflictHorizon,
 							 HeapTupleFreeze *tuples, int ntuples)
 {
 	Page		page = BufferGetPage(buffer);
 
 	Assert(ntuples > 0);
-	Assert(TransactionIdIsNormal(FreezeLimit));
 
 	START_CRIT_SECTION();
 
@@ -6788,19 +6838,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		int			nplans;
 		xl_heap_freeze_page xlrec;
 		XLogRecPtr	recptr;
-		TransactionId snapshotConflictHorizon;
 
 		/* Prepare deduplicated representation for use in WAL record */
 		nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
 
-		/*
-		 * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
-		 * Back up caller's FreezeLimit to avoid false conflicts when
-		 * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
-		 */
-		snapshotConflictHorizon = FreezeLimit;
-		TransactionIdRetreat(snapshotConflictHorizon);
-
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -6841,8 +6882,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	bool		do_freeze;
 	bool		totally_frozen;
 	struct VacuumCutoffs cutoffs;
-	TransactionId NewRelfrozenXid = FreezeLimit;
-	MultiXactId NewRelminMxid = MultiXactCutoff;
+	HeapPageFreeze pagefrz;
 
 	cutoffs.relfrozenxid = relfrozenxid;
 	cutoffs.relminmxid = relminmxid;
@@ -6851,9 +6891,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 
-	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs,
-										  &frz, &totally_frozen,
-										  &NewRelfrozenXid, &NewRelminMxid);
+	pagefrz.freeze_required = true;
+	pagefrz.NoFreezePageRelfrozenXid = FreezeLimit;
+	pagefrz.NoFreezePageRelminMxid = MultiXactCutoff;
+	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
+	pagefrz.FreezePageRelminMxid = MultiXactCutoff;
+
+	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs, &pagefrz,
+										  &frz, &totally_frozen);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7276,22 +7321,23 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
 }
 
 /*
- * heap_tuple_would_freeze
+ * heap_tuple_should_freeze
  *
- * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * Return value indicates if heap_prepare_freeze_tuple sibling function should
+ * force freezing of the page containing tuple.  This happens whenever the
+ * tuple contains XID/MXID fields with values < FreezeLimit/MultiXactCutoff.
  *
- * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
- * like the heap_prepare_freeze_tuple arguments that they're based on.  We
- * never freeze here, which makes tracking the oldest extant XID/MXID simple.
+ * The *NoFreezePageRelfrozenXid and *NoFreezePageRelminMxid input/output
+ * arguments help VACUUM track the oldest extant XID/MXID remaining in rel.
+ * Our working assumption is that caller won't decide to freeze this tuple.
+ * It's up to caller to only ratchet back its own top-level trackers after the
+ * point that it commits to not freezing the tuple/page in question.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple,
-						const struct VacuumCutoffs *cutoffs,
-						TransactionId *relfrozenxid_out,
-						MultiXactId *relminmxid_out)
+heap_tuple_should_freeze(HeapTupleHeader tuple,
+						 const struct VacuumCutoffs *cutoffs,
+						 TransactionId *NoFreezePageRelfrozenXid,
+						 MultiXactId *NoFreezePageRelminMxid)
 {
 	TransactionId xid;
 	MultiXactId multi;
@@ -7302,8 +7348,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	if (TransactionIdIsNormal(xid))
 	{
 		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+			*NoFreezePageRelfrozenXid = xid;
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 			freeze = true;
 	}
@@ -7320,8 +7366,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	{
 		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
 		/* xmax is a non-permanent XID */
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+			*NoFreezePageRelfrozenXid = xid;
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 			freeze = true;
 	}
@@ -7332,8 +7378,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 	{
 		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, *NoFreezePageRelminMxid))
+			*NoFreezePageRelminMxid = multi;
 		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
 		freeze = true;
 	}
@@ -7344,8 +7390,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		int			nmembers;
 
 		Assert(MultiXactIdPrecedesOrEquals(cutoffs->relminmxid, multi));
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, *NoFreezePageRelminMxid))
+			*NoFreezePageRelminMxid = multi;
 		if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
 			freeze = true;
 
@@ -7357,8 +7403,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		{
 			xid = members[i].xid;
 			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+				*NoFreezePageRelfrozenXid = xid;
 			if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 				freeze = true;
 		}
@@ -7372,9 +7418,9 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		if (TransactionIdIsNormal(xid))
 		{
 			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+				*NoFreezePageRelfrozenXid = xid;
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9e..e334ee8dc 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -21,6 +21,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/catalog.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -205,9 +206,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		{
 			int			ndeleted,
 						nnewlpdead;
+			bool		fpi;
 
 			ndeleted = heap_page_prune(relation, buffer, vistest, limited_xmin,
-									   limited_ts, &nnewlpdead, NULL);
+									   limited_ts, &nnewlpdead, &fpi, NULL);
 
 			/*
 			 * Report the number of tuples reclaimed to pgstats.  This is
@@ -255,7 +257,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * InvalidTransactionId/0 respectively.
  *
  * Sets *nnewlpdead for caller, indicating the number of items that were
- * newly set LP_DEAD during prune operation.
+ * newly set LP_DEAD during prune operation.  Also sets *prune_fpi for
+ * caller, indicating if pruning generated a full-page image as torn page
+ * protection.
  *
  * off_loc is the offset location required by the caller to use in error
  * callback.
@@ -267,7 +271,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				int *nnewlpdead,
+				int *nnewlpdead, bool *prune_fpi,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -380,6 +384,8 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	*prune_fpi = false;			/* for now */
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -417,6 +423,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
+			int64		wal_fpi_before = pgWalUsage.wal_fpi;
 
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
@@ -448,6 +455,9 @@ heap_page_prune(Relation relation, Buffer buffer,
 			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
+
+			if (wal_fpi_before != pgWalUsage.wal_fpi)
+				*prune_fpi = true;
 		}
 	}
 	else
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b234072e8..fe64bd6ed 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1528,8 +1528,9 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	bool		prune_fpi;
+	HeapPageFreeze pagefrz;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
@@ -1545,8 +1546,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.freeze_required = false;
+	pagefrz.NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.FreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.FreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1564,7 +1568,7 @@ retry:
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
-									 &vacrel->offnum);
+									 &prune_fpi, &vacrel->offnum);
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
@@ -1599,27 +1603,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1746,9 +1746,8 @@ retry:
 		prunestate->hastup = true;	/* page makes rel truncation unsafe */
 
 		/* Tuple with storage -- consider need to freeze */
-		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs,
-									  &frozen[tuples_frozen], &totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs, &pagefrz,
+									  &frozen[tuples_frozen], &totally_frozen))
 		{
 			/* Save prepared freeze plan for later */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1769,23 +1768,65 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
+	 * freeze when pruning generated an FPI, if doing so means that we set the
+	 * page all-frozen afterwards (this could happen during second heap pass).
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (pagefrz.freeze_required || tuples_frozen == 0 ||
+		(prunestate->all_visible && prunestate->all_frozen && prune_fpi))
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need to freeze anything (pruning might be all we need).
+		 */
+		vacrel->NewRelfrozenXid = pagefrz.FreezePageRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.FreezePageRelminMxid;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* NewRelfrozenXid <= all XIDs in tuples that weren't pruned away */
+		vacrel->NewRelfrozenXid = pagefrz.NoFreezePageRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.NoFreezePageRelminMxid;
+
+		/* Might still set page all-visible, but never all-frozen */
+		tuples_frozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
-	 * first (arbitrary)
 	 */
 	if (tuples_frozen > 0)
 	{
-		Assert(prunestate->hastup);
+		TransactionId snapshotConflictHorizon;
+
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		vacrel->frozen_pages++;
 
+		/*
+		 * We can use the latest xmin cutoff (which is generally used for 'VM
+		 * set' conflicts) as our cutoff for freeze conflicts when the whole
+		 * page is eligible to become all-frozen in the VM once frozen by us.
+		 * Otherwise use a conservative cutoff (just back up from OldestXmin).
+		 */
+		if (prunestate->all_visible && prunestate->all_frozen)
+			snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+		else
+		{
+			snapshotConflictHorizon = vacrel->cutoffs.OldestXmin;
+			TransactionIdRetreat(snapshotConflictHorizon);
+		}
+
 		/* Execute all freeze plans for page as a single atomic action */
 		heap_freeze_execute_prepared(vacrel->rel, buf,
-									 vacrel->cutoffs.FreezeLimit,
+									 snapshotConflictHorizon,
 									 frozen, tuples_frozen);
 	}
 
@@ -1804,7 +1845,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1812,8 +1853,7 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1834,9 +1874,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1850,6 +1887,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->all_visible = false;
+		prunestate->has_lpdead_items = true;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -1894,8 +1935,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				recently_dead_tuples,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
-	TransactionId NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NewRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1940,8 +1981,9 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_would_freeze(tupleheader, &vacrel->cutoffs,
-									&NewRelfrozenXid, &NewRelminMxid))
+		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
+									 &NoFreezePageRelfrozenXid,
+									 &NoFreezePageRelminMxid))
 		{
 			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
 			if (vacrel->aggressive)
@@ -2022,8 +2064,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items array */
 	if (vacrel->nindexes == 0)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8e4145979..cbcca561d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9171,9 +9171,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9251,9 +9251,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        should use to decide whether to trigger freezing of pages with
+        an older multixact ID.  The default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
-- 
2.38.1

v10-0003-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchapplication/x-patch; name=v10-0003-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchDownload

From 7e75a28e36f2ef6c508acd38b62e465fe727ca07 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v10 3/5] Add eager and lazy freezing strategies to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  VACUUM determines its freezing
strategy based on the value of the new vacuum_freeze_strategy_threshold
GUC (or reloption) in most cases: Tables that exceeds the size threshold
use the eager freezing strategy.  Otherwise VACUUM uses the lazy
freezing strategy,  which is essentially the same approach that VACUUM
has always taken to freezing (though not quite, due to the influence of
page level freezing following recent work).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).
Eager freezing is generally only applied when vacuuming larger tables,
where freezing most individual heap pages at the first opportunity (in
the first VACUUM operation where they can definitely be set all-visible)
will improve performance stability.

A later commit will add vmsnap scanning strategies, which are designed
to work in tandem with the freezing strategies from this commit.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/commands/vacuum.h                 | 10 +++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 ++++++
 src/backend/access/heap/vacuumlazy.c          | 39 ++++++++++++++++++-
 src/backend/commands/vacuum.c                 | 17 +++++++-
 src/backend/postmaster/autovacuum.c           | 10 +++++
 src/backend/utils/misc/guc_tables.c           | 11 ++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 19 ++++++++-
 doc/src/sgml/ref/create_table.sgml            | 14 +++++++
 doc/src/sgml/ref/vacuum.sgml                  | 16 ++++----
 11 files changed, 138 insertions(+), 11 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 896d1b1ac..194849cff 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -272,6 +275,12 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
+	 * rel's main fork) that triggers VACUUM's eager freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold;
 };
 
 /*
@@ -295,6 +304,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca..e195b63d7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b734489..4b680501c 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index fe64bd6ed..a53bc6b0f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -152,6 +152,8 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -241,6 +243,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void lazy_scan_strategy(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -469,6 +472,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
+	/*
+	 * Now determine VACUUM's freezing strategy.
+	 */
+	lazy_scan_strategy(vacrel);
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -1252,6 +1259,27 @@ lazy_scan_heap(LVRelState *vacrel)
 		lazy_cleanup_all_indexes(vacrel);
 }
 
+/*
+ *	lazy_scan_strategy() -- Determine freezing strategy.
+ *
+ * Our lazy freezing strategy is useful when putting off the work of freezing
+ * totally avoids freezing that turns out to have been wasted effort later on.
+ * Our eager freezing strategy is useful with larger tables that experience
+ * continual growth, where freezing pages proactively is needed just to avoid
+ * falling behind on freezing (eagerness is also likely to be cheaper in the
+ * short/medium term for such tables, but the long term picture matters most).
+ */
+static void
+lazy_scan_strategy(LVRelState *vacrel)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	Assert(vacrel->scanned_pages == 0);
+
+	vacrel->eager_freeze_strategy =
+		rel_pages >= vacrel->cutoffs.freeze_strategy_threshold;
+}
+
 /*
  *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
@@ -1773,9 +1801,18 @@ retry:
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 	 * freeze when pruning generated an FPI, if doing so means that we set the
 	 * page all-frozen afterwards (this could happen during second heap pass).
+	 *
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will become all-visible, making it all-frozen instead.
+	 * (Actually, the all-visible/eager freezing strategy doesn't quite work
+	 * that way.  It triggers freezing for pages that it sees will thereby be
+	 * set all-frozen in the VM immediately afterwards -- a stricter test.
+	 * Some pages that can be set all-visible cannot also be set all-frozen,
+	 * even after freezing, due to the presence of lock-only MultiXactIds.)
 	 */
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
-		(prunestate->all_visible && prunestate->all_frozen && prune_fpi))
+		(prunestate->all_visible && prunestate->all_frozen &&
+		 (vacrel->eager_freeze_strategy || prune_fpi)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ba965b8c7..7c68bd8ff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
@@ -926,7 +930,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -939,6 +944,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/* Set pg_class fields in cutoffs */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1053,6 +1059,15 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
 		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	Assert(freeze_strategy_threshold >= 0);
+	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
+
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 0746d8022..23e316e59 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1bf14eec6..549a2e969 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2503,6 +2503,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 043864597..4763cb6bb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -692,6 +692,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cbcca561d..66e947612 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9138,6 +9138,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to apply its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
       <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9173,7 +9188,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         Specifies the cutoff age (in transactions) that
         <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages that have an older XID.
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index e14ead882..79595b1cb 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> and <xref
+       linkend="guc-vacuum-freeze-table-age"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
-- 
2.38.1

v10-0004-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchapplication/x-patch; name=v10-0004-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchDownload

From 63c3e4e46463dcc76174eedb2351e5be979c9657 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v10 4/5] Add eager and lazy VM strategies to VACUUM.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will scan (or skip) heap pages.  The data structure we
use is a local copy of the visibility map at the start of VACUUM.  It
spills to disk as required, though only with a larger table.

Non-aggressive VACUUMs now make an up-front choice about VM snapshot
scanning strategy: they decide whether or not to prioritize early
advancement of relfrozenxid (eager strategy) over avoiding work by
skipping all-visible pages (lazy strategy).  VACUUM decides on its
scanning and freezing strategies together, shortly before the first pass
over the heap begins, since the concepts are closely related, and work
in tandem.  Note that the scanning strategy often has a significant
impact on the total number of pages frozen by VACUUM, even when lazy
freezing is in use.

Also make the VACUUM command's DISABLE_PAGE_SKIPPING option stop forcing
aggressive mode.  As a consequence, the option will no longer have any
impact on when or how VACUUM waits for a cleanup lock the hard way.  The
option now makes VACUUM distrust the visibility map, and nothing more.
DISABLE_PAGE_SKIPPING now works by making VACUUM opt to use a dedicated
"scan all pages" scanning strategy.

This lays the groundwork for completely removing aggressive mode VACUUMs
in a later commit; vmsnap scanning strategies now supersede the "early
aggressive VACUUM" concept implemented by vacuum_freeze_table_age, which
is now just a compatibility option (its new default of -1 is interpreted
as "just use autovacuum_freeze_max_age").  VACUUM makes a choice about
which VM scanning strategy to use by considering how close table age is
to autovacuum_freeze_max_age (actually vacuum_freeze_table_age)
directly, in a way that is roughly comparable to our previous approach.
But table age is now just one factor considered alongside several other
factors.

Also add explicit I/O prefetching of heap pages, which is controlled by
maintenance_io_concurrency.  We prefetch at the point that the next
block in line is requested by VACUUM.  Prefetching is under the direct
control of the visibility map snapshot code, since VACUUM's vmsnap is
now an authoritative guide to which pages VACUUM will scan.  VACUUM's
final scanned_pages is "locked in" when it decides on scanning strategy
(so scanned_pages is finalized before the first heap pass even begins).

Prefetching should totally avoid the loss of performance that might
otherwise result from removing SKIP_PAGES_THRESHOLD in this commit.
SKIP_PAGES_THRESHOLD was intended to force OS readahead and encourage
relfrozenxid advancement.  See commit bf136cf6 from around the time the
visibility map first went in for full details.

Also teach VACUUM to use scanned_pages (not rel_pages) to cap the size
of the dead_items array.  This approach is strictly better, since there
is no question of scanning any pages other than the precise set of pages
already locked in by vmsnap by the time dead_items is allocated.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h            |  17 +
 src/include/commands/vacuum.h                 |  15 +-
 src/backend/access/heap/vacuumlazy.c          | 464 ++++++++-------
 src/backend/access/heap/visibilitymap.c       | 547 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |  68 +--
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      |  66 ++-
 doc/src/sgml/ref/vacuum.sgml                  |   4 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 11 files changed, 935 insertions(+), 279 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..4a1f47ac6 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,17 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
+/* VACUUM scanning strategy */
+typedef enum vmstrategy
+{
+	VMSNAP_SCAN_LAZY,			/* Skip all-visible and all-frozen pages */
+	VMSNAP_SCAN_EAGER,			/* Only skip all-frozen pages */
+	VMSNAP_SCAN_ALL				/* Don't skip any pages (scan them instead) */
+} vmstrategy;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +46,12 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+											  BlockNumber *scanned_pages_lazy,
+											  BlockNumber *scanned_pages_eager);
+extern void visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat);
+extern BlockNumber visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 194849cff..4dd31bd3f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 
 /*
  * Values used by index_cleanup and truncate params.
@@ -281,6 +281,19 @@ struct VacuumCutoffs
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
 	 */
 	BlockNumber freeze_strategy_threshold;
+
+	/*
+	 * The tableagefrac value 1.0 represents the point that autovacuum.c
+	 * scheduling (and VACUUM itself) considers relfrozenxid advancement
+	 * strictly necessary.
+	 *
+	 * Lower values provide useful context, and influence whether VACUUM will
+	 * opt to advance relfrozenxid before the point that it is strictly
+	 * necessary.  VACUUM can (and often does) opt to advance relfrozenxid
+	 * proactively.  It is especially likely with tables where the _added_
+	 * costs happen to be low.
+	 */
+	double		tableagefrac;
 };
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a53bc6b0f..0619d54dd 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -109,10 +109,18 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * tableagefrac-wise cutoffs influencing VACUUM's choice of scanning strategy
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define TABLEAGEFRAC_MIDPOINT		0.5 /* half way to antiwraparound AV */
+#define TABLEAGEFRAC_HIGHPOINT		0.9 /* Eagerness now mandatory */
+
+/*
+ * Thresholds (expressed as a proportion of rel_pages) that determine the
+ * cutoff (in extra pages scanned) for eager vmsnap scanning behavior at
+ * particular tableagefrac-wise table ages
+ */
+#define MAX_PAGES_YOUNG_TABLEAGE	0.05	/* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE		0.70	/* 70% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -150,8 +158,6 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -170,7 +176,9 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map (as of time that VACUUM began) */
+	vmsnapshot *vmsnap;
+	vmstrategy	vmstrat;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -243,11 +251,8 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static void lazy_scan_strategy(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  const VacuumParams *params);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -277,7 +282,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -309,10 +315,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	BlockNumber orig_rel_pages,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -458,37 +464,27 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		vacrel->aggressive = true;
-		skipwithvm = false;
-	}
-
-	vacrel->skipwithvm = skipwithvm;
 
 	/*
-	 * Now determine VACUUM's freezing strategy.
+	 * Now determine VACUUM's vmsnap freezing and scanning strategies.
+	 *
+	 * This process is driven in part by information from VACUUM's visibility
+	 * map snapshot, which will be acquired in passing.  lazy_scan_heap will
+	 * use the same immutable VM snapshot to determine which pages to scan.
+	 * Using an immutable structure (instead of the live visibility map) makes
+	 * VACUUM avoid scanning concurrently modified pages.  These pages can
+	 * only have deleted tuples that OldestXmin will consider RECENTLY_DEAD.
 	 */
-	lazy_scan_strategy(vacrel);
+	scanned_pages = lazy_scan_strategy(vacrel, params);
 	if (verbose)
-	{
-		if (vacrel->aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace, vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -498,13 +494,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -551,12 +548,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid when lazy_scan_strategy call
+		 * decided to skip all-visible pages
 		 */
 		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -601,6 +597,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -628,10 +627,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -827,13 +822,12 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -847,42 +841,29 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap,
+												 &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
+		if (blkno < next_block_to_scan)
 		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
-
-			Assert(next_unskippable_block >= blkno + 1);
+			Assert(blkno != rel_pages - 1);
+			continue;
 		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
 
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap,
+													 &next_all_visible);
+		Assert(next_block_to_scan > blkno);
 
 		vacrel->scanned_pages++;
 
@@ -1092,10 +1073,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1123,12 +1103,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1167,7 +1145,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1260,7 +1238,7 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine freezing strategy.
+ *	lazy_scan_strategy() -- Determine freezing/vmsnap scanning strategies.
  *
  * Our lazy freezing strategy is useful when putting off the work of freezing
  * totally avoids freezing that turns out to have been wasted effort later on.
@@ -1268,122 +1246,199 @@ lazy_scan_heap(LVRelState *vacrel)
  * continual growth, where freezing pages proactively is needed just to avoid
  * falling behind on freezing (eagerness is also likely to be cheaper in the
  * short/medium term for such tables, but the long term picture matters most).
- */
-static void
-lazy_scan_strategy(LVRelState *vacrel)
-{
-	BlockNumber rel_pages = vacrel->rel_pages;
-
-	Assert(vacrel->scanned_pages == 0);
-
-	vacrel->eager_freeze_strategy =
-		rel_pages >= vacrel->cutoffs.freeze_strategy_threshold;
-}
-
-/*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
+ * Our lazy vmsnap scanning strategy is useful when we can save a significant
+ * amount of work in the short term by not advancing relfrozenxid/relminmxid.
+ * Our eager vmsnap scanning strategy is useful when there is hardly any work
+ * avoided by being lazy anyway, and/or when tableagefrac is nearing or has
+ * already surpassed 1.0.  The value 1.0 is the point that autovacuum.c starts
+ * launching antiwraparound autovacuums to advance relfrozenxid/relminmxid,
+ * which makes eager scanning strategy mandatory (though we always use eager
+ * scanning whenever tableagefrac reaches 0.9 or more, to try to stay ahead).
  *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
+ * Freezing and scanning strategies are structured as two independent choices,
+ * but they are not independent in any practical sense (it's just mechanical).
+ * Eager and lazy behaviors go hand in hand, since the choice of each strategy
+ * is driven by similar considerations about the needs of the target table.
+ * Moreover, choosing eager scanning strategy can easily result in freezing
+ * many more pages (compared to an equivalent lazy scanning strategy VACUUM),
+ * since VACUUM can only freeze pages that it actually scans.  (All-visible
+ * pages may well have XIDs < FreezeLimit by now, but VACUUM has no way of
+ * noticing that it should freeze such pages besides just scanning them.)
  *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
+ * The single most important justification for the eager behaviors is system
+ * level performance stability.  It is often better to freeze all-visible
+ * pages before we're truly forced to (just to advance relfrozenxid) as a way
+ * of avoiding big spikes, where VACUUM has to freeze many pages all at once.
+ *
+ * Returns final scanned_pages for the VACUUM operation.  The exact number of
+ * pages that lazy_scan_heap scans depends in part on which vmsnap scanning
+ * strategy we choose (only eager scanning will scan rel's all-visible pages).
  */
 static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
+lazy_scan_strategy(LVRelState *vacrel, const VacuumParams *params)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
-
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
-	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
-
-		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-		{
-			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
-		 */
-		if (next_unskippable_block == rel_pages - 1)
-			break;
-
-		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
-			break;
-
-		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
-		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
-
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
-	}
+				scanned_pages_lazy,
+				scanned_pages_eager,
+				nextra_scanned_eager,
+				nextra_young_threshold,
+				nextra_old_threshold,
+				nextra_toomany_threshold;
+	double		tableagefrac = vacrel->cutoffs.tableagefrac;
 
 	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
+	 * Decide freezing strategy.
 	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 * The eager freezing strategy is used when the threshold controlled by
+	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
+	 *
+	 * Also freeze eagerly whenever table age is close to requiring (or is
+	 * actually undergoing) an antiwraparound autovacuum.  This may delay the
+	 * next antiwraparound autovacuum against the table.  We avoid relying on
+	 * them, if at all possible (mostly-static tables tend to rely on them).
+	 *
+	 * Once a table first becomes big enough for eager freezing, it's almost
+	 * inevitable that it will also naturally settle into a cadence where
+	 * relfrozenxid is advanced during every VACUUM (barring rel truncation).
+	 * This is a consequence of eager freezing strategy avoiding creating new
+	 * all-visible pages: if there never are any all-visible pages (if all
+	 * skippable pages are fully all-frozen), then there is no way that lazy
+	 * scanning strategy can ever look better than eager scanning strategy.
+	 * There are still ways that the occasional all-visible page could slip
+	 * into a table that we always freeze eagerly (at least when its tuples
+	 * tend to contain MultiXacts), but that should have negligible impact.
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
+	vacrel->eager_freeze_strategy =
+		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 tableagefrac >= TABLEAGEFRAC_HIGHPOINT);
+
+	/*
+	 * Decide vmsnap scanning strategy.
+	 *
+	 * First acquire a visibility map snapshot, which determines the number of
+	 * pages that each vmsnap scanning strategy is required to scan for us in
+	 * passing.
+	 *
+	 * The number of "extra" scanned_pages added by choosing VMSNAP_SCAN_EAGER
+	 * over VMSNAP_SCAN_LAZY is a key input into the decision making process.
+	 * It is a good proxy for the added cost of applying our eager vmsnap
+	 * strategy during this particular VACUUM.  (We may or may not have to
+	 * dirty/freeze the extra pages when we scan them, which isn't something
+	 * that we try to model.  It shouldn't matter very much at this level.)
+	 */
+	vacrel->vmsnap = visibilitymap_snap_acquire(vacrel->rel, rel_pages,
+												&scanned_pages_lazy,
+												&scanned_pages_eager);
+	nextra_scanned_eager = scanned_pages_eager - scanned_pages_lazy;
+
+	/*
+	 * Next determine guideline "nextra_scanned_eager" thresholds, which are
+	 * applied based in part on tableagefrac (when nextra_toomany_threshold is
+	 * determined below).  These thresholds also represent minimum and maximum
+	 * sensible thresholds that can ever make sense for a table of this size
+	 * (when the table's age isn't old enough to make eagerness mandatory).
+	 *
+	 * For the most part we only care about relative (not absolute) costs.  We
+	 * want to advance relfrozenxid at an opportune time, during a VACUUM that
+	 * has to scan relatively many pages either way (whether due to the need
+	 * to remove dead tuples from many pages, or due to the table containing
+	 * lots of existing all-frozen pages, or due to a combination of both).
+	 * Smaller tables (where lazy freezing is generally used) shouldn't ever
+	 * need to do dramatically more work than usual to advance relfrozenxid.
+	 */
+	nextra_young_threshold = (double) rel_pages * MAX_PAGES_YOUNG_TABLEAGE;
+	nextra_old_threshold = (double) rel_pages * MAX_PAGES_OLD_TABLEAGE;
+
+	/*
+	 * Next determine nextra_toomany_threshold, which represents how many
+	 * extra scanned_pages are deemed too high a cost to pay for eagerness,
+	 * given present conditions.  This is our model's break-even point.
+	 */
+	Assert(rel_pages >= nextra_scanned_eager && vacrel->scanned_pages == 0);
+	if (tableagefrac < TABLEAGEFRAC_MIDPOINT)
+	{
+		/*
+		 * The table's age is still below table age mid point, so table age is
+		 * still of only minimal concern.  We're still willing to act eagerly
+		 * when it's _very_ cheap to do so: when use of VMSNAP_SCAN_EAGER will
+		 * force us to scan some extra pages not exceeding 5% of rel_pages.
+		 */
+		nextra_toomany_threshold = nextra_young_threshold;
+	}
+	else if (tableagefrac <= TABLEAGEFRAC_HIGHPOINT)
+	{
+		double		nextra_scale;
+
+		/*
+		 * The table's age is starting to become a concern, but not to the
+		 * extent that we'll force the use of VMSNAP_SCAN_EAGER strategy.
+		 * We'll need to interpolate to get an nextra_scanned_eager-based
+		 * threshold.
+		 *
+		 * If tableagefrac is only barely over the midway point, then we'll
+		 * choose an nextra_scanned_eager threshold of ~5% of rel_pages. The
+		 * opposite extreme occurs when tableagefrac is very near to the high
+		 * point.  That will make our nextra_scanned_eager threshold very
+		 * aggressive: we'll go with VMSNAP_SCAN_EAGER when doing so requires
+		 * we scan a number of extra blocks as high as ~70% of rel_pages.
+		 *
+		 * Note that the threshold grows (on a pure percentage basis) by ~8.1%
+		 * of rel_pages for each additional increment of 5% of tableagefrac
+		 * after tableagefrac crosses the mid point (and before tableagefrac
+		 * crosses the high point, which will always force eager scanning).
+		 */
+		nextra_scale =
+			1.0 - ((TABLEAGEFRAC_HIGHPOINT - tableagefrac) /
+				   (TABLEAGEFRAC_HIGHPOINT - TABLEAGEFRAC_MIDPOINT));
+
+		nextra_toomany_threshold =
+			(nextra_young_threshold * (1.0 - nextra_scale)) +
+			(nextra_old_threshold * nextra_scale);
+	}
 	else
 	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
+		/*
+		 * The table's age is approaching (or surpasses) the point that an
+		 * antiwraparound autovacuum is required.  Force VMSNAP_SCAN_EAGER, no
+		 * matter how many extra pages we'll be required to scan as a result.
+		 *
+		 * Note that there is a discontinuity when tableagefrac crosses the
+		 * high point: the threshold set here jumps from 70% of rel_pages to
+		 * 100% of rel_pages.  It's useful to only weigh table age at some
+		 * point before an antiwraparound autovacuum is required.  That way
+		 * even extreme cases (including cases where freeze_strategy_threshold
+		 * is set to a very high value) have at least some chance of using the
+		 * eager scanning strategy outside of antiwraparound autovacuums.
+		 */
+		nextra_toomany_threshold = MaxBlockNumber;
 	}
 
-	return next_unskippable_block;
+	/* Make final choice on scanning strategy using final threshold */
+	nextra_toomany_threshold = Max(32, nextra_toomany_threshold);
+	vacrel->vmstrat = (nextra_scanned_eager >= nextra_toomany_threshold ?
+					   VMSNAP_SCAN_LAZY : VMSNAP_SCAN_EAGER);
+
+	/*
+	 * Override choice of scanning strategy (force vmsnap to scan every page
+	 * in the range of rel_pages) in DISABLE_PAGE_SKIPPING case
+	 */
+	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+		vacrel->vmstrat = VMSNAP_SCAN_ALL;
+	else if (vacrel->aggressive)
+		vacrel->vmstrat = VMSNAP_SCAN_EAGER;
+
+	/* Inform vmsnap infrastructure of our chosen strategy */
+	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
+
+	/* Return appropriate scanned_pages for final strategy chosen */
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
+		return scanned_pages_lazy;
+	if (vacrel->vmstrat == VMSNAP_SCAN_EAGER)
+		return scanned_pages_eager;
+
+	/* DISABLE_PAGE_SKIPPING/VMSNAP_SCAN_ALL case */
+	return rel_pages;
 }
 
 /*
@@ -2823,6 +2878,14 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * Also don't attempt it if we are doing early pruning/vacuuming, because a
  * scan which cannot find a truncated heap page cannot determine that the
  * snapshot is too old to read that page.
+ *
+ * Note that we effectively rely on visibilitymap_snap_next() having forced
+ * VACUUM to scan the final page (rel_pages - 1) in all cases.  Without that,
+ * we'd tend to needlessly acquire an AccessExclusiveLock just to attempt rel
+ * truncation that is bound to fail.  VACUUM cannot set vacrel->nonempty_pages
+ * in pages that it skips using the VM, so we must avoid interpreting skipped
+ * pages as empty pages when it makes little sense.  Observing that the final
+ * page has tuples is a simple way of avoiding pathological locking behavior.
  */
 static bool
 should_attempt_truncation(LVRelState *vacrel)
@@ -3113,14 +3176,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3129,15 +3191,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3159,12 +3219,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e..816576dca 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,10 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap_acquire - acquire snapshot of visibility map
+ *		visibilitymap_snap_strategy - set VACUUM's scanning strategy
+ *		visibilitymap_snap_next - get next block to scan from vmsnap
+ *		visibilitymap_snap_release - release previously acquired snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +56,10 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset.  This also provides us with a
+ * convenient way of performing I/O prefetching on behalf of VACUUM, since the
+ * pages that VACUUM's first heap pass will scan are fully predetermined.
  *
  * LOCKING
  *
@@ -92,10 +100,12 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
+#include "utils/spccache.h"
 
 
 /*#define TRACE_VISIBILITYMAP */
@@ -124,9 +134,87 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Prefetching of heap pages takes place as VACUUM requests the next block in
+ * line from its visibility map snapshot
+ *
+ * XXX MIN_PREFETCH_SIZE of 32 is a little on the high side, but matches
+ * hard-coded constant used by vacuumlazy.c when prefetching for rel
+ * truncation.  Might be better to increase the maintenance_io_concurrency
+ * default, or to do nothing like this at all.
+ */
+#define STAGED_BUFSIZE			(MAX_IO_CONCURRENCY * 2)
+#define MIN_PREFETCH_SIZE		((BlockNumber) 32)
+
+typedef struct vmsnapblock
+{
+	BlockNumber scanned_block;
+	bool		all_visible;
+} vmsnapblock;
+
+/*
+ * Snapshot of visibility map at the start of a VACUUM operation
+ */
+struct vmsnapshot
+{
+	/* Target heap rel */
+	Relation	rel;
+	/* Scanning strategy used by VACUUM operation */
+	vmstrategy	strat;
+	/* Per-strategy final scanned_pages */
+	BlockNumber rel_pages;
+	BlockNumber scanned_pages_lazy;
+	BlockNumber scanned_pages_eager;
+
+	/*
+	 * Materialized visibility map state.
+	 *
+	 * VM snapshots spill to a temp file when required.
+	 */
+	BlockNumber nvmpages;
+	BufFile    *file;
+
+	/*
+	 * Prefetch distance, used to perform I/O prefetching of heap pages
+	 */
+	int			prefetch_distance;
+
+	/* Current VM page cached */
+	BlockNumber curvmpage;
+	char	   *rawmap;
+	PGAlignedBlock vmpage;
+
+	/* Staging area for blocks returned to VACUUM */
+	vmsnapblock staged[STAGED_BUFSIZE];
+	int			current_nblocks_staged;
+
+	/*
+	 * Next block from range of rel_pages to consider placing in staged block
+	 * array (it will be placed there if it's going to be scanned by VACUUM)
+	 */
+	BlockNumber next_block;
+
+	/*
+	 * Number of blocks that we still need to return, and number of blocks
+	 * that we still need to prefetch
+	 */
+	BlockNumber scanned_pages_to_return;
+	BlockNumber scanned_pages_to_prefetch;
+
+	/* offset of next block in line to return (from staged) */
+	int			next_return_idx;
+	/* offset of next block in line to prefetch (from staged) */
+	int			next_prefetch_idx;
+	/* offset of first garbage/invalid element (from staged) */
+	int			first_invalid_idx;
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static void vm_snap_stage_blocks(vmsnapshot *vmsnap);
+static uint8 vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 
 
 /*
@@ -373,6 +461,350 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap_acquire - get read-only snapshot of visibility map
+ *
+ * Initializes VACUUM caller's snapshot, allocating memory in current context.
+ * Used by VACUUM to determine which pages it must scan up front.
+ *
+ * Set scanned_pages_lazy and scanned_pages_eager to help VACUUM decide on its
+ * scanning strategy.  These are VACUUM's scanned_pages when it opts to skip
+ * all eligible pages and scanned_pages when it opts to just skip all-frozen
+ * pages, respectively.
+ *
+ * Caller finalizes scanning strategy by calling visibilitymap_snap_strategy.
+ * This determines the kind of blocks visibilitymap_snap_next should indicate
+ * need to be scanned by VACUUM.
+ */
+vmsnapshot *
+visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+						   BlockNumber *scanned_pages_lazy,
+						   BlockNumber *scanned_pages_eager)
+{
+	BlockNumber nvmpages = 0,
+				mapBlockLast = 0,
+				all_visible = 0,
+				all_frozen = 0;
+	uint8		mapbits_last_page = 0;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_acquire %s %u",
+		 RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	if (rel_pages > 0)
+	{
+		mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages - 1);
+		nvmpages = mapBlockLast + 1;
+	}
+
+	/* Allocate and initialize VM snapshot state */
+	vmsnap = palloc0(sizeof(vmsnapshot));
+	vmsnap->rel = rel;
+	vmsnap->strat = VMSNAP_SCAN_ALL;	/* for now */
+	vmsnap->rel_pages = rel_pages;	/* scanned_pages for VMSNAP_SCAN_ALL */
+	vmsnap->scanned_pages_lazy = 0;
+	vmsnap->scanned_pages_eager = 0;
+
+	/*
+	 * vmsnap temp file state.
+	 *
+	 * Only relations large enough to need more than one visibility map page
+	 * use a temp file (cannot wholly rely on vmsnap's single page cache).
+	 */
+	vmsnap->nvmpages = nvmpages;
+	vmsnap->file = NULL;
+	if (nvmpages > 1)
+		vmsnap->file = BufFileCreateTemp(false);
+	vmsnap->prefetch_distance = 0;
+#ifdef USE_PREFETCH
+	vmsnap->prefetch_distance =
+		get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
+#endif
+	vmsnap->prefetch_distance = Max(vmsnap->prefetch_distance, MIN_PREFETCH_SIZE);
+
+	/* cache of VM pages read from temp file */
+	vmsnap->curvmpage = 0;
+	vmsnap->rawmap = NULL;
+
+	/* staged blocks array state */
+	vmsnap->current_nblocks_staged = 0;
+	vmsnap->next_block = 0;
+	vmsnap->scanned_pages_to_return = 0;
+	vmsnap->scanned_pages_to_prefetch = 0;
+	/* Offsets into staged blocks array */
+	vmsnap->next_return_idx = 0;
+	vmsnap->next_prefetch_idx = 0;
+	vmsnap->first_invalid_idx = 0;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		/* Cache page locally */
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(vmsnap->vmpage.data, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/* Finish off this VM page using snapshot's vmpage cache */
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = map = PageGetContents(vmsnap->vmpage.data);
+		umap = (uint64 *) map;
+
+		if (mapBlock == mapBlockLast)
+		{
+			uint32		mapByte;
+			uint8		mapOffset;
+
+			/*
+			 * The last VM page requires some extra steps.
+			 *
+			 * First get the status of the last heap page (page in the range
+			 * of rel_pages) in passing.
+			 */
+			Assert(mapBlock == HEAPBLK_TO_MAPBLOCK(rel_pages - 1));
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages - 1);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages - 1);
+			mapbits_last_page = ((map[mapByte] >> mapOffset) &
+								 VISIBILITYMAP_VALID_BITS);
+
+			/*
+			 * Also defensively "truncate" our local copy of the last page in
+			 * order to reliably exclude heap pages beyond the range of
+			 * rel_pages.  This is sheer paranoia.
+			 */
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages);
+			if (mapByte != 0 || mapOffset != 0)
+			{
+				MemSet(&map[mapByte + 1], 0, MAPSIZE - (mapByte + 1));
+				map[mapByte] &= (1 << mapOffset) - 1;
+			}
+		}
+
+		/* Maintain count of all-frozen and all-visible pages */
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+
+		/* Finally, write out vmpage cache VM page to vmsnap's temp file */
+		if (vmsnap->file)
+			BufFileWrite(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+	}
+
+	/*
+	 * Done copying all VM pages from authoritative VM into a VM snapshot.
+	 *
+	 * Figure out the final scanned_pages for the two skipping policies that
+	 * we might use: skipallvis (skip both all-frozen and all-visible) and
+	 * skipallfrozen (just skip all-frozen).
+	 */
+	Assert(all_frozen <= all_visible && all_visible <= rel_pages);
+	*scanned_pages_lazy = rel_pages - all_visible;
+	*scanned_pages_eager = rel_pages - all_frozen;
+
+	/*
+	 * When the last page is skippable in principle, it still won't be treated
+	 * as skippable by visibilitymap_snap_next, which recognizes the last page
+	 * as a special case.  Compensate by incrementing each scanning strategy's
+	 * scanned_pages as needed to avoid counting the last page as skippable.
+	 */
+	if (mapbits_last_page & VISIBILITYMAP_ALL_VISIBLE)
+		(*scanned_pages_lazy)++;
+	if (mapbits_last_page & VISIBILITYMAP_ALL_FROZEN)
+		(*scanned_pages_eager)++;
+
+	vmsnap->scanned_pages_lazy = *scanned_pages_lazy;
+	vmsnap->scanned_pages_eager = *scanned_pages_eager;
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_strategy -- determine VACUUM's scanning strategy.
+ *
+ * VACUUM chooses a vmsnap strategy according to priorities around advancing
+ * relfrozenxid.  See visibilitymap_snap_acquire.
+ */
+void
+visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat)
+{
+	int			nprefetch;
+
+	/* Remember final scanning strategy */
+	vmsnap->strat = strat;
+
+	if (vmsnap->strat == VMSNAP_SCAN_LAZY)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_lazy;
+	else if (vmsnap->strat == VMSNAP_SCAN_EAGER)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_eager;
+	else
+		vmsnap->scanned_pages_to_return = vmsnap->rel_pages;
+
+	vmsnap->scanned_pages_to_prefetch = vmsnap->scanned_pages_to_return;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_strategy %s %d %u",
+		 RelationGetRelationName(vmsnap->rel), (int) strat,
+		 vmsnap->scanned_pages_to_return);
+#endif
+
+	/*
+	 * Stage blocks (may have to read from temp file).
+	 *
+	 * We rely on the assumption that we'll always have a large enough staged
+	 * blocks array to accommodate any possible prefetch distance.
+	 */
+	vm_snap_stage_blocks(vmsnap);
+
+	nprefetch = Min(vmsnap->current_nblocks_staged, vmsnap->prefetch_distance);
+#ifdef USE_PREFETCH
+	for (int i = 0; i < nprefetch; i++)
+	{
+		BlockNumber block = vmsnap->staged[i].scanned_block;
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, block);
+	}
+#endif
+
+	vmsnap->scanned_pages_to_prefetch -= nprefetch;
+	vmsnap->next_prefetch_idx += nprefetch;
+}
+
+/*
+ *	visibilitymap_snap_next -- get next block to scan from vmsnap.
+ *
+ * Returns next block in line for VACUUM to scan according to vmsnap.  Caller
+ * skips any and all blocks preceding returned block.
+ *
+ * The all-visible status of returned block is set in *all_visible.  Block
+ * usually won't be set all-visible (else VACUUM wouldn't need to scan it),
+ * but it can be in certain corner cases.  This includes the VMSNAP_SCAN_ALL
+ * case, as well as a special case that VACUUM expects us to handle: the final
+ * block (rel_pages - 1) is always returned here (regardless of our strategy).
+ *
+ * VACUUM always scans the last page to determine whether it has tuples.  This
+ * is useful as a way of avoiding certain pathological cases with heap rel
+ * truncation.
+ */
+BlockNumber
+visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible)
+{
+	BlockNumber next_block_to_scan;
+	vmsnapblock block;
+
+	*allvisible = true;
+	if (vmsnap->scanned_pages_to_return == 0)
+		return InvalidBlockNumber;
+
+	/* Prepare to return this block */
+	block = vmsnap->staged[vmsnap->next_return_idx++];
+	*allvisible = block.all_visible;
+	next_block_to_scan = block.scanned_block;
+	vmsnap->current_nblocks_staged--;
+	vmsnap->scanned_pages_to_return--;
+
+	/*
+	 * Did the staged blocks array just run out of blocks to return to caller,
+	 * or do we need to stage more blocks for I/O prefetching purposes?
+	 */
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+	if ((vmsnap->current_nblocks_staged == 0 &&
+		 vmsnap->scanned_pages_to_return > 0) ||
+		(vmsnap->next_prefetch_idx == vmsnap->first_invalid_idx &&
+		 vmsnap->scanned_pages_to_prefetch > 0))
+	{
+		if (vmsnap->current_nblocks_staged > 0)
+		{
+			/*
+			 * We've run out of prefetchable blocks, but still have some
+			 * non-returned blocks.  Shift existing blocks to the start of the
+			 * array.  The newly staged blocks go after these ones.
+			 */
+			memmove(&vmsnap->staged[0],
+					&vmsnap->staged[vmsnap->next_return_idx],
+					sizeof(vmsnapblock) * vmsnap->current_nblocks_staged);
+		}
+
+		/*
+		 * Reset offsets in staged blocks array, while accounting for likely
+		 * presence of preexisting blocks that have already been prefetched
+		 * but have yet to be returned to VACUUM caller
+		 */
+		vmsnap->next_prefetch_idx -= vmsnap->next_return_idx;
+		vmsnap->first_invalid_idx -= vmsnap->next_return_idx;
+		vmsnap->next_return_idx = 0;
+
+		/* Stage more blocks (may have to read from temp file) */
+		vm_snap_stage_blocks(vmsnap);
+	}
+
+	/*
+	 * By here we're guaranteed to have at least one prefetchable block in the
+	 * staged blocks array (unless we've already prefetched all blocks that
+	 * will ever be returned to VACUUM caller)
+	 */
+	if (vmsnap->next_prefetch_idx < vmsnap->first_invalid_idx)
+	{
+#ifdef USE_PREFETCH
+		/* Still have remaining blocks to prefetch, so prefetch next one */
+		vmsnapblock prefetch = vmsnap->staged[vmsnap->next_prefetch_idx++];
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, prefetch.scanned_block);
+#else
+		vmsnap->next_prefetch_idx++;
+#endif
+		Assert(vmsnap->current_nblocks_staged > 1);
+		Assert(vmsnap->scanned_pages_to_prefetch > 0);
+		vmsnap->scanned_pages_to_prefetch--;
+	}
+	else
+	{
+		Assert(vmsnap->scanned_pages_to_prefetch == 0);
+	}
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_next %s %u",
+		 RelationGetRelationName(vmsnap->rel), next_block_to_scan);
+#endif
+
+	return next_block_to_scan;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Frees resources allocated in visibilitymap_snap_acquire for VACUUM.
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->scanned_pages_to_return == 0);
+	if (vmsnap->file)
+		BufFileClose(vmsnap->file);
+	pfree(vmsnap);
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
@@ -677,3 +1109,118 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
+
+/*
+ * Stage some heap blocks from vmsnap to return to VACUUM caller.
+ *
+ * Called when we completely run out of staged blocks to return to VACUUM, or
+ * when vmsnap still has some pending staged blocks, but too few to be able to
+ * prefetch incrementally as the remaining blocks are returned to VACUUM.
+ */
+static void
+vm_snap_stage_blocks(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->current_nblocks_staged < STAGED_BUFSIZE);
+	Assert(vmsnap->first_invalid_idx < STAGED_BUFSIZE);
+	Assert(vmsnap->next_return_idx <= vmsnap->first_invalid_idx);
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+
+	while (vmsnap->next_block < vmsnap->rel_pages &&
+		   vmsnap->current_nblocks_staged < STAGED_BUFSIZE)
+	{
+		bool		all_visible = true;
+		vmsnapblock stage;
+
+		for (;;)
+		{
+			uint8		mapbits = vm_snap_get_status(vmsnap,
+													 vmsnap->next_block);
+
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+				all_visible = false;
+				break;
+			}
+
+			/*
+			 * Stop staging blocks just before final page, which must always
+			 * be scanned by VACUUM
+			 */
+			if (vmsnap->next_block == vmsnap->rel_pages - 1)
+				break;
+
+			/* VMSNAP_SCAN_ALL forcing VACUUM to scan every page? */
+			if (vmsnap->strat == VMSNAP_SCAN_ALL)
+				break;
+
+			/*
+			 * Check if VACUUM must scan this page because it's not all-frozen
+			 * and VACUUM opted to use VMSNAP_SCAN_EAGER strategy
+			 */
+			if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
+				vmsnap->strat == VMSNAP_SCAN_EAGER)
+				break;
+
+			/* VACUUM will skip this block -- so don't stage it for later */
+			vmsnap->next_block++;
+		}
+
+		/* VACUUM will scan this block, so stage it for later */
+		stage.scanned_block = vmsnap->next_block++;
+		stage.all_visible = all_visible;
+		vmsnap->staged[vmsnap->first_invalid_idx++] = stage;
+		vmsnap->current_nblocks_staged++;
+	}
+}
+
+/*
+ * Get status of bits from vm snapshot
+ */
+static uint8
+vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "vm_snap_get_status %u", heapBlk);
+#endif
+
+	/*
+	 * If we didn't see the VM page when the snapshot was first acquired we
+	 * defensively assume heapBlk not all-visible or all-frozen
+	 */
+	Assert(heapBlk <= vmsnap->rel_pages);
+	if (unlikely(mapBlock >= vmsnap->nvmpages))
+		return 0;
+
+	/*
+	 * Read from temp file when required.
+	 *
+	 * Although this routine supports random access, sequential access is
+	 * expected.  We should only need to read each temp file page into cache
+	 * at most once per VACUUM.
+	 */
+	if (unlikely(mapBlock != vmsnap->curvmpage))
+	{
+		size_t		nread;
+
+		if (BufFileSeekBlock(vmsnap->file, mapBlock) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek to block %u of vmsnap temporary file",
+							mapBlock)));
+		nread = BufFileRead(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+		if (nread != BLCKSZ)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u of vmsnap temporary file: read only %zu of %zu bytes",
+							mapBlock, nread, (size_t) BLCKSZ)));
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = PageGetContents(vmsnap->vmpage.data);
+	}
+
+	return ((vmsnap->rawmap[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c68bd8ff..5085d9407 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -933,11 +933,11 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				effective_multixact_freeze_max_age,
 				freeze_strategy_threshold;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
@@ -1069,48 +1069,48 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
-	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.  The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/*
+	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+	 * XMID table age (whichever is greater currently).
+	 */
+	XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+		((double) freeze_table_age + 0.5);
+	MXIDFrac = (double) (nextMXID - cutoffs->relminmxid) /
+		((double) multixact_freeze_table_age + 0.5);
+	cutoffs->tableagefrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		cutoffs->tableagefrac = 1.0;
+
+	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 549a2e969..554e2bd0c 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2476,10 +2476,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2496,10 +2496,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4763cb6bb..bb50a5486 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -658,6 +658,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -691,11 +698,9 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 66e947612..f36594e7c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9161,20 +9161,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9243,19 +9251,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 79595b1cb..c137debb1 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -158,9 +158,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
       all tuples are known to be frozen can always be skipped, and those
       where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      skipped except when performing an aggressive vacuum.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.38.1

v10-0001-Refactor-how-VACUUM-passes-around-its-XID-cutoff.patchapplication/x-patch; name=v10-0001-Refactor-how-VACUUM-passes-around-its-XID-cutoff.patchDownload

From b2d882a7ef30ead6cfc781294330bd56aecc1606 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 19 Nov 2022 16:37:53 -0800
Subject: [PATCH v10 1/5] Refactor how VACUUM passes around its XID cutoffs.

Use a dedicated struct for the XID/MXID cutoffs used by VACUUM, such as
FreezeLimit and OldestXmin.  This state is initialized in vacuum.c, and
then passed around (via const pointers) by code from vacuumlazy.c to
external freezing related routines like heap_prepare_freeze_tuple.

Also simplify some of the logic for dealing with frozen xmin in
heap_prepare_freeze_tuple: add dedicated "xmin_already_frozen" state to
clearly distinguish xmin XIDs that we're going to freeze from those that
were already frozen from before.  This makes its xmin handling code
symmetrical with its xmax handling code.  This is preparation for an
upcoming commit that adds page level freezing.

Also refactor the control flow within FreezeMultiXactId(), while adding
stricter sanity checks.  We now test OldestXmin directly (instead of
using FreezeLimit as an inexact proxy for OldestXmin).  This is further
preparation for the page level freezing work, which will make the
function's caller give over control of page level freezing when needed
(whenever heap_prepare_freeze_tuple encounters a tuple/page that happens
to contain one or more MultiXactIds).

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznS9TxXmz2_=SY+SyJyDFbiOftKofM9=aDo68BbXNBUMA@mail.gmail.com
---
 src/include/access/heapam.h            |  10 +-
 src/include/access/tableam.h           |   2 +-
 src/include/commands/vacuum.h          |  49 ++-
 src/backend/access/heap/heapam.c       | 490 ++++++++++++-------------
 src/backend/access/heap/vacuumlazy.c   | 197 +++++-----
 src/backend/access/transam/multixact.c |   9 +-
 src/backend/commands/cluster.c         |  25 +-
 src/backend/commands/vacuum.c          | 120 +++---
 8 files changed, 438 insertions(+), 464 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 810baaf9d..53eb01176 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -38,6 +38,7 @@
 
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
+struct VacuumCutoffs;
 
 #define MaxLockTupleMode	LockTupleExclusive
 
@@ -178,8 +179,7 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
-									  TransactionId relfrozenxid, TransactionId relminmxid,
-									  TransactionId cutoff_xid, TransactionId cutoff_multi,
+									  const struct VacuumCutoffs *cutoffs,
 									  HeapTupleFreeze *frz, bool *totally_frozen,
 									  TransactionId *relfrozenxid_out,
 									  MultiXactId *relminmxid_out);
@@ -188,9 +188,9 @@ extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
-							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi,
+							  TransactionId FreezeLimit, TransactionId MultiXactCutoff);
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
+									const struct VacuumCutoffs *cutoffs,
 									TransactionId *relfrozenxid_out,
 									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4d1ef405c..1320ee6db 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1634,7 +1634,7 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
  *   in that index's order; if false and OldIndex is InvalidOid, no sorting is
  *   performed
  * - OldIndex - see use_sort
- * - OldestXmin - computed by vacuum_set_xid_limits(), even when
+ * - OldestXmin - computed by vacuum_get_cutoffs(), even when
  *   not needed for the relation's AM
  * - *xid_cutoff - ditto
  * - *multi_cutoff - ditto
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8..896d1b1ac 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -235,6 +235,45 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
+/*
+ * VacuumCutoffs is immutable state that describes the cutoffs used by VACUUM.
+ * Established at the beginning of each VACUUM operation.
+ */
+struct VacuumCutoffs
+{
+	/*
+	 * Existing pg_class fields at start of VACUUM (used for sanity checks)
+	 */
+	TransactionId relfrozenxid;
+	MultiXactId relminmxid;
+
+	/*
+	 * OldestXmin is the Xid below which tuples deleted by any xact (that
+	 * committed) should be considered DEAD, not just RECENTLY_DEAD.
+	 *
+	 * OldestMxact is the Mxid below which MultiXacts are definitely not seen
+	 * as visible by any running transaction.
+	 *
+	 * OldestXmin and OldestMxact are also the most recent values that can
+	 * ever be passed to vac_update_relstats() as frozenxid and minmulti
+	 * arguments at the end of VACUUM.  These same values should be passed
+	 * when it turns out that VACUUM will leave no unfrozen XIDs/MXIDs behind
+	 * in the table.
+	 */
+	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
+
+	/*
+	 * FreezeLimit is the Xid below which all Xids are definitely replaced by
+	 * FrozenTransactionId in heap pages that VACUUM can cleanup lock.
+	 *
+	 * MultiXactCutoff is the value below which all MultiXactIds are
+	 * definitely removed from Xmax in heap pages VACUUM can cleanup lock.
+	 */
+	TransactionId FreezeLimit;
+	MultiXactId MultiXactCutoff;
+};
+
 /*
  * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
  */
@@ -286,13 +325,9 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
-								  TransactionId *OldestXmin,
-								  MultiXactId *OldestMxact,
-								  TransactionId *FreezeLimit,
-								  MultiXactId *MultiXactCutoff);
-extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
-									  MultiXactId relminmxid);
+extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+							   struct VacuumCutoffs *cutoffs);
+extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
 extern void vacuum_delay_point(void);
 extern bool vacuum_is_permitted_for_relation(Oid relid, Form_pg_class reltuple,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 42756a9e6..389f529af 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -52,6 +52,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -6121,12 +6122,10 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
-				  TransactionId relfrozenxid, TransactionId relminmxid,
-				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
+				  const struct VacuumCutoffs *cutoffs, uint16 *flags,
+				  TransactionId *mxid_oldest_xid_out)
 {
-	TransactionId xid = InvalidTransactionId;
-	int			i;
+	TransactionId newxmax = InvalidTransactionId;
 	MultiXactMember *members;
 	int			nmembers;
 	bool		need_replace;
@@ -6149,12 +6148,12 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		*flags |= FRM_INVALIDATE_XMAX;
 		return InvalidTransactionId;
 	}
-	else if (MultiXactIdPrecedes(multi, relminmxid))
+	else if (MultiXactIdPrecedes(multi, cutoffs->relminmxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
-								 multi, relminmxid)));
-	else if (MultiXactIdPrecedes(multi, cutoff_multi))
+								 multi, cutoffs->relminmxid)));
+	else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
 	{
 		/*
 		 * This old multi cannot possibly have members still running, but
@@ -6167,39 +6166,39 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
-									 multi, cutoff_multi)));
+									 multi, cutoffs->MultiXactCutoff)));
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
 			*flags |= FRM_INVALIDATE_XMAX;
-			xid = InvalidTransactionId;
+			newxmax = InvalidTransactionId;
 		}
 		else
 		{
-			/* replace multi by update xid */
-			xid = MultiXactIdGetUpdateXid(multi, t_infomask);
+			/* replace multi with single XID for its updater */
+			newxmax = MultiXactIdGetUpdateXid(multi, t_infomask);
 
 			/* wasn't only a lock, xid needs to be valid */
-			Assert(TransactionIdIsValid(xid));
+			Assert(TransactionIdIsValid(newxmax));
 
-			if (TransactionIdPrecedes(xid, relfrozenxid))
+			if (TransactionIdPrecedes(newxmax, cutoffs->relfrozenxid))
 				ereport(ERROR,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("found update xid %u from before relfrozenxid %u",
-										 xid, relfrozenxid)));
+										 newxmax, cutoffs->relfrozenxid)));
 
 			/*
-			 * If the xid is older than the cutoff, it has to have aborted,
-			 * otherwise the tuple would have gotten pruned away.
+			 * If the new xmax xid is older than OldestXmin, it has to have
+			 * aborted, otherwise the tuple would have been pruned away
 			 */
-			if (TransactionIdPrecedes(xid, cutoff_xid))
+			if (TransactionIdPrecedes(newxmax, cutoffs->OldestXmin))
 			{
-				if (TransactionIdDidCommit(xid))
+				if (TransactionIdDidCommit(newxmax))
 					ereport(ERROR,
 							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg_internal("cannot freeze committed update xid %u", xid)));
+							 errmsg_internal("cannot freeze committed update xid %u", newxmax)));
 				*flags |= FRM_INVALIDATE_XMAX;
-				xid = InvalidTransactionId;
+				newxmax = InvalidTransactionId;
 			}
 			else
 			{
@@ -6211,17 +6210,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
 		 * when no Xids will remain
 		 */
-		return xid;
+		return newxmax;
 	}
 
 	/*
-	 * This multixact might have or might not have members still running, but
-	 * we know it's valid and is newer than the cutoff point for multis.
-	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * Some member(s) of this Multi may be below FreezeLimit xid cutoff, so we
 	 * need to walk the whole members array to figure out what to do, if
 	 * anything.
 	 */
-
 	nmembers =
 		GetMultiXactIdMembers(multi, &members, false,
 							  HEAP_XMAX_IS_LOCKED_ONLY(t_infomask));
@@ -6232,12 +6228,15 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		return InvalidTransactionId;
 	}
 
-	/* is there anything older than the cutoff? */
 	need_replace = false;
 	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
-	for (i = 0; i < nmembers; i++)
+	for (int i = 0; i < nmembers; i++)
 	{
-		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		TransactionId xid = members[i].xid;
+
+		Assert(!TransactionIdPrecedes(xid, cutoffs->relfrozenxid));
+
+		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
 			need_replace = true;
 			break;
@@ -6247,7 +6246,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	}
 
 	/*
-	 * In the simplest case, there is no member older than the cutoff; we can
+	 * In the simplest case, there is no member older than FreezeLimit; we can
 	 * keep the existing MultiXactId as-is, avoiding a more expensive second
 	 * pass over the multi
 	 */
@@ -6275,110 +6274,97 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	update_committed = false;
 	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
 
-	for (i = 0; i < nmembers; i++)
+	/*
+	 * Determine whether to keep each member xid, or to ignore it instead
+	 */
+	for (int i = 0; i < nmembers; i++)
 	{
-		/*
-		 * Determine whether to keep this member or ignore it.
-		 */
-		if (ISUPDATE_from_mxstatus(members[i].status))
+		TransactionId xid = members[i].xid;
+		MultiXactStatus mstatus = members[i].status;
+
+		Assert(!TransactionIdPrecedes(xid, cutoffs->relfrozenxid));
+
+		if (!ISUPDATE_from_mxstatus(mstatus))
 		{
-			TransactionId txid = members[i].xid;
-
-			Assert(TransactionIdIsValid(txid));
-			if (TransactionIdPrecedes(txid, relfrozenxid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("found update xid %u from before relfrozenxid %u",
-										 txid, relfrozenxid)));
-
 			/*
-			 * It's an update; should we keep it?  If the transaction is known
-			 * aborted or crashed then it's okay to ignore it, otherwise not.
-			 * Note that an updater older than cutoff_xid cannot possibly be
-			 * committed, because HeapTupleSatisfiesVacuum would have returned
-			 * HEAPTUPLE_DEAD and we would not be trying to freeze the tuple.
-			 *
-			 * As with all tuple visibility routines, it's critical to test
-			 * TransactionIdIsInProgress before TransactionIdDidCommit,
-			 * because of race conditions explained in detail in
-			 * heapam_visibility.c.
+			 * Locker XID (not updater XID).  We only keep lockers that are
+			 * still running.
 			 */
-			if (TransactionIdIsCurrentTransactionId(txid) ||
-				TransactionIdIsInProgress(txid))
-			{
-				Assert(!TransactionIdIsValid(update_xid));
-				update_xid = txid;
-			}
-			else if (TransactionIdDidCommit(txid))
-			{
-				/*
-				 * The transaction committed, so we can tell caller to set
-				 * HEAP_XMAX_COMMITTED.  (We can only do this because we know
-				 * the transaction is not running.)
-				 */
-				Assert(!TransactionIdIsValid(update_xid));
-				update_committed = true;
-				update_xid = txid;
-			}
-			else
-			{
-				/*
-				 * Not in progress, not committed -- must be aborted or
-				 * crashed; we can ignore it.
-				 */
-			}
-
-			/*
-			 * Since the tuple wasn't totally removed when vacuum pruned, the
-			 * update Xid cannot possibly be older than the xid cutoff. The
-			 * presence of such a tuple would cause corruption, so be paranoid
-			 * and check.
-			 */
-			if (TransactionIdIsValid(update_xid) &&
-				TransactionIdPrecedes(update_xid, cutoff_xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("found update xid %u from before xid cutoff %u",
-										 update_xid, cutoff_xid)));
-
-			/*
-			 * We determined that this is an Xid corresponding to an update
-			 * that must be retained -- add it to new members list for later.
-			 *
-			 * Also consider pushing back temp_xid_out, which is needed when
-			 * we later conclude that a new multi is required (i.e. when we go
-			 * on to set FRM_RETURN_IS_MULTI for our caller because we also
-			 * need to retain a locker that's still running).
-			 */
-			if (TransactionIdIsValid(update_xid))
+			if (TransactionIdIsCurrentTransactionId(xid) ||
+				TransactionIdIsInProgress(xid))
 			{
 				newmembers[nnewmembers++] = members[i];
-				if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
-					temp_xid_out = members[i].xid;
+				has_lockers = true;
+
+				/*
+				 * Cannot possibly be older than VACUUM's OldestXmin, so we
+				 * don't need a NewRelfrozenXid step here
+				 */
+				Assert(TransactionIdPrecedesOrEquals(cutoffs->OldestXmin, xid));
 			}
+
+			continue;
+		}
+
+		/*
+		 * Updater XID (not locker XID).  Should we keep it?
+		 *
+		 * Since the tuple wasn't totally removed when vacuum pruned, the
+		 * update Xid cannot possibly be older than OldestXmin cutoff. The
+		 * presence of such a tuple would cause corruption, so be paranoid and
+		 * check.
+		 */
+		if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("found update xid %u from before removable cutoff %u",
+									 xid, cutoffs->OldestXmin)));
+		if (TransactionIdIsValid(update_xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("multixact %u has two or more updating members",
+									 multi),
+					 errdetail_internal("First updater XID=%u second updater XID=%u.",
+										update_xid, xid)));
+
+		/*
+		 * If the transaction is known aborted or crashed then it's okay to
+		 * ignore it, otherwise not.
+		 *
+		 * As with all tuple visibility routines, it's critical to test
+		 * TransactionIdIsInProgress before TransactionIdDidCommit, because of
+		 * race conditions explained in detail in heapam_visibility.c.
+		 */
+		if (TransactionIdIsCurrentTransactionId(xid) ||
+			TransactionIdIsInProgress(xid))
+			update_xid = xid;
+		else if (TransactionIdDidCommit(xid))
+		{
+			/*
+			 * The transaction committed, so we can tell caller to set
+			 * HEAP_XMAX_COMMITTED.  (We can only do this because we know the
+			 * transaction is not running.)
+			 */
+			update_committed = true;
+			update_xid = xid;
 		}
 		else
 		{
-			/* We only keep lockers if they are still running */
-			if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
-				TransactionIdIsInProgress(members[i].xid))
-			{
-				/*
-				 * Running locker cannot possibly be older than the cutoff.
-				 *
-				 * The cutoff is <= VACUUM's OldestXmin, which is also the
-				 * initial value used for top-level relfrozenxid_out tracking
-				 * state.  A running locker cannot be older than VACUUM's
-				 * OldestXmin, either, so we don't need a temp_xid_out step.
-				 */
-				Assert(TransactionIdIsNormal(members[i].xid));
-				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
-				Assert(!TransactionIdPrecedes(members[i].xid,
-											  *mxid_oldest_xid_out));
-				newmembers[nnewmembers++] = members[i];
-				has_lockers = true;
-			}
+			/*
+			 * Not in progress, not committed -- must be aborted or crashed;
+			 * we can ignore it.
+			 */
+			continue;
 		}
+
+		/*
+		 * We determined that this is an Xid corresponding to an update that
+		 * must be retained -- add it to new members list for later.  Also
+		 * consider pushing back mxid_oldest_xid_out.
+		 */
+		newmembers[nnewmembers++] = members[i];
+		if (TransactionIdPrecedes(xid, temp_xid_out))
+			temp_xid_out = xid;
 	}
 
 	pfree(members);
@@ -6391,7 +6377,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* nothing worth keeping!? Tell caller to remove the whole thing */
 		*flags |= FRM_INVALIDATE_XMAX;
-		xid = InvalidTransactionId;
+		newxmax = InvalidTransactionId;
 		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
@@ -6407,7 +6393,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		*flags |= FRM_RETURN_IS_XID;
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
-		xid = update_xid;
+		newxmax = update_xid;
 		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
 	}
 	else
@@ -6417,26 +6403,29 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * one, to set as new Xmax in the tuple.  The oldest surviving member
 		 * might push back mxid_oldest_xid_out.
 		 */
-		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
+		newxmax = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
 		*mxid_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
 
-	return xid;
+	return newxmax;
 }
 
 /*
  * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the specified cutoff XID and cutoff MultiXactId.  If so,
+ * are older than the FreezeLimit and/or MultiXactCutoff freeze cutoffs.  If so,
  * setup enough state (in the *frz output argument) to later execute and
- * WAL-log what we would need to do, and return true.  Return false if nothing
- * is to be changed.  In addition, set *totally_frozen to true if the tuple
- * will be totally frozen after these operations are performed and false if
- * more freezing will eventually be required.
+ * WAL-log what caller needs to do for the tuple, and return true.  Return
+ * false if nothing can be changed about the tuple right now.
+ *
+ * Also sets *totally_frozen to true if the tuple will be totally frozen once
+ * caller executes returned freeze plan (or if the tuple was already totally
+ * frozen by an earlier VACUUM).  This indicates that there are no remaining
+ * XIDs or MultiXactIds that will need to be processed by a future VACUUM.
  *
  * VACUUM caller must assemble HeapTupleFreeze entries for every tuple that we
  * returned true for when called.  A later heap_freeze_execute_prepared call
@@ -6454,12 +6443,6 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
  * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
  *
- * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6467,16 +6450,17 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  */
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
-						  TransactionId relfrozenxid, TransactionId relminmxid,
-						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  const struct VacuumCutoffs *cutoffs,
 						  HeapTupleFreeze *frz, bool *totally_frozen,
 						  TransactionId *relfrozenxid_out,
 						  MultiXactId *relminmxid_out)
 {
-	bool		changed = false;
-	bool		xmax_already_frozen = false;
-	bool		xmin_frozen;
-	bool		freeze_xmax;
+	bool		xmin_already_frozen = false,
+				xmax_already_frozen = false;
+	bool		freeze_xmin = false,
+				replace_xvac = false,
+				replace_xmax = false,
+				freeze_xmax = false;
 	TransactionId xid;
 
 	frz->frzflags = 0;
@@ -6485,37 +6469,29 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
 
 	/*
-	 * Process xmin.  xmin_frozen has two slightly different meanings: in the
-	 * !XidIsNormal case, it means "the xmin doesn't need any freezing" (it's
-	 * already a permanent value), while in the block below it is set true to
-	 * mean "xmin won't need freezing after what we do to it here" (false
-	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
-	 * handling, since either way the tuple's xmin will be a permanent value
-	 * once we're done with it.
+	 * Process xmin, while keeping track of whether it's already frozen, or
+	 * will become frozen when our freeze plan is executed by caller (could be
+	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
-		xmin_frozen = true;
+		xmin_already_frozen = true;
 	else
 	{
-		if (TransactionIdPrecedes(xid, relfrozenxid))
+		if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
-									 xid, relfrozenxid)));
+									 xid, cutoffs->relfrozenxid)));
 
-		xmin_frozen = TransactionIdPrecedes(xid, cutoff_xid);
-		if (xmin_frozen)
+		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->FreezeLimit);
+		if (freeze_xmin)
 		{
 			if (!TransactionIdDidCommit(xid))
 				ereport(ERROR,
 						(errcode(ERRCODE_DATA_CORRUPTED),
 						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-										 xid, cutoff_xid)));
-
-			frz->t_infomask |= HEAP_XMIN_FROZEN;
-			changed = true;
+										 xid, cutoffs->FreezeLimit)));
 		}
 		else
 		{
@@ -6525,10 +6501,27 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		}
 	}
 
+	/*
+	 * Old-style VACUUM FULL is gone, but we have to process xvac for as long
+	 * as we support having MOVED_OFF/MOVED_IN tuples in the database
+	 */
+	xid = HeapTupleHeaderGetXvac(tuple);
+	if (TransactionIdIsNormal(xid))
+	{
+		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
+		Assert(TransactionIdPrecedes(xid, cutoffs->OldestXmin));
+
+		/*
+		 * For Xvac, we always freeze proactively.  This allows totally_frozen
+		 * tracking to ignore xvac.
+		 */
+		replace_xvac = true;
+	}
+
 	/*
 	 * Process xmax.  To thoroughly examine the current Xmax value we need to
 	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given cutoff for Xids.  In that case, those values might need
+	 * below the given FreezeLimit.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
@@ -6543,13 +6536,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		uint16		flags;
 		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
 
-		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
-									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi,
+		newxmax = FreezeMultiXactId(xid, tuple->t_infomask, cutoffs,
 									&flags, &mxid_oldest_xid_out);
 
-		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
-
 		if (flags & FRM_RETURN_IS_XID)
 		{
 			/*
@@ -6558,8 +6547,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * Might have to ratchet back relfrozenxid_out here, though never
 			 * relminmxid_out.
 			 */
-			Assert(!freeze_xmax);
-			Assert(TransactionIdIsValid(newxmax));
+			Assert(!TransactionIdPrecedes(newxmax, cutoffs->OldestXmin));
 			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
 				*relfrozenxid_out = newxmax;
 
@@ -6574,7 +6562,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
-			changed = true;
+			replace_xmax = true;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6587,9 +6575,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * Might have to ratchet back relfrozenxid_out here, though never
 			 * relminmxid_out.
 			 */
-			Assert(!freeze_xmax);
-			Assert(MultiXactIdIsValid(newxmax));
-			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(!MultiXactIdPrecedes(newxmax, cutoffs->OldestMxact));
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
 												 *relfrozenxid_out));
 			*relfrozenxid_out = mxid_oldest_xid_out;
@@ -6605,10 +6591,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			GetMultiXactIdHintBits(newxmax, &newbits, &newbits2);
 			frz->t_infomask |= newbits;
 			frz->t_infomask2 |= newbits2;
-
 			frz->xmax = newxmax;
-
-			changed = true;
+			replace_xmax = true;
 		}
 		else if (flags & FRM_NOOP)
 		{
@@ -6617,7 +6601,6 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
 			 * both together.
 			 */
-			Assert(!freeze_xmax);
 			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
 			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
 												 *relfrozenxid_out));
@@ -6628,23 +6611,25 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		else
 		{
 			/*
-			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.
-			 * Won't have to ratchet back relminmxid_out or relfrozenxid_out.
+			 * Freeze plan for tuple "freezes xmax" in the strictest sense:
+			 * it'll leave nothing in xmax (neither an Xid nor a MultiXactId).
 			 */
-			Assert(freeze_xmax);
+			Assert(flags & FRM_INVALIDATE_XMAX);
+			Assert(MultiXactIdPrecedes(xid, cutoffs->OldestMxact));
 			Assert(!TransactionIdIsValid(newxmax));
+			freeze_xmax = true;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
 		/* Raw xmax is normal XID */
-		if (TransactionIdPrecedes(xid, relfrozenxid))
+		if (TransactionIdPrecedes(xid, cutoffs->relfrozenxid))
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
-									 xid, relfrozenxid)));
+									 xid, cutoffs->relfrozenxid)));
 
-		if (TransactionIdPrecedes(xid, cutoff_xid))
+		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
 			/*
 			 * If we freeze xmax, make absolutely sure that it's not an XID
@@ -6663,7 +6648,6 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		}
 		else
 		{
-			freeze_xmax = false;
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
 		}
@@ -6672,19 +6656,41 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		/* Raw xmax is InvalidTransactionId XID */
 		Assert((tuple->t_infomask & HEAP_XMAX_IS_MULTI) == 0);
-		freeze_xmax = false;
 		xmax_already_frozen = true;
-		/* No need for relfrozenxid_out handling for already-frozen xmax */
 	}
 	else
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
-				 errmsg_internal("found xmax %u (infomask 0x%04x) not frozen, not multi, not normal",
+				 errmsg_internal("found raw xmax %u (infomask 0x%04x) not invalid and not multi",
 								 xid, tuple->t_infomask)));
 
+	if (freeze_xmin)
+	{
+		Assert(!xmin_already_frozen);
+
+		frz->t_infomask |= HEAP_XMIN_FROZEN;
+	}
+	if (replace_xvac)
+	{
+		/*
+		 * If a MOVED_OFF tuple is not dead, the xvac transaction must have
+		 * failed; whereas a non-dead MOVED_IN tuple must mean the xvac
+		 * transaction succeeded.
+		 */
+		if (tuple->t_infomask & HEAP_MOVED_OFF)
+			frz->frzflags |= XLH_INVALID_XVAC;
+		else
+			frz->frzflags |= XLH_FREEZE_XVAC;
+	}
+	if (replace_xmax)
+	{
+		Assert(!xmax_already_frozen && !freeze_xmax);
+
+		/* Already set t_infomask/t_infomask2 flags in freeze plan */
+	}
 	if (freeze_xmax)
 	{
-		Assert(!xmax_already_frozen);
+		Assert(!xmax_already_frozen && !replace_xmax);
 
 		frz->xmax = InvalidTransactionId;
 
@@ -6697,52 +6703,20 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		frz->t_infomask |= HEAP_XMAX_INVALID;
 		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
 		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
-		changed = true;
 	}
 
 	/*
-	 * Old-style VACUUM FULL is gone, but we have to keep this code as long as
-	 * we support having MOVED_OFF/MOVED_IN tuples in the database.
+	 * Determine if this tuple is already totally frozen, or will become
+	 * totally frozen
 	 */
-	if (tuple->t_infomask & HEAP_MOVED)
-	{
-		xid = HeapTupleHeaderGetXvac(tuple);
-
-		/*
-		 * For Xvac, we ignore the cutoff_xid and just always perform the
-		 * freeze operation.  The oldest release in which such a value can
-		 * actually be set is PostgreSQL 8.4, because old-style VACUUM FULL
-		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
-		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
-		 * when we skipped freezing on that basis.
-		 *
-		 * No need for relfrozenxid_out handling, since we always freeze xvac.
-		 */
-		if (TransactionIdIsNormal(xid))
-		{
-			/*
-			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
-			 * have failed; whereas a non-dead MOVED_IN tuple must mean the
-			 * xvac transaction succeeded.
-			 */
-			if (tuple->t_infomask & HEAP_MOVED_OFF)
-				frz->frzflags |= XLH_INVALID_XVAC;
-			else
-				frz->frzflags |= XLH_FREEZE_XVAC;
-
-			/*
-			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
-			 * will already be set here, but there's a small chance not.
-			 */
-			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
-			frz->t_infomask |= HEAP_XMIN_COMMITTED;
-			changed = true;
-		}
-	}
-
-	*totally_frozen = (xmin_frozen &&
+	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
-	return changed;
+
+	/* A "totally_frozen" tuple must not leave anything behind in xmax */
+	Assert(!*totally_frozen || !replace_xmax);
+
+	/* Tell caller if this tuple has a usable freeze plan set in *frz */
+	return freeze_xmin || replace_xvac || replace_xmax || freeze_xmax;
 }
 
 /*
@@ -6861,19 +6835,25 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 bool
 heap_freeze_tuple(HeapTupleHeader tuple,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
-				  TransactionId cutoff_xid, TransactionId cutoff_multi)
+				  TransactionId FreezeLimit, TransactionId MultiXactCutoff)
 {
 	HeapTupleFreeze frz;
 	bool		do_freeze;
-	bool		tuple_totally_frozen;
-	TransactionId relfrozenxid_out = cutoff_xid;
-	MultiXactId relminmxid_out = cutoff_multi;
+	bool		totally_frozen;
+	struct VacuumCutoffs cutoffs;
+	TransactionId NewRelfrozenXid = FreezeLimit;
+	MultiXactId NewRelminMxid = MultiXactCutoff;
 
-	do_freeze = heap_prepare_freeze_tuple(tuple,
-										  relfrozenxid, relminmxid,
-										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+	cutoffs.relfrozenxid = relfrozenxid;
+	cutoffs.relminmxid = relminmxid;
+	cutoffs.OldestXmin = FreezeLimit;
+	cutoffs.OldestMxact = MultiXactCutoff;
+	cutoffs.FreezeLimit = FreezeLimit;
+	cutoffs.MultiXactCutoff = MultiXactCutoff;
+
+	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs,
+										  &frz, &totally_frozen,
+										  &NewRelfrozenXid, &NewRelminMxid);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7308,23 +7288,24 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi,
+heap_tuple_would_freeze(HeapTupleHeader tuple,
+						const struct VacuumCutoffs *cutoffs,
 						TransactionId *relfrozenxid_out,
 						MultiXactId *relminmxid_out)
 {
 	TransactionId xid;
 	MultiXactId multi;
-	bool		would_freeze = false;
+	bool		freeze = false;
 
 	/* First deal with xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (TransactionIdIsNormal(xid))
 	{
+		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
-			would_freeze = true;
+		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
+			freeze = true;
 	}
 
 	/* Now deal with xmax */
@@ -7337,11 +7318,12 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 	if (TransactionIdIsNormal(xid))
 	{
+		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
 		/* xmax is a non-permanent XID */
 		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 			*relfrozenxid_out = xid;
-		if (TransactionIdPrecedes(xid, cutoff_xid))
-			would_freeze = true;
+		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
+			freeze = true;
 	}
 	else if (!MultiXactIdIsValid(multi))
 	{
@@ -7353,7 +7335,7 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		if (MultiXactIdPrecedes(multi, *relminmxid_out))
 			*relminmxid_out = multi;
 		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
-		would_freeze = true;
+		freeze = true;
 	}
 	else
 	{
@@ -7361,10 +7343,11 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		MultiXactMember *members;
 		int			nmembers;
 
+		Assert(MultiXactIdPrecedesOrEquals(cutoffs->relminmxid, multi));
 		if (MultiXactIdPrecedes(multi, *relminmxid_out))
 			*relminmxid_out = multi;
-		if (MultiXactIdPrecedes(multi, cutoff_multi))
-			would_freeze = true;
+		if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
+			freeze = true;
 
 		/* need to check whether any member of the mxact is old */
 		nmembers = GetMultiXactIdMembers(multi, &members, false,
@@ -7373,11 +7356,11 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		for (int i = 0; i < nmembers; i++)
 		{
 			xid = members[i].xid;
-			Assert(TransactionIdIsNormal(xid));
+			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
-			if (TransactionIdPrecedes(xid, cutoff_xid))
-				would_freeze = true;
+			if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
+				freeze = true;
 		}
 		if (nmembers > 0)
 			pfree(members);
@@ -7388,14 +7371,15 @@ heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		xid = HeapTupleHeaderGetXvac(tuple);
 		if (TransactionIdIsNormal(xid))
 		{
+			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
 			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
 				*relfrozenxid_out = xid;
 			/* heap_prepare_freeze_tuple always freezes xvac */
-			would_freeze = true;
+			freeze = true;
 		}
 	}
 
-	return would_freeze;
+	return freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7e..b234072e8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,6 +144,10 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Buffer access strategy and parallel vacuum state */
+	BufferAccessStrategy bstrategy;
+	ParallelVacuumState *pvs;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -158,21 +162,9 @@ typedef struct LVRelState
 	bool		do_index_cleanup;
 	bool		do_rel_truncate;
 
-	/* Buffer access strategy and parallel vacuum state */
-	BufferAccessStrategy bstrategy;
-	ParallelVacuumState *pvs;
-
-	/* rel's initial relfrozenxid and relminmxid */
-	TransactionId relfrozenxid;
-	MultiXactId relminmxid;
-	double		old_live_tuples;	/* previous value of pg_class.reltuples */
-
 	/* VACUUM operation's cutoffs for freezing and pruning */
-	TransactionId OldestXmin;
+	struct VacuumCutoffs cutoffs;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
-	TransactionId FreezeLimit;
-	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
@@ -314,14 +306,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				aggressive,
 				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
-	TransactionId OldestXmin,
-				FreezeLimit;
-	MultiXactId OldestMxact,
-				MultiXactCutoff;
 	BlockNumber orig_rel_pages,
 				new_rel_pages,
 				new_rel_allvisible;
@@ -353,27 +340,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
 								  RelationGetRelid(rel));
 
-	/*
-	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
-	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
-	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
-	 * XIDs < FreezeLimit (all MXIDs < MultiXactCutoff also need to go away).
-	 */
-	aggressive = vacuum_set_xid_limits(rel, params, &OldestXmin, &OldestMxact,
-									   &FreezeLimit, &MultiXactCutoff);
-
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		aggressive = true;
-		skipwithvm = false;
-	}
-
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
 	 * up an error context callback to display additional information on any
@@ -396,25 +362,12 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
-	if (verbose)
-	{
-		Assert(!IsAutoVacuumWorkerProcess());
-		if (aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
 
 	/* Set up high level stuff about rel and its indexes */
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
+	vacrel->bstrategy = bstrategy;
 	if (instrument && vacrel->nindexes > 0)
 	{
 		/* Copy index names used by instrumentation (not error reporting) */
@@ -435,8 +388,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
-	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -459,11 +410,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		Assert(params->index_cleanup == VACOPTVALUE_AUTO);
 	}
 
-	vacrel->bstrategy = bstrategy;
-	vacrel->relfrozenxid = rel->rd_rel->relfrozenxid;
-	vacrel->relminmxid = rel->rd_rel->relminmxid;
-	vacrel->old_live_tuples = rel->rd_rel->reltuples;
-
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
 	vacrel->removed_pages = 0;
@@ -489,32 +435,53 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->missed_dead_tuples = 0;
 
 	/*
-	 * Determine the extent of the blocks that we'll scan in lazy_scan_heap,
-	 * and finalize cutoffs used for freezing and pruning in lazy_scan_prune.
+	 * Get cutoffs that determine which deleted tuples are considered DEAD,
+	 * not just RECENTLY_DEAD, and which XIDs/MXIDs to freeze.  Then determine
+	 * the extent of the blocks that we'll scan in lazy_scan_heap.  It has to
+	 * happen in this order to ensure that the OldestXmin cutoff field works
+	 * as an upper bound on the XIDs stored in the pages we'll actually scan
+	 * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs).
 	 *
+	 * Next acquire vistest, a related cutoff that's used in heap_page_prune.
 	 * We expect vistest will always make heap_page_prune remove any deleted
 	 * tuple whose xmax is < OldestXmin.  lazy_scan_prune must never become
 	 * confused about whether a tuple should be frozen or removed.  (In the
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
-	 *
-	 * We must determine rel_pages _after_ OldestXmin has been established.
-	 * lazy_scan_heap's physical heap scan (scan of pages < rel_pages) is
-	 * thereby guaranteed to not miss any tuples with XIDs < OldestXmin. These
-	 * XIDs must at least be considered for freezing (though not necessarily
-	 * frozen) during its scan.
 	 */
+	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
-	vacrel->OldestXmin = OldestXmin;
 	vacrel->vistest = GlobalVisTestFor(rel);
-	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
-	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
-	vacrel->MultiXactCutoff = MultiXactCutoff;
 	/* Initialize state used to track oldest extant XID/MXID */
-	vacrel->NewRelfrozenXid = OldestXmin;
-	vacrel->NewRelminMxid = OldestMxact;
+	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
+	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
 	vacrel->skippedallvis = false;
+	skipwithvm = true;
+	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+	{
+		/*
+		 * Force aggressive mode, and disable skipping blocks using the
+		 * visibility map (even those set all-frozen)
+		 */
+		vacrel->aggressive = true;
+		skipwithvm = false;
+	}
+
+	vacrel->skipwithvm = skipwithvm;
+
+	if (verbose)
+	{
+		if (vacrel->aggressive)
+			ereport(INFO,
+					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
+							get_database_name(MyDatabaseId),
+							vacrel->relnamespace, vacrel->relname)));
+		else
+			ereport(INFO,
+					(errmsg("vacuuming \"%s.%s.%s\"",
+							get_database_name(MyDatabaseId),
+							vacrel->relnamespace, vacrel->relname)));
+	}
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -569,13 +536,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
 	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
 	 */
-	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
-		   TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
-										 vacrel->relfrozenxid,
+	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
+		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
+										 vacrel->cutoffs.relfrozenxid,
 										 vacrel->NewRelfrozenXid));
-	Assert(vacrel->NewRelminMxid == OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
-									   vacrel->relminmxid,
+	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
+									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
 	if (vacrel->skippedallvis)
 	{
@@ -584,7 +551,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * chose to skip an all-visible page range.  The state that tracks new
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 */
-		Assert(!aggressive);
+		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -669,14 +636,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				 * implies aggressive.  Produce distinct output for the corner
 				 * case all the same, just in case.
 				 */
-				if (aggressive)
+				if (vacrel->aggressive)
 					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			}
 			else
 			{
-				if (aggressive)
+				if (vacrel->aggressive)
 					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
 				else
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
@@ -702,20 +669,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
 								 (long long) vacrel->missed_dead_tuples,
 								 vacrel->missed_dead_pages);
-			diff = (int32) (ReadNextTransactionId() - OldestXmin);
+			diff = (int32) (ReadNextTransactionId() -
+							vacrel->cutoffs.OldestXmin);
 			appendStringInfo(&buf,
 							 _("removable cutoff: %u, which was %d XIDs old when operation ended\n"),
-							 OldestXmin, diff);
+							 vacrel->cutoffs.OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid -
+								vacrel->cutoffs.relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d XIDs ahead of previous value\n"),
 								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid -
+								vacrel->cutoffs.relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d MXIDs ahead of previous value\n"),
 								 vacrel->NewRelminMxid, diff);
@@ -1610,7 +1580,7 @@ retry:
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
 	{
-		bool		tuple_totally_frozen;
+		bool		totally_frozen;
 
 		/*
 		 * Set the offset number so that we can display it along with any
@@ -1666,7 +1636,8 @@ retry:
 		 * since heap_page_prune() looked.  Handle that here by restarting.
 		 * (See comments at the top of function for a full explanation.)
 		 */
-		res = HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf);
+		res = HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
+									   buf);
 
 		if (unlikely(res == HEAPTUPLE_DEAD))
 			goto retry;
@@ -1723,7 +1694,8 @@ retry:
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					if (!TransactionIdPrecedes(xmin,
+											   vacrel->cutoffs.OldestXmin))
 					{
 						prunestate->all_visible = false;
 						break;
@@ -1774,13 +1746,8 @@ retry:
 		prunestate->hastup = true;	/* page makes rel truncation unsafe */
 
 		/* Tuple with storage -- consider need to freeze */
-		if (heap_prepare_freeze_tuple(tuple.t_data,
-									  vacrel->relfrozenxid,
-									  vacrel->relminmxid,
-									  vacrel->FreezeLimit,
-									  vacrel->MultiXactCutoff,
-									  &frozen[tuples_frozen],
-									  &tuple_totally_frozen,
+		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs,
+									  &frozen[tuples_frozen], &totally_frozen,
 									  &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Save prepared freeze plan for later */
@@ -1791,7 +1758,7 @@ retry:
 		 * If tuple is not frozen (and not about to become frozen) then caller
 		 * had better not go on to set this page's VM bit
 		 */
-		if (!tuple_totally_frozen)
+		if (!totally_frozen)
 			prunestate->all_frozen = false;
 	}
 
@@ -1817,7 +1784,8 @@ retry:
 		vacrel->frozen_pages++;
 
 		/* Execute all freeze plans for page as a single atomic action */
-		heap_freeze_execute_prepared(vacrel->rel, buf, vacrel->FreezeLimit,
+		heap_freeze_execute_prepared(vacrel->rel, buf,
+									 vacrel->cutoffs.FreezeLimit,
 									 frozen, tuples_frozen);
 	}
 
@@ -1972,9 +1940,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_would_freeze(tupleheader,
-									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff,
+		if (heap_tuple_would_freeze(tupleheader, &vacrel->cutoffs,
 									&NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
@@ -2010,7 +1976,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
+										 buf))
 		{
 			case HEAPTUPLE_DELETE_IN_PROGRESS:
 			case HEAPTUPLE_LIVE:
@@ -2274,6 +2241,7 @@ static bool
 lazy_vacuum_all_indexes(LVRelState *vacrel)
 {
 	bool		allindexes = true;
+	double		old_live_tuples = vacrel->rel->rd_rel->reltuples;
 
 	Assert(vacrel->nindexes > 0);
 	Assert(vacrel->do_index_vacuuming);
@@ -2297,9 +2265,9 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 			Relation	indrel = vacrel->indrels[idx];
 			IndexBulkDeleteResult *istat = vacrel->indstats[idx];
 
-			vacrel->indstats[idx] =
-				lazy_vacuum_one_index(indrel, istat, vacrel->old_live_tuples,
-									  vacrel);
+			vacrel->indstats[idx] = lazy_vacuum_one_index(indrel, istat,
+														  old_live_tuples,
+														  vacrel);
 
 			if (lazy_check_wraparound_failsafe(vacrel))
 			{
@@ -2312,7 +2280,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	else
 	{
 		/* Outsource everything to parallel variant */
-		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, vacrel->old_live_tuples,
+		parallel_vacuum_bulkdel_all_indexes(vacrel->pvs, old_live_tuples,
 											vacrel->num_index_scans);
 
 		/*
@@ -2581,15 +2549,14 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 static bool
 lazy_check_wraparound_failsafe(LVRelState *vacrel)
 {
-	Assert(TransactionIdIsNormal(vacrel->relfrozenxid));
-	Assert(MultiXactIdIsValid(vacrel->relminmxid));
+	Assert(TransactionIdIsNormal(vacrel->cutoffs.relfrozenxid));
+	Assert(MultiXactIdIsValid(vacrel->cutoffs.relminmxid));
 
 	/* Don't warn more than once per VACUUM */
 	if (vacrel->failsafe_active)
 		return true;
 
-	if (unlikely(vacuum_xid_failsafe_check(vacrel->relfrozenxid,
-										   vacrel->relminmxid)))
+	if (unlikely(vacuum_xid_failsafe_check(&vacrel->cutoffs)))
 	{
 		vacrel->failsafe_active = true;
 
@@ -3246,7 +3213,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 		tuple.t_len = ItemIdGetLength(itemid);
 		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
 
-		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
+										 buf))
 		{
 			case HEAPTUPLE_LIVE:
 				{
@@ -3265,7 +3233,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 					 * that everyone sees it as committed?
 					 */
 					xmin = HeapTupleHeaderGetXmin(tuple.t_data);
-					if (!TransactionIdPrecedes(xmin, vacrel->OldestXmin))
+					if (!TransactionIdPrecedes(xmin,
+											   vacrel->cutoffs.OldestXmin))
 					{
 						all_visible = false;
 						*all_frozen = false;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e1191a756..28514a1c5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2813,14 +2813,11 @@ ReadMultiXactCounts(uint32 *multixacts, MultiXactOffset *members)
  * As the fraction of the member space currently in use grows, we become
  * more aggressive in clamping this value.  That not only causes autovacuum
  * to ramp up, but also makes any manual vacuums the user issues more
- * aggressive.  This happens because vacuum_set_xid_limits() clamps the
- * freeze table and the minimum freeze age based on the effective
+ * aggressive.  This happens because vacuum_get_cutoffs() will clamp the
+ * freeze table and the minimum freeze age cutoffs based on the effective
  * autovacuum_multixact_freeze_max_age this function returns.  In the worst
  * case, we'll claim the freeze_max_age to zero, and every vacuum of any
- * table will try to freeze every multixact.
- *
- * It's possible that these thresholds should be user-tunable, but for now
- * we keep it simple.
+ * table will freeze every multixact.
  */
 int
 MultiXactMemberFreezeThreshold(void)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 8966b75bd..2a5fc2c28 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -826,10 +826,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	VacuumParams params;
-	TransactionId OldestXmin,
-				FreezeXid;
-	MultiXactId OldestMxact,
-				MultiXactCutoff;
+	struct VacuumCutoffs cutoffs;
 	bool		use_sort;
 	double		num_tuples = 0,
 				tups_vacuumed = 0,
@@ -918,23 +915,24 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	memset(&params, 0, sizeof(VacuumParams));
-	vacuum_set_xid_limits(OldHeap, &params, &OldestXmin, &OldestMxact,
-						  &FreezeXid, &MultiXactCutoff);
+	vacuum_get_cutoffs(OldHeap, &params, &cutoffs);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
 	 * backwards, so take the max.
 	 */
 	if (TransactionIdIsValid(OldHeap->rd_rel->relfrozenxid) &&
-		TransactionIdPrecedes(FreezeXid, OldHeap->rd_rel->relfrozenxid))
-		FreezeXid = OldHeap->rd_rel->relfrozenxid;
+		TransactionIdPrecedes(cutoffs.FreezeLimit,
+							  OldHeap->rd_rel->relfrozenxid))
+		cutoffs.FreezeLimit = OldHeap->rd_rel->relfrozenxid;
 
 	/*
 	 * MultiXactCutoff, similarly, shouldn't go backwards either.
 	 */
 	if (MultiXactIdIsValid(OldHeap->rd_rel->relminmxid) &&
-		MultiXactIdPrecedes(MultiXactCutoff, OldHeap->rd_rel->relminmxid))
-		MultiXactCutoff = OldHeap->rd_rel->relminmxid;
+		MultiXactIdPrecedes(cutoffs.MultiXactCutoff,
+							OldHeap->rd_rel->relminmxid))
+		cutoffs.MultiXactCutoff = OldHeap->rd_rel->relminmxid;
 
 	/*
 	 * Decide whether to use an indexscan or seqscan-and-optional-sort to scan
@@ -973,13 +971,14 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * values (e.g. because the AM doesn't use freezing).
 	 */
 	table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
-									OldestXmin, &FreezeXid, &MultiXactCutoff,
+									cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+									&cutoffs.MultiXactCutoff,
 									&num_tuples, &tups_vacuumed,
 									&tups_recently_dead);
 
 	/* return selected values to caller, get set as relfrozenxid/minmxid */
-	*pFreezeXid = FreezeXid;
-	*pCutoffMulti = MultiXactCutoff;
+	*pFreezeXid = cutoffs.FreezeLimit;
+	*pCutoffMulti = cutoffs.MultiXactCutoff;
 
 	/* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
 	NewHeap->rd_toastoid = InvalidOid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 293b84bbc..ba965b8c7 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -907,34 +907,20 @@ get_all_vacuum_rels(int options)
 }
 
 /*
- * vacuum_set_xid_limits() -- compute OldestXmin and freeze cutoff points
+ * vacuum_get_cutoffs() -- compute OldestXmin and freeze cutoff points
  *
  * The target relation and VACUUM parameters are our inputs.
  *
- * Our output parameters are:
- * - OldestXmin is the Xid below which tuples deleted by any xact (that
- *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - OldestMxact is the Mxid below which MultiXacts are definitely not
- *   seen as visible by any running transaction.
- * - FreezeLimit is the Xid below which all Xids are definitely frozen or
- *   removed during aggressive vacuums.
- * - MultiXactCutoff is the value below which all MultiXactIds are definitely
- *   removed from Xmax during aggressive vacuums.
+ * Output parameters are the cutoffs that VACUUM caller should use.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
  * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
  * minimum).
- *
- * OldestXmin and OldestMxact are the most recent values that can ever be
- * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
- * vacuumlazy.c caller later on.  These values should be passed when it turns
- * out that VACUUM will leave no unfrozen XIDs/MXIDs behind in the table.
  */
 bool
-vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
-					  TransactionId *OldestXmin, MultiXactId *OldestMxact,
-					  TransactionId *FreezeLimit, MultiXactId *MultiXactCutoff)
+vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+				   struct VacuumCutoffs *cutoffs)
 {
 	int			freeze_min_age,
 				multixact_freeze_min_age,
@@ -954,6 +940,10 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
 
+	/* Set pg_class fields in cutoffs */
+	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
+	cutoffs->relminmxid = rel->rd_rel->relminmxid;
+
 	/*
 	 * Acquire OldestXmin.
 	 *
@@ -965,14 +955,14 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	 * that only one vacuum process can be working on a particular table at
 	 * any time, and that each vacuum is always an independent transaction.
 	 */
-	*OldestXmin = GetOldestNonRemovableTransactionId(rel);
+	cutoffs->OldestXmin = GetOldestNonRemovableTransactionId(rel);
 
 	if (OldSnapshotThresholdActive())
 	{
 		TransactionId limit_xmin;
 		TimestampTz limit_ts;
 
-		if (TransactionIdLimitedForOldSnapshots(*OldestXmin, rel,
+		if (TransactionIdLimitedForOldSnapshots(cutoffs->OldestXmin, rel,
 												&limit_xmin, &limit_ts))
 		{
 			/*
@@ -982,20 +972,48 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 			 * frequency), but would still be a significant improvement.
 			 */
 			SetOldSnapshotThresholdTimestamp(limit_ts, limit_xmin);
-			*OldestXmin = limit_xmin;
+			cutoffs->OldestXmin = limit_xmin;
 		}
 	}
 
-	Assert(TransactionIdIsNormal(*OldestXmin));
+	Assert(TransactionIdIsNormal(cutoffs->OldestXmin));
 
 	/* Acquire OldestMxact */
-	*OldestMxact = GetOldestMultiXactId();
-	Assert(MultiXactIdIsValid(*OldestMxact));
+	cutoffs->OldestMxact = GetOldestMultiXactId();
+	Assert(MultiXactIdIsValid(cutoffs->OldestMxact));
 
 	/* Acquire next XID/next MXID values used to apply age-based settings */
 	nextXID = ReadNextTransactionId();
 	nextMXID = ReadNextMultiXactId();
 
+	/*
+	 * Also compute the multixact age for which freezing is urgent.  This is
+	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
+	 * short of multixact member space.
+	 */
+	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+
+	/*
+	 * Almost ready to set freeze output parameters; check if OldestXmin or
+	 * OldestMxact are held back to an unsafe degree before we start on that
+	 */
+	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
+	if (!TransactionIdIsNormal(safeOldestXmin))
+		safeOldestXmin = FirstNormalTransactionId;
+	safeOldestMxact = nextMXID - effective_multixact_freeze_max_age;
+	if (safeOldestMxact < FirstMultiXactId)
+		safeOldestMxact = FirstMultiXactId;
+	if (TransactionIdPrecedes(cutoffs->OldestXmin, safeOldestXmin))
+		ereport(WARNING,
+				(errmsg("cutoff for removing and freezing tuples is far in the past"),
+				 errhint("Close open transactions soon to avoid wraparound problems.\n"
+						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+	if (MultiXactIdPrecedes(cutoffs->OldestMxact, safeOldestMxact))
+		ereport(WARNING,
+				(errmsg("cutoff for freezing multixacts is far in the past"),
+				 errhint("Close open transactions soon to avoid wraparound problems.\n"
+						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+
 	/*
 	 * Determine the minimum freeze age to use: as specified by the caller, or
 	 * vacuum_freeze_min_age, but in any case not more than half
@@ -1008,19 +1026,12 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	Assert(freeze_min_age >= 0);
 
 	/* Compute FreezeLimit, being careful to generate a normal XID */
-	*FreezeLimit = nextXID - freeze_min_age;
-	if (!TransactionIdIsNormal(*FreezeLimit))
-		*FreezeLimit = FirstNormalTransactionId;
+	cutoffs->FreezeLimit = nextXID - freeze_min_age;
+	if (!TransactionIdIsNormal(cutoffs->FreezeLimit))
+		cutoffs->FreezeLimit = FirstNormalTransactionId;
 	/* FreezeLimit must always be <= OldestXmin */
-	if (TransactionIdPrecedes(*OldestXmin, *FreezeLimit))
-		*FreezeLimit = *OldestXmin;
-
-	/*
-	 * Compute the multixact age for which freezing is urgent.  This is
-	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
-	 * short of multixact member space.
-	 */
-	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+	if (TransactionIdPrecedes(cutoffs->OldestXmin, cutoffs->FreezeLimit))
+		cutoffs->FreezeLimit = cutoffs->OldestXmin;
 
 	/*
 	 * Determine the minimum multixact freeze age to use: as specified by
@@ -1035,33 +1046,12 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
 	Assert(multixact_freeze_min_age >= 0);
 
 	/* Compute MultiXactCutoff, being careful to generate a valid value */
-	*MultiXactCutoff = nextMXID - multixact_freeze_min_age;
-	if (*MultiXactCutoff < FirstMultiXactId)
-		*MultiXactCutoff = FirstMultiXactId;
+	cutoffs->MultiXactCutoff = nextMXID - multixact_freeze_min_age;
+	if (cutoffs->MultiXactCutoff < FirstMultiXactId)
+		cutoffs->MultiXactCutoff = FirstMultiXactId;
 	/* MultiXactCutoff must always be <= OldestMxact */
-	if (MultiXactIdPrecedes(*OldestMxact, *MultiXactCutoff))
-		*MultiXactCutoff = *OldestMxact;
-
-	/*
-	 * Done setting output parameters; check if OldestXmin or OldestMxact are
-	 * held back to an unsafe degree in passing
-	 */
-	safeOldestXmin = nextXID - autovacuum_freeze_max_age;
-	if (!TransactionIdIsNormal(safeOldestXmin))
-		safeOldestXmin = FirstNormalTransactionId;
-	safeOldestMxact = nextMXID - effective_multixact_freeze_max_age;
-	if (safeOldestMxact < FirstMultiXactId)
-		safeOldestMxact = FirstMultiXactId;
-	if (TransactionIdPrecedes(*OldestXmin, safeOldestXmin))
-		ereport(WARNING,
-				(errmsg("cutoff for removing and freezing tuples is far in the past"),
-				 errhint("Close open transactions soon to avoid wraparound problems.\n"
-						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
-	if (MultiXactIdPrecedes(*OldestMxact, safeOldestMxact))
-		ereport(WARNING,
-				(errmsg("cutoff for freezing multixacts is far in the past"),
-				 errhint("Close open transactions soon to avoid wraparound problems.\n"
-						 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
+	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
+		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
@@ -1113,13 +1103,13 @@ vacuum_set_xid_limits(Relation rel, const VacuumParams *params,
  * mechanism to determine if its table's relfrozenxid and relminmxid are now
  * dangerously far in the past.
  *
- * Input parameters are the target relation's relfrozenxid and relminmxid.
- *
  * When we return true, VACUUM caller triggers the failsafe.
  */
 bool
-vacuum_xid_failsafe_check(TransactionId relfrozenxid, MultiXactId relminmxid)
+vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs)
 {
+	TransactionId relfrozenxid = cutoffs->relfrozenxid;
+	MultiXactId relminmxid = cutoffs->relminmxid;
 	TransactionId xid_skip_limit;
 	MultiXactId multi_skip_limit;
 	int			skip_index_vacuum;
-- 
2.38.1

#52

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Geoghegan (#51)

Re: New strategies for freezing, advancing relfrozenxid early

On Sun, 2022-12-18 at 14:20 -0800, Peter Geoghegan wrote:

Attached is v10, which fixes this issue, but using a different
approach to the one I sketched here.

In 0001, it's fairly straightforward rearrangement and looks like an
improvement to me. I have a few complaints, but they are about pre-
existing code that you moved around, and I like that you didn't
editorialize too much while just moving code around. +1 from me.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#53

Nikita Malakhov

hukutoc@gmail.com

about 3 years ago

In reply to: Jeff Davis (#52)

Re: New strategies for freezing, advancing relfrozenxid early

Hi!

I'll try to apply this patch onto my branch with Pluggable TOAST to test
these mechanics with new TOAST. Would reply on the result. It could
be difficult though, because both have a lot of changes that affect
the same code.

I'm not sure how much this would help with bloat. I suspect that it
could make a big difference with the right workload. If you always
need frequent autovacuums, just to deal with bloat, then there is
never a good time to run an aggressive antiwraparound autovacuum. An
aggressive AV will probably end up taking much longer than the typical
autovacuum that deals with bloat. While the aggressive AV will remove
as much bloat as any other AV, in theory, that might not help much. If
the aggressive AV takes as long as (say) 5 regular autovacuums would
have taken, and if you really needed those 5 separate autovacuums to
run, just to deal with the bloat, then that's a real problem. The
aggressive AV effectively causes bloat with such a workload.

On Tue, Dec 20, 2022 at 12:01 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Sun, 2022-12-18 at 14:20 -0800, Peter Geoghegan wrote:

Attached is v10, which fixes this issue, but using a different
approach to the one I sketched here.

In 0001, it's fairly straightforward rearrangement and looks like an
improvement to me. I have a few complaints, but they are about pre-
existing code that you moved around, and I like that you didn't
editorialize too much while just moving code around. +1 from me.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

--
Regards,

--
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

#54

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Geoghegan (#51)

Re: New strategies for freezing, advancing relfrozenxid early

On Sun, 2022-12-18 at 14:20 -0800, Peter Geoghegan wrote:

On Thu, Dec 15, 2022 at 10:53 AM Peter Geoghegan <pg@bowt.ie> wrote:

I agree that the burden of catch-up freezing is excessive here (in
fact I already wrote something to that effect on the wiki page).
The
likely solution can be simple enough.

Attached is v10, which fixes this issue, but using a different
approach to the one I sketched here.

Comments on 0002:

Can you explain the following portion of the diff:

  - else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
  + else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))

...

  + /* Can't violate the MultiXactCutoff invariant, either */
  + if (!need_replace)
  +     need_replace = MultiXactIdPrecedes(
  +        multi, cutoffs->MultiXactCutoff);

Regarding correctness, it seems like the basic structure and invariants
are the same, and it builds on the changes already in 9e5405993c. Patch
0002 seems *mostly* about making choices within the existing framework.
That gives me more confidence.

That being said, it does push harder against the limits on both sides.
If I understand correctly, that means pages with wider distributions of
xids are going to persist longer, which could expose pre-existing bugs
in new and interesting ways.

Next, the 'freeze_required' field suggests that it's more involved in
the control flow that causes freezing than it actually is. All it does
is communicate how the trackers need to be adjusted. The return value
of heap_prepare_freeze_tuple() (and underneath, the flags set by
FreezeMultiXactId()) are what actually control what happens. It would
be nice to make this more clear somehow.

The comment:

/*
* If we freeze xmax, make absolutely sure that it's not an XID that
* is important. (Note, a lock-only xmax can be removed independent
* of committedness, since a committed lock holder has released the
* lock).
*/

caused me to go down a rabbit hole looking for edge cases where we
might want to freeze an xmax but not an xmin; e.g. tup.xmax <
OldestXmin < tup.xmin or the related case where tup.xmax < RecentXmin <
tup.xmin. I didn't find a problem, so that's good news.

I also tried some pgbench activity along with concurrent vacuums (and
vacuum freezes) along with periodic verify_heapam(). No problems there.

Did you already describe the testing you've done for 0001+0002
specfiically? It's not radically new logic, but it would be good to try
to catch minor state-handling errors.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#55

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Jeff Davis (#54)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Dec 20, 2022 at 5:44 PM Jeff Davis <pgsql@j-davis.com> wrote:

Comments on 0002:

Can you explain the following portion of the diff:

- else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
+ else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))

...

+ /* Can't violate the MultiXactCutoff invariant, either */
+ if (!need_replace)
+     need_replace = MultiXactIdPrecedes(
+        multi, cutoffs->MultiXactCutoff);

Don't forget the historic context: before Postgres 15's commit
0b018fab, VACUUM's final relfrozenxid always came from FreezeLimit.
Almost all of this code predates that work. So the general idea that
you can make a "should I freeze or should I ratchet back my
relfrozenxid tracker instead?" trade-off at the level of individual
tuples and pages is still a very new one. Right now it's only applied
within lazy_scan_noprune(), but 0002 leverages the same principles
here.

Before now, these heapam.c freezing routines had cutoffs called
cutoff_xid and cutoff_multi. These had values that actually came from
vacuumlazy.c's FreezeLimit and MultiXactCutoff cutoffs (which was
rather unclear). But cutoff_xid and cutoff_multi were *also* used as
inexact proxies for OldestXmin and OldestMxact (also kind of unclear,
but true). For example, there are some sanity checks in heapam.c that
kind of pretend that cutoff_xid is OldestXmin, even though it usually
isn't the same value (it can be, but only during VACUUM FREEZE, or
when the min freeze age is 0 in some other way).

So 0002 teaches the same heapam.c code about everything -- about all
of the different cutoffs, and about the true requirements of VACUUM
around relfrozenxid advancement. In fact, 0002 makes vacuumlazy.c cede
a lot of control of "XID stuff" to the same heapam.c code, freezing it
up to think about freezing as something that works at the level of
physical pages. This is key to allowing vacuumlazy.c to reason about
freezing at the level of the whole table. It thinks about physical
blocks, leaving logical XIDs up to heapam.c code.

This business that you asked about in FreezeMultiXactId() is needed so
that we can allow vacuumlazy.c to "think in terms of physical pages",
while at the same time avoiding allocating new Multis in VACUUM --
which requires "thinking about individual xmax fields" instead -- a
somewhat conflicting goal. We're really trying to have it both ways
(we get page-level freezing, with a little tuple level freezing on the
side, sufficient to to avoid allocating new Multis during VACUUMs in
roughly the same way as we do right now).

In most cases "freezing a page" removes all XIDs < OldestXmin, and all
MXIDs < OldestMxact. It doesn't quite work that way in certain rare
cases involving MultiXacts, though. It is convenient to define "freeze
the page" in a way that gives heapam.c's FreezeMultiXactId() the
leeway to put off the work of processing an individual tuple's xmax,
whenever it happens to be a MultiXactId that would require an
expensive second pass to process aggressively (allocating a new Multi
during VACUUM is especially worth avoiding here).

Our definition of "freeze the page" is a bit creative, at least if
you're used to thinking about it in terms of strict XID-wise cutoffs
like OldestXmin/FreezeLimit. But even if you do think of it in terms
of XIDs, the practical difference is extremely small in practice.

FreezeMultiXactId() effectively makes a decision on how to proceed
with processing at the level of each individual xmax field. Its no-op
multi processing "freezes" an xmax in the event of a costly-to-process
xmax on a page when (for whatever reason) page-level freezing is
triggered. If, on the other hand, page-level freezing isn't triggered
for the page, then page-level no-op processing takes care of the multi
for us instead. Either way, the remaining Multi will ratchet back
VACUUM's relfrozenxid and/or relminmxid trackers as required, and we
won't need an expensive second pass over the multi (unless we really
have no choice, for example during a VACUUM FREEZE, where
OldestXmin==FreezeLimit).

Regarding correctness, it seems like the basic structure and invariants
are the same, and it builds on the changes already in 9e5405993c. Patch
0002 seems *mostly* about making choices within the existing framework.
That gives me more confidence.

You're right that it's the same basic invariants as before, of course.
Turns out that those invariants can be pushed quite far.

Though note that I kind of invented a new invariant (not really, sort
of). Well, it's a postcondition, which is a sort of invariant: any
scanned heap page that can be cleanup locked must never have any
remaining XIDs < FreezeLimit, nor can any MXIDs < MultiXactCutoff
remain. But a cleanup-locked page does *not* need to get rid of all
XIDs < OldestXmin, nor MXIDs < OldestMxact.

This flexibility is mostly useful because it allows lazy_scan_prune to
just decide to not freeze. But, to a much lesser degree, it's useful
because of the edge case with multis -- in general we might just need
the same leeway when lazy_scan_prune "freezes the page".

That being said, it does push harder against the limits on both sides.
If I understand correctly, that means pages with wider distributions of
xids are going to persist longer, which could expose pre-existing bugs
in new and interesting ways.

I don't think it's fundamentally different to what we're already doing
in lazy_scan_noprune. It's just more complicated, because you have to
tease apart slightly different definitions of freezing to understand
code around FreezeMultiXactId(). This is more or less needed to
provide maximum flexibility, where we delay decisions about what to do
until the very last moment.

Next, the 'freeze_required' field suggests that it's more involved in
the control flow that causes freezing than it actually is. All it does
is communicate how the trackers need to be adjusted. The return value
of heap_prepare_freeze_tuple() (and underneath, the flags set by
FreezeMultiXactId()) are what actually control what happens. It would
be nice to make this more clear somehow.

I'm not sure what you mean. Page-level freezing *doesn't* have to go
ahead when freeze_required is not ever set to true for any tuple on
the page (which is most of the time, in practice). lazy_scan_prune
gets to make a choice about freezing the page, when the choice is
available.

Note also that the FRM_NOOP case happens when a call to
FreezeMultiXactId() takes place that won't leave behind a freeze plan
for the tuple (unless its xmin happens to necessitate a freeze plan
for the same tuple). And yet, it will do useful work, needed iff the
"freeze the page" path is ultimately taken by lazy_scan_prune --
FreezeMultiXactId() itself will ratchet back
FreezePageRelfrozenXid/NewRelfrozenXid as needed to make everything
safe.

The comment:

/*
* If we freeze xmax, make absolutely sure that it's not an XID that
* is important. (Note, a lock-only xmax can be removed independent
* of committedness, since a committed lock holder has released the
* lock).
*/

caused me to go down a rabbit hole looking for edge cases where we
might want to freeze an xmax but not an xmin; e.g. tup.xmax <
OldestXmin < tup.xmin or the related case where tup.xmax < RecentXmin <
tup.xmin. I didn't find a problem, so that's good news.

This is an example of what I meant about the heapam.c code using a
cutoff that actually comes from FreezeLimit, when it would be more
sensible to use OldestXmin instead.

I also tried some pgbench activity along with concurrent vacuums (and
vacuum freezes) along with periodic verify_heapam(). No problems there.

Did you already describe the testing you've done for 0001+0002
specfiically? It's not radically new logic, but it would be good to try
to catch minor state-handling errors.

Lots of stuff with contrib/amcheck, which, as you must already know,
will notice when an XID/MXID is contained in a table whose
relfrozenxid and/or relminmxid indicates that it shouldn't be there.
(Though VACUUM itself does the same thing, albeit not as effectively.)

Obviously the invariants haven't changed here. In many ways it's a
very small set of changes. But in one or two ways it's a significant
shift. It depends on how you think about it.

--
Peter Geoghegan

#56

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#55)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Dec 20, 2022 at 7:15 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Dec 20, 2022 at 5:44 PM Jeff Davis <pgsql@j-davis.com> wrote:

Next, the 'freeze_required' field suggests that it's more involved in
the control flow that causes freezing than it actually is. All it does
is communicate how the trackers need to be adjusted. The return value
of heap_prepare_freeze_tuple() (and underneath, the flags set by
FreezeMultiXactId()) are what actually control what happens. It would
be nice to make this more clear somehow.

I'm not sure what you mean. Page-level freezing *doesn't* have to go
ahead when freeze_required is not ever set to true for any tuple on
the page (which is most of the time, in practice). lazy_scan_prune
gets to make a choice about freezing the page, when the choice is
available.

Oh wait, I think I see the point of confusion now.

When freeze_required is set to true, that means that lazy_scan_prune
literally has no choice -- it simply must freeze the page as
instructed by heap_prepare_freeze_tuple/FreezeMultiXactId. It's not
just a strong suggestion -- it's crucial that lazy_scan_prune freezes
the page as instructed.

The "no freeze" trackers (HeapPageFreeze.NoFreezePageRelfrozenXid and
HeapPageFreeze.NoFreezePageRelminMxid) won't have been maintained
properly when freeze_required was set, so lazy_scan_prune can't expect
to use them -- doing so would lead to VACUUM setting incorrect values
in pg_class later on.

Avoiding the work of maintaining those "no freeze" trackers isn't just
a nice-to-have microoptimization -- it is sometimes very important. We
kind of rely on this to be able to avoid getting too many MultiXact
member SLRU buffer misses inside FreezeMultiXactId. There is a comment
above FreezeMultiXactId that advises its caller that it had better not
call heap_tuple_should_freeze when freeze_required is set to true,
because that could easily lead to multixact member SLRU buffer misses
-- misses that FreezeMultiXactId set out to avoid itself.

It could actually be cheaper to freeze than to not freeze, in the case
of a Multi -- member space misses can sometimes be really expensive.
And so FreezeMultiXactId sometimes freezes a Multi even though it's
not strictly required to do so.

Note also that this isn't a new behavior -- it's actually an old one,
for the most part. It kinda doesn't look that way, because we haven't
passed down separate FreezeLimit/OldestXmin cutoffs (and separate
OldestMxact/MultiXactCutoff cutoffs) until now. But we often don't
need that granular information to be able to process Multis before the
multi value is < MultiXactCutoff.

If you look at how FreezeMultiXactId works, in detail, you'll see that
even on Postgres HEAD it can (say) set a tuple's xmax to
InvalidTransactionId long before the multi value is < MultiXactCutoff.
It just needs to detect that the multi is not still running, and
notice that it's HEAP_XMAX_IS_LOCKED_ONLY(). Stuff like that happens
quite a bit. So for the most part "eager processing of Multis as a
special case" is an old behavior, that has only been enhanced a little
bit (the really important, new change in FreezeMultiXactId is how the
FRM_NOOP case works with FreezeLimit, even though OldestXmin is used
nearby -- this is extra confusing because 0002 doesn't change how we
use FreezeLimit -- it actually changes every other use of FreezeLimit
nearby, making it OldestXmin).

--
Peter Geoghegan

#57

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Geoghegan (#56)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, 2022-12-20 at 21:26 -0800, Peter Geoghegan wrote:

When freeze_required is set to true, that means that lazy_scan_prune
literally has no choice -- it simply must freeze the page as
instructed by heap_prepare_freeze_tuple/FreezeMultiXactId. It's not
just a strong suggestion -- it's crucial that lazy_scan_prune freezes
the page as instructed.

The confusing thing to me is perhaps just the name -- to me,
"freeze_required" suggests that if it were set to true, it would cause
freezing to happen. But as far as I can tell, it does not cause
freezing to happen, it causes some other things to happen that are
necessary when freezing happens (updating and using the right
trackers).

A minor point, no need to take action here. Perhaps rename the
variable.

I think 0001+0002 are about ready.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#58

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Jeff Davis (#57)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Dec 21, 2022 at 4:30 PM Jeff Davis <pgsql@j-davis.com> wrote:

The confusing thing to me is perhaps just the name -- to me,
"freeze_required" suggests that if it were set to true, it would cause
freezing to happen. But as far as I can tell, it does not cause
freezing to happen, it causes some other things to happen that are
necessary when freezing happens (updating and using the right
trackers).

freeze_required is about what's required, which tells us nothing about
what will happen when it's not required (could go either way,
depending on how lazy_scan_prune feels about it).

Setting freeze_required=true implies that heap_prepare_freeze_tuple
has stopped doing maintenance of the "no freeze" trackers. When it
sets freeze_required=true, it really *does* force freezing to happen,
in every practical sense. This happens because lazy_scan_prune does
what it's told to do when it's told that freezing is required. Because
of course it does, why wouldn't it?

So...I still don't get what you mean. Why would lazy_scan_prune ever
break its contract with heap_prepare_freeze_tuple? And in what sense
would you say that heap_prepare_freeze_tuple's setting
freeze_required=true doesn't quite amount to "forcing freezing"? Are
you worried about the possibility that lazy_scan_prune will decide to
rebel at some point, and fail to honor its contract with
heap_prepare_freeze_tuple? :-)

A minor point, no need to take action here. Perhaps rename the
variable.

Andres was the one that suggested this name, actually. I initially
just called it "freeze", but I think that Andres had it right.

I think 0001+0002 are about ready.

Great. I plan on committing 0001 in the next few days. Committing 0002
might take a bit longer.

Thanks
--
Peter Geoghegan

#59

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#58)

4 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Dec 21, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

Great. I plan on committing 0001 in the next few days. Committing 0002
might take a bit longer.

I pushed the VACUUM cutoffs patch (previously 0001) this morning -
thanks for your help with that one.

Attached is v11, which is mostly just to fix the bitrot caused by
today's commits. Though I did adjust some of the commit messages a
bit. There is also one minor functional change in v11: we now always
use eager freezing strategy in unlogged and temp tables, since it's
virtually guaranteed to be a win there.

With an unlogged or temp table, most of the cost of freezing is just
the cycles spent preparing to freeze, since, of course, there isn't
any WAL overhead to have to worry about (which is the dominant concern
with freezing costs, in general). Deciding *not* to freeze pages that
we can freeze and make all-frozen in the VM from unlogged/temp tables
seems like a case of wasting the cycles spent preparing freeze plans.
Why not just do the tiny additional work of executing the freeze plans
at that point?

It's not like eager freezing strategy comes with an added risk that
VACUUM will allocate new multis that it wouldn't otherwise have to
allocate. Nor does it change cleanup-lock-wait behavior. Clearly this
optimization isn't equivalent to interpreting vacuum_freeze_min_age as
0 in unlogged/temp tables. The whole design of freezing strategies is
supposed to abstract away details like that, freeing up high level
code like lazy_scan_strategy to think about freezing at the level of
the whole table -- the cost model stuff really benefits from being
able to measure debt at the table level, measuring things in terms of
units like total all-frozen pages, rel_pages, etc.

--
Peter Geoghegan

Attachments:

v11-0001-Add-page-level-freezing-to-VACUUM.patchapplication/x-patch; name=v11-0001-Add-page-level-freezing-to-VACUUM.patchDownload

From ec184e3b5e44c93b0938e8f8b27d73642c8fb479 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v11 1/4] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).  Making the choice to freeze work at the page level tends to
result in VACUUM writing less WAL in the long term.  This is especially
likely to work due to complementary effects with the freeze plan WAL
deduplication optimization added by commit 9e540599.

Also teach VACUUM to trigger page-level freezing whenever it detects
that heap pruning generated an FPI as torn page protection.  We'll have
already written a large amount of WAL just to do that much, so it's very
likely a good idea to get freezing out of the way for the page early.
This only happens in cases where it will directly lead to marking the
page all-frozen in the visibility map.

In most cases "freezing a page" removes all XIDs < OldestXmin, and all
MXIDs < OldestMxact.  It doesn't quite work that way in certain rare
cases involving MultiXacts, though.  It is convenient to define "freeze
the page" in a way that gives FreezeMultiXactId the leeway to put off
the work of processing an individual tuple's xmax whenever it happens to
be a MultiXactId that would require an expensive second pass to process
aggressively (allocating a new Multi is especially worth avoiding here).

FreezeMultiXactId effectively makes a decision on how to proceed with
processing at the level of each individual xmax field.  Its no-op multi
processing "freezes" an xmax in the event of an expensive-to-process
xmax on a page when (for whatever reason) page-level freezing triggers.
If, on the other hand, freezing is not triggered for the page, then
page-level no-op processing takes care of the multi for us instead.
Either way, the remaining Multi will ratchet back VACUUM's relfrozenxid
and/or relminmxid trackers as required, and we won't need an expensive
second pass over the multi (unless we really have no choice, for example
during a VACUUM FREEZE, where FreezeLimit always matches OldestXmin).

This allows vacuumlazy.c to think of freezing as something that happens
at the page level, or not at all -- without concerning itself with any
of these details.  It largely cedes control of decisions about freezing
and relfrozenxid/relminmxid to the heapam.c freezing routines (routines
like heap_prepare_freeze_tuple and FreezeMultiXactId), which now have
all of the context needed to make decisions about freezing and how it
may affect relfrozenxid and relminmxid advancement.  vacuumlazy.c is now
free to focus on the big picture around freezing physical heap pages.

Later work will add eager freezing strategy to VACUUM (and recast the
behavior established in this commit as lazy freezing, though it isn't
quite as lazy as the historic tuple-orientated approach to freezing).
Making freezing work at the page level is not just an optimization; it's
also a useful basis for modelling costs at the whole table level, since
it makes the visibility map a more reliable indicator of just how far
behind we are on freezing at the level of the whole table.  Later work
that adds explicit eager and lazy scanning strategies will build on this
in order to teach VACUUM to advance relfrozenxid earlier and much more
frequently than before.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/heapam.h          |  82 +++++-
 src/backend/access/heap/heapam.c     | 388 +++++++++++++++------------
 src/backend/access/heap/pruneheap.c  |  16 +-
 src/backend/access/heap/vacuumlazy.c | 132 ++++++---
 doc/src/sgml/config.sgml             |  11 +-
 5 files changed, 397 insertions(+), 232 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 53eb01176..0782fed14 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -113,6 +113,71 @@ typedef struct HeapTupleFreeze
 	OffsetNumber offset;
 } HeapTupleFreeze;
 
+/*
+ * State used by VACUUM to track the details of freezing all eligible tuples
+ * on a given heap page.
+ *
+ * VACUUM prepares freeze plans for each page via heap_prepare_freeze_tuple
+ * calls (every tuple with storage gets its own call).  This page-level freeze
+ * state is updated across each call, which ultimately determines whether or
+ * not freezing the page is required. (VACUUM freezes the page via a call to
+ * heap_freeze_execute_prepared, which freezes using prepared freeze plans.)
+ *
+ * Aside from the basic question of whether or not freezing will go ahead, the
+ * state also tracks the oldest extant XID/MXID in the table as a whole, for
+ * the purposes of advancing relfrozenxid/relminmxid values in pg_class later
+ * on.  Each heap_prepare_freeze_tuple call pushes NewRelfrozenXid and/or
+ * NewRelminMxid back as required to avoid unsafe final pg_class values.  Any
+ * and all unfrozen XIDs or MXIDs that remain after VACUUM finishes _must_
+ * have values >= the final relfrozenxid/relminmxid values in pg_class.  This
+ * includes XIDs that remain as MultiXact members from any tuple's xmax.
+ *
+ * When 'freeze_required' flag isn't set after all tuples are examined, the
+ * final choice on freezing is made by vacuumlazy.c.  It can decide to trigger
+ * freezing based on whatever criteria it deems appropriate.  However, it is
+ * highly recommended that vacuumlazy.c avoid freezing any page that cannot be
+ * marked all-frozen in the visibility map afterwards.
+ *
+ * Freezing is typically optional for most individual pages scanned during any
+ * given VACUUM operation.  This allows vacuumlazy.c to manage the cost of
+ * freezing at the level of the entire VACUUM operation/entire heap relation.
+ */
+typedef struct HeapPageFreeze
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze_required;
+
+	/*
+	 * "No freeze" NewRelfrozenXid/NewRelminMxid trackers.
+	 *
+	 * These trackers are maintained in the same way as the trackers used when
+	 * VACUUM scans a page that isn't cleanup locked.  Both code paths are
+	 * based on the same general idea (do less work for this page during the
+	 * ongoing VACUUM, at the cost of having to accept older final values).
+	 */
+	TransactionId NoFreezePageRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid;
+
+	/*
+	 * Trackers used when heap_freeze_execute_prepared freezes the page.
+	 *
+	 * When we freeze a page, we generally freeze all XIDs < OldestXmin, only
+	 * leaving behind XIDs that are ineligible for freezing, if any.  And so
+	 * you might wonder why these trackers are necessary at all; why should
+	 * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that
+	 * ratchet back the rel-level NewRelfrozenXid/NewRelminMxid trackers?
+	 *
+	 * It is useful to use a definition of "freeze the page" that does not
+	 * overspecify how MultiXacts are affected.  heap_prepare_freeze_tuple
+	 * generally prefers to remove Multis eagerly, but lazy processing is used
+	 * in cases where laziness allows VACUUM to avoid allocating a new Multi.
+	 * The "freeze the page" trackers enable this flexibility.
+	 */
+	TransactionId FreezePageRelfrozenXid;
+	MultiXactId FreezePageRelminMxid;
+
+} HeapPageFreeze;
+
 /* ----------------
  *		function prototypes for heap access method
  *
@@ -180,19 +245,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  const struct VacuumCutoffs *cutoffs,
-									  HeapTupleFreeze *frz, bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  HeapPageFreeze *pagefrz,
+									  HeapTupleFreeze *frz, bool *totally_frozen);
 extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-										 TransactionId FreezeLimit,
+										 TransactionId snapshotConflictHorizon,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId FreezeLimit, TransactionId MultiXactCutoff);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
-									const struct VacuumCutoffs *cutoffs,
-									TransactionId *relfrozenxid_out,
-									MultiXactId *relminmxid_out);
+extern bool heap_tuple_should_freeze(HeapTupleHeader tuple,
+									 const struct VacuumCutoffs *cutoffs,
+									 TransactionId *NoFreezePageRelfrozenXid,
+									 MultiXactId *NoFreezePageRelminMxid);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
@@ -210,7 +274,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts,
-							int *nnewlpdead,
+							int *nnewlpdead, bool *prune_fpi,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
 									OffsetNumber *redirected, int nredirected,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a88de85..dae3f26ce 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6098,9 +6098,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		MultiXactId.
  *
  * "flags" is an output value; it's used to tell caller what to do on return.
- *
- * "mxid_oldest_xid_out" is an output value; it's used to track the oldest
- * extant Xid within any Multixact that will remain after freezing executes.
+ * "pagefrz" is an input/output value, used to manage page level freezing.
  *
  * Possible values that we can set in "flags":
  * FRM_NOOP
@@ -6115,16 +6113,34 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
  *
- * "mxid_oldest_xid_out" is only set when "flags" contains either FRM_NOOP or
- * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
+ * Caller delegates control of page freezing to us.  In practice we always
+ * force freezing of caller's page unless FRM_NOOP processing is indicated.
+ * We help caller ensure that XIDs < FreezeLimit and MXIDs < MultiXactCutoff
+ * can never be left behind.  We freely choose when and how to process each
+ * Multi, without ever violating the cutoff postconditions for freezing.
  *
- * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ * It's useful to remove Multis on a proactive timeline (relative to freezing
+ * XIDs) to keep MultiXact member SLRU buffer misses to a minimum.  It can also
+ * be cheaper in the short run, for us, since we too can avoid SLRU buffer
+ * misses through eager processing.
+ *
+ * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set, though only
+ * when FreezeLimit and/or MultiXactCutoff cutoffs leave us with no choice.
+ * This can usually be put off, which is usually enough to avoid it altogether.
+ *
+ * NB: Caller must maintain "no freeze" NewRelfrozenXid/NewRelminMxid trackers
+ * using heap_tuple_should_freeze when we haven't forced page-level freezing.
+ *
+ * NB: Caller should avoid needlessly calling heap_tuple_should_freeze when we
+ * have already forced page-level freezing, since that might incur the same
+ * SLRU buffer misses that we specifically intended to avoid by freezing.
  */
 static TransactionId
-FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
+FreezeMultiXactId(MultiXactId multi, HeapTupleHeader tuple,
 				  const struct VacuumCutoffs *cutoffs, uint16 *flags,
-				  TransactionId *mxid_oldest_xid_out)
+				  HeapPageFreeze *pagefrz)
 {
+	uint16		t_infomask = tuple->t_infomask;
 	TransactionId newxmax = InvalidTransactionId;
 	MultiXactMember *members;
 	int			nmembers;
@@ -6134,7 +6150,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
-	TransactionId temp_xid_out;
+	TransactionId FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;
+	TransactionId axid PG_USED_FOR_ASSERTS_ONLY = cutoffs->OldestXmin;
+	MultiXactId amxid PG_USED_FOR_ASSERTS_ONLY = cutoffs->OldestMxact;
 
 	*flags = 0;
 
@@ -6146,14 +6164,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* Ensure infomask bits are appropriately set/reset */
 		*flags |= FRM_INVALIDATE_XMAX;
-		return InvalidTransactionId;
+		pagefrz->freeze_required = true;
+		Assert(!TransactionIdIsValid(newxmax));
+		return newxmax;
 	}
 	else if (MultiXactIdPrecedes(multi, cutoffs->relminmxid))
 		ereport(ERROR,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
 								 multi, cutoffs->relminmxid)));
-	else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
+	else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))
 	{
 		/*
 		 * This old multi cannot possibly have members still running, but
@@ -6166,7 +6186,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
-									 multi, cutoffs->MultiXactCutoff)));
+									 multi, cutoffs->OldestMxact)));
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
@@ -6202,14 +6222,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 			else
 			{
+				if (TransactionIdPrecedes(newxmax, FreezePageRelfrozenXid))
+					FreezePageRelfrozenXid = newxmax;
 				*flags |= FRM_RETURN_IS_XID;
 			}
 		}
 
-		/*
-		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
-		 * when no Xids will remain
-		 */
+		pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+		pagefrz->freeze_required = true;
 		return newxmax;
 	}
 
@@ -6225,11 +6245,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* Nothing worth keeping */
 		*flags |= FRM_INVALIDATE_XMAX;
-		return InvalidTransactionId;
+		pagefrz->freeze_required = true;
+		Assert(!TransactionIdIsValid(newxmax));
+		return newxmax;
 	}
 
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;	/* for FRM_NOOP */
 	for (int i = 0; i < nmembers; i++)
 	{
 		TransactionId xid = members[i].xid;
@@ -6238,26 +6260,35 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
+			/* Can't violate the FreezeLimit postcondition */
 			need_replace = true;
 			break;
 		}
-		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
-			temp_xid_out = members[i].xid;
+		if (TransactionIdPrecedes(xid, FreezePageRelfrozenXid))
+			FreezePageRelfrozenXid = xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than FreezeLimit; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* Can't violate the MultiXactCutoff postcondition, either */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff);
+
 	if (!need_replace)
 	{
 		/*
-		 * When mxid_oldest_xid_out gets pushed back here it's likely that the
-		 * update Xid was the oldest member, but we don't rely on that
+		 * FRM_NOOP case is the only one where we don't force page-level
+		 * freezing (see header comments)
 		 */
 		*flags |= FRM_NOOP;
-		*mxid_oldest_xid_out = temp_xid_out;
+
+		/*
+		 * Might have to ratchet back NewRelminMxid, NewRelfrozenXid, or both
+		 * together to make it safe to skip this particular multi/tuple xmax
+		 * if the page is frozen (similar handling will also be required if
+		 * the page isn't frozen, but caller deals with that directly).
+		 */
+		pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+		if (MultiXactIdPrecedes(multi, pagefrz->FreezePageRelminMxid))
+			pagefrz->FreezePageRelminMxid = multi;
 		pfree(members);
 		return multi;
 	}
@@ -6266,13 +6297,18 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_should_freeze will indicate that the tuple should be frozen.
 	 */
+	Assert(heap_tuple_should_freeze(tuple, cutoffs, &axid, &amxid));
+
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
 	update_xid = InvalidTransactionId;
 	update_committed = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
+	FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;	/* re-init */
 
 	/*
 	 * Determine whether to keep each member xid, or to ignore it instead
@@ -6360,11 +6396,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		/*
 		 * We determined that this is an Xid corresponding to an update that
 		 * must be retained -- add it to new members list for later.  Also
-		 * consider pushing back mxid_oldest_xid_out.
+		 * consider pushing back NewRelfrozenXid tracker.
 		 */
 		newmembers[nnewmembers++] = members[i];
-		if (TransactionIdPrecedes(xid, temp_xid_out))
-			temp_xid_out = xid;
+		if (TransactionIdPrecedes(xid, FreezePageRelfrozenXid))
+			FreezePageRelfrozenXid = xid;
 	}
 
 	pfree(members);
@@ -6375,10 +6411,14 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (nnewmembers == 0)
 	{
-		/* nothing worth keeping!? Tell caller to remove the whole thing */
+		/*
+		 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.  Won't
+		 * have to ratchet back NewRelfrozenXid or NewRelminMxid.
+		 */
 		*flags |= FRM_INVALIDATE_XMAX;
 		newxmax = InvalidTransactionId;
-		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
+
+		Assert(pagefrz->FreezePageRelfrozenXid == FreezePageRelfrozenXid);
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
 	{
@@ -6394,22 +6434,29 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		newxmax = update_xid;
-		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
+
+		/* Might have to push back FreezePageRelfrozenXid/NewRelfrozenXid */
+		Assert(TransactionIdPrecedesOrEquals(FreezePageRelfrozenXid,
+											 update_xid));
 	}
 	else
 	{
 		/*
 		 * Create a new multixact with the surviving members of the previous
 		 * one, to set as new Xmax in the tuple.  The oldest surviving member
-		 * might push back mxid_oldest_xid_out.
+		 * might have already pushed back NewRelfrozenXid.
 		 */
 		newxmax = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
-		*mxid_oldest_xid_out = temp_xid_out;
+
+		/* Never need to push back FreezePageRelminMxid/NewRelminMxid */
+		Assert(MultiXactIdPrecedesOrEquals(cutoffs->OldestMxact, newxmax));
 	}
 
 	pfree(newmembers);
 
+	pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+	pagefrz->freeze_required = true;
 	return newxmax;
 }
 
@@ -6417,9 +6464,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the FreezeLimit and/or MultiXactCutoff freeze cutoffs.  If so,
- * setup enough state (in the *frz output argument) to later execute and
- * WAL-log what caller needs to do for the tuple, and return true.  Return
+ * are older than the OldestXmin and/or OldestMxact freeze cutoffs.  If so,
+ * setup enough state (in the *frz output argument) to enable caller to
+ * process this tuple as part of freezing its page, and return true.  Return
  * false if nothing can be changed about the tuple right now.
  *
  * Also sets *totally_frozen to true if the tuple will be totally frozen once
@@ -6427,22 +6474,30 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * frozen by an earlier VACUUM).  This indicates that there are no remaining
  * XIDs or MultiXactIds that will need to be processed by a future VACUUM.
  *
- * VACUUM caller must assemble HeapTupleFreeze entries for every tuple that we
- * returned true for when called.  A later heap_freeze_execute_prepared call
- * will execute freezing for caller's page as a whole.
+ * VACUUM caller must assemble HeapTupleFreeze freeze plan entries for every
+ * tuple that we returned true for, and call heap_freeze_execute_prepared to
+ * execute freezing.  Caller must initialize pagefrz fields for page as a
+ * whole before first call here for each heap page.
+ *
+ * VACUUM caller decides on whether or not to freeze the page as a whole.
+ * We'll often prepare freeze plans for a page that caller just discards.
+ * However, VACUUM doesn't always get to make a choice; it must freeze when
+ * pagefrz.freeze_required is set, to ensure that any XIDs < FreezeLimit (and
+ * MXIDs < MultiXactCutoff) can never be left behind.  We make sure that
+ * VACUUM always follows that rule.
+ *
+ * We sometimes force freezing of xmax MultiXactId values long before it is
+ * strictly necessary to do so just to ensure the FreezeLimit postcondition.
+ * It's worth processing MultiXactIds proactively when it is cheap to do so,
+ * and it's convenient to make that happen by piggy-backing it on the "force
+ * freezing" mechanism.  Conversely, we sometimes delay freezing MultiXactIds
+ * because it is expensive right now (though only when it's still possible to
+ * do so without violating the FreezeLimit/MultiXactCutoff postcondition).
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6451,9 +6506,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  const struct VacuumCutoffs *cutoffs,
-						  HeapTupleFreeze *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  HeapPageFreeze *pagefrz,
+						  HeapTupleFreeze *frz, bool *totally_frozen)
 {
 	bool		xmin_already_frozen = false,
 				xmax_already_frozen = false;
@@ -6470,7 +6524,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/*
 	 * Process xmin, while keeping track of whether it's already frozen, or
-	 * will become frozen when our freeze plan is executed by caller (could be
+	 * will become frozen iff our freeze plan is executed by caller (could be
 	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
@@ -6484,21 +6538,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->FreezeLimit);
-		if (freeze_xmin)
-		{
-			if (!TransactionIdDidCommit(xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-										 xid, cutoffs->FreezeLimit)));
-		}
-		else
-		{
-			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-		}
+		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
+		if (freeze_xmin && !TransactionIdDidCommit(xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
+									 xid, cutoffs->OldestXmin)));
 	}
 
 	/*
@@ -6515,41 +6560,55 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we always freeze proactively.  This allows totally_frozen
 		 * tracking to ignore xvac.
 		 */
-		replace_xvac = true;
+		replace_xvac = pagefrz->freeze_required = true;
 	}
 
-	/*
-	 * Process xmax.  To thoroughly examine the current Xmax value we need to
-	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given FreezeLimit.  In that case, those values might need
-	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
-	 * it out --- if there's a live updater Xid, it needs to be kept.
-	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
-	 */
+	/* Now process xmax */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
-
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		/* Raw xmax is a MultiXactId */
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
 
-		newxmax = FreezeMultiXactId(xid, tuple->t_infomask, cutoffs,
-									&flags, &mxid_oldest_xid_out);
+		/*
+		 * We will either remove xmax completely (in the "freeze_xmax" path),
+		 * process xmax by replacing it (in the "replace_xmax" path), or
+		 * perform no-op xmax processing.  The only constraint is that the
+		 * FreezeLimit/MultiXactCutoff postcondition must never be violated.
+		 */
+		newxmax = FreezeMultiXactId(xid, tuple, cutoffs, &flags, pagefrz);
 
-		if (flags & FRM_RETURN_IS_XID)
+		if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 * This is the only case where 'freeze_required' won't have been
+			 * set for us by FreezeMultiXactId, as well as the only case where
+			 * neither freeze_xmax nor replace_xmax are set (given a multi).
+			 *
+			 * This is a no-op, but the call to FreezeMultiXactId might have
+			 * ratcheted back NewRelfrozenXid and/or NewRelminMxid for us.
+			 * That makes it safe to freeze the page while leaving this
+			 * particular xmax undisturbed.
+			 *
+			 * FreezeMultiXactId is _not_ responsible for the "no freeze"
+			 * NewRelfrozenXid/NewRelminMxid trackers, though -- that's our
+			 * job.  A call to heap_tuple_should_freeze for this same tuple
+			 * will take place below if 'freeze_required' isn't set already.
+			 * (This approach repeats some of the work from FreezeMultiXactId,
+			 * which is not ideal but makes things simpler.)
+			 */
+			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
+			Assert(!MultiXactIdPrecedes(newxmax, pagefrz->FreezePageRelminMxid));
+		}
+		else if (flags & FRM_RETURN_IS_XID)
 		{
 			/*
 			 * xmax will become an updater Xid (original MultiXact's updater
 			 * member Xid will be carried forward as a simple Xid in Xmax).
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
 			 */
 			Assert(!TransactionIdPrecedes(newxmax, cutoffs->OldestXmin));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6572,13 +6631,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/*
 			 * xmax is an old MultiXactId that we have to replace with a new
 			 * MultiXactId, to carry forward two or more original member XIDs.
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
 			 */
 			Assert(!MultiXactIdPrecedes(newxmax, cutoffs->OldestMxact));
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6594,20 +6648,6 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 			replace_xmax = true;
 		}
-		else if (flags & FRM_NOOP)
-		{
-			/*
-			 * xmax is a MultiXactId, and nothing about it changes for now.
-			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
-			 * both together.
-			 */
-			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
-		}
 		else
 		{
 			/*
@@ -6621,6 +6661,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/* Will set t_infomask/t_infomask2 flags in freeze plan below */
 			freeze_xmax = true;
 		}
+
+		/* Only FRM_NOOP doesn't force caller to freeze page */
+		Assert(pagefrz->freeze_required || (!freeze_xmax && !replace_xmax));
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6631,28 +6674,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
-		{
-			/*
-			 * If we freeze xmax, make absolutely sure that it's not an XID
-			 * that is important.  (Note, a lock-only xmax can be removed
-			 * independent of committedness, since a committed lock holder has
-			 * released the lock).
-			 */
-			if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
-				TransactionIdDidCommit(xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("cannot freeze committed xmax %u",
-										 xid)));
+		if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
 			freeze_xmax = true;
-			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
-		}
-		else
-		{
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-		}
+
+		/*
+		 * If we freeze xmax, make absolutely sure that it's not an XID that
+		 * is important.  (Note, a lock-only xmax can be removed independent
+		 * of committedness, since a committed lock holder has released the
+		 * lock).
+		 */
+		if (freeze_xmax && !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
+			TransactionIdDidCommit(xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("cannot freeze committed xmax %u",
+									 xid)));
 	}
 	else if (!TransactionIdIsValid(xid))
 	{
@@ -6679,6 +6715,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * failed; whereas a non-dead MOVED_IN tuple must mean the xvac
 		 * transaction succeeded.
 		 */
+		Assert(pagefrz->freeze_required);
 		if (tuple->t_infomask & HEAP_MOVED_OFF)
 			frz->frzflags |= XLH_INVALID_XVAC;
 		else
@@ -6687,6 +6724,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	if (replace_xmax)
 	{
 		Assert(!xmax_already_frozen && !freeze_xmax);
+		Assert(pagefrz->freeze_required);
 
 		/* Already set t_infomask/t_infomask2 flags in freeze plan */
 	}
@@ -6709,7 +6747,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/*
 	 * Determine if this tuple is already totally frozen, or will become
-	 * totally frozen
+	 * totally frozen (provided caller executes freeze plan for the page)
 	 */
 	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
@@ -6717,6 +6755,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	/* A "totally_frozen" tuple must not leave anything behind in xmax */
 	Assert(!*totally_frozen || !replace_xmax);
 
+	/*
+	 * Check if the option of _not_ freezing caller's page is still in play,
+	 * though don't bother when we already forced freezing earlier on
+	 */
+	if (!pagefrz->freeze_required && !(xmin_already_frozen &&
+									   xmax_already_frozen))
+	{
+		pagefrz->freeze_required =
+			heap_tuple_should_freeze(tuple, cutoffs,
+									 &pagefrz->NoFreezePageRelfrozenXid,
+									 &pagefrz->NoFreezePageRelminMxid);
+	}
+
 	/* Tell caller if this tuple has a usable freeze plan set in *frz */
 	return freeze_xmin || replace_xvac || replace_xmax || freeze_xmax;
 }
@@ -6761,13 +6812,12 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
  */
 void
 heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-							 TransactionId FreezeLimit,
+							 TransactionId snapshotConflictHorizon,
 							 HeapTupleFreeze *tuples, int ntuples)
 {
 	Page		page = BufferGetPage(buffer);
 
 	Assert(ntuples > 0);
-	Assert(TransactionIdIsNormal(FreezeLimit));
 
 	START_CRIT_SECTION();
 
@@ -6790,19 +6840,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		int			nplans;
 		xl_heap_freeze_page xlrec;
 		XLogRecPtr	recptr;
-		TransactionId snapshotConflictHorizon;
 
 		/* Prepare deduplicated representation for use in WAL record */
 		nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
 
-		/*
-		 * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
-		 * Back up caller's FreezeLimit to avoid false conflicts when
-		 * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
-		 */
-		snapshotConflictHorizon = FreezeLimit;
-		TransactionIdRetreat(snapshotConflictHorizon);
-
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -6843,8 +6884,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	bool		do_freeze;
 	bool		totally_frozen;
 	struct VacuumCutoffs cutoffs;
-	TransactionId NewRelfrozenXid = FreezeLimit;
-	MultiXactId NewRelminMxid = MultiXactCutoff;
+	HeapPageFreeze pagefrz;
 
 	cutoffs.relfrozenxid = relfrozenxid;
 	cutoffs.relminmxid = relminmxid;
@@ -6853,9 +6893,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 
-	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs,
-										  &frz, &totally_frozen,
-										  &NewRelfrozenXid, &NewRelminMxid);
+	pagefrz.freeze_required = true;
+	pagefrz.NoFreezePageRelfrozenXid = FreezeLimit;
+	pagefrz.NoFreezePageRelminMxid = MultiXactCutoff;
+	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
+	pagefrz.FreezePageRelminMxid = MultiXactCutoff;
+
+	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs, &pagefrz,
+										  &frz, &totally_frozen);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7278,22 +7323,23 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
 }
 
 /*
- * heap_tuple_would_freeze
+ * heap_tuple_should_freeze
  *
- * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * Return value indicates if heap_prepare_freeze_tuple sibling function should
+ * force freezing of the page containing tuple.  This happens whenever the
+ * tuple contains XID/MXID fields with values < FreezeLimit/MultiXactCutoff.
  *
- * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
- * like the heap_prepare_freeze_tuple arguments that they're based on.  We
- * never freeze here, which makes tracking the oldest extant XID/MXID simple.
+ * The *NoFreezePageRelfrozenXid and *NoFreezePageRelminMxid input/output
+ * arguments help VACUUM track the oldest extant XID/MXID remaining in rel.
+ * Our working assumption is that caller won't decide to freeze this tuple.
+ * It's up to caller to only ratchet back its own top-level trackers after the
+ * point that it commits to not freezing the tuple/page in question.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple,
-						const struct VacuumCutoffs *cutoffs,
-						TransactionId *relfrozenxid_out,
-						MultiXactId *relminmxid_out)
+heap_tuple_should_freeze(HeapTupleHeader tuple,
+						 const struct VacuumCutoffs *cutoffs,
+						 TransactionId *NoFreezePageRelfrozenXid,
+						 MultiXactId *NoFreezePageRelminMxid)
 {
 	TransactionId xid;
 	MultiXactId multi;
@@ -7304,8 +7350,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	if (TransactionIdIsNormal(xid))
 	{
 		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+			*NoFreezePageRelfrozenXid = xid;
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 			freeze = true;
 	}
@@ -7322,8 +7368,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	{
 		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
 		/* xmax is a non-permanent XID */
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+			*NoFreezePageRelfrozenXid = xid;
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 			freeze = true;
 	}
@@ -7334,8 +7380,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 	{
 		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, *NoFreezePageRelminMxid))
+			*NoFreezePageRelminMxid = multi;
 		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
 		freeze = true;
 	}
@@ -7346,8 +7392,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		int			nmembers;
 
 		Assert(MultiXactIdPrecedesOrEquals(cutoffs->relminmxid, multi));
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, *NoFreezePageRelminMxid))
+			*NoFreezePageRelminMxid = multi;
 		if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
 			freeze = true;
 
@@ -7359,8 +7405,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		{
 			xid = members[i].xid;
 			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+				*NoFreezePageRelfrozenXid = xid;
 			if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 				freeze = true;
 		}
@@ -7374,9 +7420,9 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		if (TransactionIdIsNormal(xid))
 		{
 			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+				*NoFreezePageRelfrozenXid = xid;
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 91c5f5e9e..e334ee8dc 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -21,6 +21,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "catalog/catalog.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -205,9 +206,10 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		{
 			int			ndeleted,
 						nnewlpdead;
+			bool		fpi;
 
 			ndeleted = heap_page_prune(relation, buffer, vistest, limited_xmin,
-									   limited_ts, &nnewlpdead, NULL);
+									   limited_ts, &nnewlpdead, &fpi, NULL);
 
 			/*
 			 * Report the number of tuples reclaimed to pgstats.  This is
@@ -255,7 +257,9 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  * InvalidTransactionId/0 respectively.
  *
  * Sets *nnewlpdead for caller, indicating the number of items that were
- * newly set LP_DEAD during prune operation.
+ * newly set LP_DEAD during prune operation.  Also sets *prune_fpi for
+ * caller, indicating if pruning generated a full-page image as torn page
+ * protection.
  *
  * off_loc is the offset location required by the caller to use in error
  * callback.
@@ -267,7 +271,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
-				int *nnewlpdead,
+				int *nnewlpdead, bool *prune_fpi,
 				OffsetNumber *off_loc)
 {
 	int			ndeleted = 0;
@@ -380,6 +384,8 @@ heap_page_prune(Relation relation, Buffer buffer,
 	if (off_loc)
 		*off_loc = InvalidOffsetNumber;
 
+	*prune_fpi = false;			/* for now */
+
 	/* Any error while applying the changes is critical */
 	START_CRIT_SECTION();
 
@@ -417,6 +423,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 		{
 			xl_heap_prune xlrec;
 			XLogRecPtr	recptr;
+			int64		wal_fpi_before = pgWalUsage.wal_fpi;
 
 			xlrec.snapshotConflictHorizon = prstate.snapshotConflictHorizon;
 			xlrec.nredirected = prstate.nredirected;
@@ -448,6 +455,9 @@ heap_page_prune(Relation relation, Buffer buffer,
 			recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_PRUNE);
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
+
+			if (wal_fpi_before != pgWalUsage.wal_fpi)
+				*prune_fpi = true;
 		}
 	}
 	else
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 98ccb9882..5bd35fbd4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1525,8 +1525,9 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	bool		prune_fpi;
+	HeapPageFreeze pagefrz;
+	bool		freeze_all_eligible PG_USED_FOR_ASSERTS_ONLY;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
@@ -1542,8 +1543,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.freeze_required = false;
+	pagefrz.NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.FreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.FreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1561,7 +1565,7 @@ retry:
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
-									 &vacrel->offnum);
+									 &prune_fpi, &vacrel->offnum);
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
@@ -1596,27 +1600,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1743,9 +1743,8 @@ retry:
 		prunestate->hastup = true;	/* page makes rel truncation unsafe */
 
 		/* Tuple with storage -- consider need to freeze */
-		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs,
-									  &frozen[tuples_frozen], &totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs, &pagefrz,
+									  &frozen[tuples_frozen], &totally_frozen))
 		{
 			/* Save prepared freeze plan for later */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1766,23 +1765,69 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
+	 * freeze when pruning generated an FPI, if doing so means that we set the
+	 * page all-frozen afterwards (this could happen during second heap pass).
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (pagefrz.freeze_required || tuples_frozen == 0 ||
+		(prunestate->all_visible && prunestate->all_frozen && prune_fpi))
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 *
+		 * Note: although we're freezing all eligible tuples on this page, we
+		 * might not need any freeze plans to do so (pruning might be enough).
+		 * We always assume that a call to heap_prepare_freeze_tuple that had
+		 * to ratchet back the "freeze" NewRelfrozenXid/NewRelminMxid trackers
+		 * might have taken place earlier, though; having zero freeze plans
+		 * does not indicate that it's safe to skip this step.
+		 */
+		vacrel->NewRelfrozenXid = pagefrz.FreezePageRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.FreezePageRelminMxid;
+		freeze_all_eligible = true;
+	}
+	else
+	{
+		/* NewRelfrozenXid <= all XIDs in tuples that weren't pruned away */
+		vacrel->NewRelfrozenXid = pagefrz.NoFreezePageRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.NoFreezePageRelminMxid;
+
+		/* Might still set page all-visible, but never all-frozen */
+		tuples_frozen = 0;
+		freeze_all_eligible = prunestate->all_frozen = false;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
-	 * first (arbitrary)
 	 */
 	if (tuples_frozen > 0)
 	{
-		Assert(prunestate->hastup);
+		TransactionId snapshotConflictHorizon;
+
+		Assert(prunestate->hastup && freeze_all_eligible);
 
 		vacrel->frozen_pages++;
 
+		/*
+		 * We can use the latest xmin cutoff (which is generally used for 'VM
+		 * set' conflicts) as our cutoff for freeze conflicts when the whole
+		 * page is eligible to become all-frozen in the VM once frozen by us.
+		 * Otherwise use a conservative cutoff (just back up from OldestXmin).
+		 */
+		if (prunestate->all_visible && prunestate->all_frozen)
+			snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+		else
+		{
+			snapshotConflictHorizon = vacrel->cutoffs.OldestXmin;
+			TransactionIdRetreat(snapshotConflictHorizon);
+		}
+
 		/* Execute all freeze plans for page as a single atomic action */
 		heap_freeze_execute_prepared(vacrel->rel, buf,
-									 vacrel->cutoffs.FreezeLimit,
+									 snapshotConflictHorizon,
 									 frozen, tuples_frozen);
 	}
 
@@ -1801,7 +1846,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1809,8 +1854,7 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
+		Assert(prunestate->all_frozen == all_frozen || !freeze_all_eligible);
 
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
@@ -1831,9 +1875,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1847,6 +1888,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->all_visible = false;
+		prunestate->has_lpdead_items = true;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -1891,8 +1936,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				recently_dead_tuples,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
-	TransactionId NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NewRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1937,8 +1982,9 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_would_freeze(tupleheader, &vacrel->cutoffs,
-									&NewRelfrozenXid, &NewRelminMxid))
+		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
+									 &NoFreezePageRelfrozenXid,
+									 &NoFreezePageRelminMxid))
 		{
 			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
 			if (vacrel->aggressive)
@@ -2019,8 +2065,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items array */
 	if (vacrel->nindexes == 0)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9eedab652..44e15b5fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9194,9 +9194,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9274,9 +9274,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        should use to decide whether to trigger freezing of pages with
+        an older multixact ID.  The default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
-- 
2.38.1

v11-0002-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchapplication/x-patch; name=v11-0002-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchDownload

From 5d40d003b35a7d15a3750646f74149dd1635fb59 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v11 2/4] Add eager and lazy freezing strategies to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  VACUUM determines its freezing
strategy based on the value of the new vacuum_freeze_strategy_threshold
GUC (or reloption) in most cases: Tables that exceeds the size threshold
use the eager freezing strategy.  Otherwise VACUUM uses the lazy
freezing strategy,  which is essentially the same approach that VACUUM
has always taken to freezing (though not quite, due to the influence of
page level freezing following recent work).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).
Eager freezing is generally only applied when vacuuming larger tables,
where freezing most individual heap pages at the first opportunity (in
the first VACUUM operation where they can definitely be set all-visible)
will improve performance stability.

A later commit will add vmsnap scanning strategies, which are designed
to work in tandem with the freezing strategies from this commit.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/commands/vacuum.h                 | 10 ++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++++
 src/backend/access/heap/vacuumlazy.c          | 49 ++++++++++++++++++-
 src/backend/commands/vacuum.c                 | 17 ++++++-
 src/backend/postmaster/autovacuum.c           | 10 ++++
 src/backend/utils/misc/guc_tables.c           | 11 +++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 19 ++++++-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++++
 doc/src/sgml/ref/vacuum.sgml                  | 16 +++---
 11 files changed, 148 insertions(+), 11 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 2f274f2be..b39178d5b 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -272,6 +275,12 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
+	 * rel's main fork) that triggers VACUUM's eager freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold;
 };
 
 /*
@@ -295,6 +304,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca..e195b63d7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b734489..4b680501c 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 5bd35fbd4..3c5974a54 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,6 +153,8 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -242,6 +244,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void lazy_scan_strategy(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -470,6 +473,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
+	/*
+	 * Now determine VACUUM's freezing strategy.
+	 */
+	lazy_scan_strategy(vacrel);
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -1249,6 +1256,37 @@ lazy_scan_heap(LVRelState *vacrel)
 		lazy_cleanup_all_indexes(vacrel);
 }
 
+/*
+ *	lazy_scan_strategy() -- Determine freezing strategy.
+ *
+ * Our lazy freezing strategy is useful when putting off the work of freezing
+ * totally avoids freezing that turns out to have been wasted effort later on.
+ * Our eager freezing strategy is useful with larger tables that experience
+ * continual growth, where freezing pages proactively is needed just to avoid
+ * falling behind on freezing (eagerness is also likely to be cheaper in the
+ * short/medium term for such tables, but the long term picture matters most).
+ */
+static void
+lazy_scan_strategy(LVRelState *vacrel)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	/*
+	 * Decide freezing strategy.
+	 *
+	 * The eager freezing strategy is used when the threshold controlled by
+	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
+	 *
+	 * Also freeze eagerly with an unlogged or temp table, where the total
+	 * cost of freezing each page is just the cycles spent on the preparation,
+	 * which has to be paid even if/when lazy_scan_prune opts not to execute.
+	 * (WAL overhead is always the main cost of interest here, in general.)
+	 */
+	vacrel->eager_freeze_strategy =
+		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 !RelationIsPermanent(vacrel->rel));
+}
+
 /*
  *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
@@ -1770,9 +1808,18 @@ retry:
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 	 * freeze when pruning generated an FPI, if doing so means that we set the
 	 * page all-frozen afterwards (this could happen during second heap pass).
+	 *
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will become all-visible, making it all-frozen instead.
+	 * (Actually, the all-visible/eager freezing strategy doesn't quite work
+	 * that way.  It triggers freezing for pages that it sees will thereby be
+	 * set all-frozen in the VM immediately afterwards -- a stricter test.
+	 * Some pages that can be set all-visible cannot also be set all-frozen,
+	 * even after freezing, due to the presence of lock-only MultiXactIds.)
 	 */
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
-		(prunestate->all_visible && prunestate->all_frozen && prune_fpi))
+		(prunestate->all_visible && prunestate->all_frozen &&
+		 (vacrel->eager_freeze_strategy || prune_fpi)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ba965b8c7..7c68bd8ff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
@@ -926,7 +930,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -939,6 +944,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/* Set pg_class fields in cutoffs */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1053,6 +1059,15 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
 		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	Assert(freeze_strategy_threshold >= 0);
+	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
+
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 0746d8022..23e316e59 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 436afe1d2..a009017bd 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2518,6 +2518,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5afdeb04d..447645b73 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,6 +693,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 44e15b5fb..167c6570e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9161,6 +9161,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to apply its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
       <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9196,7 +9211,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         Specifies the cutoff age (in transactions) that
         <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages that have an older XID.
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index e14ead882..79595b1cb 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> and <xref
+       linkend="guc-vacuum-freeze-table-age"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
-- 
2.38.1

v11-0003-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchapplication/x-patch; name=v11-0003-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchDownload

From 9956ca1f28c3684a7d1785bfa1bfeecfb6c7fd98 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v11 3/4] Add eager and lazy VM strategies to VACUUM.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will scan (or skip) heap pages (scanning strategy).  The
data structure we a local copy of the visibility map at the start of
VACUUM.  It spills to disk as required, though only with a larger table.

VACUUM decides on its visibility map scanning and freezing strategies
together, shortly before the first pass over the heap begins, since the
concepts are closely related, and work in tandem.  Lazy scanning allows
VACUUM to skip all-visible pages, while eager scanning allows VACUUM to
advance relfrozenxid/relminmxid at the end of the VACUUM operation.

This work, combined with recent work to add freezing strategies, results
in VACUUM advancing relfrozenxid at a cadence that is barely influenced
by autovacuum_freeze_max_age at all.  Now antiwraparound autovacuums
will be far less common in practice.  Freezing now drives relfrozenxid,
rather than relfrozenxid driving freezing.  Even tables that always use
lazy freezing will have a decent chance of relfrozenxid advancement long
before table age nears or exceeds autovacuum_freeze_max_age.

This also lays the groundwork for completely removing aggressive mode
VACUUMs in a later commit.  Scanning strategies now supersede the "early
aggressive VACUUM" concept implemented by vacuum_freeze_table_age, which
is now just a compatibility option (its new default of -1 is interpreted
as "just use autovacuum_freeze_max_age").  Later work that makes the
choice to wait for a cleanup lock depend entirely on individual page
characteristics will decouple that "aggressive behavior" from the eager
scanning strategy behavior (a behavior that's not really "aggressive" in
any general sense, since it's chosen based on both costs and benefits).

Also add explicit I/O prefetching of heap pages, which is controlled by
maintenance_io_concurrency.  We prefetch at the point that the next
block in line is requested by VACUUM.  Prefetching is under the direct
control of the visibility map snapshot code, since VACUUM's vmsnap is
now an authoritative guide to which pages VACUUM will scan.  VACUUM's
final scanned_pages is "locked in" when it decides on scanning strategy
(so scanned_pages is finalized before the first heap pass even begins).

Prefetching should totally avoid the loss of performance that might
otherwise result from removing SKIP_PAGES_THRESHOLD in this commit.
SKIP_PAGES_THRESHOLD was intended to force OS readahead and encourage
relfrozenxid advancement.  See commit bf136cf6 from around the time the
visibility map first went in for full details.

Also teach VACUUM to use scanned_pages (not rel_pages) to cap the size
of the dead_items array.  This approach is strictly better, since there
is no question of scanning any pages other than the precise set of pages
already locked in by vmsnap by the time dead_items is allocated.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h            |  17 +
 src/include/commands/vacuum.h                 |  15 +-
 src/backend/access/heap/vacuumlazy.c          | 464 ++++++++-------
 src/backend/access/heap/visibilitymap.c       | 547 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |  68 +--
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      |  66 ++-
 doc/src/sgml/ref/vacuum.sgml                  |   4 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 11 files changed, 935 insertions(+), 279 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..4a1f47ac6 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,17 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
+/* VACUUM scanning strategy */
+typedef enum vmstrategy
+{
+	VMSNAP_SCAN_LAZY,			/* Skip all-visible and all-frozen pages */
+	VMSNAP_SCAN_EAGER,			/* Only skip all-frozen pages */
+	VMSNAP_SCAN_ALL				/* Don't skip any pages (scan them instead) */
+} vmstrategy;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +46,12 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+											  BlockNumber *scanned_pages_lazy,
+											  BlockNumber *scanned_pages_eager);
+extern void visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat);
+extern BlockNumber visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index b39178d5b..43e367bcb 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 
 /*
  * Values used by index_cleanup and truncate params.
@@ -281,6 +281,19 @@ struct VacuumCutoffs
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
 	 */
 	BlockNumber freeze_strategy_threshold;
+
+	/*
+	 * The tableagefrac value 1.0 represents the point that autovacuum.c
+	 * scheduling (and VACUUM itself) considers relfrozenxid advancement
+	 * strictly necessary.
+	 *
+	 * Lower values provide useful context, and influence whether VACUUM will
+	 * opt to advance relfrozenxid before the point that it is strictly
+	 * necessary.  VACUUM can (and often does) opt to advance relfrozenxid
+	 * proactively.  It is especially likely with tables where the _added_
+	 * costs happen to be low.
+	 */
+	double		tableagefrac;
 };
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3c5974a54..69062d016 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -110,10 +110,18 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * tableagefrac-wise cutoffs influencing VACUUM's choice of scanning strategy
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define TABLEAGEFRAC_MIDPOINT		0.5 /* half way to antiwraparound AV */
+#define TABLEAGEFRAC_HIGHPOINT		0.9 /* Eagerness now mandatory */
+
+/*
+ * Thresholds (expressed as a proportion of rel_pages) that determine the
+ * cutoff (in extra pages scanned) for eager vmsnap scanning behavior at
+ * particular tableagefrac-wise table ages
+ */
+#define MAX_PAGES_YOUNG_TABLEAGE	0.05	/* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE		0.70	/* 70% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -151,8 +159,6 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -171,7 +177,9 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map (as of time that VACUUM began) */
+	vmsnapshot *vmsnap;
+	vmstrategy	vmstrat;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -244,11 +252,8 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static void lazy_scan_strategy(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  bool force_scan_all);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -278,7 +283,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -310,10 +316,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	BlockNumber orig_rel_pages,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -459,37 +465,29 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		vacrel->aggressive = true;
-		skipwithvm = false;
-	}
-
-	vacrel->skipwithvm = skipwithvm;
 
 	/*
-	 * Now determine VACUUM's freezing strategy.
+	 * Now determine VACUUM's freezing and vmsnap scanning strategies.
+	 *
+	 * This process is driven in part by information from VACUUM's visibility
+	 * map snapshot, which will be acquired in passing.  lazy_scan_heap will
+	 * use the same immutable VM snapshot to determine which pages to scan.
+	 * Using an immutable structure (instead of the live visibility map) makes
+	 * VACUUM avoid scanning concurrently modified pages.  These pages can
+	 * only have deleted tuples that OldestXmin will consider RECENTLY_DEAD.
 	 */
-	lazy_scan_strategy(vacrel);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   (params->options &
+										VACOPT_DISABLE_PAGE_SKIPPING) != 0);
 	if (verbose)
-	{
-		if (vacrel->aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace, vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -499,13 +497,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -552,12 +551,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
+		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
 		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -602,6 +600,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -629,10 +630,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -828,12 +825,11 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -847,42 +843,29 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap,
+												 &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
+		if (blkno < next_block_to_scan)
 		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
-
-			Assert(next_unskippable_block >= blkno + 1);
+			Assert(blkno != rel_pages - 1);
+			continue;
 		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
 
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap,
+													 &next_all_visible);
+		Assert(next_block_to_scan > blkno);
 
 		vacrel->scanned_pages++;
 
@@ -1089,10 +1072,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1120,12 +1102,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
@@ -1164,7 +1144,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
+		else if (all_visible_according_to_vmsnap && prunestate.all_visible &&
 				 prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
@@ -1257,7 +1237,7 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine freezing strategy.
+ *	lazy_scan_strategy() -- Determine freezing/vmsnap scanning strategies.
  *
  * Our lazy freezing strategy is useful when putting off the work of freezing
  * totally avoids freezing that turns out to have been wasted effort later on.
@@ -1265,11 +1245,46 @@ lazy_scan_heap(LVRelState *vacrel)
  * continual growth, where freezing pages proactively is needed just to avoid
  * falling behind on freezing (eagerness is also likely to be cheaper in the
  * short/medium term for such tables, but the long term picture matters most).
+ *
+ * Our lazy vmsnap scanning strategy is useful when we can save a significant
+ * amount of work in the short term by not advancing relfrozenxid/relminmxid.
+ * Our eager vmsnap scanning strategy is useful when there is hardly any work
+ * avoided by being lazy anyway, and/or when tableagefrac is nearing or has
+ * already surpassed 1.0.  The value 1.0 is the point that autovacuum.c starts
+ * launching antiwraparound autovacuums to advance relfrozenxid/relminmxid,
+ * which makes eager scanning strategy mandatory (though we always use eager
+ * scanning whenever tableagefrac reaches 0.9 or more, to try to stay ahead).
+ *
+ * Freezing and scanning strategies are structured as two independent choices,
+ * but they are not independent in any practical sense (it's just mechanical).
+ * Eager and lazy behaviors go hand in hand, since the choice of each strategy
+ * is driven by similar considerations about the needs of the target table.
+ * Moreover, choosing eager scanning strategy can easily result in freezing
+ * many more pages (compared to an equivalent lazy scanning strategy VACUUM),
+ * since VACUUM can only freeze pages that it actually scans.  (All-visible
+ * pages may well have XIDs < FreezeLimit by now, but VACUUM has no way of
+ * noticing that it should freeze such pages besides just scanning them.)
+ *
+ * The single most important justification for the eager behaviors is system
+ * level performance stability.  It is often better to freeze all-visible
+ * pages before we're truly forced to (just to advance relfrozenxid) as a way
+ * of avoiding big spikes, where VACUUM has to freeze many pages all at once.
+ *
+ * Returns final scanned_pages for the VACUUM operation.  The exact number of
+ * pages that lazy_scan_heap scans depends in part on which vmsnap scanning
+ * strategy we choose (only eager scanning will scan rel's all-visible pages).
  */
-static void
-lazy_scan_strategy(LVRelState *vacrel)
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->rel_pages,
+				scanned_pages_lazy,
+				scanned_pages_eager,
+				nextra_scanned_eager,
+				nextra_young_threshold,
+				nextra_old_threshold,
+				nextra_toomany_threshold;
+	double		tableagefrac = vacrel->cutoffs.tableagefrac;
 
 	/*
 	 * Decide freezing strategy.
@@ -1277,120 +1292,160 @@ lazy_scan_strategy(LVRelState *vacrel)
 	 * The eager freezing strategy is used when the threshold controlled by
 	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
 	 *
+	 * Also freeze eagerly whenever table age is close to requiring (or is
+	 * actually undergoing) an antiwraparound autovacuum.  This may delay the
+	 * next antiwraparound autovacuum against the table.  We avoid relying on
+	 * them, if at all possible (mostly-static tables tend to rely on them).
+	 *
 	 * Also freeze eagerly with an unlogged or temp table, where the total
 	 * cost of freezing each page is just the cycles spent on the preparation,
 	 * which has to be paid even if/when lazy_scan_prune opts not to execute.
 	 * (WAL overhead is always the main cost of interest here, in general.)
+	 *
+	 * Once a table first becomes big enough for eager freezing, it's almost
+	 * inevitable that it will also naturally settle into a cadence where
+	 * relfrozenxid is advanced during every VACUUM (barring rel truncation).
+	 * This is a consequence of eager freezing strategy avoiding creating new
+	 * all-visible pages: if there never are any all-visible pages (if all
+	 * skippable pages are fully all-frozen), then there is no way that lazy
+	 * scanning strategy can ever look better than eager scanning strategy.
+	 * There are still ways that the occasional all-visible page could slip
+	 * into a table that we always freeze eagerly (at least when its tuples
+	 * tend to contain MultiXacts), but that should have negligible impact.
 	 */
 	vacrel->eager_freeze_strategy =
 		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 tableagefrac > TABLEAGEFRAC_HIGHPOINT ||
 		 !RelationIsPermanent(vacrel->rel));
-}
-
-/*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
- *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
- *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
- */
-static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
-{
-	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
-
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
-	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
-
-		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-		{
-			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
-		 */
-		if (next_unskippable_block == rel_pages - 1)
-			break;
-
-		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
-			break;
-
-		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
-		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
-
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
-	}
 
 	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
+	 * Decide vmsnap scanning strategy.
 	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 * First acquire a visibility map snapshot, which determines the number of
+	 * pages that each vmsnap scanning strategy is required to scan for us in
+	 * passing.
+	 *
+	 * The number of "extra" scanned_pages added by choosing VMSNAP_SCAN_EAGER
+	 * over VMSNAP_SCAN_LAZY is a key input into the decision making process.
+	 * It is a good proxy for the added cost of applying our eager vmsnap
+	 * strategy during this particular VACUUM.  (We may or may not have to
+	 * dirty/freeze the extra pages when we scan them, which isn't something
+	 * that we try to model.  It shouldn't matter very much at this level.)
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
+	vacrel->vmsnap = visibilitymap_snap_acquire(vacrel->rel, rel_pages,
+												&scanned_pages_lazy,
+												&scanned_pages_eager);
+	nextra_scanned_eager = scanned_pages_eager - scanned_pages_lazy;
+
+	/*
+	 * Next determine guideline "nextra_scanned_eager" thresholds, which are
+	 * applied based in part on tableagefrac (when nextra_toomany_threshold is
+	 * determined below).  These thresholds also represent minimum and maximum
+	 * sensible thresholds that can ever make sense for a table of this size
+	 * (when the table's age isn't old enough to make eagerness mandatory).
+	 *
+	 * For the most part we only care about relative (not absolute) costs.  We
+	 * want to advance relfrozenxid at an opportune time, during a VACUUM that
+	 * has to scan relatively many pages either way (whether due to the need
+	 * to remove dead tuples from many pages, or due to the table containing
+	 * lots of existing all-frozen pages, or due to a combination of both).
+	 * Even small tables (where lazy freezing is used) shouldn't have to do
+	 * dramatically more work than usual when advancing relfrozenxid, which
+	 * our policy of waiting for the right VACUUM largely avoids, in practice.
+	 */
+	nextra_young_threshold = (double) rel_pages * MAX_PAGES_YOUNG_TABLEAGE;
+	nextra_old_threshold = (double) rel_pages * MAX_PAGES_OLD_TABLEAGE;
+
+	/*
+	 * Next determine nextra_toomany_threshold, which represents how many
+	 * extra scanned_pages are deemed too high a cost to pay for eagerness,
+	 * given present conditions.  This is our model's break-even point.
+	 */
+	Assert(rel_pages >= nextra_scanned_eager && vacrel->scanned_pages == 0);
+	if (tableagefrac < TABLEAGEFRAC_MIDPOINT)
+	{
+		/*
+		 * The table's age is still below table age mid point, so table age is
+		 * still of only minimal concern.  We're still willing to act eagerly
+		 * when it's _very_ cheap to do so: when use of VMSNAP_SCAN_EAGER will
+		 * force us to scan some extra pages not exceeding 5% of rel_pages.
+		 */
+		nextra_toomany_threshold = nextra_young_threshold;
+	}
+	else if (tableagefrac <= TABLEAGEFRAC_HIGHPOINT)
+	{
+		double		nextra_scale;
+
+		/*
+		 * The table's age is starting to become a concern, but not to the
+		 * extent that we'll force the use of VMSNAP_SCAN_EAGER strategy.
+		 * We'll need to interpolate to get an nextra_scanned_eager-based
+		 * threshold.
+		 *
+		 * If tableagefrac is only barely over the midway point, then we'll
+		 * choose an nextra_scanned_eager threshold of ~5% of rel_pages. The
+		 * opposite extreme occurs when tableagefrac is very near to the high
+		 * point.  That will make our nextra_scanned_eager threshold very
+		 * aggressive: we'll go with VMSNAP_SCAN_EAGER when doing so requires
+		 * we scan a number of extra blocks as high as ~70% of rel_pages.
+		 *
+		 * Note that the threshold grows (on a percentage basis) by ~8.1% of
+		 * rel_pages for every additional 5%-of-tableagefrac increment added
+		 * (after tableagefrac has crossed the 50%-of-tableagefrac mid point,
+		 * until the 90%-of-tableagefrac high point is reached, when we switch
+		 * over to not caring about the added cost of eager freezing at all).
+		 */
+		nextra_scale =
+			1.0 - ((TABLEAGEFRAC_HIGHPOINT - tableagefrac) /
+				   (TABLEAGEFRAC_HIGHPOINT - TABLEAGEFRAC_MIDPOINT));
+
+		nextra_toomany_threshold =
+			(nextra_young_threshold * (1.0 - nextra_scale)) +
+			(nextra_old_threshold * nextra_scale);
+	}
 	else
 	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
+		/*
+		 * The table's age surpasses the high point, and so is approaching (or
+		 * may even surpass) the point that an antiwraparound autovacuum is
+		 * required.  Force VMSNAP_SCAN_EAGER, no matter how many extra pages
+		 * we'll be required to scan as a result (costs no longer matter).
+		 *
+		 * Note that there is a discontinuity when tableagefrac crosses this
+		 * 90%-of-tableagefrac high point: the threshold set here jumps from
+		 * 70% of rel_pages to 100% of rel_pages (MaxBlockNumber, actually).
+		 * It's useful to only care about table age once it gets this high.
+		 * That way even extreme cases will have at least some chance of using
+		 * eager scanning before an antiwraparound autovacuum is launched.
+		 */
+		nextra_toomany_threshold = MaxBlockNumber;
 	}
 
-	return next_unskippable_block;
+	/* Make final choice on scanning strategy using final threshold */
+	nextra_toomany_threshold = Max(32, nextra_toomany_threshold);
+	vacrel->vmstrat = (nextra_scanned_eager >= nextra_toomany_threshold ?
+					   VMSNAP_SCAN_LAZY : VMSNAP_SCAN_EAGER);
+
+	/*
+	 * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+	 * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+	 */
+	if (force_scan_all)
+		vacrel->vmstrat = VMSNAP_SCAN_ALL;
+
+	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
+
+	/* Inform vmsnap infrastructure of our chosen strategy */
+	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
+
+	/* Return appropriate scanned_pages for final strategy chosen */
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
+		return scanned_pages_lazy;
+	if (vacrel->vmstrat == VMSNAP_SCAN_EAGER)
+		return scanned_pages_eager;
+
+	/* DISABLE_PAGE_SKIPPING/VMSNAP_SCAN_ALL case */
+	return rel_pages;
 }
 
 /*
@@ -2831,6 +2886,14 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * Also don't attempt it if we are doing early pruning/vacuuming, because a
  * scan which cannot find a truncated heap page cannot determine that the
  * snapshot is too old to read that page.
+ *
+ * Note that we effectively rely on visibilitymap_snap_next() having forced
+ * VACUUM to scan the final page (rel_pages - 1) in all cases.  Without that,
+ * we'd tend to needlessly acquire an AccessExclusiveLock just to attempt rel
+ * truncation that is bound to fail.  VACUUM cannot set vacrel->nonempty_pages
+ * in pages that it skips using the VM, so we must avoid interpreting skipped
+ * pages as empty pages when it makes little sense.  Observing that the final
+ * page has tuples is a simple way of avoiding pathological locking behavior.
  */
 static bool
 should_attempt_truncation(LVRelState *vacrel)
@@ -3121,14 +3184,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3137,15 +3199,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3167,12 +3227,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e..816576dca 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,10 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap_acquire - acquire snapshot of visibility map
+ *		visibilitymap_snap_strategy - set VACUUM's scanning strategy
+ *		visibilitymap_snap_next - get next block to scan from vmsnap
+ *		visibilitymap_snap_release - release previously acquired snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +56,10 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset.  This also provides us with a
+ * convenient way of performing I/O prefetching on behalf of VACUUM, since the
+ * pages that VACUUM's first heap pass will scan are fully predetermined.
  *
  * LOCKING
  *
@@ -92,10 +100,12 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
+#include "utils/spccache.h"
 
 
 /*#define TRACE_VISIBILITYMAP */
@@ -124,9 +134,87 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Prefetching of heap pages takes place as VACUUM requests the next block in
+ * line from its visibility map snapshot
+ *
+ * XXX MIN_PREFETCH_SIZE of 32 is a little on the high side, but matches
+ * hard-coded constant used by vacuumlazy.c when prefetching for rel
+ * truncation.  Might be better to increase the maintenance_io_concurrency
+ * default, or to do nothing like this at all.
+ */
+#define STAGED_BUFSIZE			(MAX_IO_CONCURRENCY * 2)
+#define MIN_PREFETCH_SIZE		((BlockNumber) 32)
+
+typedef struct vmsnapblock
+{
+	BlockNumber scanned_block;
+	bool		all_visible;
+} vmsnapblock;
+
+/*
+ * Snapshot of visibility map at the start of a VACUUM operation
+ */
+struct vmsnapshot
+{
+	/* Target heap rel */
+	Relation	rel;
+	/* Scanning strategy used by VACUUM operation */
+	vmstrategy	strat;
+	/* Per-strategy final scanned_pages */
+	BlockNumber rel_pages;
+	BlockNumber scanned_pages_lazy;
+	BlockNumber scanned_pages_eager;
+
+	/*
+	 * Materialized visibility map state.
+	 *
+	 * VM snapshots spill to a temp file when required.
+	 */
+	BlockNumber nvmpages;
+	BufFile    *file;
+
+	/*
+	 * Prefetch distance, used to perform I/O prefetching of heap pages
+	 */
+	int			prefetch_distance;
+
+	/* Current VM page cached */
+	BlockNumber curvmpage;
+	char	   *rawmap;
+	PGAlignedBlock vmpage;
+
+	/* Staging area for blocks returned to VACUUM */
+	vmsnapblock staged[STAGED_BUFSIZE];
+	int			current_nblocks_staged;
+
+	/*
+	 * Next block from range of rel_pages to consider placing in staged block
+	 * array (it will be placed there if it's going to be scanned by VACUUM)
+	 */
+	BlockNumber next_block;
+
+	/*
+	 * Number of blocks that we still need to return, and number of blocks
+	 * that we still need to prefetch
+	 */
+	BlockNumber scanned_pages_to_return;
+	BlockNumber scanned_pages_to_prefetch;
+
+	/* offset of next block in line to return (from staged) */
+	int			next_return_idx;
+	/* offset of next block in line to prefetch (from staged) */
+	int			next_prefetch_idx;
+	/* offset of first garbage/invalid element (from staged) */
+	int			first_invalid_idx;
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static void vm_snap_stage_blocks(vmsnapshot *vmsnap);
+static uint8 vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 
 
 /*
@@ -373,6 +461,350 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap_acquire - get read-only snapshot of visibility map
+ *
+ * Initializes VACUUM caller's snapshot, allocating memory in current context.
+ * Used by VACUUM to determine which pages it must scan up front.
+ *
+ * Set scanned_pages_lazy and scanned_pages_eager to help VACUUM decide on its
+ * scanning strategy.  These are VACUUM's scanned_pages when it opts to skip
+ * all eligible pages and scanned_pages when it opts to just skip all-frozen
+ * pages, respectively.
+ *
+ * Caller finalizes scanning strategy by calling visibilitymap_snap_strategy.
+ * This determines the kind of blocks visibilitymap_snap_next should indicate
+ * need to be scanned by VACUUM.
+ */
+vmsnapshot *
+visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+						   BlockNumber *scanned_pages_lazy,
+						   BlockNumber *scanned_pages_eager)
+{
+	BlockNumber nvmpages = 0,
+				mapBlockLast = 0,
+				all_visible = 0,
+				all_frozen = 0;
+	uint8		mapbits_last_page = 0;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_acquire %s %u",
+		 RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	if (rel_pages > 0)
+	{
+		mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages - 1);
+		nvmpages = mapBlockLast + 1;
+	}
+
+	/* Allocate and initialize VM snapshot state */
+	vmsnap = palloc0(sizeof(vmsnapshot));
+	vmsnap->rel = rel;
+	vmsnap->strat = VMSNAP_SCAN_ALL;	/* for now */
+	vmsnap->rel_pages = rel_pages;	/* scanned_pages for VMSNAP_SCAN_ALL */
+	vmsnap->scanned_pages_lazy = 0;
+	vmsnap->scanned_pages_eager = 0;
+
+	/*
+	 * vmsnap temp file state.
+	 *
+	 * Only relations large enough to need more than one visibility map page
+	 * use a temp file (cannot wholly rely on vmsnap's single page cache).
+	 */
+	vmsnap->nvmpages = nvmpages;
+	vmsnap->file = NULL;
+	if (nvmpages > 1)
+		vmsnap->file = BufFileCreateTemp(false);
+	vmsnap->prefetch_distance = 0;
+#ifdef USE_PREFETCH
+	vmsnap->prefetch_distance =
+		get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
+#endif
+	vmsnap->prefetch_distance = Max(vmsnap->prefetch_distance, MIN_PREFETCH_SIZE);
+
+	/* cache of VM pages read from temp file */
+	vmsnap->curvmpage = 0;
+	vmsnap->rawmap = NULL;
+
+	/* staged blocks array state */
+	vmsnap->current_nblocks_staged = 0;
+	vmsnap->next_block = 0;
+	vmsnap->scanned_pages_to_return = 0;
+	vmsnap->scanned_pages_to_prefetch = 0;
+	/* Offsets into staged blocks array */
+	vmsnap->next_return_idx = 0;
+	vmsnap->next_prefetch_idx = 0;
+	vmsnap->first_invalid_idx = 0;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		/* Cache page locally */
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(vmsnap->vmpage.data, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/* Finish off this VM page using snapshot's vmpage cache */
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = map = PageGetContents(vmsnap->vmpage.data);
+		umap = (uint64 *) map;
+
+		if (mapBlock == mapBlockLast)
+		{
+			uint32		mapByte;
+			uint8		mapOffset;
+
+			/*
+			 * The last VM page requires some extra steps.
+			 *
+			 * First get the status of the last heap page (page in the range
+			 * of rel_pages) in passing.
+			 */
+			Assert(mapBlock == HEAPBLK_TO_MAPBLOCK(rel_pages - 1));
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages - 1);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages - 1);
+			mapbits_last_page = ((map[mapByte] >> mapOffset) &
+								 VISIBILITYMAP_VALID_BITS);
+
+			/*
+			 * Also defensively "truncate" our local copy of the last page in
+			 * order to reliably exclude heap pages beyond the range of
+			 * rel_pages.  This is sheer paranoia.
+			 */
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages);
+			if (mapByte != 0 || mapOffset != 0)
+			{
+				MemSet(&map[mapByte + 1], 0, MAPSIZE - (mapByte + 1));
+				map[mapByte] &= (1 << mapOffset) - 1;
+			}
+		}
+
+		/* Maintain count of all-frozen and all-visible pages */
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+
+		/* Finally, write out vmpage cache VM page to vmsnap's temp file */
+		if (vmsnap->file)
+			BufFileWrite(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+	}
+
+	/*
+	 * Done copying all VM pages from authoritative VM into a VM snapshot.
+	 *
+	 * Figure out the final scanned_pages for the two skipping policies that
+	 * we might use: skipallvis (skip both all-frozen and all-visible) and
+	 * skipallfrozen (just skip all-frozen).
+	 */
+	Assert(all_frozen <= all_visible && all_visible <= rel_pages);
+	*scanned_pages_lazy = rel_pages - all_visible;
+	*scanned_pages_eager = rel_pages - all_frozen;
+
+	/*
+	 * When the last page is skippable in principle, it still won't be treated
+	 * as skippable by visibilitymap_snap_next, which recognizes the last page
+	 * as a special case.  Compensate by incrementing each scanning strategy's
+	 * scanned_pages as needed to avoid counting the last page as skippable.
+	 */
+	if (mapbits_last_page & VISIBILITYMAP_ALL_VISIBLE)
+		(*scanned_pages_lazy)++;
+	if (mapbits_last_page & VISIBILITYMAP_ALL_FROZEN)
+		(*scanned_pages_eager)++;
+
+	vmsnap->scanned_pages_lazy = *scanned_pages_lazy;
+	vmsnap->scanned_pages_eager = *scanned_pages_eager;
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_strategy -- determine VACUUM's scanning strategy.
+ *
+ * VACUUM chooses a vmsnap strategy according to priorities around advancing
+ * relfrozenxid.  See visibilitymap_snap_acquire.
+ */
+void
+visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat)
+{
+	int			nprefetch;
+
+	/* Remember final scanning strategy */
+	vmsnap->strat = strat;
+
+	if (vmsnap->strat == VMSNAP_SCAN_LAZY)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_lazy;
+	else if (vmsnap->strat == VMSNAP_SCAN_EAGER)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_eager;
+	else
+		vmsnap->scanned_pages_to_return = vmsnap->rel_pages;
+
+	vmsnap->scanned_pages_to_prefetch = vmsnap->scanned_pages_to_return;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_strategy %s %d %u",
+		 RelationGetRelationName(vmsnap->rel), (int) strat,
+		 vmsnap->scanned_pages_to_return);
+#endif
+
+	/*
+	 * Stage blocks (may have to read from temp file).
+	 *
+	 * We rely on the assumption that we'll always have a large enough staged
+	 * blocks array to accommodate any possible prefetch distance.
+	 */
+	vm_snap_stage_blocks(vmsnap);
+
+	nprefetch = Min(vmsnap->current_nblocks_staged, vmsnap->prefetch_distance);
+#ifdef USE_PREFETCH
+	for (int i = 0; i < nprefetch; i++)
+	{
+		BlockNumber block = vmsnap->staged[i].scanned_block;
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, block);
+	}
+#endif
+
+	vmsnap->scanned_pages_to_prefetch -= nprefetch;
+	vmsnap->next_prefetch_idx += nprefetch;
+}
+
+/*
+ *	visibilitymap_snap_next -- get next block to scan from vmsnap.
+ *
+ * Returns next block in line for VACUUM to scan according to vmsnap.  Caller
+ * skips any and all blocks preceding returned block.
+ *
+ * The all-visible status of returned block is set in *all_visible.  Block
+ * usually won't be set all-visible (else VACUUM wouldn't need to scan it),
+ * but it can be in certain corner cases.  This includes the VMSNAP_SCAN_ALL
+ * case, as well as a special case that VACUUM expects us to handle: the final
+ * block (rel_pages - 1) is always returned here (regardless of our strategy).
+ *
+ * VACUUM always scans the last page to determine whether it has tuples.  This
+ * is useful as a way of avoiding certain pathological cases with heap rel
+ * truncation.
+ */
+BlockNumber
+visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible)
+{
+	BlockNumber next_block_to_scan;
+	vmsnapblock block;
+
+	*allvisible = true;
+	if (vmsnap->scanned_pages_to_return == 0)
+		return InvalidBlockNumber;
+
+	/* Prepare to return this block */
+	block = vmsnap->staged[vmsnap->next_return_idx++];
+	*allvisible = block.all_visible;
+	next_block_to_scan = block.scanned_block;
+	vmsnap->current_nblocks_staged--;
+	vmsnap->scanned_pages_to_return--;
+
+	/*
+	 * Did the staged blocks array just run out of blocks to return to caller,
+	 * or do we need to stage more blocks for I/O prefetching purposes?
+	 */
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+	if ((vmsnap->current_nblocks_staged == 0 &&
+		 vmsnap->scanned_pages_to_return > 0) ||
+		(vmsnap->next_prefetch_idx == vmsnap->first_invalid_idx &&
+		 vmsnap->scanned_pages_to_prefetch > 0))
+	{
+		if (vmsnap->current_nblocks_staged > 0)
+		{
+			/*
+			 * We've run out of prefetchable blocks, but still have some
+			 * non-returned blocks.  Shift existing blocks to the start of the
+			 * array.  The newly staged blocks go after these ones.
+			 */
+			memmove(&vmsnap->staged[0],
+					&vmsnap->staged[vmsnap->next_return_idx],
+					sizeof(vmsnapblock) * vmsnap->current_nblocks_staged);
+		}
+
+		/*
+		 * Reset offsets in staged blocks array, while accounting for likely
+		 * presence of preexisting blocks that have already been prefetched
+		 * but have yet to be returned to VACUUM caller
+		 */
+		vmsnap->next_prefetch_idx -= vmsnap->next_return_idx;
+		vmsnap->first_invalid_idx -= vmsnap->next_return_idx;
+		vmsnap->next_return_idx = 0;
+
+		/* Stage more blocks (may have to read from temp file) */
+		vm_snap_stage_blocks(vmsnap);
+	}
+
+	/*
+	 * By here we're guaranteed to have at least one prefetchable block in the
+	 * staged blocks array (unless we've already prefetched all blocks that
+	 * will ever be returned to VACUUM caller)
+	 */
+	if (vmsnap->next_prefetch_idx < vmsnap->first_invalid_idx)
+	{
+#ifdef USE_PREFETCH
+		/* Still have remaining blocks to prefetch, so prefetch next one */
+		vmsnapblock prefetch = vmsnap->staged[vmsnap->next_prefetch_idx++];
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, prefetch.scanned_block);
+#else
+		vmsnap->next_prefetch_idx++;
+#endif
+		Assert(vmsnap->current_nblocks_staged > 1);
+		Assert(vmsnap->scanned_pages_to_prefetch > 0);
+		vmsnap->scanned_pages_to_prefetch--;
+	}
+	else
+	{
+		Assert(vmsnap->scanned_pages_to_prefetch == 0);
+	}
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_next %s %u",
+		 RelationGetRelationName(vmsnap->rel), next_block_to_scan);
+#endif
+
+	return next_block_to_scan;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Frees resources allocated in visibilitymap_snap_acquire for VACUUM.
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->scanned_pages_to_return == 0);
+	if (vmsnap->file)
+		BufFileClose(vmsnap->file);
+	pfree(vmsnap);
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
@@ -677,3 +1109,118 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
+
+/*
+ * Stage some heap blocks from vmsnap to return to VACUUM caller.
+ *
+ * Called when we completely run out of staged blocks to return to VACUUM, or
+ * when vmsnap still has some pending staged blocks, but too few to be able to
+ * prefetch incrementally as the remaining blocks are returned to VACUUM.
+ */
+static void
+vm_snap_stage_blocks(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->current_nblocks_staged < STAGED_BUFSIZE);
+	Assert(vmsnap->first_invalid_idx < STAGED_BUFSIZE);
+	Assert(vmsnap->next_return_idx <= vmsnap->first_invalid_idx);
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+
+	while (vmsnap->next_block < vmsnap->rel_pages &&
+		   vmsnap->current_nblocks_staged < STAGED_BUFSIZE)
+	{
+		bool		all_visible = true;
+		vmsnapblock stage;
+
+		for (;;)
+		{
+			uint8		mapbits = vm_snap_get_status(vmsnap,
+													 vmsnap->next_block);
+
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+				all_visible = false;
+				break;
+			}
+
+			/*
+			 * Stop staging blocks just before final page, which must always
+			 * be scanned by VACUUM
+			 */
+			if (vmsnap->next_block == vmsnap->rel_pages - 1)
+				break;
+
+			/* VMSNAP_SCAN_ALL forcing VACUUM to scan every page? */
+			if (vmsnap->strat == VMSNAP_SCAN_ALL)
+				break;
+
+			/*
+			 * Check if VACUUM must scan this page because it's not all-frozen
+			 * and VACUUM opted to use VMSNAP_SCAN_EAGER strategy
+			 */
+			if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
+				vmsnap->strat == VMSNAP_SCAN_EAGER)
+				break;
+
+			/* VACUUM will skip this block -- so don't stage it for later */
+			vmsnap->next_block++;
+		}
+
+		/* VACUUM will scan this block, so stage it for later */
+		stage.scanned_block = vmsnap->next_block++;
+		stage.all_visible = all_visible;
+		vmsnap->staged[vmsnap->first_invalid_idx++] = stage;
+		vmsnap->current_nblocks_staged++;
+	}
+}
+
+/*
+ * Get status of bits from vm snapshot
+ */
+static uint8
+vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "vm_snap_get_status %u", heapBlk);
+#endif
+
+	/*
+	 * If we didn't see the VM page when the snapshot was first acquired we
+	 * defensively assume heapBlk not all-visible or all-frozen
+	 */
+	Assert(heapBlk <= vmsnap->rel_pages);
+	if (unlikely(mapBlock >= vmsnap->nvmpages))
+		return 0;
+
+	/*
+	 * Read from temp file when required.
+	 *
+	 * Although this routine supports random access, sequential access is
+	 * expected.  We should only need to read each temp file page into cache
+	 * at most once per VACUUM.
+	 */
+	if (unlikely(mapBlock != vmsnap->curvmpage))
+	{
+		size_t		nread;
+
+		if (BufFileSeekBlock(vmsnap->file, mapBlock) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek to block %u of vmsnap temporary file",
+							mapBlock)));
+		nread = BufFileRead(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+		if (nread != BLCKSZ)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u of vmsnap temporary file: read only %zu of %zu bytes",
+							mapBlock, nread, (size_t) BLCKSZ)));
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = PageGetContents(vmsnap->vmpage.data);
+	}
+
+	return ((vmsnap->rawmap[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c68bd8ff..5085d9407 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -933,11 +933,11 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				effective_multixact_freeze_max_age,
 				freeze_strategy_threshold;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
@@ -1069,48 +1069,48 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
-	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.  The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/*
+	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+	 * XMID table age (whichever is greater currently).
+	 */
+	XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+		((double) freeze_table_age + 0.5);
+	MXIDFrac = (double) (nextMXID - cutoffs->relminmxid) /
+		((double) multixact_freeze_table_age + 0.5);
+	cutoffs->tableagefrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		cutoffs->tableagefrac = 1.0;
+
+	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a009017bd..7a3972827 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2491,10 +2491,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2511,10 +2511,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 447645b73..c44c1c4e7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -659,6 +659,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -692,11 +699,9 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 167c6570e..596c44060 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9184,20 +9184,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9266,19 +9274,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 79595b1cb..c137debb1 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -158,9 +158,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
       all tuples are known to be frozen can always be skipped, and those
       where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      skipped except when performing an aggressive vacuum.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.38.1

v11-0004-Finish-removing-aggressive-mode-VACUUM.patchapplication/x-patch; name=v11-0004-Finish-removing-aggressive-mode-VACUUM.patchDownload

From 862d5dfa65ff6dedf417db47e6f6c43e1e3dc0ce Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v11 4/4] Finish removing aggressive mode VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Although pre-visibility-map
VACUUM was far less efficient (especially after 9.6's commit fd31cd26),
its naive approach had one notable advantage: users only had to think
about a single kind of lazy vacuum (the only kind that existed).

Break the final remaining dependency on aggressive mode: replace the
rules governing when VACUUM will wait for a cleanup lock with a new set
of rules more attuned to the needs of the table.  With that last
dependency gone, there is no need for aggressive mode, so get rid of it.
Users once again only have to think about one kind of lazy vacuum.

In general, all of the behaviors associated with aggressive mode prior
to Postgres 16 have been retained; they just get applied selectively, on
a more dynamic timeline.  For example, the aforementioned change to
VACUUM's cleanup lock behavior retains the general idea of sometimes
waiting for a cleanup lock to make sure that older XIDs get frozen, so
that relfrozenxid can be advanced by a sufficient amount.  All that
really changes is the information driving VACUUM's decision on waiting.

We use new, dedicated cutoffs, rather than applying the FreezeLimit and
MultiXactCutoff used when deciding whether we should trigger freezing on
the basis of XID/MXID age.  These minimum fallback cutoffs (which are
called MinXid and MinMulti) are typically far older than the standard
FreezeLimit/MultiXactCutoff cutoffs.  VACUUM doesn't need an aggressive
mode to decide on whether to wait for a cleanup lock anymore; it can
decide everything at the level of individual heap pages.

It is okay to aggressively punt on waiting for a cleanup lock like this
because VACUUM now directly understands the importance of never falling
too far behind on the work of freezing physical heap pages at the level
of the whole table, following recent work to add VM scanning strategies.
It is generally safer for VACUUM to press on with freezing other heap
pages from the table instead.  Even if relfrozenxid can only be advanced
by relatively few XIDs as a consequence, VACUUM should have more than
ample opportunity to catch up next time, since there is bound to be no
more than a small number of problematic unfrozen pages left behind.
VACUUM now tends to consistently advance relfrozenxid (at least by some
small amount) all the time in larger tables, so all that has to happen
for relfrozenxid to fully catch up is for a few remaining unfrozen pages
to get frozen.  Since relfrozenxid is now considered to be no more than
a lagging indicator of freezing, and since relfrozenxid isn't used to
trigger freezing in the way that it once was, time is on our side.

Also teach VACUUM to wait for a short while for cleanup locks when doing
so has a decent chance of preserving its ability to advance relfrozenxid
up to FreezeLimit (and/or to advance relminmxid up to MultiXactCutoff).
As a result, VACUUM typically manages to advance relfrozenxid by just as
much as it would have had it promised to advance it up to FreezeLimit
(i.e. had it made the traditional aggressive VACUUM guarantee), even
when vacuuming a table that happens to have relatively many cleanup lock
conflicts affecting pages with older XIDs/MXIDs.  VACUUM thereby avoids
missing out on advancing relfrozenxid up to the traditional target
amount when it really can be avoided fairly easily, without promising to
do so (VACUUM only promises to advance up to MinXid/MinMulti).

XXX We also need to avoid the special auto-cancellation behavior for
antiwraparound autovacuums to make this truly safe.  See also, related
patch for this: https://commitfest.postgresql.org/41/4027/

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   9 +-
 src/backend/access/heap/vacuumlazy.c          | 221 +++---
 src/backend/commands/vacuum.c                 |  42 +-
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 doc/src/sgml/config.sgml                      |  10 +-
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  10 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 doc/src/sgml/xact.sgml                        |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  30 +-
 14 files changed, 555 insertions(+), 530 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 43e367bcb..b75b813f8 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -276,6 +276,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
@@ -348,7 +355,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+extern void vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 							   struct VacuumCutoffs *cutoffs);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 69062d016..732c3d73c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -157,8 +157,6 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -262,7 +260,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							LVPagePruneState *prunestate);
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
-							  bool *hastup, bool *recordfreespace);
+							  bool *hastup, bool *recordfreespace,
+							  Size *freespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -459,7 +458,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
+	vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
@@ -539,17 +538,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(vacrel->cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(vacrel->cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
@@ -557,7 +553,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
 		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
-		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -626,33 +621,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
 			memset(&walusage, 0, sizeof(WalUsage));
 			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
 			initStringInfo(&buf);
+
 			if (verbose)
-			{
-				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
-			}
 			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			else
-			{
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			appendStringInfo(&buf, msgfmt,
 							 get_database_name(MyDatabaseId),
 							 vacrel->relnamespace,
@@ -948,6 +924,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			bool		hastup,
 						recordfreespace;
+			Size		freespace;
 
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
@@ -961,10 +938,8 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Collect LP_DEAD items in dead_items array, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
-								  &recordfreespace))
+								  &recordfreespace, &freespace))
 			{
-				Size		freespace = 0;
-
 				/*
 				 * Processed page successfully (without cleanup lock) -- just
 				 * need to perform rel truncation and FSM steps, much like the
@@ -973,21 +948,14 @@ lazy_scan_heap(LVRelState *vacrel)
 				 */
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
-				if (recordfreespace)
-					freespace = PageGetHeapFreeSpace(page);
-				UnlockReleaseBuffer(buf);
 				if (recordfreespace)
 					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+
+				/* lock and pin released by lazy_scan_noprune */
 				continue;
 			}
 
-			/*
-			 * lazy_scan_noprune could not do all required processing.  Wait
-			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
-			 */
-			Assert(vacrel->aggressive);
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockBufferForCleanup(buf);
+			/* cleanup lock acquired by lazy_scan_noprune */
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
@@ -1433,8 +1401,6 @@ lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 	if (force_scan_all)
 		vacrel->vmstrat = VMSNAP_SCAN_ALL;
 
-	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
-
 	/* Inform vmsnap infrastructure of our chosen strategy */
 	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
 
@@ -2011,17 +1977,32 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items array for removal from indexes (assuming caller's page
+ * can be processed successfully here).
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We return true to indicate that processing succeeded, in which case we'll
+ * have dropped the lock and pin on buf/page.  Else returns false, indicating
+ * that page must be processed by lazy_scan_prune in the usual way after all.
+ * Acquires a cleanup lock on buf/page for caller before returning false.
+ *
+ * We go to considerable trouble to get a cleanup lock on any page that has
+ * XIDs/MXIDs that need to be frozen in order for VACUUM to be able to set
+ * relfrozenxid/relminmxid to values >= FreezeLimit/MultiXactCutoff cutoffs.
+ * But we don't strictly guarantee it; we only guarantee that final values
+ * will be >= MinXid/MinMulti cutoffs in the worst case.
+ *
+ * We prefer to "under promise and over deliver" like this because a strong
+ * guarantee has the potential to make a bad situation even worse.  VACUUM
+ * should avoid waiting for a cleanup lock for an indefinitely long time until
+ * it has already exhausted every available alternative.  It's quite possible
+ * (and perhaps even likely) that the problem will go away on its own.  But
+ * even when it doesn't, our approach at least makes it likely that the first
+ * VACUUM that encounters the issue will catch up on whatever freezing may
+ * still be required for every other page in the target rel.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
- * generic FSM processing for page.
+ * generic FSM processing for page, using *freespace value set here.
  */
 static bool
 lazy_scan_noprune(LVRelState *vacrel,
@@ -2029,7 +2010,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				  BlockNumber blkno,
 				  Page page,
 				  bool *hastup,
-				  bool *recordfreespace)
+				  bool *recordfreespace,
+				  Size *freespace)
 {
 	OffsetNumber offnum,
 				maxoff;
@@ -2037,6 +2019,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples,
 				missed_dead_tuples;
+	bool		should_freeze = false;
 	HeapTupleHeader tupleheader;
 	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
 	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
@@ -2046,6 +2029,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	*hastup = false;			/* for now */
 	*recordfreespace = false;	/* for now */
+	*freespace = PageGetHeapFreeSpace(page);
 
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -2087,34 +2071,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
 									 &NoFreezePageRelfrozenXid,
 									 &NoFreezePageRelminMxid))
-		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
-			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
-			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
-			 */
-		}
+			should_freeze = true;
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
@@ -2163,10 +2120,98 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * By here we know for sure that caller can put off freezing and pruning
-	 * this particular page until the next VACUUM.  Remember its details now.
-	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
+	 * Release lock (but not pin) on page now.  Then consider if we should
+	 * back out of accepting reduced processing for this page.
+	 *
+	 * Our caller's initial inability to get a cleanup lock will often turn
+	 * out to have been nothing more than a momentary blip, and it would be a
+	 * shame if relfrozenxid/relminmxid values < FreezeLimit/MultiXactCutoff
+	 * were used without good reason.  For example, the checkpointer might
+	 * have been writing out this page a moment ago, in which case its buffer
+	 * pin might have already been released by now.
+	 *
+	 * It's also possible that the conflicting buffer pin will continue to
+	 * block cleanup lock acquisition on the buffer for an extended period.
+	 * For example, it isn't uncommon for heap_lock_tuple to sleep while
+	 * holding a buffer pin, in which case a conflicting pin could easily be
+	 * held for much longer than VACUUM can reasonably be expected to wait.
+	 * There are also truly pathological cases to worry about.  For example,
+	 * the case where buggy application code holds open a cursor forever.
 	 */
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	if (should_freeze)
+	{
+		/*
+		 * If page has tuple with a dangerously old XID/MXID (an XID < MinXid,
+		 * or an MXID < MinMulti), then we wait for however long it takes to
+		 * get a cleanup lock.
+		 *
+		 * Check for that first (get it out of the way).
+		 */
+		if (TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								  vacrel->cutoffs.MinXid) ||
+			MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								vacrel->cutoffs.MinMulti))
+		{
+			/*
+			 * MinXid/MinMulti are considered to be only barely adequate final
+			 * values, so we only expect to end up here when previous VACUUMs
+			 * put off processing by lazy_scan_prune in the hope that it would
+			 * never come to this.  That hasn't worked out, so we must wait.
+			 */
+			LockBufferForCleanup(buf);
+			return false;
+		}
+
+		/*
+		 * Page has tuple with XID < FreezeLimit, or MXID < MultiXactCutoff,
+		 * but they're not so old that we're _strictly_ obligated to freeze.
+		 *
+		 * We are willing to go to the trouble of waiting for a cleanup lock
+		 * for a short while for such a page -- just not indefinitely long.
+		 * This avoids squandering opportunities to advance relfrozenxid or
+		 * relminmxid by the target amount during any one VACUUM, which is
+		 * particularly important with larger tables that only get vacuumed
+		 * when autovacuum.c is concerned about table age.  It would not be
+		 * okay if the number of autovacuums such a table ended up requiring
+		 * noticeably exceeded the expected autovacuum_freeze_max_age cadence.
+		 *
+		 * We are willing to wait and try again a total of 3 times.  If that
+		 * doesn't work then we just give up.  We only wait here when it is
+		 * actually expected to preserve current NewRelfrozenXid/NewRelminMxid
+		 * tracker values, and when trackers will actually be used to update
+		 * pg_class later on.  This also tends to limit the impact of waiting
+		 * for VACUUMs that experience relatively many cleanup lock conflicts.
+		 */
+		if (vacrel->vmstrat != VMSNAP_SCAN_LAZY &&
+			(TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								   vacrel->NewRelfrozenXid) ||
+			 MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								 vacrel->NewRelminMxid)))
+		{
+			/* wait 10ms, then 20ms, then 30ms, then give up */
+			for (int i = 1; i <= 3; i++)
+			{
+				CHECK_FOR_INTERRUPTS();
+
+				pg_usleep(1000L * 10L * i);
+				if (ConditionalLockBufferForCleanup(buf))
+				{
+					/* Go process page in lazy_scan_prune after all */
+					return false;
+				}
+			}
+		}
+
+		/* Accept reduced processing for this page after all */
+	}
+
+	/*
+	 * By here we know for sure that caller will put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * Also drop the buffer pin that we held onto during cleanup lock steps.
+	 */
+	ReleaseBuffer(buf);
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5085d9407..f4429e320 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -916,13 +916,8 @@ get_all_vacuum_rels(int options)
  * The target relation and VACUUM parameters are our inputs.
  *
  * Output parameters are the cutoffs that VACUUM caller should use.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
  */
-bool
+void
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				   struct VacuumCutoffs *cutoffs)
 {
@@ -1092,6 +1087,39 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 		multixact_freeze_table_age > effective_multixact_freeze_max_age)
 		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
+	/*
+	 * Determine the cutoffs used by VACUUM to decide on whether to wait for a
+	 * cleanup lock on a page (that it can't cleanup lock right away).  These
+	 * are the MinXid and MinMulti cutoffs.  They are related to the cutoffs
+	 * for freezing (FreezeLimit and MultiXactCutoff), and are only applied on
+	 * pages that we cannot freeze right away.  See vacuumlazy.c for details.
+	 *
+	 * VACUUM can ratchet back NewRelfrozenXid and/or NewRelminMxid instead of
+	 * waiting indefinitely for a cleanup lock in almost all cases.  The high
+	 * level goal is to create as many opportunities as possible to freeze
+	 * (across many successive VACUUM operations), while avoiding waiting for
+	 * a cleanup lock whenever possible.  Any time spent waiting is time spent
+	 * not freezing other eligible pages, which is typically a bad trade-off.
+	 *
+	 * As a consequence of all this, MinXid and MinMulti also act as limits on
+	 * the oldest acceptable values that can ever be set in pg_class by VACUUM
+	 * (though this is only relevant when they have already attained XID/XMID
+	 * ages that approach freeze_table_age and/or multixact_freeze_table_age).
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
 	 * XMID table age (whichever is greater currently).
@@ -1109,8 +1137,6 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 */
 	if (params->is_wraparound)
 		cutoffs->tableagefrac = 1.0;
-
-	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index f9788c30a..0c80896cc 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 596c44060..fd9b2b619 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8256,7 +8256,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8445,7 +8445,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9195,7 +9195,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         billion, <command>VACUUM</command> will silently limit the
         effective value to <xref
          linkend="guc-autovacuum-freeze-max-age"/>. For more
-        information see <xref linkend="vacuum-for-wraparound"/>.
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
@@ -9228,7 +9228,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9284,7 +9284,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        billion, <command>VACUUM</command> will silently limit the
        effective value to <xref
         linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
-       information see <xref linkend="vacuum-for-wraparound"/>.
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c137debb1..d4237ec5d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -156,9 +156,11 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
      <para>
       Normally, <command>VACUUM</command> will skip pages based on the <link
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.
+      all tuples are known to be frozen are always skipped.  Pages
+      where all tuples are known to be visible to all transactions are
+      skipped whenever <command>VACUUM</command> determined that
+      advancing <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> was unnecessary.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -213,7 +215,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/doc/src/sgml/xact.sgml b/doc/src/sgml/xact.sgml
index b467660ee..c4146539f 100644
--- a/doc/src/sgml/xact.sgml
+++ b/doc/src/sgml/xact.sgml
@@ -49,8 +49,8 @@
 
   <para>
    The internal transaction ID type <type>xid</type> is 32 bits wide
-   and <link linkend="vacuum-for-wraparound">wraps around</link> every
-   4 billion transactions. A 32-bit epoch is incremented during each
+   and <link linkend="vacuum-xid-space">wraps around</link> every
+   2 billion transactions. A 32-bit epoch is incremented during each
    wraparound. There is also a 64-bit type <type>xid8</type> which
    includes this epoch and therefore does not wrap around during the
    life of an installation;  it can be converted to xid by casting.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..927410258 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,18 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), without ever being
+# prepared to wait for a cleanup lock (we'll never wait on a cleanup
+# lock because the separate MinXid cutoff for waiting will still be
+# well before FreezeLimit, given our default autovacuum_freeze_max_age).
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +78,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +94,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +105,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +118,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +130,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +138,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
-- 
2.38.1

#60

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#59)

4 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Dec 22, 2022 at 11:39 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Dec 21, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

Great. I plan on committing 0001 in the next few days. Committing 0002
might take a bit longer.

I pushed the VACUUM cutoffs patch (previously 0001) this morning -
thanks for your help with that one.

Attached is v12. I think that the page-level freezing patch is now
commitable, and plan on committing it in the next 2-4 days barring any
objections.

Notable changes in v12:

* Simplified some of the logic in FreezeMultiXactId(), which now
doesn't have any needless handling of NewRelfrozenXid style cutoffs
except in the one case that still needs it (its no-op processing
case).

We don't need most of the handling on HEAD anymore because every
possible approach to processing a Multi other than FRM_NOOP will
reliably leave behind a new xmax that is either InvalidTransactionId,
or an XID/MXID >= OldestXmin/OldestMxact. Such values cannot possibly
need to be tracked by the NewRelfrozenXid trackers, since the trackers
are initialized using OldestXmin/OldestMxact to begin with.

* v12 merges together the code for the "freeze the page"
lazy_scan_prune path with the block that actually calls
heap_freeze_execute_prepared().

This should make it clear that pagefrz.freeze_required really does
mean that freezing is required. Hopefully that addresses Jeff's recent
concern. It's certainly an improvement, in any case.

* On a related note, comments around the same point in lazy_scan_prune
as well as comments above the HeapPageFreeze struct now explain a
concept I decided to call "nominal freezing". This is the case where
we "freeze a page" without having any freeze plans to execute.

"nominal freezing" is the new name for a concept I invented many
months ago, which helps to resolve subtle problems with the way that
heap_prepare_freeze_tuple is tasked with doing two different things
for its lazy_scan_prune caller: 1. telling lazy_scan_prune how it
would freeze each tuple (were it to freeze the page), and 2. helping
lazy_scan_prune to determine if the page should become all-frozen in
the VM. The latter is always conditioned on page-level freezing
actually going ahead, since everything else in
heap_prepare_freeze_tuple has to work that way.

We always freeze a page with zero freeze plans (or "nominally freeze"
the page) in lazy_scan_prune (which is nothing new in itself). We
thereby avoid breaking heap_prepare_freeze_tuple's working assumption
that all it needs to focus on what the page will look like after
freezing executes, while also avoiding senselessly throwing away the
ability to set a page all-frozen in the VM in lazy_scan_prune when
it'll cost us nothing extra. That is, by always freezing in the event
of zero freeze plans, we won't senselessly miss out on setting a page
all-frozen in cases where we don't actually have to execute any freeze
plans to make that safe, while the "freeze the page path versus don't
freeze the page path" dichotomy still works as a high level conceptual
abstraction.

--
Peter Geoghegan

Attachments:

v12-0002-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchapplication/octet-stream; name=v12-0002-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchDownload

From c9ba6835f4446716040d2579006180940e1fa61a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v12 2/4] Add eager and lazy freezing strategies to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  VACUUM determines its freezing
strategy based on the value of the new vacuum_freeze_strategy_threshold
GUC (or reloption) in most cases: Tables that exceeds the size threshold
use the eager freezing strategy.  Otherwise VACUUM uses the lazy
freezing strategy,  which is essentially the same approach that VACUUM
has always taken to freezing (though not quite, due to the influence of
page level freezing following recent work).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).
Eager freezing is generally only applied when vacuuming larger tables,
where freezing most individual heap pages at the first opportunity (in
the first VACUUM operation where they can definitely be set all-visible)
will improve performance stability.

A later commit will add vmsnap scanning strategies, which are designed
to work in tandem with the freezing strategies from this commit.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/commands/vacuum.h                 | 10 +++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++++
 src/backend/access/heap/heapam.c              |  1 +
 src/backend/access/heap/vacuumlazy.c          | 43 ++++++++++++++++++-
 src/backend/commands/vacuum.c                 | 17 +++++++-
 src/backend/postmaster/autovacuum.c           | 10 +++++
 src/backend/utils/misc/guc_tables.c           | 11 +++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 19 +++++++-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++++
 doc/src/sgml/ref/vacuum.sgml                  | 16 +++----
 12 files changed, 143 insertions(+), 11 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 2f274f2be..b39178d5b 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -272,6 +275,12 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
+	 * rel's main fork) that triggers VACUUM's eager freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold;
 };
 
 /*
@@ -295,6 +304,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca..e195b63d7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b734489..4b680501c 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 71dfe5933..4651895f8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6876,6 +6876,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.freeze_strategy_threshold = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 18192fed5..8021f7fd5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,6 +153,8 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -242,6 +244,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void lazy_scan_strategy(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -470,6 +473,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
+	/*
+	 * Now determine VACUUM's freezing strategy.
+	 */
+	lazy_scan_strategy(vacrel);
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -1249,6 +1256,38 @@ lazy_scan_heap(LVRelState *vacrel)
 		lazy_cleanup_all_indexes(vacrel);
 }
 
+/*
+ *	lazy_scan_strategy() -- Determine freezing strategy.
+ *
+ * Our lazy freezing strategy is useful when putting off the work of freezing
+ * totally avoids freezing that turns out to have been wasted effort later on.
+ * Our eager freezing strategy is useful with larger tables that experience
+ * continual growth, where freezing pages proactively is needed just to avoid
+ * falling behind on freezing (eagerness is also likely to be cheaper in the
+ * short/medium term for such tables, but the long term picture matters most).
+ */
+static void
+lazy_scan_strategy(LVRelState *vacrel)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	/*
+	 * Decide freezing strategy.
+	 *
+	 * The eager freezing strategy is used when the threshold controlled by
+	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
+	 *
+	 * Also freeze eagerly with an unlogged or temp table, where the total
+	 * cost of freezing each page is just the cycles needed to prepare a set
+	 * of freeze plans.  Executing the freeze plans adds very little cost.
+	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
+	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 */
+	vacrel->eager_freeze_strategy =
+		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 !RelationIsPermanent(vacrel->rel));
+}
+
 /*
  *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
@@ -1770,10 +1809,12 @@ retry:
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 	 * freeze when pruning generated an FPI, if doing so means that we set the
 	 * page all-frozen afterwards (might not happen until second heap pass).
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will thereby become all-frozen in the visibility map.
 	 */
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
 		(prunestate->all_visible && prunestate->all_frozen &&
-		 fpi_before != pgWalUsage.wal_fpi))
+		 (fpi_before != pgWalUsage.wal_fpi || vacrel->eager_freeze_strategy)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ba965b8c7..7c68bd8ff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
@@ -926,7 +930,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -939,6 +944,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/* Set pg_class fields in cutoffs */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1053,6 +1059,15 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
 		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	Assert(freeze_strategy_threshold >= 0);
+	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
+
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 0746d8022..23e316e59 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 436afe1d2..a009017bd 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2518,6 +2518,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5afdeb04d..447645b73 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,6 +693,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 44e15b5fb..167c6570e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9161,6 +9161,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to apply its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
       <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9196,7 +9211,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         Specifies the cutoff age (in transactions) that
         <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages that have an older XID.
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index e14ead882..79595b1cb 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> and <xref
+       linkend="guc-vacuum-freeze-table-age"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
-- 
2.38.1

v12-0004-Finish-removing-aggressive-mode-VACUUM.patchapplication/octet-stream; name=v12-0004-Finish-removing-aggressive-mode-VACUUM.patchDownload

From 22489f72c114a0f8ac791327f2556fa2e6d59af1 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v12 4/4] Finish removing aggressive mode VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Although pre-visibility-map
VACUUM was far less efficient (especially after 9.6's commit fd31cd26),
its naive approach had one notable advantage: users only had to think
about a single kind of lazy vacuum (the only kind that existed).

Break the final remaining dependency on aggressive mode: replace the
rules governing when VACUUM will wait for a cleanup lock with a new set
of rules more attuned to the needs of the table.  With that last
dependency gone, there is no need for aggressive mode, so get rid of it.
Users once again only have to think about one kind of lazy vacuum.

In general, all of the behaviors associated with aggressive mode prior
to Postgres 16 have been retained; they just get applied selectively, on
a more dynamic timeline.  For example, the aforementioned change to
VACUUM's cleanup lock behavior retains the general idea of sometimes
waiting for a cleanup lock to make sure that older XIDs get frozen, so
that relfrozenxid can be advanced by a sufficient amount.  All that
really changes is the information driving VACUUM's decision on waiting.

We use new, dedicated cutoffs, rather than applying the FreezeLimit and
MultiXactCutoff used when deciding whether we should trigger freezing on
the basis of XID/MXID age.  These minimum fallback cutoffs (which are
called MinXid and MinMulti) are typically far older than the standard
FreezeLimit/MultiXactCutoff cutoffs.  VACUUM doesn't need an aggressive
mode to decide on whether to wait for a cleanup lock anymore; it can
decide everything at the level of individual heap pages.

It is okay to aggressively punt on waiting for a cleanup lock like this
because VACUUM now directly understands the importance of never falling
too far behind on the work of freezing physical heap pages at the level
of the whole table, following recent work to add VM scanning strategies.
It is generally safer for VACUUM to press on with freezing other heap
pages from the table instead.  Even if relfrozenxid can only be advanced
by relatively few XIDs as a consequence, VACUUM should have more than
ample opportunity to catch up next time, since there is bound to be no
more than a small number of problematic unfrozen pages left behind.
VACUUM now tends to consistently advance relfrozenxid (at least by some
small amount) all the time in larger tables, so all that has to happen
for relfrozenxid to fully catch up is for a few remaining unfrozen pages
to get frozen.  Since relfrozenxid is now considered to be no more than
a lagging indicator of freezing, and since relfrozenxid isn't used to
trigger freezing in the way that it once was, time is on our side.

Also teach VACUUM to wait for a short while for cleanup locks when doing
so has a decent chance of preserving its ability to advance relfrozenxid
up to FreezeLimit (and/or to advance relminmxid up to MultiXactCutoff).
As a result, VACUUM typically manages to advance relfrozenxid by just as
much as it would have had it promised to advance it up to FreezeLimit
(i.e. had it made the traditional aggressive VACUUM guarantee), even
when vacuuming a table that happens to have relatively many cleanup lock
conflicts affecting pages with older XIDs/MXIDs.  VACUUM thereby avoids
missing out on advancing relfrozenxid up to the traditional target
amount when it really can be avoided fairly easily, without promising to
do so (VACUUM only promises to advance up to MinXid/MinMulti).

XXX We also need to avoid the special auto-cancellation behavior for
antiwraparound autovacuums to make this truly safe.  See also, related
patch for this: https://commitfest.postgresql.org/41/4027/

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   9 +-
 src/backend/access/heap/heapam.c              |   2 +
 src/backend/access/heap/vacuumlazy.c          | 221 +++---
 src/backend/commands/vacuum.c                 |  42 +-
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 doc/src/sgml/config.sgml                      |  10 +-
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  10 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 doc/src/sgml/xact.sgml                        |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  33 +-
 15 files changed, 560 insertions(+), 530 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 43e367bcb..b75b813f8 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -276,6 +276,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
@@ -348,7 +355,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+extern void vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 							   struct VacuumCutoffs *cutoffs);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f13b3a05d..4dcc38848 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6876,6 +6876,8 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.MinXid = FreezeLimit;
+	cutoffs.MinMulti = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold = 0;
 	cutoffs.tableagefrac = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 69a10b9be..24097f4e3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -157,8 +157,6 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -262,7 +260,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							LVPagePruneState *prunestate);
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
-							  bool *hastup, bool *recordfreespace);
+							  bool *hastup, bool *recordfreespace,
+							  Size *freespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -459,7 +458,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
+	vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
@@ -539,17 +538,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(vacrel->cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(vacrel->cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
@@ -557,7 +553,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
 		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
-		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -626,33 +621,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
 			memset(&walusage, 0, sizeof(WalUsage));
 			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
 			initStringInfo(&buf);
+
 			if (verbose)
-			{
-				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
-			}
 			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			else
-			{
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			appendStringInfo(&buf, msgfmt,
 							 get_database_name(MyDatabaseId),
 							 vacrel->relnamespace,
@@ -948,6 +924,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			bool		hastup,
 						recordfreespace;
+			Size		freespace;
 
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
@@ -961,10 +938,8 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Collect LP_DEAD items in dead_items array, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
-								  &recordfreespace))
+								  &recordfreespace, &freespace))
 			{
-				Size		freespace = 0;
-
 				/*
 				 * Processed page successfully (without cleanup lock) -- just
 				 * need to perform rel truncation and FSM steps, much like the
@@ -973,21 +948,14 @@ lazy_scan_heap(LVRelState *vacrel)
 				 */
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
-				if (recordfreespace)
-					freespace = PageGetHeapFreeSpace(page);
-				UnlockReleaseBuffer(buf);
 				if (recordfreespace)
 					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+
+				/* lock and pin released by lazy_scan_noprune */
 				continue;
 			}
 
-			/*
-			 * lazy_scan_noprune could not do all required processing.  Wait
-			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
-			 */
-			Assert(vacrel->aggressive);
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockBufferForCleanup(buf);
+			/* cleanup lock acquired by lazy_scan_noprune */
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
@@ -1430,8 +1398,6 @@ lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 	if (force_scan_all)
 		vacrel->vmstrat = VMSNAP_SCAN_ALL;
 
-	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
-
 	/* Inform vmsnap infrastructure of our chosen strategy */
 	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
 
@@ -2010,17 +1976,32 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items array for removal from indexes (assuming caller's page
+ * can be processed successfully here).
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We return true to indicate that processing succeeded, in which case we'll
+ * have dropped the lock and pin on buf/page.  Else returns false, indicating
+ * that page must be processed by lazy_scan_prune in the usual way after all.
+ * Acquires a cleanup lock on buf/page for caller before returning false.
+ *
+ * We go to considerable trouble to get a cleanup lock on any page that has
+ * XIDs/MXIDs that need to be frozen in order for VACUUM to be able to set
+ * relfrozenxid/relminmxid to values >= FreezeLimit/MultiXactCutoff cutoffs.
+ * But we don't strictly guarantee it; we only guarantee that final values
+ * will be >= MinXid/MinMulti cutoffs in the worst case.
+ *
+ * We prefer to "under promise and over deliver" like this because a strong
+ * guarantee has the potential to make a bad situation even worse.  VACUUM
+ * should avoid waiting for a cleanup lock for an indefinitely long time until
+ * it has already exhausted every available alternative.  It's quite possible
+ * (and perhaps even likely) that the problem will go away on its own.  But
+ * even when it doesn't, our approach at least makes it likely that the first
+ * VACUUM that encounters the issue will catch up on whatever freezing may
+ * still be required for every other page in the target rel.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
- * generic FSM processing for page.
+ * generic FSM processing for page, using *freespace value set here.
  */
 static bool
 lazy_scan_noprune(LVRelState *vacrel,
@@ -2028,7 +2009,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				  BlockNumber blkno,
 				  Page page,
 				  bool *hastup,
-				  bool *recordfreespace)
+				  bool *recordfreespace,
+				  Size *freespace)
 {
 	OffsetNumber offnum,
 				maxoff;
@@ -2036,6 +2018,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples,
 				missed_dead_tuples;
+	bool		should_freeze = false;
 	HeapTupleHeader tupleheader;
 	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
 	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
@@ -2045,6 +2028,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	*hastup = false;			/* for now */
 	*recordfreespace = false;	/* for now */
+	*freespace = PageGetHeapFreeSpace(page);
 
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -2086,34 +2070,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
 									 &NoFreezePageRelfrozenXid,
 									 &NoFreezePageRelminMxid))
-		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
-			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
-			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
-			 */
-		}
+			should_freeze = true;
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
@@ -2162,10 +2119,98 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * By here we know for sure that caller can put off freezing and pruning
-	 * this particular page until the next VACUUM.  Remember its details now.
-	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
+	 * Release lock (but not pin) on page now.  Then consider if we should
+	 * back out of accepting reduced processing for this page.
+	 *
+	 * Our caller's initial inability to get a cleanup lock will often turn
+	 * out to have been nothing more than a momentary blip, and it would be a
+	 * shame if relfrozenxid/relminmxid values < FreezeLimit/MultiXactCutoff
+	 * were used without good reason.  For example, the checkpointer might
+	 * have been writing out this page a moment ago, in which case its buffer
+	 * pin might have already been released by now.
+	 *
+	 * It's also possible that the conflicting buffer pin will continue to
+	 * block cleanup lock acquisition on the buffer for an extended period.
+	 * For example, it isn't uncommon for heap_lock_tuple to sleep while
+	 * holding a buffer pin, in which case a conflicting pin could easily be
+	 * held for much longer than VACUUM can reasonably be expected to wait.
+	 * There are also truly pathological cases to worry about.  For example,
+	 * the case where buggy application code holds open a cursor forever.
 	 */
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	if (should_freeze)
+	{
+		/*
+		 * If page has tuple with a dangerously old XID/MXID (an XID < MinXid,
+		 * or an MXID < MinMulti), then we wait for however long it takes to
+		 * get a cleanup lock.
+		 *
+		 * Check for that first (get it out of the way).
+		 */
+		if (TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								  vacrel->cutoffs.MinXid) ||
+			MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								vacrel->cutoffs.MinMulti))
+		{
+			/*
+			 * MinXid/MinMulti are considered to be only barely adequate final
+			 * values, so we only expect to end up here when previous VACUUMs
+			 * put off processing by lazy_scan_prune in the hope that it would
+			 * never come to this.  That hasn't worked out, so we must wait.
+			 */
+			LockBufferForCleanup(buf);
+			return false;
+		}
+
+		/*
+		 * Page has tuple with XID < FreezeLimit, or MXID < MultiXactCutoff,
+		 * but they're not so old that we're _strictly_ obligated to freeze.
+		 *
+		 * We are willing to go to the trouble of waiting for a cleanup lock
+		 * for a short while for such a page -- just not indefinitely long.
+		 * This avoids squandering opportunities to advance relfrozenxid or
+		 * relminmxid by the target amount during any one VACUUM, which is
+		 * particularly important with larger tables that only get vacuumed
+		 * when autovacuum.c is concerned about table age.  It would not be
+		 * okay if the number of autovacuums such a table ended up requiring
+		 * noticeably exceeded the expected autovacuum_freeze_max_age cadence.
+		 *
+		 * We are willing to wait and try again a total of 3 times.  If that
+		 * doesn't work then we just give up.  We only wait here when it is
+		 * actually expected to preserve current NewRelfrozenXid/NewRelminMxid
+		 * tracker values, and when trackers will actually be used to update
+		 * pg_class later on.  This also tends to limit the impact of waiting
+		 * for VACUUMs that experience relatively many cleanup lock conflicts.
+		 */
+		if (vacrel->vmstrat != VMSNAP_SCAN_LAZY &&
+			(TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								   vacrel->NewRelfrozenXid) ||
+			 MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								 vacrel->NewRelminMxid)))
+		{
+			/* wait 10ms, then 20ms, then 30ms, then give up */
+			for (int i = 1; i <= 3; i++)
+			{
+				CHECK_FOR_INTERRUPTS();
+
+				pg_usleep(1000L * 10L * i);
+				if (ConditionalLockBufferForCleanup(buf))
+				{
+					/* Go process page in lazy_scan_prune after all */
+					return false;
+				}
+			}
+		}
+
+		/* Accept reduced processing for this page after all */
+	}
+
+	/*
+	 * By here we know for sure that caller will put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * Also drop the buffer pin that we held onto during cleanup lock steps.
+	 */
+	ReleaseBuffer(buf);
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5085d9407..f4429e320 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -916,13 +916,8 @@ get_all_vacuum_rels(int options)
  * The target relation and VACUUM parameters are our inputs.
  *
  * Output parameters are the cutoffs that VACUUM caller should use.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
  */
-bool
+void
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				   struct VacuumCutoffs *cutoffs)
 {
@@ -1092,6 +1087,39 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 		multixact_freeze_table_age > effective_multixact_freeze_max_age)
 		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
+	/*
+	 * Determine the cutoffs used by VACUUM to decide on whether to wait for a
+	 * cleanup lock on a page (that it can't cleanup lock right away).  These
+	 * are the MinXid and MinMulti cutoffs.  They are related to the cutoffs
+	 * for freezing (FreezeLimit and MultiXactCutoff), and are only applied on
+	 * pages that we cannot freeze right away.  See vacuumlazy.c for details.
+	 *
+	 * VACUUM can ratchet back NewRelfrozenXid and/or NewRelminMxid instead of
+	 * waiting indefinitely for a cleanup lock in almost all cases.  The high
+	 * level goal is to create as many opportunities as possible to freeze
+	 * (across many successive VACUUM operations), while avoiding waiting for
+	 * a cleanup lock whenever possible.  Any time spent waiting is time spent
+	 * not freezing other eligible pages, which is typically a bad trade-off.
+	 *
+	 * As a consequence of all this, MinXid and MinMulti also act as limits on
+	 * the oldest acceptable values that can ever be set in pg_class by VACUUM
+	 * (though this is only relevant when they have already attained XID/XMID
+	 * ages that approach freeze_table_age and/or multixact_freeze_table_age).
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
 	 * XMID table age (whichever is greater currently).
@@ -1109,8 +1137,6 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 */
 	if (params->is_wraparound)
 		cutoffs->tableagefrac = 1.0;
-
-	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index f9788c30a..0c80896cc 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 596c44060..fd9b2b619 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8256,7 +8256,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8445,7 +8445,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9195,7 +9195,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         billion, <command>VACUUM</command> will silently limit the
         effective value to <xref
          linkend="guc-autovacuum-freeze-max-age"/>. For more
-        information see <xref linkend="vacuum-for-wraparound"/>.
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
@@ -9228,7 +9228,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9284,7 +9284,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        billion, <command>VACUUM</command> will silently limit the
        effective value to <xref
         linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
-       information see <xref linkend="vacuum-for-wraparound"/>.
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c137debb1..d4237ec5d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -156,9 +156,11 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
      <para>
       Normally, <command>VACUUM</command> will skip pages based on the <link
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.
+      all tuples are known to be frozen are always skipped.  Pages
+      where all tuples are known to be visible to all transactions are
+      skipped whenever <command>VACUUM</command> determined that
+      advancing <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> was unnecessary.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -213,7 +215,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/doc/src/sgml/xact.sgml b/doc/src/sgml/xact.sgml
index b467660ee..c4146539f 100644
--- a/doc/src/sgml/xact.sgml
+++ b/doc/src/sgml/xact.sgml
@@ -49,8 +49,8 @@
 
   <para>
    The internal transaction ID type <type>xid</type> is 32 bits wide
-   and <link linkend="vacuum-for-wraparound">wraps around</link> every
-   4 billion transactions. A 32-bit epoch is incremented during each
+   and <link linkend="vacuum-xid-space">wraps around</link> every
+   2 billion transactions. A 32-bit epoch is incremented during each
    wraparound. There is also a 64-bit type <type>xid8</type> which
    includes this epoch and therefore does not wrap around during the
    life of an installation;  it can be converted to xid by casting.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..f9e4194cd 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,21 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), while still using default
+# settings for vacuum_freeze_table_age/autovacuum_freeze_max_age.
+#
+# This makes VACUUM freeze tuples just as aggressively as it would if the
+# VACUUM command's FREEZE option was specified with almost all heap pages.
+# However, VACUUM is still unwilling to wait indefinitely for a cleanup lock,
+# just to freeze a few XIDs/MXIDs that still aren't very old.
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +81,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +97,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +108,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +121,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +133,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +141,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
-- 
2.38.1

v12-0003-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchapplication/octet-stream; name=v12-0003-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchDownload

From 31c3b10f2bf4f2f1f0188baa0f1df22843792839 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v12 3/4] Add eager and lazy VM strategies to VACUUM.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will scan (or skip) heap pages (scanning strategy).  The
data structure we a local copy of the visibility map at the start of
VACUUM.  It spills to disk as required, though only with a larger table.

VACUUM decides on its visibility map scanning and freezing strategies
together, shortly before the first pass over the heap begins, since the
concepts are closely related, and work in tandem.  Lazy scanning allows
VACUUM to skip all-visible pages, while eager scanning allows VACUUM to
advance relfrozenxid/relminmxid at the end of the VACUUM operation.

This work, combined with recent work to add freezing strategies, results
in VACUUM advancing relfrozenxid at a cadence that is barely influenced
by autovacuum_freeze_max_age at all.  Now antiwraparound autovacuums
will be far less common in practice.  Freezing now drives relfrozenxid,
rather than relfrozenxid driving freezing.  Even tables that always use
lazy freezing will have a decent chance of relfrozenxid advancement long
before table age nears or exceeds autovacuum_freeze_max_age.

This also lays the groundwork for completely removing aggressive mode
VACUUMs in a later commit.  Scanning strategies now supersede the "early
aggressive VACUUM" concept implemented by vacuum_freeze_table_age, which
is now just a compatibility option (its new default of -1 is interpreted
as "just use autovacuum_freeze_max_age").  Later work that makes the
choice to wait for a cleanup lock depend entirely on individual page
characteristics will decouple that "aggressive behavior" from the eager
scanning strategy behavior (a behavior that's not really "aggressive" in
any general sense, since it's chosen based on both costs and benefits).

Also add explicit I/O prefetching of heap pages, which is controlled by
maintenance_io_concurrency.  We prefetch at the point that the next
block in line is requested by VACUUM.  Prefetching is under the direct
control of the visibility map snapshot code, since VACUUM's vmsnap is
now an authoritative guide to which pages VACUUM will scan.  VACUUM's
final scanned_pages is "locked in" when it decides on scanning strategy
(so scanned_pages is finalized before the first heap pass even begins).

Prefetching should totally avoid the loss of performance that might
otherwise result from removing SKIP_PAGES_THRESHOLD in this commit.
SKIP_PAGES_THRESHOLD was intended to force OS readahead and encourage
relfrozenxid advancement.  See commit bf136cf6 from around the time the
visibility map first went in for full details.

Also teach VACUUM to use scanned_pages (not rel_pages) to cap the size
of the dead_items array.  This approach is strictly better, since there
is no question of scanning any pages other than the precise set of pages
already locked in by vmsnap by the time dead_items is allocated.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h            |  17 +
 src/include/commands/vacuum.h                 |  15 +-
 src/backend/access/heap/heapam.c              |   1 +
 src/backend/access/heap/vacuumlazy.c          | 464 ++++++++-------
 src/backend/access/heap/visibilitymap.c       | 547 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |  68 +--
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      |  66 ++-
 doc/src/sgml/ref/vacuum.sgml                  |   4 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 12 files changed, 934 insertions(+), 281 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..4a1f47ac6 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,17 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
+/* VACUUM scanning strategy */
+typedef enum vmstrategy
+{
+	VMSNAP_SCAN_LAZY,			/* Skip all-visible and all-frozen pages */
+	VMSNAP_SCAN_EAGER,			/* Only skip all-frozen pages */
+	VMSNAP_SCAN_ALL				/* Don't skip any pages (scan them instead) */
+} vmstrategy;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +46,12 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+											  BlockNumber *scanned_pages_lazy,
+											  BlockNumber *scanned_pages_eager);
+extern void visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat);
+extern BlockNumber visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index b39178d5b..43e367bcb 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 
 /*
  * Values used by index_cleanup and truncate params.
@@ -281,6 +281,19 @@ struct VacuumCutoffs
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
 	 */
 	BlockNumber freeze_strategy_threshold;
+
+	/*
+	 * The tableagefrac value 1.0 represents the point that autovacuum.c
+	 * scheduling (and VACUUM itself) considers relfrozenxid advancement
+	 * strictly necessary.
+	 *
+	 * Lower values provide useful context, and influence whether VACUUM will
+	 * opt to advance relfrozenxid before the point that it is strictly
+	 * necessary.  VACUUM can (and often does) opt to advance relfrozenxid
+	 * proactively.  It is especially likely with tables where the _added_
+	 * costs happen to be low.
+	 */
+	double		tableagefrac;
 };
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4651895f8..f13b3a05d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6877,6 +6877,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold = 0;
+	cutoffs.tableagefrac = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8021f7fd5..69a10b9be 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -110,10 +110,18 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * tableagefrac-wise cutoffs influencing VACUUM's choice of scanning strategy
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define TABLEAGEFRAC_MIDPOINT		0.5 /* half way to antiwraparound AV */
+#define TABLEAGEFRAC_HIGHPOINT		0.9 /* Eagerness now mandatory */
+
+/*
+ * Thresholds (expressed as a proportion of rel_pages) that determine the
+ * cutoff (in extra pages scanned) for eager vmsnap scanning behavior at
+ * particular tableagefrac-wise table ages
+ */
+#define MAX_PAGES_YOUNG_TABLEAGE	0.05	/* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE		0.70	/* 70% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -151,8 +159,6 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -171,7 +177,9 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map (as of time that VACUUM began) */
+	vmsnapshot *vmsnap;
+	vmstrategy	vmstrat;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -244,11 +252,8 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static void lazy_scan_strategy(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  bool force_scan_all);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -278,7 +283,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -310,10 +316,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	BlockNumber orig_rel_pages,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -459,37 +465,29 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		vacrel->aggressive = true;
-		skipwithvm = false;
-	}
-
-	vacrel->skipwithvm = skipwithvm;
 
 	/*
-	 * Now determine VACUUM's freezing strategy.
+	 * Now determine VACUUM's freezing and vmsnap scanning strategies.
+	 *
+	 * This process is driven in part by information from VACUUM's visibility
+	 * map snapshot, which will be acquired in passing.  lazy_scan_heap will
+	 * use the same immutable VM snapshot to determine which pages to scan.
+	 * Using an immutable structure (instead of the live visibility map) makes
+	 * VACUUM avoid scanning concurrently modified pages.  These pages can
+	 * only have deleted tuples that OldestXmin will consider RECENTLY_DEAD.
 	 */
-	lazy_scan_strategy(vacrel);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   (params->options &
+										VACOPT_DISABLE_PAGE_SKIPPING) != 0);
 	if (verbose)
-	{
-		if (vacrel->aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace, vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -499,13 +497,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -552,12 +551,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
+		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
 		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -602,6 +600,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -629,10 +630,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -828,12 +825,11 @@ lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_fsm_block_to_vacuum = 0;
+	bool		next_all_visible;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -847,42 +843,29 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
+	next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap,
+												 &next_all_visible);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vmsnap;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
+		if (blkno < next_block_to_scan)
 		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
-
-			Assert(next_unskippable_block >= blkno + 1);
+			Assert(blkno != rel_pages - 1);
+			continue;
 		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
 
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * Determine the next page in line to be scanned according to vmsnap
+		 * before scanning this page
+		 */
+		all_visible_according_to_vmsnap = next_all_visible;
+		next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap,
+													 &next_all_visible);
+		Assert(next_block_to_scan > blkno);
 
 		vacrel->scanned_pages++;
 
@@ -1089,10 +1072,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1120,13 +1102,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page) &&
+				 VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1164,8 +1144,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
-				 prunestate.all_frozen &&
+		else if (all_visible_according_to_vmsnap &&
+				 prunestate.all_visible && prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
 			/*
@@ -1257,7 +1237,7 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine freezing strategy.
+ *	lazy_scan_strategy() -- Determine freezing/vmsnap scanning strategies.
  *
  * Our lazy freezing strategy is useful when putting off the work of freezing
  * totally avoids freezing that turns out to have been wasted effort later on.
@@ -1265,11 +1245,42 @@ lazy_scan_heap(LVRelState *vacrel)
  * continual growth, where freezing pages proactively is needed just to avoid
  * falling behind on freezing (eagerness is also likely to be cheaper in the
  * short/medium term for such tables, but the long term picture matters most).
+ *
+ * Our lazy vmsnap scanning strategy is useful when we can save a significant
+ * amount of work in the short term by not advancing relfrozenxid/relminmxid.
+ * Our eager vmsnap scanning strategy is useful when there is hardly any work
+ * avoided by being lazy anyway, and/or when tableagefrac is nearing or has
+ * already surpassed 1.0, which is the point of antiwraparound autovacuuming.
+ *
+ * Freezing and scanning strategies are structured as two independent choices,
+ * but they are not independent in any practical sense (it's just mechanical).
+ * Eager and lazy behaviors go hand in hand, since the choice of each strategy
+ * is driven by similar considerations about the needs of the target table.
+ * Moreover, choosing eager scanning strategy can easily result in freezing
+ * many more pages (compared to an equivalent lazy scanning strategy VACUUM),
+ * since VACUUM can only freeze pages that it actually scans.  (All-visible
+ * pages may well have XIDs < FreezeLimit by now, but VACUUM has no way of
+ * noticing that it should freeze such pages besides just scanning them.)
+ *
+ * The single most important justification for the eager behaviors is system
+ * level performance stability.  It is often better to freeze all-visible
+ * pages before we're truly forced to (just to advance relfrozenxid) as a way
+ * of avoiding big spikes, where VACUUM has to freeze many pages all at once.
+ *
+ * Returns final scanned_pages for the VACUUM operation.  The exact number of
+ * pages that lazy_scan_heap scans depends in part on which vmsnap scanning
+ * strategy we choose (only eager scanning will scan rel's all-visible pages).
  */
-static void
-lazy_scan_strategy(LVRelState *vacrel)
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->rel_pages,
+				scanned_pages_lazy,
+				scanned_pages_eager,
+				nextra_scanned_eager,
+				nextra_young_threshold,
+				nextra_old_threshold,
+				nextra_toomany_threshold;
 
 	/*
 	 * Decide freezing strategy.
@@ -1277,121 +1288,161 @@ lazy_scan_strategy(LVRelState *vacrel)
 	 * The eager freezing strategy is used when the threshold controlled by
 	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
 	 *
+	 * Also freeze eagerly whenever table age is close to requiring (or is
+	 * actually undergoing) an antiwraparound autovacuum.  This may delay the
+	 * next antiwraparound autovacuum against the table.  We avoid relying on
+	 * them, if at all possible (mostly-static tables tend to rely on them).
+	 *
 	 * Also freeze eagerly with an unlogged or temp table, where the total
 	 * cost of freezing each page is just the cycles needed to prepare a set
 	 * of freeze plans.  Executing the freeze plans adds very little cost.
 	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
 	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 *
+	 * Once a table first becomes big enough for eager freezing, it's almost
+	 * inevitable that it will also naturally settle into a cadence where
+	 * relfrozenxid is advanced during every VACUUM (barring rel truncation).
+	 * This is a consequence of eager freezing strategy avoiding creating new
+	 * all-visible pages: if there never are any all-visible pages (if all
+	 * skippable pages are fully all-frozen), then there is no way that lazy
+	 * scanning strategy can ever look better than eager scanning strategy.
+	 * There are still ways that the occasional all-visible page could slip
+	 * into a table that we always freeze eagerly (at least when its tuples
+	 * tend to contain MultiXacts), but that should have negligible impact.
 	 */
 	vacrel->eager_freeze_strategy =
 		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 vacrel->cutoffs.tableagefrac > TABLEAGEFRAC_HIGHPOINT ||
 		 !RelationIsPermanent(vacrel->rel));
-}
-
-/*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
- *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
- *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
- */
-static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
-{
-	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
-
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
-	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
-
-		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-		{
-			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
-		 */
-		if (next_unskippable_block == rel_pages - 1)
-			break;
-
-		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
-			break;
-
-		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
-		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
-
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
-	}
 
 	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
+	 * Decide vmsnap scanning strategy.
 	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 * First acquire a visibility map snapshot, which determines the number of
+	 * pages that each vmsnap scanning strategy is required to scan for us in
+	 * passing.
+	 *
+	 * The number of "extra" scanned_pages added by choosing VMSNAP_SCAN_EAGER
+	 * over VMSNAP_SCAN_LAZY is a key input into the decision making process.
+	 * It is a good proxy for the added cost of applying our eager vmsnap
+	 * strategy during this particular VACUUM.  (We may or may not have to
+	 * dirty/freeze the extra pages when we scan them, which isn't something
+	 * that we try to model.  It shouldn't matter very much at this level.)
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
+	vacrel->vmsnap = visibilitymap_snap_acquire(vacrel->rel, rel_pages,
+												&scanned_pages_lazy,
+												&scanned_pages_eager);
+	nextra_scanned_eager = scanned_pages_eager - scanned_pages_lazy;
+
+	/*
+	 * Next determine guideline "nextra_scanned_eager" thresholds, which are
+	 * applied based in part on tableagefrac (when nextra_toomany_threshold is
+	 * determined below).  These thresholds also represent minimum and maximum
+	 * sensible thresholds that can ever make sense for a table of this size
+	 * (when the table's age isn't old enough to make eagerness mandatory).
+	 *
+	 * For the most part we only care about relative (not absolute) costs.  We
+	 * want to advance relfrozenxid at an opportune time, during a VACUUM that
+	 * has to scan relatively many pages either way (whether due to the need
+	 * to remove dead tuples from many pages, or due to the table containing
+	 * lots of existing all-frozen pages, or due to a combination of both).
+	 * Even small tables (where lazy freezing is used) shouldn't have to do
+	 * dramatically more work than usual when advancing relfrozenxid, which
+	 * our policy of waiting for the right VACUUM largely avoids, in practice.
+	 */
+	nextra_young_threshold = (double) rel_pages * MAX_PAGES_YOUNG_TABLEAGE;
+	nextra_old_threshold = (double) rel_pages * MAX_PAGES_OLD_TABLEAGE;
+
+	/*
+	 * Next determine nextra_toomany_threshold, which represents how many
+	 * extra scanned_pages are deemed too high a cost to pay for eagerness,
+	 * given present conditions.  This is our model's break-even point.
+	 */
+	Assert(rel_pages >= nextra_scanned_eager && vacrel->scanned_pages == 0);
+	if (vacrel->cutoffs.tableagefrac < TABLEAGEFRAC_MIDPOINT)
+	{
+		/*
+		 * The table's age is still below table age mid point, so table age is
+		 * still of only minimal concern.  We're still willing to act eagerly
+		 * when it's _very_ cheap to do so: when use of VMSNAP_SCAN_EAGER will
+		 * force us to scan some extra pages not exceeding 5% of rel_pages.
+		 */
+		nextra_toomany_threshold = nextra_young_threshold;
+	}
+	else if (vacrel->cutoffs.tableagefrac <= TABLEAGEFRAC_HIGHPOINT)
+	{
+		double		nextra_scale;
+
+		/*
+		 * The table's age is starting to become a concern, but not to the
+		 * extent that we'll force the use of VMSNAP_SCAN_EAGER strategy.
+		 * We'll need to interpolate to get an nextra_scanned_eager-based
+		 * threshold.
+		 *
+		 * If tableagefrac is only barely over the midway point, then we'll
+		 * choose an nextra_scanned_eager threshold of ~5% of rel_pages. The
+		 * opposite extreme occurs when tableagefrac is very near to the high
+		 * point.  That will make our nextra_scanned_eager threshold very
+		 * aggressive: we'll go with VMSNAP_SCAN_EAGER when doing so requires
+		 * we scan a number of extra blocks as high as ~70% of rel_pages.
+		 *
+		 * Note that the threshold grows (on a percentage basis) by ~8.1% of
+		 * rel_pages for every additional 5%-of-tableagefrac increment added
+		 * (after tableagefrac has crossed the 50%-of-tableagefrac mid point,
+		 * until the 90%-of-tableagefrac high point is reached, when we switch
+		 * over to not caring about the added cost of eager freezing at all).
+		 */
+		nextra_scale =
+			1.0 - ((TABLEAGEFRAC_HIGHPOINT - vacrel->cutoffs.tableagefrac) /
+				   (TABLEAGEFRAC_HIGHPOINT - TABLEAGEFRAC_MIDPOINT));
+
+		nextra_toomany_threshold =
+			(nextra_young_threshold * (1.0 - nextra_scale)) +
+			(nextra_old_threshold * nextra_scale);
+	}
 	else
 	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
+		/*
+		 * The table's age surpasses the high point, and so is approaching (or
+		 * may even surpass) the point that an antiwraparound autovacuum is
+		 * required.  Force VMSNAP_SCAN_EAGER, no matter how many extra pages
+		 * we'll be required to scan as a result (costs no longer matter).
+		 *
+		 * Note that there is a discontinuity when tableagefrac crosses this
+		 * 90%-of-tableagefrac high point: the threshold set here jumps from
+		 * 70% of rel_pages to 100% of rel_pages (MaxBlockNumber, actually).
+		 * It's useful to only care about table age once it gets this high.
+		 * That way even extreme cases will have at least some chance of using
+		 * eager scanning before an antiwraparound autovacuum is launched.
+		 */
+		nextra_toomany_threshold = MaxBlockNumber;
 	}
 
-	return next_unskippable_block;
+	/* Make final choice on scanning strategy using final threshold */
+	nextra_toomany_threshold = Max(32, nextra_toomany_threshold);
+	vacrel->vmstrat = (nextra_scanned_eager >= nextra_toomany_threshold ?
+					   VMSNAP_SCAN_LAZY : VMSNAP_SCAN_EAGER);
+
+	/*
+	 * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+	 * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+	 */
+	if (force_scan_all)
+		vacrel->vmstrat = VMSNAP_SCAN_ALL;
+
+	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
+
+	/* Inform vmsnap infrastructure of our chosen strategy */
+	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
+
+	/* Return appropriate scanned_pages for final strategy chosen */
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
+		return scanned_pages_lazy;
+	if (vacrel->vmstrat == VMSNAP_SCAN_EAGER)
+		return scanned_pages_eager;
+
+	/* DISABLE_PAGE_SKIPPING/VMSNAP_SCAN_ALL case */
+	return rel_pages;
 }
 
 /*
@@ -2834,6 +2885,14 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * Also don't attempt it if we are doing early pruning/vacuuming, because a
  * scan which cannot find a truncated heap page cannot determine that the
  * snapshot is too old to read that page.
+ *
+ * Note that we effectively rely on visibilitymap_snap_next() having forced
+ * VACUUM to scan the final page (rel_pages - 1) in all cases.  Without that,
+ * we'd tend to needlessly acquire an AccessExclusiveLock just to attempt rel
+ * truncation that is bound to fail.  VACUUM cannot set vacrel->nonempty_pages
+ * in pages that it skips using the VM, so we must avoid interpreting skipped
+ * pages as empty pages when it makes little sense.  Observing that the final
+ * page has tuples is a simple way of avoiding pathological locking behavior.
  */
 static bool
 should_attempt_truncation(LVRelState *vacrel)
@@ -3124,14 +3183,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3140,15 +3198,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3170,12 +3226,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e..816576dca 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,10 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap_acquire - acquire snapshot of visibility map
+ *		visibilitymap_snap_strategy - set VACUUM's scanning strategy
+ *		visibilitymap_snap_next - get next block to scan from vmsnap
+ *		visibilitymap_snap_release - release previously acquired snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +56,10 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset.  This also provides us with a
+ * convenient way of performing I/O prefetching on behalf of VACUUM, since the
+ * pages that VACUUM's first heap pass will scan are fully predetermined.
  *
  * LOCKING
  *
@@ -92,10 +100,12 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
+#include "utils/spccache.h"
 
 
 /*#define TRACE_VISIBILITYMAP */
@@ -124,9 +134,87 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Prefetching of heap pages takes place as VACUUM requests the next block in
+ * line from its visibility map snapshot
+ *
+ * XXX MIN_PREFETCH_SIZE of 32 is a little on the high side, but matches
+ * hard-coded constant used by vacuumlazy.c when prefetching for rel
+ * truncation.  Might be better to increase the maintenance_io_concurrency
+ * default, or to do nothing like this at all.
+ */
+#define STAGED_BUFSIZE			(MAX_IO_CONCURRENCY * 2)
+#define MIN_PREFETCH_SIZE		((BlockNumber) 32)
+
+typedef struct vmsnapblock
+{
+	BlockNumber scanned_block;
+	bool		all_visible;
+} vmsnapblock;
+
+/*
+ * Snapshot of visibility map at the start of a VACUUM operation
+ */
+struct vmsnapshot
+{
+	/* Target heap rel */
+	Relation	rel;
+	/* Scanning strategy used by VACUUM operation */
+	vmstrategy	strat;
+	/* Per-strategy final scanned_pages */
+	BlockNumber rel_pages;
+	BlockNumber scanned_pages_lazy;
+	BlockNumber scanned_pages_eager;
+
+	/*
+	 * Materialized visibility map state.
+	 *
+	 * VM snapshots spill to a temp file when required.
+	 */
+	BlockNumber nvmpages;
+	BufFile    *file;
+
+	/*
+	 * Prefetch distance, used to perform I/O prefetching of heap pages
+	 */
+	int			prefetch_distance;
+
+	/* Current VM page cached */
+	BlockNumber curvmpage;
+	char	   *rawmap;
+	PGAlignedBlock vmpage;
+
+	/* Staging area for blocks returned to VACUUM */
+	vmsnapblock staged[STAGED_BUFSIZE];
+	int			current_nblocks_staged;
+
+	/*
+	 * Next block from range of rel_pages to consider placing in staged block
+	 * array (it will be placed there if it's going to be scanned by VACUUM)
+	 */
+	BlockNumber next_block;
+
+	/*
+	 * Number of blocks that we still need to return, and number of blocks
+	 * that we still need to prefetch
+	 */
+	BlockNumber scanned_pages_to_return;
+	BlockNumber scanned_pages_to_prefetch;
+
+	/* offset of next block in line to return (from staged) */
+	int			next_return_idx;
+	/* offset of next block in line to prefetch (from staged) */
+	int			next_prefetch_idx;
+	/* offset of first garbage/invalid element (from staged) */
+	int			first_invalid_idx;
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static void vm_snap_stage_blocks(vmsnapshot *vmsnap);
+static uint8 vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 
 
 /*
@@ -373,6 +461,350 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap_acquire - get read-only snapshot of visibility map
+ *
+ * Initializes VACUUM caller's snapshot, allocating memory in current context.
+ * Used by VACUUM to determine which pages it must scan up front.
+ *
+ * Set scanned_pages_lazy and scanned_pages_eager to help VACUUM decide on its
+ * scanning strategy.  These are VACUUM's scanned_pages when it opts to skip
+ * all eligible pages and scanned_pages when it opts to just skip all-frozen
+ * pages, respectively.
+ *
+ * Caller finalizes scanning strategy by calling visibilitymap_snap_strategy.
+ * This determines the kind of blocks visibilitymap_snap_next should indicate
+ * need to be scanned by VACUUM.
+ */
+vmsnapshot *
+visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+						   BlockNumber *scanned_pages_lazy,
+						   BlockNumber *scanned_pages_eager)
+{
+	BlockNumber nvmpages = 0,
+				mapBlockLast = 0,
+				all_visible = 0,
+				all_frozen = 0;
+	uint8		mapbits_last_page = 0;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_acquire %s %u",
+		 RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	if (rel_pages > 0)
+	{
+		mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages - 1);
+		nvmpages = mapBlockLast + 1;
+	}
+
+	/* Allocate and initialize VM snapshot state */
+	vmsnap = palloc0(sizeof(vmsnapshot));
+	vmsnap->rel = rel;
+	vmsnap->strat = VMSNAP_SCAN_ALL;	/* for now */
+	vmsnap->rel_pages = rel_pages;	/* scanned_pages for VMSNAP_SCAN_ALL */
+	vmsnap->scanned_pages_lazy = 0;
+	vmsnap->scanned_pages_eager = 0;
+
+	/*
+	 * vmsnap temp file state.
+	 *
+	 * Only relations large enough to need more than one visibility map page
+	 * use a temp file (cannot wholly rely on vmsnap's single page cache).
+	 */
+	vmsnap->nvmpages = nvmpages;
+	vmsnap->file = NULL;
+	if (nvmpages > 1)
+		vmsnap->file = BufFileCreateTemp(false);
+	vmsnap->prefetch_distance = 0;
+#ifdef USE_PREFETCH
+	vmsnap->prefetch_distance =
+		get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
+#endif
+	vmsnap->prefetch_distance = Max(vmsnap->prefetch_distance, MIN_PREFETCH_SIZE);
+
+	/* cache of VM pages read from temp file */
+	vmsnap->curvmpage = 0;
+	vmsnap->rawmap = NULL;
+
+	/* staged blocks array state */
+	vmsnap->current_nblocks_staged = 0;
+	vmsnap->next_block = 0;
+	vmsnap->scanned_pages_to_return = 0;
+	vmsnap->scanned_pages_to_prefetch = 0;
+	/* Offsets into staged blocks array */
+	vmsnap->next_return_idx = 0;
+	vmsnap->next_prefetch_idx = 0;
+	vmsnap->first_invalid_idx = 0;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		/* Cache page locally */
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(vmsnap->vmpage.data, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/* Finish off this VM page using snapshot's vmpage cache */
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = map = PageGetContents(vmsnap->vmpage.data);
+		umap = (uint64 *) map;
+
+		if (mapBlock == mapBlockLast)
+		{
+			uint32		mapByte;
+			uint8		mapOffset;
+
+			/*
+			 * The last VM page requires some extra steps.
+			 *
+			 * First get the status of the last heap page (page in the range
+			 * of rel_pages) in passing.
+			 */
+			Assert(mapBlock == HEAPBLK_TO_MAPBLOCK(rel_pages - 1));
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages - 1);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages - 1);
+			mapbits_last_page = ((map[mapByte] >> mapOffset) &
+								 VISIBILITYMAP_VALID_BITS);
+
+			/*
+			 * Also defensively "truncate" our local copy of the last page in
+			 * order to reliably exclude heap pages beyond the range of
+			 * rel_pages.  This is sheer paranoia.
+			 */
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages);
+			if (mapByte != 0 || mapOffset != 0)
+			{
+				MemSet(&map[mapByte + 1], 0, MAPSIZE - (mapByte + 1));
+				map[mapByte] &= (1 << mapOffset) - 1;
+			}
+		}
+
+		/* Maintain count of all-frozen and all-visible pages */
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+
+		/* Finally, write out vmpage cache VM page to vmsnap's temp file */
+		if (vmsnap->file)
+			BufFileWrite(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+	}
+
+	/*
+	 * Done copying all VM pages from authoritative VM into a VM snapshot.
+	 *
+	 * Figure out the final scanned_pages for the two skipping policies that
+	 * we might use: skipallvis (skip both all-frozen and all-visible) and
+	 * skipallfrozen (just skip all-frozen).
+	 */
+	Assert(all_frozen <= all_visible && all_visible <= rel_pages);
+	*scanned_pages_lazy = rel_pages - all_visible;
+	*scanned_pages_eager = rel_pages - all_frozen;
+
+	/*
+	 * When the last page is skippable in principle, it still won't be treated
+	 * as skippable by visibilitymap_snap_next, which recognizes the last page
+	 * as a special case.  Compensate by incrementing each scanning strategy's
+	 * scanned_pages as needed to avoid counting the last page as skippable.
+	 */
+	if (mapbits_last_page & VISIBILITYMAP_ALL_VISIBLE)
+		(*scanned_pages_lazy)++;
+	if (mapbits_last_page & VISIBILITYMAP_ALL_FROZEN)
+		(*scanned_pages_eager)++;
+
+	vmsnap->scanned_pages_lazy = *scanned_pages_lazy;
+	vmsnap->scanned_pages_eager = *scanned_pages_eager;
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_strategy -- determine VACUUM's scanning strategy.
+ *
+ * VACUUM chooses a vmsnap strategy according to priorities around advancing
+ * relfrozenxid.  See visibilitymap_snap_acquire.
+ */
+void
+visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat)
+{
+	int			nprefetch;
+
+	/* Remember final scanning strategy */
+	vmsnap->strat = strat;
+
+	if (vmsnap->strat == VMSNAP_SCAN_LAZY)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_lazy;
+	else if (vmsnap->strat == VMSNAP_SCAN_EAGER)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_eager;
+	else
+		vmsnap->scanned_pages_to_return = vmsnap->rel_pages;
+
+	vmsnap->scanned_pages_to_prefetch = vmsnap->scanned_pages_to_return;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_strategy %s %d %u",
+		 RelationGetRelationName(vmsnap->rel), (int) strat,
+		 vmsnap->scanned_pages_to_return);
+#endif
+
+	/*
+	 * Stage blocks (may have to read from temp file).
+	 *
+	 * We rely on the assumption that we'll always have a large enough staged
+	 * blocks array to accommodate any possible prefetch distance.
+	 */
+	vm_snap_stage_blocks(vmsnap);
+
+	nprefetch = Min(vmsnap->current_nblocks_staged, vmsnap->prefetch_distance);
+#ifdef USE_PREFETCH
+	for (int i = 0; i < nprefetch; i++)
+	{
+		BlockNumber block = vmsnap->staged[i].scanned_block;
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, block);
+	}
+#endif
+
+	vmsnap->scanned_pages_to_prefetch -= nprefetch;
+	vmsnap->next_prefetch_idx += nprefetch;
+}
+
+/*
+ *	visibilitymap_snap_next -- get next block to scan from vmsnap.
+ *
+ * Returns next block in line for VACUUM to scan according to vmsnap.  Caller
+ * skips any and all blocks preceding returned block.
+ *
+ * The all-visible status of returned block is set in *all_visible.  Block
+ * usually won't be set all-visible (else VACUUM wouldn't need to scan it),
+ * but it can be in certain corner cases.  This includes the VMSNAP_SCAN_ALL
+ * case, as well as a special case that VACUUM expects us to handle: the final
+ * block (rel_pages - 1) is always returned here (regardless of our strategy).
+ *
+ * VACUUM always scans the last page to determine whether it has tuples.  This
+ * is useful as a way of avoiding certain pathological cases with heap rel
+ * truncation.
+ */
+BlockNumber
+visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible)
+{
+	BlockNumber next_block_to_scan;
+	vmsnapblock block;
+
+	*allvisible = true;
+	if (vmsnap->scanned_pages_to_return == 0)
+		return InvalidBlockNumber;
+
+	/* Prepare to return this block */
+	block = vmsnap->staged[vmsnap->next_return_idx++];
+	*allvisible = block.all_visible;
+	next_block_to_scan = block.scanned_block;
+	vmsnap->current_nblocks_staged--;
+	vmsnap->scanned_pages_to_return--;
+
+	/*
+	 * Did the staged blocks array just run out of blocks to return to caller,
+	 * or do we need to stage more blocks for I/O prefetching purposes?
+	 */
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+	if ((vmsnap->current_nblocks_staged == 0 &&
+		 vmsnap->scanned_pages_to_return > 0) ||
+		(vmsnap->next_prefetch_idx == vmsnap->first_invalid_idx &&
+		 vmsnap->scanned_pages_to_prefetch > 0))
+	{
+		if (vmsnap->current_nblocks_staged > 0)
+		{
+			/*
+			 * We've run out of prefetchable blocks, but still have some
+			 * non-returned blocks.  Shift existing blocks to the start of the
+			 * array.  The newly staged blocks go after these ones.
+			 */
+			memmove(&vmsnap->staged[0],
+					&vmsnap->staged[vmsnap->next_return_idx],
+					sizeof(vmsnapblock) * vmsnap->current_nblocks_staged);
+		}
+
+		/*
+		 * Reset offsets in staged blocks array, while accounting for likely
+		 * presence of preexisting blocks that have already been prefetched
+		 * but have yet to be returned to VACUUM caller
+		 */
+		vmsnap->next_prefetch_idx -= vmsnap->next_return_idx;
+		vmsnap->first_invalid_idx -= vmsnap->next_return_idx;
+		vmsnap->next_return_idx = 0;
+
+		/* Stage more blocks (may have to read from temp file) */
+		vm_snap_stage_blocks(vmsnap);
+	}
+
+	/*
+	 * By here we're guaranteed to have at least one prefetchable block in the
+	 * staged blocks array (unless we've already prefetched all blocks that
+	 * will ever be returned to VACUUM caller)
+	 */
+	if (vmsnap->next_prefetch_idx < vmsnap->first_invalid_idx)
+	{
+#ifdef USE_PREFETCH
+		/* Still have remaining blocks to prefetch, so prefetch next one */
+		vmsnapblock prefetch = vmsnap->staged[vmsnap->next_prefetch_idx++];
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, prefetch.scanned_block);
+#else
+		vmsnap->next_prefetch_idx++;
+#endif
+		Assert(vmsnap->current_nblocks_staged > 1);
+		Assert(vmsnap->scanned_pages_to_prefetch > 0);
+		vmsnap->scanned_pages_to_prefetch--;
+	}
+	else
+	{
+		Assert(vmsnap->scanned_pages_to_prefetch == 0);
+	}
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_next %s %u",
+		 RelationGetRelationName(vmsnap->rel), next_block_to_scan);
+#endif
+
+	return next_block_to_scan;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Frees resources allocated in visibilitymap_snap_acquire for VACUUM.
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->scanned_pages_to_return == 0);
+	if (vmsnap->file)
+		BufFileClose(vmsnap->file);
+	pfree(vmsnap);
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
@@ -677,3 +1109,118 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
+
+/*
+ * Stage some heap blocks from vmsnap to return to VACUUM caller.
+ *
+ * Called when we completely run out of staged blocks to return to VACUUM, or
+ * when vmsnap still has some pending staged blocks, but too few to be able to
+ * prefetch incrementally as the remaining blocks are returned to VACUUM.
+ */
+static void
+vm_snap_stage_blocks(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->current_nblocks_staged < STAGED_BUFSIZE);
+	Assert(vmsnap->first_invalid_idx < STAGED_BUFSIZE);
+	Assert(vmsnap->next_return_idx <= vmsnap->first_invalid_idx);
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+
+	while (vmsnap->next_block < vmsnap->rel_pages &&
+		   vmsnap->current_nblocks_staged < STAGED_BUFSIZE)
+	{
+		bool		all_visible = true;
+		vmsnapblock stage;
+
+		for (;;)
+		{
+			uint8		mapbits = vm_snap_get_status(vmsnap,
+													 vmsnap->next_block);
+
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+				all_visible = false;
+				break;
+			}
+
+			/*
+			 * Stop staging blocks just before final page, which must always
+			 * be scanned by VACUUM
+			 */
+			if (vmsnap->next_block == vmsnap->rel_pages - 1)
+				break;
+
+			/* VMSNAP_SCAN_ALL forcing VACUUM to scan every page? */
+			if (vmsnap->strat == VMSNAP_SCAN_ALL)
+				break;
+
+			/*
+			 * Check if VACUUM must scan this page because it's not all-frozen
+			 * and VACUUM opted to use VMSNAP_SCAN_EAGER strategy
+			 */
+			if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
+				vmsnap->strat == VMSNAP_SCAN_EAGER)
+				break;
+
+			/* VACUUM will skip this block -- so don't stage it for later */
+			vmsnap->next_block++;
+		}
+
+		/* VACUUM will scan this block, so stage it for later */
+		stage.scanned_block = vmsnap->next_block++;
+		stage.all_visible = all_visible;
+		vmsnap->staged[vmsnap->first_invalid_idx++] = stage;
+		vmsnap->current_nblocks_staged++;
+	}
+}
+
+/*
+ * Get status of bits from vm snapshot
+ */
+static uint8
+vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "vm_snap_get_status %u", heapBlk);
+#endif
+
+	/*
+	 * If we didn't see the VM page when the snapshot was first acquired we
+	 * defensively assume heapBlk not all-visible or all-frozen
+	 */
+	Assert(heapBlk <= vmsnap->rel_pages);
+	if (unlikely(mapBlock >= vmsnap->nvmpages))
+		return 0;
+
+	/*
+	 * Read from temp file when required.
+	 *
+	 * Although this routine supports random access, sequential access is
+	 * expected.  We should only need to read each temp file page into cache
+	 * at most once per VACUUM.
+	 */
+	if (unlikely(mapBlock != vmsnap->curvmpage))
+	{
+		size_t		nread;
+
+		if (BufFileSeekBlock(vmsnap->file, mapBlock) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek to block %u of vmsnap temporary file",
+							mapBlock)));
+		nread = BufFileRead(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+		if (nread != BLCKSZ)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u of vmsnap temporary file: read only %zu of %zu bytes",
+							mapBlock, nread, (size_t) BLCKSZ)));
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = PageGetContents(vmsnap->vmpage.data);
+	}
+
+	return ((vmsnap->rawmap[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c68bd8ff..5085d9407 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -933,11 +933,11 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				effective_multixact_freeze_max_age,
 				freeze_strategy_threshold;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
@@ -1069,48 +1069,48 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
-	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.  The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/*
+	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+	 * XMID table age (whichever is greater currently).
+	 */
+	XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+		((double) freeze_table_age + 0.5);
+	MXIDFrac = (double) (nextMXID - cutoffs->relminmxid) /
+		((double) multixact_freeze_table_age + 0.5);
+	cutoffs->tableagefrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		cutoffs->tableagefrac = 1.0;
+
+	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a009017bd..7a3972827 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2491,10 +2491,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2511,10 +2511,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 447645b73..c44c1c4e7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -659,6 +659,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -692,11 +699,9 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 167c6570e..596c44060 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9184,20 +9184,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9266,19 +9274,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 79595b1cb..c137debb1 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -158,9 +158,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
       all tuples are known to be frozen can always be skipped, and those
       where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      skipped except when performing an aggressive vacuum.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.38.1

v12-0001-Add-page-level-freezing-to-VACUUM.patchapplication/octet-stream; name=v12-0001-Add-page-level-freezing-to-VACUUM.patchDownload

From f6489bbdfd8af4bcab9076300291a2182abbb6aa Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 12 Jun 2022 15:46:08 -0700
Subject: [PATCH v12 1/4] Add page-level freezing to VACUUM.

Teach VACUUM to decide on whether or not to trigger freezing at the
level of whole heap pages, not individual tuple fields.  OldestXmin is
now treated as the cutoff for freezing eligibility in all cases, while
FreezeLimit is used to trigger freezing at the level of each page (we
now freeze all eligible XIDs on a page when freezing is triggered for
the page).  Making the choice to freeze work at the page level tends to
result in VACUUM writing less WAL in the long term.  This is especially
likely to work out due to complementary effects with the freeze plan WAL
deduplication optimization added by commit 9e540599.

Also teach VACUUM to trigger page-level freezing whenever it detects
that heap pruning generated an FPI as torn page protection.  We'll have
already written a large amount of WAL just to do that much, so it's very
likely a good idea to get freezing out of the way for the page early.
This only happens in cases where it will directly lead to marking the
page all-frozen in the visibility map.

In most cases "freezing a page" removes all XIDs < OldestXmin, and all
MXIDs < OldestMxact.  It doesn't quite work that way in certain rare
cases involving MultiXacts, though.  It is convenient to define "freeze
the page" in a way that gives FreezeMultiXactId the leeway to put off
the work of processing an individual tuple's xmax whenever it happens to
be a MultiXactId that would require an expensive second pass to process
aggressively (allocating a new Multi is especially worth avoiding here).

FreezeMultiXactId effectively makes a decision on how to proceed with
processing at the level of each individual xmax field.  Its no-op multi
processing "freezes" an xmax in the event of an expensive-to-process
xmax on a page when (for whatever reason) page-level freezing triggers.
If, on the other hand, freezing is not triggered for the page, then
page-level no-op processing takes care of the multi for us instead.
Either way, the remaining Multi will ratchet back VACUUM's relfrozenxid
and/or relminmxid trackers as required, and we won't need an expensive
second pass over the multi (unless we really have no choice, for example
during a VACUUM FREEZE, where FreezeLimit always matches OldestXmin).

Later work will add eager freezing strategy to VACUUM (and reframe the
behavior established by this commit as lazy freezing, even though it's
not quite as lazy as the historical tuple-based approach to freezing).
Making freezing work at the page level is not just an optimization; it's
also a useful basis for modelling costs at the whole table level, since
it makes the visibility map a more reliable indicator of how far behind
(or ahead) we are on freezing at the level of the whole table.  Later
work that adds eager and lazy scanning strategies will build on that,
ultimately allowing VACUUM to advance relfrozenxid far more frequently.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/heapam.h          |  92 +++++-
 src/backend/access/heap/heapam.c     | 455 ++++++++++++++-------------
 src/backend/access/heap/vacuumlazy.c | 169 ++++++----
 doc/src/sgml/config.sgml             |  11 +-
 4 files changed, 444 insertions(+), 283 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 53eb01176..83b52e2a7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -113,6 +113,83 @@ typedef struct HeapTupleFreeze
 	OffsetNumber offset;
 } HeapTupleFreeze;
 
+/*
+ * State used by VACUUM to track the details of freezing all eligible tuples
+ * on a given heap page.
+ *
+ * VACUUM prepares freeze plans for each page via heap_prepare_freeze_tuple
+ * calls (every tuple with storage gets its own call).  This page-level freeze
+ * state is updated across each call, which ultimately determines whether or
+ * not freezing the page is required. (VACUUM freezes the page via a call to
+ * heap_freeze_execute_prepared, which freezes using prepared freeze plans.)
+ *
+ * Aside from the basic question of whether or not freezing will go ahead, the
+ * state also tracks the oldest extant XID/MXID in the table as a whole, for
+ * the purposes of advancing relfrozenxid/relminmxid values in pg_class later
+ * on.  Each heap_prepare_freeze_tuple call pushes NewRelfrozenXid and/or
+ * NewRelminMxid back as required to avoid unsafe final pg_class values.  Any
+ * and all unfrozen XIDs or MXIDs that remain after VACUUM finishes _must_
+ * have values >= the final relfrozenxid/relminmxid values in pg_class.  This
+ * includes XIDs that remain as MultiXact members from any tuple's xmax.
+ *
+ * When 'freeze_required' flag isn't set after all tuples are examined, the
+ * final choice on freezing is made by vacuumlazy.c.  It can decide to trigger
+ * freezing based on whatever criteria it deems appropriate.  However, it is
+ * recommended that vacuumlazy.c avoid early freezing of a page when it cannot
+ * then be marked all-frozen in the visibility map.
+ */
+typedef struct HeapPageFreeze
+{
+	/* Is heap_prepare_freeze_tuple caller required to freeze page? */
+	bool		freeze_required;
+
+	/*
+	 * "Freeze" NewRelfrozenXid/NewRelminMxid trackers.
+	 *
+	 * Trackers used when heap_freeze_execute_prepared freezes the page, and
+	 * when page is "nominally frozen", which happens with pages where every
+	 * call to heap_prepare_freeze_tuple produced no usable freeze plan.
+	 *
+	 * "Nominal freezing" enables vacuumlazy.c's approach of setting a page
+	 * all-frozen in the visibility map when every tuple's 'totally_frozen'
+	 * result is true.  That always works in the same way, independent of the
+	 * need to freeze tuples, and without complicating the general rule around
+	 * 'totally_frozen' results (which is that 'totally_frozen' results are
+	 * only to be trusted with a page that goes on to be frozen by caller).
+	 *
+	 * When we freeze a page, we generally freeze all XIDs < OldestXmin, only
+	 * leaving behind XIDs that are ineligible for freezing, if any.  And so
+	 * you might wonder why these trackers are necessary at all; why should
+	 * _any_ page that VACUUM freezes _ever_ be left with XIDs/MXIDs that
+	 * ratchet back the top-level NewRelfrozenXid/NewRelminMxid trackers?
+	 *
+	 * It is useful to use a definition of "freeze the page" that does not
+	 * overspecify how MultiXacts are affected.  heap_prepare_freeze_tuple
+	 * generally prefers to remove Multis eagerly, but lazy processing is used
+	 * in cases where laziness allows VACUUM to avoid allocating a new Multi.
+	 * The "freeze the page" trackers enable this flexibility.
+	 */
+	TransactionId FreezePageRelfrozenXid;
+	MultiXactId FreezePageRelminMxid;
+
+	/*
+	 * "No freeze" NewRelfrozenXid/NewRelminMxid trackers.
+	 *
+	 * These trackers are maintained in the same way as the trackers used when
+	 * VACUUM scans a page that isn't cleanup locked.  Both code paths are
+	 * based on the same general idea (do less work for this page during the
+	 * ongoing VACUUM, at the cost of having to accept older final values).
+	 *
+	 * When vacuumlazy.c caller decides to do "no freeze" processing, it must
+	 * not go on to set the page all-frozen (setting the page all-visible
+	 * could still be okay).  heap_prepare_freeze_tuple's 'totally_frozen'
+	 * results can only be trusted on a page that is frozen afterwards.
+	 */
+	TransactionId NoFreezePageRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid;
+
+} HeapPageFreeze;
+
 /* ----------------
  *		function prototypes for heap access method
  *
@@ -180,19 +257,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  const struct VacuumCutoffs *cutoffs,
-									  HeapTupleFreeze *frz, bool *totally_frozen,
-									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  HeapPageFreeze *pagefrz,
+									  HeapTupleFreeze *frz, bool *totally_frozen);
 extern void heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-										 TransactionId FreezeLimit,
+										 TransactionId snapshotConflictHorizon,
 										 HeapTupleFreeze *tuples, int ntuples);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId FreezeLimit, TransactionId MultiXactCutoff);
-extern bool heap_tuple_would_freeze(HeapTupleHeader tuple,
-									const struct VacuumCutoffs *cutoffs,
-									TransactionId *relfrozenxid_out,
-									MultiXactId *relminmxid_out);
+extern bool heap_tuple_should_freeze(HeapTupleHeader tuple,
+									 const struct VacuumCutoffs *cutoffs,
+									 TransactionId *NoFreezePageRelfrozenXid,
+									 MultiXactId *NoFreezePageRelminMxid);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 86a88de85..71dfe5933 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6098,9 +6098,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		MultiXactId.
  *
  * "flags" is an output value; it's used to tell caller what to do on return.
- *
- * "mxid_oldest_xid_out" is an output value; it's used to track the oldest
- * extant Xid within any Multixact that will remain after freezing executes.
+ * "pagefrz" is an input/output value, used to manage page level freezing.
  *
  * Possible values that we can set in "flags":
  * FRM_NOOP
@@ -6115,15 +6113,32 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
  *
- * "mxid_oldest_xid_out" is only set when "flags" contains either FRM_NOOP or
- * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
+ * Caller delegates control of page freezing to us.  In practice we always
+ * force freezing of caller's page unless FRM_NOOP processing is indicated.
+ * We help caller ensure that XIDs < FreezeLimit and MXIDs < MultiXactCutoff
+ * can never be left behind.  We freely choose when and how to process each
+ * Multi, without ever violating the cutoff postconditions for freezing.
  *
- * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
+ * It's useful to remove Multis on a proactive timeline (relative to freezing
+ * XIDs) to keep MultiXact member SLRU buffer misses to a minimum.  It can also
+ * be cheaper in the short run, for us, since we too can avoid SLRU buffer
+ * misses through eager processing.
+ *
+ * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set, though only
+ * when FreezeLimit and/or MultiXactCutoff cutoffs leave us with no choice.
+ * This can usually be put off, which is usually enough to avoid it altogether.
+ *
+ * NB: Caller must maintain "no freeze" NewRelfrozenXid/NewRelminMxid trackers
+ * using heap_tuple_should_freeze when we haven't forced page-level freezing.
+ *
+ * NB: Caller should avoid needlessly calling heap_tuple_should_freeze when we
+ * have already forced page-level freezing, since that might incur the same
+ * SLRU buffer misses that we specifically intended to avoid by freezing.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  const struct VacuumCutoffs *cutoffs, uint16 *flags,
-				  TransactionId *mxid_oldest_xid_out)
+				  HeapPageFreeze *pagefrz)
 {
 	TransactionId newxmax = InvalidTransactionId;
 	MultiXactMember *members;
@@ -6134,7 +6149,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
-	TransactionId temp_xid_out;
+	TransactionId FreezePageRelfrozenXid;
 
 	*flags = 0;
 
@@ -6144,8 +6159,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	if (!MultiXactIdIsValid(multi) ||
 		HEAP_LOCKED_UPGRADED(t_infomask))
 	{
-		/* Ensure infomask bits are appropriately set/reset */
 		*flags |= FRM_INVALIDATE_XMAX;
+		pagefrz->freeze_required = true;
 		return InvalidTransactionId;
 	}
 	else if (MultiXactIdPrecedes(multi, cutoffs->relminmxid))
@@ -6153,7 +6168,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				(errcode(ERRCODE_DATA_CORRUPTED),
 				 errmsg_internal("found multixact %u from before relminmxid %u",
 								 multi, cutoffs->relminmxid)));
-	else if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
+	else if (MultiXactIdPrecedes(multi, cutoffs->OldestMxact))
 	{
 		/*
 		 * This old multi cannot possibly have members still running, but
@@ -6166,50 +6181,45 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
 					 errmsg_internal("multixact %u from before cutoff %u found to be still running",
-									 multi, cutoffs->MultiXactCutoff)));
+									 multi, cutoffs->OldestMxact)));
 
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
 			*flags |= FRM_INVALIDATE_XMAX;
+			pagefrz->freeze_required = true;
+			return InvalidTransactionId;
+		}
+
+		/* replace multi with single XID for its updater */
+		newxmax = MultiXactIdGetUpdateXid(multi, t_infomask);
+
+		if (TransactionIdPrecedes(newxmax, cutoffs->relfrozenxid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("multixact %u contains update xid %u from before relfrozenxid %u",
+									 multi, newxmax, cutoffs->relfrozenxid)));
+		else if (TransactionIdPrecedes(newxmax, cutoffs->OldestXmin))
+		{
+			/*
+			 * Updater XID has to have aborted (otherwise the tuple would have
+			 * been pruned away instead, since updater XID is < OldestXmin).
+			 * Just remove xmax.
+			 */
+			if (TransactionIdDidCommit(newxmax))
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg_internal("multixact %u contains uncommitted update xid %u",
+										 multi, newxmax)));
+			*flags |= FRM_INVALIDATE_XMAX;
 			newxmax = InvalidTransactionId;
 		}
 		else
 		{
-			/* replace multi with single XID for its updater */
-			newxmax = MultiXactIdGetUpdateXid(multi, t_infomask);
-
-			/* wasn't only a lock, xid needs to be valid */
-			Assert(TransactionIdIsValid(newxmax));
-
-			if (TransactionIdPrecedes(newxmax, cutoffs->relfrozenxid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("found update xid %u from before relfrozenxid %u",
-										 newxmax, cutoffs->relfrozenxid)));
-
-			/*
-			 * If the new xmax xid is older than OldestXmin, it has to have
-			 * aborted, otherwise the tuple would have been pruned away
-			 */
-			if (TransactionIdPrecedes(newxmax, cutoffs->OldestXmin))
-			{
-				if (TransactionIdDidCommit(newxmax))
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg_internal("cannot freeze committed update xid %u", newxmax)));
-				*flags |= FRM_INVALIDATE_XMAX;
-				newxmax = InvalidTransactionId;
-			}
-			else
-			{
-				*flags |= FRM_RETURN_IS_XID;
-			}
+			/* Have to keep updater XID as new xmax */
+			*flags |= FRM_RETURN_IS_XID;
 		}
 
-		/*
-		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
-		 * when no Xids will remain
-		 */
+		pagefrz->freeze_required = true;
 		return newxmax;
 	}
 
@@ -6225,11 +6235,30 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	{
 		/* Nothing worth keeping */
 		*flags |= FRM_INVALIDATE_XMAX;
+		pagefrz->freeze_required = true;
 		return InvalidTransactionId;
 	}
 
+	/*
+	 * The FRM_NOOP case is the only case where we might need to ratchet back
+	 * FreezePageRelfrozenXid or FreezePageRelminMxid.  It is also the only
+	 * case where our caller might ratchet back its NoFreezePageRelfrozenXid
+	 * or NoFreezePageRelminMxid "no freeze" trackers to deal with a multi.
+	 * FRM_NOOP handling should result in the NewRelfrozenXid/NewRelminMxid
+	 * trackers managed by VACUUM being ratcheting back by xmax to the degree
+	 * required to make it safe to leave xmax undisturbed, independent of
+	 * whether or not page freezing is triggered somewhere else.
+	 *
+	 * Our policy is to force freezing in every case other than FRM_NOOP,
+	 * which obviates the need to maintain either set of trackers, anywhere.
+	 * Every other case will reliably execute a freeze plan for xmax that
+	 * either replaces xmax with an XID/MXID >= OldestXmin/OldestMxact, or
+	 * sets xmax to an InvalidTransactionId XID, rendering xmax fully frozen.
+	 * (VACUUM's NewRelfrozenXid/NewRelminMxid trackers are initialized with
+	 * OldestXmin/OldestMxact, so later values never need to be tracked here.)
+	 */
 	need_replace = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
+	FreezePageRelfrozenXid = pagefrz->FreezePageRelfrozenXid;
 	for (int i = 0; i < nmembers; i++)
 	{
 		TransactionId xid = members[i].xid;
@@ -6238,26 +6267,29 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 		{
+			/* Can't violate the FreezeLimit postcondition */
 			need_replace = true;
 			break;
 		}
-		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
-			temp_xid_out = members[i].xid;
+		if (TransactionIdPrecedes(xid, FreezePageRelfrozenXid))
+			FreezePageRelfrozenXid = xid;
 	}
 
-	/*
-	 * In the simplest case, there is no member older than FreezeLimit; we can
-	 * keep the existing MultiXactId as-is, avoiding a more expensive second
-	 * pass over the multi
-	 */
+	/* Can't violate the MultiXactCutoff postcondition, either */
+	if (!need_replace)
+		need_replace = MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff);
+
 	if (!need_replace)
 	{
 		/*
-		 * When mxid_oldest_xid_out gets pushed back here it's likely that the
-		 * update Xid was the oldest member, but we don't rely on that
+		 * vacuumlazy.c might ratchet back NewRelminMxid, NewRelfrozenXid, or
+		 * both together to make it safe to retain this particular multi after
+		 * freezing its page
 		 */
 		*flags |= FRM_NOOP;
-		*mxid_oldest_xid_out = temp_xid_out;
+		pagefrz->FreezePageRelfrozenXid = FreezePageRelfrozenXid;
+		if (MultiXactIdPrecedes(multi, pagefrz->FreezePageRelminMxid))
+			pagefrz->FreezePageRelminMxid = multi;
 		pfree(members);
 		return multi;
 	}
@@ -6266,13 +6298,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * Do a more thorough second pass over the multi to figure out which
 	 * member XIDs actually need to be kept.  Checking the precise status of
 	 * individual members might even show that we don't need to keep anything.
+	 *
+	 * We only reach this far when replacing xmax is absolutely mandatory.
+	 * heap_tuple_should_freeze will indicate that the tuple should be frozen.
+	 * We definitely won't leave behind an XID/MXID < OldestXmin/OldestMxact.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
 	update_xid = InvalidTransactionId;
 	update_committed = false;
-	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
 
 	/*
 	 * Determine whether to keep each member xid, or to ignore it instead
@@ -6293,14 +6328,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			if (TransactionIdIsCurrentTransactionId(xid) ||
 				TransactionIdIsInProgress(xid))
 			{
+				if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg_internal("multixact %u contains locker xid %u from before removable cutoff %u",
+											 multi, xid, cutoffs->OldestXmin)));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
-
-				/*
-				 * Cannot possibly be older than VACUUM's OldestXmin, so we
-				 * don't need a NewRelfrozenXid step here
-				 */
-				Assert(TransactionIdPrecedesOrEquals(cutoffs->OldestXmin, xid));
 			}
 
 			continue;
@@ -6317,8 +6351,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg_internal("found update xid %u from before removable cutoff %u",
-									 xid, cutoffs->OldestXmin)));
+					 errmsg_internal("multixact %u contains update xid %u from before removable cutoff %u",
+									 multi, xid, cutoffs->OldestXmin)));
 		if (TransactionIdIsValid(update_xid))
 			ereport(ERROR,
 					(errcode(ERRCODE_DATA_CORRUPTED),
@@ -6328,8 +6362,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 										update_xid, xid)));
 
 		/*
-		 * If the transaction is known aborted or crashed then it's okay to
-		 * ignore it, otherwise not.
+		 * If the updater transaction is known aborted or crashed then it's
+		 * okay to ignore it, otherwise not.
 		 *
 		 * As with all tuple visibility routines, it's critical to test
 		 * TransactionIdIsInProgress before TransactionIdDidCommit, because of
@@ -6358,13 +6392,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		}
 
 		/*
-		 * We determined that this is an Xid corresponding to an update that
-		 * must be retained -- add it to new members list for later.  Also
-		 * consider pushing back mxid_oldest_xid_out.
+		 * We determined that updater has an Xid >= OldestXmin, which must be
+		 * retained -- add it to pending new members list
 		 */
 		newmembers[nnewmembers++] = members[i];
-		if (TransactionIdPrecedes(xid, temp_xid_out))
-			temp_xid_out = xid;
 	}
 
 	pfree(members);
@@ -6375,10 +6406,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (nnewmembers == 0)
 	{
-		/* nothing worth keeping!? Tell caller to remove the whole thing */
+		/* Keeping nothing (neither an Xid nor a MultiXactId) in xmax */
 		*flags |= FRM_INVALIDATE_XMAX;
 		newxmax = InvalidTransactionId;
-		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
 	{
@@ -6394,22 +6424,20 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		newxmax = update_xid;
-		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
 	}
 	else
 	{
 		/*
 		 * Create a new multixact with the surviving members of the previous
-		 * one, to set as new Xmax in the tuple.  The oldest surviving member
-		 * might push back mxid_oldest_xid_out.
+		 * one (all of which are >= OldestXmin) to set as new Xmax
 		 */
 		newxmax = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
-		*mxid_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
 
+	pagefrz->freeze_required = true;
 	return newxmax;
 }
 
@@ -6417,9 +6445,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the FreezeLimit and/or MultiXactCutoff freeze cutoffs.  If so,
- * setup enough state (in the *frz output argument) to later execute and
- * WAL-log what caller needs to do for the tuple, and return true.  Return
+ * are older than the OldestXmin and/or OldestMxact freeze cutoffs.  If so,
+ * setup enough state (in the *frz output argument) to enable caller to
+ * process this tuple as part of freezing its page, and return true.  Return
  * false if nothing can be changed about the tuple right now.
  *
  * Also sets *totally_frozen to true if the tuple will be totally frozen once
@@ -6427,22 +6455,30 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * frozen by an earlier VACUUM).  This indicates that there are no remaining
  * XIDs or MultiXactIds that will need to be processed by a future VACUUM.
  *
- * VACUUM caller must assemble HeapTupleFreeze entries for every tuple that we
- * returned true for when called.  A later heap_freeze_execute_prepared call
- * will execute freezing for caller's page as a whole.
+ * VACUUM caller must assemble HeapTupleFreeze freeze plan entries for every
+ * tuple that we returned true for, and call heap_freeze_execute_prepared to
+ * execute freezing.  Caller must initialize pagefrz fields for page as a
+ * whole before first call here for each heap page.
+ *
+ * VACUUM caller decides on whether or not to freeze the page as a whole.
+ * We'll often prepare freeze plans for a page that caller just discards.
+ * However, VACUUM doesn't always get to make a choice; it must freeze when
+ * pagefrz.freeze_required is set, to ensure that any XIDs < FreezeLimit (and
+ * MXIDs < MultiXactCutoff) can never be left behind.  We help to make sure
+ * that VACUUM always follows that rule.
+ *
+ * We sometimes force freezing of xmax MultiXactId values long before it is
+ * strictly necessary to do so just to ensure the FreezeLimit postcondition.
+ * It's worth processing MultiXactIds proactively when it is cheap to do so,
+ * and it's convenient to make that happen by piggy-backing it on the "force
+ * freezing" mechanism.  Conversely, we sometimes delay freezing MultiXactIds
+ * because it is expensive right now (though only when it's still possible to
+ * do so without violating the FreezeLimit/MultiXactCutoff postcondition).
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * The *relfrozenxid_out and *relminmxid_out arguments are the current target
- * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
- * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
- * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
- * This includes XIDs that remain as MultiXact members from any tuple's xmax.
- * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
- * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
- *
  * NB: This function has side effects: it might allocate a new MultiXactId.
  * It will be set as tuple's new xmax when our *frz output is processed within
  * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
@@ -6451,9 +6487,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  const struct VacuumCutoffs *cutoffs,
-						  HeapTupleFreeze *frz, bool *totally_frozen,
-						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  HeapPageFreeze *pagefrz,
+						  HeapTupleFreeze *frz, bool *totally_frozen)
 {
 	bool		xmin_already_frozen = false,
 				xmax_already_frozen = false;
@@ -6470,7 +6505,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/*
 	 * Process xmin, while keeping track of whether it's already frozen, or
-	 * will become frozen when our freeze plan is executed by caller (could be
+	 * will become frozen iff our freeze plan is executed by caller (could be
 	 * neither).
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
@@ -6484,21 +6519,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmin %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->FreezeLimit);
-		if (freeze_xmin)
-		{
-			if (!TransactionIdDidCommit(xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-										 xid, cutoffs->FreezeLimit)));
-		}
-		else
-		{
-			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-		}
+		freeze_xmin = TransactionIdPrecedes(xid, cutoffs->OldestXmin);
+		if (freeze_xmin && !TransactionIdDidCommit(xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
+									 xid, cutoffs->OldestXmin)));
+
+		/* Will set freeze_xmin flags in freeze plan below */
 	}
 
 	/*
@@ -6515,41 +6543,59 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * For Xvac, we always freeze proactively.  This allows totally_frozen
 		 * tracking to ignore xvac.
 		 */
-		replace_xvac = true;
+		replace_xvac = pagefrz->freeze_required = true;
+
+		/* Will set replace_xvac flags in freeze plan below */
 	}
 
-	/*
-	 * Process xmax.  To thoroughly examine the current Xmax value we need to
-	 * resolve a MultiXactId to its member Xids, in case some of them are
-	 * below the given FreezeLimit.  In that case, those values might need
-	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
-	 * it out --- if there's a live updater Xid, it needs to be kept.
-	 *
-	 * Make sure to keep heap_tuple_would_freeze in sync with this.
-	 */
+	/* Now process xmax */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
-
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		/* Raw xmax is a MultiXactId */
 		TransactionId newxmax;
 		uint16		flags;
-		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
 
+		/*
+		 * We will either remove xmax completely (in the "freeze_xmax" path),
+		 * process xmax by replacing it (in the "replace_xmax" path), or
+		 * perform no-op xmax processing.  The only constraint is that the
+		 * FreezeLimit/MultiXactCutoff postcondition must never be violated.
+		 */
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask, cutoffs,
-									&flags, &mxid_oldest_xid_out);
+									&flags, pagefrz);
 
-		if (flags & FRM_RETURN_IS_XID)
+		if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 * This is the only case where 'freeze_required' won't have been
+			 * set for us by FreezeMultiXactId, as well as the only case where
+			 * neither freeze_xmax nor replace_xmax are set (given a multi).
+			 *
+			 * This is a no-op, but the call to FreezeMultiXactId might have
+			 * ratcheted back NewRelfrozenXid and/or NewRelminMxid trackers
+			 * for us (the "freeze page" variants, specifically).  That'll
+			 * make it safe for our caller to freeze the page later on, while
+			 * leaving this particular xmax undisturbed.
+			 *
+			 * FreezeMultiXactId is _not_ responsible for the "no freeze"
+			 * NewRelfrozenXid/NewRelminMxid trackers, though -- that's our
+			 * job.  A call to heap_tuple_should_freeze for this same tuple
+			 * will take place below if 'freeze_required' isn't set already.
+			 * (This repeats work from FreezeMultiXactId, but allows "no
+			 * freeze" tracker maintenance to happen in only one place.)
+			 */
+			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
+			Assert(!MultiXactIdPrecedes(newxmax, pagefrz->FreezePageRelminMxid));
+		}
+		else if (flags & FRM_RETURN_IS_XID)
 		{
 			/*
 			 * xmax will become an updater Xid (original MultiXact's updater
 			 * member Xid will be carried forward as a simple Xid in Xmax).
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
 			 */
 			Assert(!TransactionIdPrecedes(newxmax, cutoffs->OldestXmin));
-			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
-				*relfrozenxid_out = newxmax;
 
 			/*
 			 * NB -- some of these transformations are only valid because we
@@ -6572,13 +6618,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			/*
 			 * xmax is an old MultiXactId that we have to replace with a new
 			 * MultiXactId, to carry forward two or more original member XIDs.
-			 * Might have to ratchet back relfrozenxid_out here, though never
-			 * relminmxid_out.
 			 */
 			Assert(!MultiXactIdPrecedes(newxmax, cutoffs->OldestMxact));
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			*relfrozenxid_out = mxid_oldest_xid_out;
 
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
@@ -6594,20 +6635,6 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 			replace_xmax = true;
 		}
-		else if (flags & FRM_NOOP)
-		{
-			/*
-			 * xmax is a MultiXactId, and nothing about it changes for now.
-			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
-			 * both together.
-			 */
-			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
-			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
-												 *relfrozenxid_out));
-			if (MultiXactIdPrecedes(xid, *relminmxid_out))
-				*relminmxid_out = xid;
-			*relfrozenxid_out = mxid_oldest_xid_out;
-		}
 		else
 		{
 			/*
@@ -6618,9 +6645,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(MultiXactIdPrecedes(xid, cutoffs->OldestMxact));
 			Assert(!TransactionIdIsValid(newxmax));
 
-			/* Will set t_infomask/t_infomask2 flags in freeze plan below */
+			/* Will set freeze_xmax flags in freeze plan below */
 			freeze_xmax = true;
 		}
+
+		/* Only FRM_NOOP doesn't force caller to freeze page */
+		Assert(pagefrz->freeze_required || (!freeze_xmax && !replace_xmax));
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6631,28 +6661,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 					 errmsg_internal("found xmax %u from before relfrozenxid %u",
 									 xid, cutoffs->relfrozenxid)));
 
-		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
-		{
-			/*
-			 * If we freeze xmax, make absolutely sure that it's not an XID
-			 * that is important.  (Note, a lock-only xmax can be removed
-			 * independent of committedness, since a committed lock holder has
-			 * released the lock).
-			 */
-			if (!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
-				TransactionIdDidCommit(xid))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("cannot freeze committed xmax %u",
-										 xid)));
+		if (TransactionIdPrecedes(xid, cutoffs->OldestXmin))
 			freeze_xmax = true;
-			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
-		}
-		else
-		{
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-		}
+
+		/*
+		 * If we freeze xmax, make absolutely sure that it's not an XID that
+		 * is important.  (Note, a lock-only xmax can be removed independent
+		 * of committedness, since a committed lock holder has released the
+		 * lock).
+		 */
+		if (freeze_xmax && !HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) &&
+			TransactionIdDidCommit(xid))
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg_internal("cannot freeze committed xmax %u",
+									 xid)));
 	}
 	else if (!TransactionIdIsValid(xid))
 	{
@@ -6679,6 +6702,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * failed; whereas a non-dead MOVED_IN tuple must mean the xvac
 		 * transaction succeeded.
 		 */
+		Assert(pagefrz->freeze_required);
 		if (tuple->t_infomask & HEAP_MOVED_OFF)
 			frz->frzflags |= XLH_INVALID_XVAC;
 		else
@@ -6687,8 +6711,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	if (replace_xmax)
 	{
 		Assert(!xmax_already_frozen && !freeze_xmax);
+		Assert(pagefrz->freeze_required);
 
-		/* Already set t_infomask/t_infomask2 flags in freeze plan */
+		/* Already set replace_xmax flags in freeze plan earlier */
 	}
 	if (freeze_xmax)
 	{
@@ -6709,13 +6734,23 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/*
 	 * Determine if this tuple is already totally frozen, or will become
-	 * totally frozen
+	 * totally frozen (provided caller executes freeze plan for the page)
 	 */
 	*totally_frozen = ((freeze_xmin || xmin_already_frozen) &&
 					   (freeze_xmax || xmax_already_frozen));
 
-	/* A "totally_frozen" tuple must not leave anything behind in xmax */
-	Assert(!*totally_frozen || !replace_xmax);
+	if (!pagefrz->freeze_required && !(xmin_already_frozen &&
+									   xmax_already_frozen))
+	{
+		/*
+		 * So far no previous tuple from the page made freezing mandatory.
+		 * Does this tuple force caller to freeze the entire page?
+		 */
+		pagefrz->freeze_required =
+			heap_tuple_should_freeze(tuple, cutoffs,
+									 &pagefrz->NoFreezePageRelfrozenXid,
+									 &pagefrz->NoFreezePageRelminMxid);
+	}
 
 	/* Tell caller if this tuple has a usable freeze plan set in *frz */
 	return freeze_xmin || replace_xvac || replace_xmax || freeze_xmax;
@@ -6761,13 +6796,12 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapTupleFreeze *frz)
  */
 void
 heap_freeze_execute_prepared(Relation rel, Buffer buffer,
-							 TransactionId FreezeLimit,
+							 TransactionId snapshotConflictHorizon,
 							 HeapTupleFreeze *tuples, int ntuples)
 {
 	Page		page = BufferGetPage(buffer);
 
 	Assert(ntuples > 0);
-	Assert(TransactionIdIsNormal(FreezeLimit));
 
 	START_CRIT_SECTION();
 
@@ -6790,19 +6824,10 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
 		int			nplans;
 		xl_heap_freeze_page xlrec;
 		XLogRecPtr	recptr;
-		TransactionId snapshotConflictHorizon;
 
 		/* Prepare deduplicated representation for use in WAL record */
 		nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
 
-		/*
-		 * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
-		 * Back up caller's FreezeLimit to avoid false conflicts when
-		 * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
-		 */
-		snapshotConflictHorizon = FreezeLimit;
-		TransactionIdRetreat(snapshotConflictHorizon);
-
 		xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
 		xlrec.nplans = nplans;
 
@@ -6843,8 +6868,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	bool		do_freeze;
 	bool		totally_frozen;
 	struct VacuumCutoffs cutoffs;
-	TransactionId NewRelfrozenXid = FreezeLimit;
-	MultiXactId NewRelminMxid = MultiXactCutoff;
+	HeapPageFreeze pagefrz;
 
 	cutoffs.relfrozenxid = relfrozenxid;
 	cutoffs.relminmxid = relminmxid;
@@ -6853,9 +6877,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 
+	pagefrz.freeze_required = true;
+	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
+	pagefrz.FreezePageRelminMxid = MultiXactCutoff;
+	pagefrz.NoFreezePageRelfrozenXid = FreezeLimit;
+	pagefrz.NoFreezePageRelminMxid = MultiXactCutoff;
+
 	do_freeze = heap_prepare_freeze_tuple(tuple, &cutoffs,
-										  &frz, &totally_frozen,
-										  &NewRelfrozenXid, &NewRelminMxid);
+										  &pagefrz, &frz, &totally_frozen);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7278,22 +7307,24 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
 }
 
 /*
- * heap_tuple_would_freeze
+ * heap_tuple_should_freeze
  *
  * Return value indicates if heap_prepare_freeze_tuple sibling function would
- * freeze any of the XID/MXID fields from the tuple, given the same cutoffs.
- * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
- * could be processed by pruning away the whole tuple instead of freezing.
+ * (or should) force freezing of the heap page that contains caller's tuple.
+ * Tuple header XIDs/MXIDs < FreezeLimit/MultiXactCutoff trigger freezing.
+ * This includes (xmin, xmax, xvac) fields, as well as MultiXact member XIDs.
  *
- * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
- * like the heap_prepare_freeze_tuple arguments that they're based on.  We
- * never freeze here, which makes tracking the oldest extant XID/MXID simple.
+ * The *NoFreezePageRelfrozenXid and *NoFreezePageRelminMxid input/output
+ * arguments help VACUUM track the oldest extant XID/MXID remaining in rel.
+ * Our working assumption is that caller won't decide to freeze this tuple.
+ * It's up to caller to only ratchet back its own top-level trackers after the
+ * point that it fully commits to not freezing the tuple/page in question.
  */
 bool
-heap_tuple_would_freeze(HeapTupleHeader tuple,
-						const struct VacuumCutoffs *cutoffs,
-						TransactionId *relfrozenxid_out,
-						MultiXactId *relminmxid_out)
+heap_tuple_should_freeze(HeapTupleHeader tuple,
+						 const struct VacuumCutoffs *cutoffs,
+						 TransactionId *NoFreezePageRelfrozenXid,
+						 MultiXactId *NoFreezePageRelminMxid)
 {
 	TransactionId xid;
 	MultiXactId multi;
@@ -7304,8 +7335,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	if (TransactionIdIsNormal(xid))
 	{
 		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+			*NoFreezePageRelfrozenXid = xid;
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 			freeze = true;
 	}
@@ -7322,8 +7353,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	{
 		Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
 		/* xmax is a non-permanent XID */
-		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+			*NoFreezePageRelfrozenXid = xid;
 		if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 			freeze = true;
 	}
@@ -7334,8 +7365,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 	{
 		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, *NoFreezePageRelminMxid))
+			*NoFreezePageRelminMxid = multi;
 		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
 		freeze = true;
 	}
@@ -7346,8 +7377,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		int			nmembers;
 
 		Assert(MultiXactIdPrecedesOrEquals(cutoffs->relminmxid, multi));
-		if (MultiXactIdPrecedes(multi, *relminmxid_out))
-			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, *NoFreezePageRelminMxid))
+			*NoFreezePageRelminMxid = multi;
 		if (MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff))
 			freeze = true;
 
@@ -7359,8 +7390,8 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		{
 			xid = members[i].xid;
 			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+				*NoFreezePageRelfrozenXid = xid;
 			if (TransactionIdPrecedes(xid, cutoffs->FreezeLimit))
 				freeze = true;
 		}
@@ -7374,9 +7405,9 @@ heap_tuple_would_freeze(HeapTupleHeader tuple,
 		if (TransactionIdIsNormal(xid))
 		{
 			Assert(TransactionIdPrecedesOrEquals(cutoffs->relfrozenxid, xid));
-			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
-				*relfrozenxid_out = xid;
-			/* heap_prepare_freeze_tuple always freezes xvac */
+			if (TransactionIdPrecedes(xid, *NoFreezePageRelfrozenXid))
+				*NoFreezePageRelfrozenXid = xid;
+			/* heap_prepare_freeze_tuple forces xvac freezing */
 			freeze = true;
 		}
 	}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 98ccb9882..18192fed5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1525,8 +1525,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	HeapPageFreeze pagefrz;
+	int64		fpi_before = pgWalUsage.wal_fpi;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
@@ -1542,8 +1542,11 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.freeze_required = false;
+	pagefrz.FreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.FreezePageRelminMxid = vacrel->NewRelminMxid;
+	pagefrz.NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
 	lpdead_items = 0;
@@ -1596,27 +1599,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1743,9 +1742,8 @@ retry:
 		prunestate->hastup = true;	/* page makes rel truncation unsafe */
 
 		/* Tuple with storage -- consider need to freeze */
-		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs,
-									  &frozen[tuples_frozen], &totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+		if (heap_prepare_freeze_tuple(tuple.t_data, &vacrel->cutoffs, &pagefrz,
+									  &frozen[tuples_frozen], &totally_frozen))
 		{
 			/* Save prepared freeze plan for later */
 			frozen[tuples_frozen++].offset = offnum;
@@ -1759,40 +1757,98 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
-	vacrel->offnum = InvalidOffsetNumber;
-
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * Consider the need to freeze any items with tuple storage from the page
-	 * first (arbitrary)
+	 * Freeze the page when heap_prepare_freeze_tuple indicates that at least
+	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
+	 * freeze when pruning generated an FPI, if doing so means that we set the
+	 * page all-frozen afterwards (might not happen until second heap pass).
 	 */
-	if (tuples_frozen > 0)
+	if (pagefrz.freeze_required || tuples_frozen == 0 ||
+		(prunestate->all_visible && prunestate->all_frozen &&
+		 fpi_before != pgWalUsage.wal_fpi))
 	{
-		Assert(prunestate->hastup);
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs that are just about to be frozen anyway.
+		 */
+		vacrel->NewRelfrozenXid = pagefrz.FreezePageRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.FreezePageRelminMxid;
 
-		vacrel->frozen_pages++;
+		if (tuples_frozen == 0)
+		{
+			/*
+			 * We're freezing all eligible tuples on the page, but have no
+			 * freeze plans to execute.  This is structured as a case where
+			 * the page is nominally frozen so that we reliably ratchet back
+			 * the NewRelfrozenXid/NewRelminMxid trackers as instructed by
+			 * heap_prepare_freeze_tuple.  Note that we may still set the page
+			 * all-frozen in the visibility map (unlike the "no freeze" case).
+			 *
+			 * We end up here when pruning removed a deleted tuple which just
+			 * so happened to leave only totally frozen tuples on the page.
+			 * It's also possible that there are remaining unfrozen XIDs/MXIDs
+			 * that are ineligible for freezing, which precludes setting the
+			 * page all-frozen, but doesn't necessarily preclude setting the
+			 * page all-visible (sometimes a single lock-only MultiXactId will
+			 * have made it unsafe to set an all-visible page all-frozen).
+			 *
+			 * We deliberately don't touch the frozen_pages instrumentation
+			 * counter here, since it counts pages with newly frozen tuples
+			 * (don't confuse that with pages newly set all-frozen in VM).
+			 */
+		}
+		else
+		{
+			TransactionId snapshotConflictHorizon;
 
-		/* Execute all freeze plans for page as a single atomic action */
-		heap_freeze_execute_prepared(vacrel->rel, buf,
-									 vacrel->cutoffs.FreezeLimit,
-									 frozen, tuples_frozen);
+			Assert(prunestate->hastup);
+
+			vacrel->frozen_pages++;
+
+			/*
+			 * We can use the latest xmin cutoff (which is generally used for
+			 * 'VM set' conflicts) as our cutoff for freeze conflicts when the
+			 * whole page is eligible to become all-frozen in the VM once
+			 * frozen by us.  Otherwise use a conservative cutoff (just back
+			 * up from OldestXmin).
+			 */
+			if (prunestate->all_visible && prunestate->all_frozen)
+				snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+			else
+			{
+				snapshotConflictHorizon = vacrel->cutoffs.OldestXmin;
+				TransactionIdRetreat(snapshotConflictHorizon);
+			}
+
+			/* Execute all freeze plans for page as a single atomic action */
+			heap_freeze_execute_prepared(vacrel->rel, buf,
+										 snapshotConflictHorizon,
+										 frozen, tuples_frozen);
+		}
+	}
+	else
+	{
+		/*
+		 * Page requires "no freeze" processing.  It might be possible to set
+		 * the page all-visible, but it'll never become all-frozen in the VM.
+		 *
+		 * NewRelfrozenXid will be <= XIDs from remaining unpruned tuples.
+		 */
+		vacrel->NewRelfrozenXid = pagefrz.NoFreezePageRelfrozenXid;
+		vacrel->NewRelminMxid = pagefrz.NoFreezePageRelminMxid;
+		tuples_frozen = 0;
+		prunestate->all_frozen = false;
 	}
 
 	/*
-	 * The second pass over the heap can also set visibility map bits, using
-	 * the same approach.  This is important when the table frequently has a
-	 * few old LP_DEAD items on each page by the time we get to it (typically
-	 * because past opportunistic pruning operations freed some non-HOT
-	 * tuples).
-	 *
 	 * VACUUM will call heap_page_is_all_visible() during the second pass over
 	 * the heap to determine all_visible and all_frozen for the page -- this
 	 * is a specialized version of the logic from this function.  Now that
@@ -1801,7 +1857,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1809,9 +1865,6 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
-		Assert(prunestate->all_frozen == all_frozen);
-
 		/*
 		 * It's possible that we froze tuples and made the page's XID cutoff
 		 * (for recovery conflict purposes) FrozenTransactionId.  This is okay
@@ -1831,9 +1884,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1847,6 +1897,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* Our caller expects LP_DEAD item to unset all_visible */
+		prunestate->all_visible = false;
+		prunestate->has_lpdead_items = true;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -1891,8 +1945,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				recently_dead_tuples,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
-	TransactionId NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	MultiXactId NewRelminMxid = vacrel->NewRelminMxid;
+	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1937,8 +1991,9 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_would_freeze(tupleheader, &vacrel->cutoffs,
-									&NewRelfrozenXid, &NewRelminMxid))
+		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
+									 &NoFreezePageRelfrozenXid,
+									 &NoFreezePageRelminMxid))
 		{
 			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
 			if (vacrel->aggressive)
@@ -2019,8 +2074,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	 * this particular page until the next VACUUM.  Remember its details now.
 	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
+	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
 	/* Save any LP_DEAD items found on the page in dead_items array */
 	if (vacrel->nindexes == 0)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9eedab652..44e15b5fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9194,9 +9194,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Specifies the cutoff age (in transactions) that <command>VACUUM</command>
-        should use to decide whether to freeze row versions
-        while scanning a table.
+        Specifies the cutoff age (in transactions) that
+        <command>VACUUM</command> should use to decide whether to
+        trigger freezing of pages that have an older XID.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
@@ -9274,9 +9274,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the cutoff age (in multixacts) that <command>VACUUM</command>
-        should use to decide whether to replace multixact IDs with a newer
-        transaction ID or multixact ID while scanning a table.  The default
-        is 5 million multixacts.
+        should use to decide whether to trigger freezing of pages with
+        an older multixact ID.  The default is 5 million multixacts.
         Although users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
         the value of <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>,
-- 
2.38.1

#61

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 3 years ago

In reply to: Peter Geoghegan (#60)

RE: New strategies for freezing, advancing relfrozenxid early

Dear Peter, Jeff,

While reviewing other patches, I found that cfbot raised ERROR during the VACUUM FREEZE [1]https://cirrus-ci.com/task/4580705867399168 on FreeBSD instance.
It seemed that same error has been occurred in other threads.

```
2022-12-23 08:50:20.175 UTC [34653][postmaster] LOG: server process (PID 37171) was terminated by signal 6: Abort trap
2022-12-23 08:50:20.175 UTC [34653][postmaster] DETAIL: Failed process was running: VACUUM FREEZE tab_freeze;
2022-12-23 08:50:20.175 UTC [34653][postmaster] LOG: terminating any other active server processes
```

I guessed that this assertion failure seemed to be caused by the commit 4ce3af[2]https://github.com/postgres/postgres/commit/4ce3afb82ecfbf64d4f6247e725004e1da30f47c,
because the Assert() seemed to be added by the commit.

```
[08:51:31.189] #3 0x00000000009b88d7 in ExceptionalCondition (conditionName=<optimized out>, fileName=0x2fd9df "../src/backend/access/heap/heapam.c", lineNumber=lineNumber@entry=6618) at ../src/backend/utils/error/assert.c:66
[08:51:31.189] No locals.
[08:51:31.189] #4 0x0000000000564205 in heap_prepare_freeze_tuple (tuple=0x8070f0bb0, cutoffs=cutoffs@entry=0x80222e768, frz=0x7fffffffb2d0, totally_frozen=totally_frozen@entry=0x7fffffffc478, relfrozenxid_out=<optimized out>, relfrozenxid_out@entry=0x7fffffffc4a8, relminmxid_out=<optimized out>, relminmxid_out@entry=0x7fffffffc474) at ../src/backend/access/heap/heapam.c:6618
```

Sorry for noise if you have already known or it is not related with this thread.

[1]: https://cirrus-ci.com/task/4580705867399168
[2]: https://github.com/postgres/postgres/commit/4ce3afb82ecfbf64d4f6247e725004e1da30f47c

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#62

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#61)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Dec 26, 2022 at 10:57 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

I guessed that this assertion failure seemed to be caused by the commit 4ce3af[2],
because the Assert() seemed to be added by the commit.

I agree that the problem is with this assertion, which is on the
master branch (not in recent versions of the patch series itself)
following commit 4ce3af:

else
{
/*
* Freeze plan for tuple "freezes xmax" in the strictest sense:
* it'll leave nothing in xmax (neither an Xid nor a MultiXactId).
*/
....
Assert(MultiXactIdPrecedes(xid, cutoffs->OldestMxact));
...
}

The problem is that FRM_INVALIDATE_XMAX multi processing can occur
both in Multis from before OldestMxact and Multis >= OldestMxact. The
latter case (the >= case) is far less common, but still quite
possible. Not sure how I missed that.

Anyway, this assertion is wrong, and simply needs to be removed.
Thanks for the report
--
Peter Geoghegan

#63

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 3 years ago

In reply to: Peter Geoghegan (#62)

RE: New strategies for freezing, advancing relfrozenxid early

Dear Peter,

Anyway, this assertion is wrong, and simply needs to be removed.
Thanks for the report

Thanks for modifying for quickly! I found your commit in the remote repository.
I will watch and report again if there are another issue.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#64

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#60)

3 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Dec 26, 2022 at 12:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v12. I think that the page-level freezing patch is now
commitable, and plan on committing it in the next 2-4 days barring any
objections.

I've pushed the page-level freezing patch, so now I need to produce a
new revision, just to keep CFTester happy.

Attached is v13. No notable changes since v12.

--
Peter Geoghegan

Attachments:

v13-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchapplication/octet-stream; name=v13-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchDownload

From 8c1e0d2ecbaa5e254637e88cd7dce6f668af7da9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v13 2/3] Add eager and lazy VM strategies to VACUUM.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will scan (or skip) heap pages (scanning strategy).  The
data structure we a local copy of the visibility map at the start of
VACUUM.  It spills to disk as required, though only with a larger table.

VACUUM decides on its visibility map scanning and freezing strategies
together, shortly before the first pass over the heap begins, since the
concepts are closely related, and work in tandem.  Lazy scanning allows
VACUUM to skip all-visible pages, while eager scanning allows VACUUM to
advance relfrozenxid/relminmxid at the end of the VACUUM operation.

This work, combined with recent work to add freezing strategies, results
in VACUUM advancing relfrozenxid at a cadence that is barely influenced
by autovacuum_freeze_max_age at all.  Now antiwraparound autovacuums
will be far less common in practice.  Freezing now drives relfrozenxid,
rather than relfrozenxid driving freezing.  Even tables that always use
lazy freezing will have a decent chance of relfrozenxid advancement long
before table age nears or exceeds autovacuum_freeze_max_age.

This also lays the groundwork for completely removing aggressive mode
VACUUMs in a later commit.  Scanning strategies now supersede the "early
aggressive VACUUM" concept implemented by vacuum_freeze_table_age, which
is now just a compatibility option (its new default of -1 is interpreted
as "just use autovacuum_freeze_max_age").  Later work that makes the
choice to wait for a cleanup lock depend entirely on individual page
characteristics will decouple that "aggressive behavior" from the eager
scanning strategy behavior (a behavior that's not really "aggressive" in
any general sense, since it's chosen based on both costs and benefits).

Also add explicit I/O prefetching of heap pages, which is controlled by
maintenance_io_concurrency.  We prefetch at the point that the next
block in line is requested by VACUUM.  Prefetching is under the direct
control of the visibility map snapshot code, since VACUUM's vmsnap is
now an authoritative guide to which pages VACUUM will scan.  VACUUM's
final scanned_pages is "locked in" when it decides on scanning strategy
(so scanned_pages is finalized before the first heap pass even begins).

Prefetching should totally avoid the loss of performance that might
otherwise result from removing SKIP_PAGES_THRESHOLD in this commit.
SKIP_PAGES_THRESHOLD was intended to force OS readahead and encourage
relfrozenxid advancement.  See commit bf136cf6 from around the time the
visibility map first went in for full details.

Also teach VACUUM to use scanned_pages (not rel_pages) to cap the size
of the dead_items array.  This approach is strictly better, since there
is no question of scanning any pages other than the precise set of pages
already locked in by vmsnap by the time dead_items is allocated.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h            |  17 +
 src/include/commands/vacuum.h                 |  15 +-
 src/backend/access/heap/heapam.c              |   1 +
 src/backend/access/heap/vacuumlazy.c          | 508 +++++++++-------
 src/backend/access/heap/visibilitymap.c       | 547 ++++++++++++++++++
 src/backend/commands/vacuum.c                 |  68 +--
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      |  66 ++-
 doc/src/sgml/ref/vacuum.sgml                  |   4 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 12 files changed, 957 insertions(+), 302 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 55f67edb6..4a1f47ac6 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,17 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
+/* VACUUM scanning strategy */
+typedef enum vmstrategy
+{
+	VMSNAP_SCAN_LAZY,			/* Skip all-visible and all-frozen pages */
+	VMSNAP_SCAN_EAGER,			/* Only skip all-frozen pages */
+	VMSNAP_SCAN_ALL				/* Don't skip any pages (scan them instead) */
+} vmstrategy;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +46,12 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+											  BlockNumber *scanned_pages_lazy,
+											  BlockNumber *scanned_pages_eager);
+extern void visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat);
+extern BlockNumber visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index b39178d5b..43e367bcb 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 
 /*
  * Values used by index_cleanup and truncate params.
@@ -281,6 +281,19 @@ struct VacuumCutoffs
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
 	 */
 	BlockNumber freeze_strategy_threshold;
+
+	/*
+	 * The tableagefrac value 1.0 represents the point that autovacuum.c
+	 * scheduling (and VACUUM itself) considers relfrozenxid advancement
+	 * strictly necessary.
+	 *
+	 * Lower values provide useful context, and influence whether VACUUM will
+	 * opt to advance relfrozenxid before the point that it is strictly
+	 * necessary.  VACUUM can (and often does) opt to advance relfrozenxid
+	 * proactively.  It is especially likely with tables where the _added_
+	 * costs happen to be low.
+	 */
+	double		tableagefrac;
 };
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3a8f50c31..54c6cb741 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6880,6 +6880,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold = 0;
+	cutoffs.tableagefrac = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0444b3f12..6e230b6af 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -110,10 +110,18 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * tableagefrac-wise cutoffs influencing VACUUM's choice of scanning strategy
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define TABLEAGEFRAC_MIDPOINT		0.5 /* half way to antiwraparound AV */
+#define TABLEAGEFRAC_HIGHPOINT		0.9 /* Eagerness now mandatory */
+
+/*
+ * Thresholds (expressed as a proportion of rel_pages) that determine the
+ * cutoff (in extra pages scanned) for eager vmsnap scanning behavior at
+ * particular tableagefrac-wise table ages
+ */
+#define MAX_PAGES_YOUNG_TABLEAGE	0.05	/* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE		0.70	/* 70% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -151,8 +159,6 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -171,7 +177,9 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map (as of time that VACUUM began) */
+	vmsnapshot *vmsnap;
+	vmstrategy	vmstrat;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -244,11 +252,8 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static void lazy_scan_strategy(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  bool force_scan_all);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -278,7 +283,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -310,10 +316,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	BlockNumber orig_rel_pages,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -459,37 +465,29 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		vacrel->aggressive = true;
-		skipwithvm = false;
-	}
-
-	vacrel->skipwithvm = skipwithvm;
 
 	/*
-	 * Now determine VACUUM's freezing strategy.
+	 * Now determine VACUUM's freezing and vmsnap scanning strategies.
+	 *
+	 * This process is driven in part by information from VACUUM's visibility
+	 * map snapshot, which will be acquired in passing.  lazy_scan_heap will
+	 * use the same immutable VM snapshot to determine which pages to scan.
+	 * Using an immutable structure (instead of the live visibility map) makes
+	 * VACUUM avoid scanning concurrently modified pages.  These pages can
+	 * only have deleted tuples that OldestXmin will consider RECENTLY_DEAD.
 	 */
-	lazy_scan_strategy(vacrel);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   (params->options &
+										VACOPT_DISABLE_PAGE_SKIPPING) != 0);
 	if (verbose)
-	{
-		if (vacrel->aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
-	}
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						get_database_name(MyDatabaseId),
+						vacrel->relnamespace, vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -499,13 +497,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -552,12 +551,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
+		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
 		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -602,6 +600,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -629,10 +630,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -827,13 +824,12 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_unskippable_block,
-				next_fsm_block_to_vacuum = 0;
+				next_blk_to_scan,
+				next_blk_to_fsm_vacuum = 0;
+	bool		next_all_visible;
+	vmsnapshot *vmsnap = vacrel->vmsnap;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -847,46 +843,30 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
-	for (blkno = 0; blkno < rel_pages; blkno++)
+	/* Determine the first page to be scanned before entering scanning loop */
+	next_blk_to_scan = visibilitymap_snap_next(vmsnap, &next_all_visible);
+	while (next_blk_to_scan < rel_pages)
 	{
+		BlockNumber blkno = next_blk_to_scan;
+		bool		all_visible_according_to_vmsnap = next_all_visible;
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		/* Determine the next page to be scanned before scanning this page */
+		next_blk_to_scan = visibilitymap_snap_next(vmsnap, &next_all_visible);
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * visibilitymap_snap_next must always force us to scan the last page
+		 * in rel (in the range of rel_pages) so that VACUUM can avoid useless
+		 * attempts at rel truncation (per should_attempt_truncation comments)
+		 */
+		Assert(next_blk_to_scan > blkno);
+		Assert(next_blk_to_scan < rel_pages || blkno == rel_pages - 1);
 
 		vacrel->scanned_pages++;
 
-		/* Report as block scanned, update error traceback information */
+		/* Report all blocks < blkno as initial-heap-pass processed */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
@@ -934,9 +914,8 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * Vacuum the Free Space Map to make newly-freed space visible on
 			 * upper-level FSM pages.  Note we have not yet processed blkno.
 			 */
-			FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
-									blkno);
-			next_fsm_block_to_vacuum = blkno;
+			FreeSpaceMapVacuumRange(vacrel->rel, next_blk_to_fsm_vacuum, blkno);
+			next_blk_to_fsm_vacuum = blkno;
 
 			/* Report that we are once again scanning the heap */
 			pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
@@ -1055,11 +1034,11 @@ lazy_scan_heap(LVRelState *vacrel)
 				 * space visible on upper FSM pages.  Note we have not yet
 				 * performed FSM processing for blkno.
 				 */
-				if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+				if (blkno - next_blk_to_fsm_vacuum >= VACUUM_FSM_EVERY_PAGES)
 				{
-					FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+					FreeSpaceMapVacuumRange(vacrel->rel, next_blk_to_fsm_vacuum,
 											blkno);
-					next_fsm_block_to_vacuum = blkno;
+					next_blk_to_fsm_vacuum = blkno;
 				}
 
 				/*
@@ -1089,10 +1068,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (!all_visible_according_to_vmsnap && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1120,13 +1098,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * The authoritative visibility map bit should never be set if the
+		 * page-level bit is clear
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+		else if (all_visible_according_to_vmsnap && !PageIsAllVisible(page) &&
+				 VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1164,8 +1140,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
 		 * all_visible is true, so we must check both prunestate fields.
 		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
-				 prunestate.all_frozen &&
+		else if (all_visible_according_to_vmsnap &&
+				 prunestate.all_visible && prunestate.all_frozen &&
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
 			/*
@@ -1214,12 +1190,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 	}
 
+	/* initial heap pass finished (final pass may still be required) */
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	/* report all blocks as initial-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
@@ -1236,20 +1213,25 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
+	 * related heap vacuuming in final heap pass
 	 */
 	if (dead_items->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * Now that both our initial heap pass and final heap pass (if any) have
+	 * ended, vacuum the Free Space Map. (Actually, similar FSM vacuuming will
+	 * have taken place earlier when VACUUM needed to call lazy_vacuum to deal
+	 * with running out of dead_items space.  Hopefully that will be rare.)
 	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+	if (rel_pages > 0)
+	{
+		Assert(vacrel->scanned_pages > 0);
+		FreeSpaceMapVacuumRange(vacrel->rel, next_blk_to_fsm_vacuum, rel_pages);
+	}
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+	/* report all blocks as final-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
 
 	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
@@ -1257,7 +1239,7 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine freezing strategy.
+ *	lazy_scan_strategy() -- Determine freezing/vmsnap scanning strategies.
  *
  * Our lazy freezing strategy is useful when putting off the work of freezing
  * totally avoids freezing that turns out to have been wasted effort later on.
@@ -1265,11 +1247,42 @@ lazy_scan_heap(LVRelState *vacrel)
  * continual growth, where freezing pages proactively is needed just to avoid
  * falling behind on freezing (eagerness is also likely to be cheaper in the
  * short/medium term for such tables, but the long term picture matters most).
+ *
+ * Our lazy vmsnap scanning strategy is useful when we can save a significant
+ * amount of work in the short term by not advancing relfrozenxid/relminmxid.
+ * Our eager vmsnap scanning strategy is useful when there is hardly any work
+ * avoided by being lazy anyway, and/or when tableagefrac is nearing or has
+ * already surpassed 1.0, which is the point of antiwraparound autovacuuming.
+ *
+ * Freezing and scanning strategies are structured as two independent choices,
+ * but they are not independent in any practical sense (it's just mechanical).
+ * Eager and lazy behaviors go hand in hand, since the choice of each strategy
+ * is driven by similar considerations about the needs of the target table.
+ * Moreover, choosing eager scanning strategy can easily result in freezing
+ * many more pages (compared to an equivalent lazy scanning strategy VACUUM),
+ * since VACUUM can only freeze pages that it actually scans.  (All-visible
+ * pages may well have XIDs < FreezeLimit by now, but VACUUM has no way of
+ * noticing that it should freeze such pages besides just scanning them.)
+ *
+ * The single most important justification for the eager behaviors is system
+ * level performance stability.  It is often better to freeze all-visible
+ * pages before we're truly forced to (just to advance relfrozenxid) as a way
+ * of avoiding big spikes, where VACUUM has to freeze many pages all at once.
+ *
+ * Returns final scanned_pages for the VACUUM operation.  The exact number of
+ * pages that lazy_scan_heap scans depends in part on which vmsnap scanning
+ * strategy we choose (only eager scanning will scan rel's all-visible pages).
  */
-static void
-lazy_scan_strategy(LVRelState *vacrel)
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->rel_pages,
+				scanned_pages_lazy,
+				scanned_pages_eager,
+				nextra_scanned_eager,
+				nextra_young_threshold,
+				nextra_old_threshold,
+				nextra_toomany_threshold;
 
 	/*
 	 * Decide freezing strategy.
@@ -1277,121 +1290,161 @@ lazy_scan_strategy(LVRelState *vacrel)
 	 * The eager freezing strategy is used when the threshold controlled by
 	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
 	 *
+	 * Also freeze eagerly whenever table age is close to requiring (or is
+	 * actually undergoing) an antiwraparound autovacuum.  This may delay the
+	 * next antiwraparound autovacuum against the table.  We avoid relying on
+	 * them, if at all possible (mostly-static tables tend to rely on them).
+	 *
 	 * Also freeze eagerly with an unlogged or temp table, where the total
 	 * cost of freezing each page is just the cycles needed to prepare a set
 	 * of freeze plans.  Executing the freeze plans adds very little cost.
 	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
 	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 *
+	 * Once a table first becomes big enough for eager freezing, it's almost
+	 * inevitable that it will also naturally settle into a cadence where
+	 * relfrozenxid is advanced during every VACUUM (barring rel truncation).
+	 * This is a consequence of eager freezing strategy avoiding creating new
+	 * all-visible pages: if there never are any all-visible pages (if all
+	 * skippable pages are fully all-frozen), then there is no way that lazy
+	 * scanning strategy can ever look better than eager scanning strategy.
+	 * There are still ways that the occasional all-visible page could slip
+	 * into a table that we always freeze eagerly (at least when its tuples
+	 * tend to contain MultiXacts), but that should have negligible impact.
 	 */
 	vacrel->eager_freeze_strategy =
 		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 vacrel->cutoffs.tableagefrac > TABLEAGEFRAC_HIGHPOINT ||
 		 !RelationIsPermanent(vacrel->rel));
-}
-
-/*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
- *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
- *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
- */
-static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
-{
-	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
-
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
-	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
-
-		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-		{
-			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
-		 */
-		if (next_unskippable_block == rel_pages - 1)
-			break;
-
-		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
-			break;
-
-		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
-		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
-
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
-	}
 
 	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
+	 * Decide vmsnap scanning strategy.
 	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 * First acquire a visibility map snapshot, which determines the number of
+	 * pages that each vmsnap scanning strategy is required to scan for us in
+	 * passing.
+	 *
+	 * The number of "extra" scanned_pages added by choosing VMSNAP_SCAN_EAGER
+	 * over VMSNAP_SCAN_LAZY is a key input into the decision making process.
+	 * It is a good proxy for the added cost of applying our eager vmsnap
+	 * strategy during this particular VACUUM.  (We may or may not have to
+	 * dirty/freeze the extra pages when we scan them, which isn't something
+	 * that we try to model.  It shouldn't matter very much at this level.)
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
+	vacrel->vmsnap = visibilitymap_snap_acquire(vacrel->rel, rel_pages,
+												&scanned_pages_lazy,
+												&scanned_pages_eager);
+	nextra_scanned_eager = scanned_pages_eager - scanned_pages_lazy;
+
+	/*
+	 * Next determine guideline "nextra_scanned_eager" thresholds, which are
+	 * applied based in part on tableagefrac (when nextra_toomany_threshold is
+	 * determined below).  These thresholds also represent minimum and maximum
+	 * sensible thresholds that can ever make sense for a table of this size
+	 * (when the table's age isn't old enough to make eagerness mandatory).
+	 *
+	 * For the most part we only care about relative (not absolute) costs.  We
+	 * want to advance relfrozenxid at an opportune time, during a VACUUM that
+	 * has to scan relatively many pages either way (whether due to the need
+	 * to remove dead tuples from many pages, or due to the table containing
+	 * lots of existing all-frozen pages, or due to a combination of both).
+	 * Even small tables (where lazy freezing is used) shouldn't have to do
+	 * dramatically more work than usual when advancing relfrozenxid, which
+	 * our policy of waiting for the right VACUUM largely avoids, in practice.
+	 */
+	nextra_young_threshold = (double) rel_pages * MAX_PAGES_YOUNG_TABLEAGE;
+	nextra_old_threshold = (double) rel_pages * MAX_PAGES_OLD_TABLEAGE;
+
+	/*
+	 * Next determine nextra_toomany_threshold, which represents how many
+	 * extra scanned_pages are deemed too high a cost to pay for eagerness,
+	 * given present conditions.  This is our model's break-even point.
+	 */
+	Assert(rel_pages >= nextra_scanned_eager && vacrel->scanned_pages == 0);
+	if (vacrel->cutoffs.tableagefrac < TABLEAGEFRAC_MIDPOINT)
+	{
+		/*
+		 * The table's age is still below table age mid point, so table age is
+		 * still of only minimal concern.  We're still willing to act eagerly
+		 * when it's _very_ cheap to do so: when use of VMSNAP_SCAN_EAGER will
+		 * force us to scan some extra pages not exceeding 5% of rel_pages.
+		 */
+		nextra_toomany_threshold = nextra_young_threshold;
+	}
+	else if (vacrel->cutoffs.tableagefrac <= TABLEAGEFRAC_HIGHPOINT)
+	{
+		double		nextra_scale;
+
+		/*
+		 * The table's age is starting to become a concern, but not to the
+		 * extent that we'll force the use of VMSNAP_SCAN_EAGER strategy.
+		 * We'll need to interpolate to get an nextra_scanned_eager-based
+		 * threshold.
+		 *
+		 * If tableagefrac is only barely over the midway point, then we'll
+		 * choose an nextra_scanned_eager threshold of ~5% of rel_pages.  The
+		 * opposite extreme occurs when tableagefrac is very near to the high
+		 * point.  That will make our nextra_scanned_eager threshold very
+		 * aggressive: we'll go with VMSNAP_SCAN_EAGER when doing so requires
+		 * we scan a number of extra blocks as high as ~70% of rel_pages.
+		 *
+		 * Note that the threshold grows (on a percentage basis) by ~8.1% of
+		 * rel_pages for every additional 5%-of-tableagefrac increment added
+		 * (after tableagefrac has crossed the 50%-of-tableagefrac mid point,
+		 * until the 90%-of-tableagefrac high point is reached, when we switch
+		 * over to not caring about the added cost of eager freezing at all).
+		 */
+		nextra_scale =
+			1.0 - ((TABLEAGEFRAC_HIGHPOINT - vacrel->cutoffs.tableagefrac) /
+				   (TABLEAGEFRAC_HIGHPOINT - TABLEAGEFRAC_MIDPOINT));
+
+		nextra_toomany_threshold =
+			(nextra_young_threshold * (1.0 - nextra_scale)) +
+			(nextra_old_threshold * nextra_scale);
+	}
 	else
 	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
+		/*
+		 * The table's age surpasses the high point, and so is approaching (or
+		 * may even surpass) the point that an antiwraparound autovacuum is
+		 * required.  Force VMSNAP_SCAN_EAGER, no matter how many extra pages
+		 * we'll be required to scan as a result (costs no longer matter).
+		 *
+		 * Note that there is a discontinuity when tableagefrac crosses this
+		 * 90%-of-tableagefrac high point: the threshold set here jumps from
+		 * 70% of rel_pages to 100% of rel_pages (MaxBlockNumber, actually).
+		 * It's useful to only care about table age once it gets this high.
+		 * That way even extreme cases will have at least some chance of using
+		 * eager scanning before an antiwraparound autovacuum is launched.
+		 */
+		nextra_toomany_threshold = MaxBlockNumber;
 	}
 
-	return next_unskippable_block;
+	/* Make final choice on scanning strategy using final threshold */
+	nextra_toomany_threshold = Max(32, nextra_toomany_threshold);
+	vacrel->vmstrat = (nextra_scanned_eager >= nextra_toomany_threshold ?
+					   VMSNAP_SCAN_LAZY : VMSNAP_SCAN_EAGER);
+
+	/*
+	 * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+	 * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+	 */
+	if (force_scan_all)
+		vacrel->vmstrat = VMSNAP_SCAN_ALL;
+
+	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
+
+	/* Inform vmsnap infrastructure of our chosen strategy */
+	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
+
+	/* Return appropriate scanned_pages for final strategy chosen */
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
+		return scanned_pages_lazy;
+	if (vacrel->vmstrat == VMSNAP_SCAN_EAGER)
+		return scanned_pages_eager;
+
+	/* DISABLE_PAGE_SKIPPING/VMSNAP_SCAN_ALL case */
+	return rel_pages;
 }
 
 /*
@@ -2836,6 +2889,14 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * Also don't attempt it if we are doing early pruning/vacuuming, because a
  * scan which cannot find a truncated heap page cannot determine that the
  * snapshot is too old to read that page.
+ *
+ * Note that we effectively rely on visibilitymap_snap_next() having forced
+ * VACUUM to scan the final page (rel_pages - 1) in all cases.  Without that,
+ * we'd tend to needlessly acquire an AccessExclusiveLock just to attempt rel
+ * truncation that is bound to fail.  VACUUM cannot set vacrel->nonempty_pages
+ * in pages that it skips using the VM, so we must avoid interpreting skipped
+ * pages as empty pages when it makes little sense.  Observing that the final
+ * page has tuples is a simple way of avoiding pathological locking behavior.
  */
 static bool
 should_attempt_truncation(LVRelState *vacrel)
@@ -3126,14 +3187,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3142,15 +3202,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3172,12 +3230,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e..816576dca 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,10 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap_acquire - acquire snapshot of visibility map
+ *		visibilitymap_snap_strategy - set VACUUM's scanning strategy
+ *		visibilitymap_snap_next - get next block to scan from vmsnap
+ *		visibilitymap_snap_release - release previously acquired snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +56,10 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset.  This also provides us with a
+ * convenient way of performing I/O prefetching on behalf of VACUUM, since the
+ * pages that VACUUM's first heap pass will scan are fully predetermined.
  *
  * LOCKING
  *
@@ -92,10 +100,12 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
+#include "utils/spccache.h"
 
 
 /*#define TRACE_VISIBILITYMAP */
@@ -124,9 +134,87 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Prefetching of heap pages takes place as VACUUM requests the next block in
+ * line from its visibility map snapshot
+ *
+ * XXX MIN_PREFETCH_SIZE of 32 is a little on the high side, but matches
+ * hard-coded constant used by vacuumlazy.c when prefetching for rel
+ * truncation.  Might be better to increase the maintenance_io_concurrency
+ * default, or to do nothing like this at all.
+ */
+#define STAGED_BUFSIZE			(MAX_IO_CONCURRENCY * 2)
+#define MIN_PREFETCH_SIZE		((BlockNumber) 32)
+
+typedef struct vmsnapblock
+{
+	BlockNumber scanned_block;
+	bool		all_visible;
+} vmsnapblock;
+
+/*
+ * Snapshot of visibility map at the start of a VACUUM operation
+ */
+struct vmsnapshot
+{
+	/* Target heap rel */
+	Relation	rel;
+	/* Scanning strategy used by VACUUM operation */
+	vmstrategy	strat;
+	/* Per-strategy final scanned_pages */
+	BlockNumber rel_pages;
+	BlockNumber scanned_pages_lazy;
+	BlockNumber scanned_pages_eager;
+
+	/*
+	 * Materialized visibility map state.
+	 *
+	 * VM snapshots spill to a temp file when required.
+	 */
+	BlockNumber nvmpages;
+	BufFile    *file;
+
+	/*
+	 * Prefetch distance, used to perform I/O prefetching of heap pages
+	 */
+	int			prefetch_distance;
+
+	/* Current VM page cached */
+	BlockNumber curvmpage;
+	char	   *rawmap;
+	PGAlignedBlock vmpage;
+
+	/* Staging area for blocks returned to VACUUM */
+	vmsnapblock staged[STAGED_BUFSIZE];
+	int			current_nblocks_staged;
+
+	/*
+	 * Next block from range of rel_pages to consider placing in staged block
+	 * array (it will be placed there if it's going to be scanned by VACUUM)
+	 */
+	BlockNumber next_block;
+
+	/*
+	 * Number of blocks that we still need to return, and number of blocks
+	 * that we still need to prefetch
+	 */
+	BlockNumber scanned_pages_to_return;
+	BlockNumber scanned_pages_to_prefetch;
+
+	/* offset of next block in line to return (from staged) */
+	int			next_return_idx;
+	/* offset of next block in line to prefetch (from staged) */
+	int			next_prefetch_idx;
+	/* offset of first garbage/invalid element (from staged) */
+	int			first_invalid_idx;
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static void vm_snap_stage_blocks(vmsnapshot *vmsnap);
+static uint8 vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 
 
 /*
@@ -373,6 +461,350 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap_acquire - get read-only snapshot of visibility map
+ *
+ * Initializes VACUUM caller's snapshot, allocating memory in current context.
+ * Used by VACUUM to determine which pages it must scan up front.
+ *
+ * Set scanned_pages_lazy and scanned_pages_eager to help VACUUM decide on its
+ * scanning strategy.  These are VACUUM's scanned_pages when it opts to skip
+ * all eligible pages and scanned_pages when it opts to just skip all-frozen
+ * pages, respectively.
+ *
+ * Caller finalizes scanning strategy by calling visibilitymap_snap_strategy.
+ * This determines the kind of blocks visibilitymap_snap_next should indicate
+ * need to be scanned by VACUUM.
+ */
+vmsnapshot *
+visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+						   BlockNumber *scanned_pages_lazy,
+						   BlockNumber *scanned_pages_eager)
+{
+	BlockNumber nvmpages = 0,
+				mapBlockLast = 0,
+				all_visible = 0,
+				all_frozen = 0;
+	uint8		mapbits_last_page = 0;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_acquire %s %u",
+		 RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	if (rel_pages > 0)
+	{
+		mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages - 1);
+		nvmpages = mapBlockLast + 1;
+	}
+
+	/* Allocate and initialize VM snapshot state */
+	vmsnap = palloc0(sizeof(vmsnapshot));
+	vmsnap->rel = rel;
+	vmsnap->strat = VMSNAP_SCAN_ALL;	/* for now */
+	vmsnap->rel_pages = rel_pages;	/* scanned_pages for VMSNAP_SCAN_ALL */
+	vmsnap->scanned_pages_lazy = 0;
+	vmsnap->scanned_pages_eager = 0;
+
+	/*
+	 * vmsnap temp file state.
+	 *
+	 * Only relations large enough to need more than one visibility map page
+	 * use a temp file (cannot wholly rely on vmsnap's single page cache).
+	 */
+	vmsnap->nvmpages = nvmpages;
+	vmsnap->file = NULL;
+	if (nvmpages > 1)
+		vmsnap->file = BufFileCreateTemp(false);
+	vmsnap->prefetch_distance = 0;
+#ifdef USE_PREFETCH
+	vmsnap->prefetch_distance =
+		get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
+#endif
+	vmsnap->prefetch_distance = Max(vmsnap->prefetch_distance, MIN_PREFETCH_SIZE);
+
+	/* cache of VM pages read from temp file */
+	vmsnap->curvmpage = 0;
+	vmsnap->rawmap = NULL;
+
+	/* staged blocks array state */
+	vmsnap->current_nblocks_staged = 0;
+	vmsnap->next_block = 0;
+	vmsnap->scanned_pages_to_return = 0;
+	vmsnap->scanned_pages_to_prefetch = 0;
+	/* Offsets into staged blocks array */
+	vmsnap->next_return_idx = 0;
+	vmsnap->next_prefetch_idx = 0;
+	vmsnap->first_invalid_idx = 0;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		/* Cache page locally */
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(vmsnap->vmpage.data, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/* Finish off this VM page using snapshot's vmpage cache */
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = map = PageGetContents(vmsnap->vmpage.data);
+		umap = (uint64 *) map;
+
+		if (mapBlock == mapBlockLast)
+		{
+			uint32		mapByte;
+			uint8		mapOffset;
+
+			/*
+			 * The last VM page requires some extra steps.
+			 *
+			 * First get the status of the last heap page (page in the range
+			 * of rel_pages) in passing.
+			 */
+			Assert(mapBlock == HEAPBLK_TO_MAPBLOCK(rel_pages - 1));
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages - 1);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages - 1);
+			mapbits_last_page = ((map[mapByte] >> mapOffset) &
+								 VISIBILITYMAP_VALID_BITS);
+
+			/*
+			 * Also defensively "truncate" our local copy of the last page in
+			 * order to reliably exclude heap pages beyond the range of
+			 * rel_pages.  This is sheer paranoia.
+			 */
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages);
+			if (mapByte != 0 || mapOffset != 0)
+			{
+				MemSet(&map[mapByte + 1], 0, MAPSIZE - (mapByte + 1));
+				map[mapByte] &= (1 << mapOffset) - 1;
+			}
+		}
+
+		/* Maintain count of all-frozen and all-visible pages */
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+
+		/* Finally, write out vmpage cache VM page to vmsnap's temp file */
+		if (vmsnap->file)
+			BufFileWrite(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+	}
+
+	/*
+	 * Done copying all VM pages from authoritative VM into a VM snapshot.
+	 *
+	 * Figure out the final scanned_pages for the two skipping policies that
+	 * we might use: skipallvis (skip both all-frozen and all-visible) and
+	 * skipallfrozen (just skip all-frozen).
+	 */
+	Assert(all_frozen <= all_visible && all_visible <= rel_pages);
+	*scanned_pages_lazy = rel_pages - all_visible;
+	*scanned_pages_eager = rel_pages - all_frozen;
+
+	/*
+	 * When the last page is skippable in principle, it still won't be treated
+	 * as skippable by visibilitymap_snap_next, which recognizes the last page
+	 * as a special case.  Compensate by incrementing each scanning strategy's
+	 * scanned_pages as needed to avoid counting the last page as skippable.
+	 */
+	if (mapbits_last_page & VISIBILITYMAP_ALL_VISIBLE)
+		(*scanned_pages_lazy)++;
+	if (mapbits_last_page & VISIBILITYMAP_ALL_FROZEN)
+		(*scanned_pages_eager)++;
+
+	vmsnap->scanned_pages_lazy = *scanned_pages_lazy;
+	vmsnap->scanned_pages_eager = *scanned_pages_eager;
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_strategy -- determine VACUUM's scanning strategy.
+ *
+ * VACUUM chooses a vmsnap strategy according to priorities around advancing
+ * relfrozenxid.  See visibilitymap_snap_acquire.
+ */
+void
+visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat)
+{
+	int			nprefetch;
+
+	/* Remember final scanning strategy */
+	vmsnap->strat = strat;
+
+	if (vmsnap->strat == VMSNAP_SCAN_LAZY)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_lazy;
+	else if (vmsnap->strat == VMSNAP_SCAN_EAGER)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_eager;
+	else
+		vmsnap->scanned_pages_to_return = vmsnap->rel_pages;
+
+	vmsnap->scanned_pages_to_prefetch = vmsnap->scanned_pages_to_return;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_strategy %s %d %u",
+		 RelationGetRelationName(vmsnap->rel), (int) strat,
+		 vmsnap->scanned_pages_to_return);
+#endif
+
+	/*
+	 * Stage blocks (may have to read from temp file).
+	 *
+	 * We rely on the assumption that we'll always have a large enough staged
+	 * blocks array to accommodate any possible prefetch distance.
+	 */
+	vm_snap_stage_blocks(vmsnap);
+
+	nprefetch = Min(vmsnap->current_nblocks_staged, vmsnap->prefetch_distance);
+#ifdef USE_PREFETCH
+	for (int i = 0; i < nprefetch; i++)
+	{
+		BlockNumber block = vmsnap->staged[i].scanned_block;
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, block);
+	}
+#endif
+
+	vmsnap->scanned_pages_to_prefetch -= nprefetch;
+	vmsnap->next_prefetch_idx += nprefetch;
+}
+
+/*
+ *	visibilitymap_snap_next -- get next block to scan from vmsnap.
+ *
+ * Returns next block in line for VACUUM to scan according to vmsnap.  Caller
+ * skips any and all blocks preceding returned block.
+ *
+ * The all-visible status of returned block is set in *all_visible.  Block
+ * usually won't be set all-visible (else VACUUM wouldn't need to scan it),
+ * but it can be in certain corner cases.  This includes the VMSNAP_SCAN_ALL
+ * case, as well as a special case that VACUUM expects us to handle: the final
+ * block (rel_pages - 1) is always returned here (regardless of our strategy).
+ *
+ * VACUUM always scans the last page to determine whether it has tuples.  This
+ * is useful as a way of avoiding certain pathological cases with heap rel
+ * truncation.
+ */
+BlockNumber
+visibilitymap_snap_next(vmsnapshot *vmsnap, bool *allvisible)
+{
+	BlockNumber next_block_to_scan;
+	vmsnapblock block;
+
+	*allvisible = true;
+	if (vmsnap->scanned_pages_to_return == 0)
+		return InvalidBlockNumber;
+
+	/* Prepare to return this block */
+	block = vmsnap->staged[vmsnap->next_return_idx++];
+	*allvisible = block.all_visible;
+	next_block_to_scan = block.scanned_block;
+	vmsnap->current_nblocks_staged--;
+	vmsnap->scanned_pages_to_return--;
+
+	/*
+	 * Did the staged blocks array just run out of blocks to return to caller,
+	 * or do we need to stage more blocks for I/O prefetching purposes?
+	 */
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+	if ((vmsnap->current_nblocks_staged == 0 &&
+		 vmsnap->scanned_pages_to_return > 0) ||
+		(vmsnap->next_prefetch_idx == vmsnap->first_invalid_idx &&
+		 vmsnap->scanned_pages_to_prefetch > 0))
+	{
+		if (vmsnap->current_nblocks_staged > 0)
+		{
+			/*
+			 * We've run out of prefetchable blocks, but still have some
+			 * non-returned blocks.  Shift existing blocks to the start of the
+			 * array.  The newly staged blocks go after these ones.
+			 */
+			memmove(&vmsnap->staged[0],
+					&vmsnap->staged[vmsnap->next_return_idx],
+					sizeof(vmsnapblock) * vmsnap->current_nblocks_staged);
+		}
+
+		/*
+		 * Reset offsets in staged blocks array, while accounting for likely
+		 * presence of preexisting blocks that have already been prefetched
+		 * but have yet to be returned to VACUUM caller
+		 */
+		vmsnap->next_prefetch_idx -= vmsnap->next_return_idx;
+		vmsnap->first_invalid_idx -= vmsnap->next_return_idx;
+		vmsnap->next_return_idx = 0;
+
+		/* Stage more blocks (may have to read from temp file) */
+		vm_snap_stage_blocks(vmsnap);
+	}
+
+	/*
+	 * By here we're guaranteed to have at least one prefetchable block in the
+	 * staged blocks array (unless we've already prefetched all blocks that
+	 * will ever be returned to VACUUM caller)
+	 */
+	if (vmsnap->next_prefetch_idx < vmsnap->first_invalid_idx)
+	{
+#ifdef USE_PREFETCH
+		/* Still have remaining blocks to prefetch, so prefetch next one */
+		vmsnapblock prefetch = vmsnap->staged[vmsnap->next_prefetch_idx++];
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, prefetch.scanned_block);
+#else
+		vmsnap->next_prefetch_idx++;
+#endif
+		Assert(vmsnap->current_nblocks_staged > 1);
+		Assert(vmsnap->scanned_pages_to_prefetch > 0);
+		vmsnap->scanned_pages_to_prefetch--;
+	}
+	else
+	{
+		Assert(vmsnap->scanned_pages_to_prefetch == 0);
+	}
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_next %s %u",
+		 RelationGetRelationName(vmsnap->rel), next_block_to_scan);
+#endif
+
+	return next_block_to_scan;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Frees resources allocated in visibilitymap_snap_acquire for VACUUM.
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->scanned_pages_to_return == 0);
+	if (vmsnap->file)
+		BufFileClose(vmsnap->file);
+	pfree(vmsnap);
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
@@ -677,3 +1109,118 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
+
+/*
+ * Stage some heap blocks from vmsnap to return to VACUUM caller.
+ *
+ * Called when we completely run out of staged blocks to return to VACUUM, or
+ * when vmsnap still has some pending staged blocks, but too few to be able to
+ * prefetch incrementally as the remaining blocks are returned to VACUUM.
+ */
+static void
+vm_snap_stage_blocks(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->current_nblocks_staged < STAGED_BUFSIZE);
+	Assert(vmsnap->first_invalid_idx < STAGED_BUFSIZE);
+	Assert(vmsnap->next_return_idx <= vmsnap->first_invalid_idx);
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+
+	while (vmsnap->next_block < vmsnap->rel_pages &&
+		   vmsnap->current_nblocks_staged < STAGED_BUFSIZE)
+	{
+		bool		all_visible = true;
+		vmsnapblock stage;
+
+		for (;;)
+		{
+			uint8		mapbits = vm_snap_get_status(vmsnap,
+													 vmsnap->next_block);
+
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+				all_visible = false;
+				break;
+			}
+
+			/*
+			 * Stop staging blocks just before final page, which must always
+			 * be scanned by VACUUM
+			 */
+			if (vmsnap->next_block == vmsnap->rel_pages - 1)
+				break;
+
+			/* VMSNAP_SCAN_ALL forcing VACUUM to scan every page? */
+			if (vmsnap->strat == VMSNAP_SCAN_ALL)
+				break;
+
+			/*
+			 * Check if VACUUM must scan this page because it's not all-frozen
+			 * and VACUUM opted to use VMSNAP_SCAN_EAGER strategy
+			 */
+			if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
+				vmsnap->strat == VMSNAP_SCAN_EAGER)
+				break;
+
+			/* VACUUM will skip this block -- so don't stage it for later */
+			vmsnap->next_block++;
+		}
+
+		/* VACUUM will scan this block, so stage it for later */
+		stage.scanned_block = vmsnap->next_block++;
+		stage.all_visible = all_visible;
+		vmsnap->staged[vmsnap->first_invalid_idx++] = stage;
+		vmsnap->current_nblocks_staged++;
+	}
+}
+
+/*
+ * Get status of bits from vm snapshot
+ */
+static uint8
+vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "vm_snap_get_status %u", heapBlk);
+#endif
+
+	/*
+	 * If we didn't see the VM page when the snapshot was first acquired we
+	 * defensively assume heapBlk not all-visible or all-frozen
+	 */
+	Assert(heapBlk <= vmsnap->rel_pages);
+	if (unlikely(mapBlock >= vmsnap->nvmpages))
+		return 0;
+
+	/*
+	 * Read from temp file when required.
+	 *
+	 * Although this routine supports random access, sequential access is
+	 * expected.  We should only need to read each temp file page into cache
+	 * at most once per VACUUM.
+	 */
+	if (unlikely(mapBlock != vmsnap->curvmpage))
+	{
+		size_t		nread;
+
+		if (BufFileSeekBlock(vmsnap->file, mapBlock) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek to block %u of vmsnap temporary file",
+							mapBlock)));
+		nread = BufFileRead(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+		if (nread != BLCKSZ)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u of vmsnap temporary file: read only %zu of %zu bytes",
+							mapBlock, nread, (size_t) BLCKSZ)));
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = PageGetContents(vmsnap->vmpage.data);
+	}
+
+	return ((vmsnap->rawmap[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c68bd8ff..5085d9407 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -933,11 +933,11 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				effective_multixact_freeze_max_age,
 				freeze_strategy_threshold;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
@@ -1069,48 +1069,48 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
-	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.  The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/*
+	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+	 * XMID table age (whichever is greater currently).
+	 */
+	XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+		((double) freeze_table_age + 0.5);
+	MXIDFrac = (double) (nextMXID - cutoffs->relminmxid) /
+		((double) multixact_freeze_table_age + 0.5);
+	cutoffs->tableagefrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		cutoffs->tableagefrac = 1.0;
+
+	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f6aae528d..ad23db432 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2497,10 +2497,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2517,10 +2517,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 447645b73..c44c1c4e7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -659,6 +659,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -692,11 +699,9 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b1137381a..d97284ec8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9184,20 +9184,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9266,19 +9274,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 79595b1cb..c137debb1 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -158,9 +158,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
       all tuples are known to be frozen can always be skipped, and those
       where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      skipped except when performing an aggressive vacuum.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.38.1

v13-0003-Finish-removing-aggressive-mode-VACUUM.patchapplication/octet-stream; name=v13-0003-Finish-removing-aggressive-mode-VACUUM.patchDownload

From c9d4773c15f82c9797ed55ce53a66c53fe8be4d5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v13 3/3] Finish removing aggressive mode VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Although pre-visibility-map
VACUUM was far less efficient (especially after 9.6's commit fd31cd26),
its naive approach had one notable advantage: users only had to think
about a single kind of lazy vacuum (the only kind that existed).

Break the final remaining dependency on aggressive mode: replace the
rules governing when VACUUM will wait for a cleanup lock with a new set
of rules more attuned to the needs of the table.  With that last
dependency gone, there is no need for aggressive mode, so get rid of it.
Users once again only have to think about one kind of lazy vacuum.

In general, all of the behaviors associated with aggressive mode prior
to Postgres 16 have been retained; they just get applied selectively, on
a more dynamic timeline.  For example, the aforementioned change to
VACUUM's cleanup lock behavior retains the general idea of sometimes
waiting for a cleanup lock to make sure that older XIDs get frozen, so
that relfrozenxid can be advanced by a sufficient amount.  All that
really changes is the information driving VACUUM's decision on waiting.

We use new, dedicated cutoffs, rather than applying the FreezeLimit and
MultiXactCutoff used when deciding whether we should trigger freezing on
the basis of XID/MXID age.  These minimum fallback cutoffs (which are
called MinXid and MinMulti) are typically far older than the standard
FreezeLimit/MultiXactCutoff cutoffs.  VACUUM doesn't need an aggressive
mode to decide on whether to wait for a cleanup lock anymore; it can
decide everything at the level of individual heap pages.

It is okay to aggressively punt on waiting for a cleanup lock like this
because VACUUM now directly understands the importance of never falling
too far behind on the work of freezing physical heap pages at the level
of the whole table, following recent work to add VM scanning strategies.
It is generally safer for VACUUM to press on with freezing other heap
pages from the table instead.  Even if relfrozenxid can only be advanced
by relatively few XIDs as a consequence, VACUUM should have more than
ample opportunity to catch up next time, since there is bound to be no
more than a small number of problematic unfrozen pages left behind.
VACUUM now tends to consistently advance relfrozenxid (at least by some
small amount) all the time in larger tables, so all that has to happen
for relfrozenxid to fully catch up is for a few remaining unfrozen pages
to get frozen.  Since relfrozenxid is now considered to be no more than
a lagging indicator of freezing, and since relfrozenxid isn't used to
trigger freezing in the way that it once was, time is on our side.

Also teach VACUUM to wait for a short while for cleanup locks when doing
so has a decent chance of preserving its ability to advance relfrozenxid
up to FreezeLimit (and/or to advance relminmxid up to MultiXactCutoff).
As a result, VACUUM typically manages to advance relfrozenxid by just as
much as it would have had it promised to advance it up to FreezeLimit
(i.e. had it made the traditional aggressive VACUUM guarantee), even
when vacuuming a table that happens to have relatively many cleanup lock
conflicts affecting pages with older XIDs/MXIDs.  VACUUM thereby avoids
missing out on advancing relfrozenxid up to the traditional target
amount when it really can be avoided fairly easily, without promising to
do so (VACUUM only promises to advance up to MinXid/MinMulti).

XXX We also need to avoid the special auto-cancellation behavior for
antiwraparound autovacuums to make this truly safe.  See also, related
patch for this: https://commitfest.postgresql.org/41/4027/

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   9 +-
 src/backend/access/heap/heapam.c              |   2 +
 src/backend/access/heap/vacuumlazy.c          | 221 +++---
 src/backend/commands/vacuum.c                 |  42 +-
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 doc/src/sgml/config.sgml                      |  10 +-
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  10 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 doc/src/sgml/xact.sgml                        |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  33 +-
 15 files changed, 560 insertions(+), 530 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 43e367bcb..b75b813f8 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -276,6 +276,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
@@ -348,7 +355,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+extern void vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 							   struct VacuumCutoffs *cutoffs);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 54c6cb741..ddc80822b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6879,6 +6879,8 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.MinXid = FreezeLimit;
+	cutoffs.MinMulti = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold = 0;
 	cutoffs.tableagefrac = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6e230b6af..640703cad 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -157,8 +157,6 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -262,7 +260,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							LVPagePruneState *prunestate);
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
-							  bool *hastup, bool *recordfreespace);
+							  bool *hastup, bool *recordfreespace,
+							  Size *freespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -459,7 +458,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
+	vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
@@ -539,17 +538,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(vacrel->cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(vacrel->cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
@@ -557,7 +553,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
 		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
-		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -626,33 +621,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
 			memset(&walusage, 0, sizeof(WalUsage));
 			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
 			initStringInfo(&buf);
+
 			if (verbose)
-			{
-				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
-			}
 			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			else
-			{
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			appendStringInfo(&buf, msgfmt,
 							 get_database_name(MyDatabaseId),
 							 vacrel->relnamespace,
@@ -944,6 +920,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			bool		hastup,
 						recordfreespace;
+			Size		freespace;
 
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
@@ -957,10 +934,8 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Collect LP_DEAD items in dead_items array, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
-								  &recordfreespace))
+								  &recordfreespace, &freespace))
 			{
-				Size		freespace = 0;
-
 				/*
 				 * Processed page successfully (without cleanup lock) -- just
 				 * need to perform rel truncation and FSM steps, much like the
@@ -969,21 +944,14 @@ lazy_scan_heap(LVRelState *vacrel)
 				 */
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
-				if (recordfreespace)
-					freespace = PageGetHeapFreeSpace(page);
-				UnlockReleaseBuffer(buf);
 				if (recordfreespace)
 					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+
+				/* lock and pin released by lazy_scan_noprune */
 				continue;
 			}
 
-			/*
-			 * lazy_scan_noprune could not do all required processing.  Wait
-			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
-			 */
-			Assert(vacrel->aggressive);
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockBufferForCleanup(buf);
+			/* cleanup lock acquired by lazy_scan_noprune */
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
@@ -1432,8 +1400,6 @@ lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 	if (force_scan_all)
 		vacrel->vmstrat = VMSNAP_SCAN_ALL;
 
-	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
-
 	/* Inform vmsnap infrastructure of our chosen strategy */
 	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
 
@@ -2014,17 +1980,32 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items array for removal from indexes (assuming caller's page
+ * can be processed successfully here).
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We return true to indicate that processing succeeded, in which case we'll
+ * have dropped the lock and pin on buf/page.  Else returns false, indicating
+ * that page must be processed by lazy_scan_prune in the usual way after all.
+ * Acquires a cleanup lock on buf/page for caller before returning false.
+ *
+ * We go to considerable trouble to get a cleanup lock on any page that has
+ * XIDs/MXIDs that need to be frozen in order for VACUUM to be able to set
+ * relfrozenxid/relminmxid to values >= FreezeLimit/MultiXactCutoff cutoffs.
+ * But we don't strictly guarantee it; we only guarantee that final values
+ * will be >= MinXid/MinMulti cutoffs in the worst case.
+ *
+ * We prefer to "under promise and over deliver" like this because a strong
+ * guarantee has the potential to make a bad situation even worse.  VACUUM
+ * should avoid waiting for a cleanup lock for an indefinitely long time until
+ * it has already exhausted every available alternative.  It's quite possible
+ * (and perhaps even likely) that the problem will go away on its own.  But
+ * even when it doesn't, our approach at least makes it likely that the first
+ * VACUUM that encounters the issue will catch up on whatever freezing may
+ * still be required for every other page in the target rel.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
- * generic FSM processing for page.
+ * generic FSM processing for page, using *freespace value set here.
  */
 static bool
 lazy_scan_noprune(LVRelState *vacrel,
@@ -2032,7 +2013,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				  BlockNumber blkno,
 				  Page page,
 				  bool *hastup,
-				  bool *recordfreespace)
+				  bool *recordfreespace,
+				  Size *freespace)
 {
 	OffsetNumber offnum,
 				maxoff;
@@ -2040,6 +2022,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples,
 				missed_dead_tuples;
+	bool		should_freeze = false;
 	HeapTupleHeader tupleheader;
 	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
 	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
@@ -2049,6 +2032,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	*hastup = false;			/* for now */
 	*recordfreespace = false;	/* for now */
+	*freespace = PageGetHeapFreeSpace(page);
 
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -2090,34 +2074,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
 									 &NoFreezePageRelfrozenXid,
 									 &NoFreezePageRelminMxid))
-		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
-			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
-			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
-			 */
-		}
+			should_freeze = true;
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
@@ -2166,10 +2123,98 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * By here we know for sure that caller can put off freezing and pruning
-	 * this particular page until the next VACUUM.  Remember its details now.
-	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
+	 * Release lock (but not pin) on page now.  Then consider if we should
+	 * back out of accepting reduced processing for this page.
+	 *
+	 * Our caller's initial inability to get a cleanup lock will often turn
+	 * out to have been nothing more than a momentary blip, and it would be a
+	 * shame if relfrozenxid/relminmxid values < FreezeLimit/MultiXactCutoff
+	 * were used without good reason.  For example, the checkpointer might
+	 * have been writing out this page a moment ago, in which case its buffer
+	 * pin might have already been released by now.
+	 *
+	 * It's also possible that the conflicting buffer pin will continue to
+	 * block cleanup lock acquisition on the buffer for an extended period.
+	 * For example, it isn't uncommon for heap_lock_tuple to sleep while
+	 * holding a buffer pin, in which case a conflicting pin could easily be
+	 * held for much longer than VACUUM can reasonably be expected to wait.
+	 * There are also truly pathological cases to worry about.  For example,
+	 * the case where buggy application code holds open a cursor forever.
 	 */
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	if (should_freeze)
+	{
+		/*
+		 * If page has tuple with a dangerously old XID/MXID (an XID < MinXid,
+		 * or an MXID < MinMulti), then we wait for however long it takes to
+		 * get a cleanup lock.
+		 *
+		 * Check for that first (get it out of the way).
+		 */
+		if (TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								  vacrel->cutoffs.MinXid) ||
+			MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								vacrel->cutoffs.MinMulti))
+		{
+			/*
+			 * MinXid/MinMulti are considered to be only barely adequate final
+			 * values, so we only expect to end up here when previous VACUUMs
+			 * put off processing by lazy_scan_prune in the hope that it would
+			 * never come to this.  That hasn't worked out, so we must wait.
+			 */
+			LockBufferForCleanup(buf);
+			return false;
+		}
+
+		/*
+		 * Page has tuple with XID < FreezeLimit, or MXID < MultiXactCutoff,
+		 * but they're not so old that we're _strictly_ obligated to freeze.
+		 *
+		 * We are willing to go to the trouble of waiting for a cleanup lock
+		 * for a short while for such a page -- just not indefinitely long.
+		 * This avoids squandering opportunities to advance relfrozenxid or
+		 * relminmxid by the target amount during any one VACUUM, which is
+		 * particularly important with larger tables that only get vacuumed
+		 * when autovacuum.c is concerned about table age.  It would not be
+		 * okay if the number of autovacuums such a table ended up requiring
+		 * noticeably exceeded the expected autovacuum_freeze_max_age cadence.
+		 *
+		 * We are willing to wait and try again a total of 3 times.  If that
+		 * doesn't work then we just give up.  We only wait here when it is
+		 * actually expected to preserve current NewRelfrozenXid/NewRelminMxid
+		 * tracker values, and when trackers will actually be used to update
+		 * pg_class later on.  This also tends to limit the impact of waiting
+		 * for VACUUMs that experience relatively many cleanup lock conflicts.
+		 */
+		if (vacrel->vmstrat != VMSNAP_SCAN_LAZY &&
+			(TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								   vacrel->NewRelfrozenXid) ||
+			 MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								 vacrel->NewRelminMxid)))
+		{
+			/* wait 10ms, then 20ms, then 30ms, then give up */
+			for (int i = 1; i <= 3; i++)
+			{
+				CHECK_FOR_INTERRUPTS();
+
+				pg_usleep(1000L * 10L * i);
+				if (ConditionalLockBufferForCleanup(buf))
+				{
+					/* Go process page in lazy_scan_prune after all */
+					return false;
+				}
+			}
+		}
+
+		/* Accept reduced processing for this page after all */
+	}
+
+	/*
+	 * By here we know for sure that caller will put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * Also drop the buffer pin that we held onto during cleanup lock steps.
+	 */
+	ReleaseBuffer(buf);
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5085d9407..f4429e320 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -916,13 +916,8 @@ get_all_vacuum_rels(int options)
  * The target relation and VACUUM parameters are our inputs.
  *
  * Output parameters are the cutoffs that VACUUM caller should use.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
  */
-bool
+void
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				   struct VacuumCutoffs *cutoffs)
 {
@@ -1092,6 +1087,39 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 		multixact_freeze_table_age > effective_multixact_freeze_max_age)
 		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
+	/*
+	 * Determine the cutoffs used by VACUUM to decide on whether to wait for a
+	 * cleanup lock on a page (that it can't cleanup lock right away).  These
+	 * are the MinXid and MinMulti cutoffs.  They are related to the cutoffs
+	 * for freezing (FreezeLimit and MultiXactCutoff), and are only applied on
+	 * pages that we cannot freeze right away.  See vacuumlazy.c for details.
+	 *
+	 * VACUUM can ratchet back NewRelfrozenXid and/or NewRelminMxid instead of
+	 * waiting indefinitely for a cleanup lock in almost all cases.  The high
+	 * level goal is to create as many opportunities as possible to freeze
+	 * (across many successive VACUUM operations), while avoiding waiting for
+	 * a cleanup lock whenever possible.  Any time spent waiting is time spent
+	 * not freezing other eligible pages, which is typically a bad trade-off.
+	 *
+	 * As a consequence of all this, MinXid and MinMulti also act as limits on
+	 * the oldest acceptable values that can ever be set in pg_class by VACUUM
+	 * (though this is only relevant when they have already attained XID/XMID
+	 * ages that approach freeze_table_age and/or multixact_freeze_table_age).
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
 	 * XMID table age (whichever is greater currently).
@@ -1109,8 +1137,6 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 */
 	if (params->is_wraparound)
 		cutoffs->tableagefrac = 1.0;
-
-	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index f9788c30a..0c80896cc 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d97284ec8..42ddee182 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8256,7 +8256,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8445,7 +8445,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9195,7 +9195,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         billion, <command>VACUUM</command> will silently limit the
         effective value to <xref
          linkend="guc-autovacuum-freeze-max-age"/>. For more
-        information see <xref linkend="vacuum-for-wraparound"/>.
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
@@ -9228,7 +9228,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9284,7 +9284,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        billion, <command>VACUUM</command> will silently limit the
        effective value to <xref
         linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
-       information see <xref linkend="vacuum-for-wraparound"/>.
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c137debb1..d4237ec5d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -156,9 +156,11 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
      <para>
       Normally, <command>VACUUM</command> will skip pages based on the <link
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.
+      all tuples are known to be frozen are always skipped.  Pages
+      where all tuples are known to be visible to all transactions are
+      skipped whenever <command>VACUUM</command> determined that
+      advancing <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> was unnecessary.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -213,7 +215,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/doc/src/sgml/xact.sgml b/doc/src/sgml/xact.sgml
index b467660ee..c4146539f 100644
--- a/doc/src/sgml/xact.sgml
+++ b/doc/src/sgml/xact.sgml
@@ -49,8 +49,8 @@
 
   <para>
    The internal transaction ID type <type>xid</type> is 32 bits wide
-   and <link linkend="vacuum-for-wraparound">wraps around</link> every
-   4 billion transactions. A 32-bit epoch is incremented during each
+   and <link linkend="vacuum-xid-space">wraps around</link> every
+   2 billion transactions. A 32-bit epoch is incremented during each
    wraparound. There is also a 64-bit type <type>xid8</type> which
    includes this epoch and therefore does not wrap around during the
    life of an installation;  it can be converted to xid by casting.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..f9e4194cd 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,21 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), while still using default
+# settings for vacuum_freeze_table_age/autovacuum_freeze_max_age.
+#
+# This makes VACUUM freeze tuples just as aggressively as it would if the
+# VACUUM command's FREEZE option was specified with almost all heap pages.
+# However, VACUUM is still unwilling to wait indefinitely for a cleanup lock,
+# just to freeze a few XIDs/MXIDs that still aren't very old.
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +81,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +97,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +108,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +121,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +133,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +141,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
-- 
2.38.1

v13-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchapplication/octet-stream; name=v13-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchDownload

From e3ede35374549810c52581e0c98c9ec4437af59e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v13 1/3] Add eager and lazy freezing strategies to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  VACUUM determines its freezing
strategy based on the value of the new vacuum_freeze_strategy_threshold
GUC (or reloption) in most cases: Tables that exceeds the size threshold
use the eager freezing strategy.  Otherwise VACUUM uses the lazy
freezing strategy,  which is essentially the same approach that VACUUM
has always taken to freezing (though not quite, due to the influence of
page level freezing following recent work).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).
Eager freezing is generally only applied when vacuuming larger tables,
where freezing most individual heap pages at the first opportunity (in
the first VACUUM operation where they can definitely be set all-visible)
will improve performance stability.

A later commit will add vmsnap scanning strategies, which are designed
to work in tandem with the freezing strategies from this commit.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/commands/vacuum.h                 | 10 +++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++++
 src/backend/access/heap/heapam.c              |  1 +
 src/backend/access/heap/vacuumlazy.c          | 43 ++++++++++++++++++-
 src/backend/commands/vacuum.c                 | 17 +++++++-
 src/backend/postmaster/autovacuum.c           | 10 +++++
 src/backend/utils/misc/guc_tables.c           | 11 +++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 19 +++++++-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++++
 doc/src/sgml/ref/vacuum.sgml                  | 16 +++----
 12 files changed, 143 insertions(+), 11 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 2f274f2be..b39178d5b 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -272,6 +275,12 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
+	 * rel's main fork) that triggers VACUUM's eager freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold;
 };
 
 /*
@@ -295,6 +304,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f383a2fca..e195b63d7 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b734489..4b680501c 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34d83dc70..3a8f50c31 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6879,6 +6879,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.freeze_strategy_threshold = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9923994b5..0444b3f12 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,6 +153,8 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -242,6 +244,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void lazy_scan_strategy(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -470,6 +473,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
+	/*
+	 * Now determine VACUUM's freezing strategy.
+	 */
+	lazy_scan_strategy(vacrel);
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -1249,6 +1256,38 @@ lazy_scan_heap(LVRelState *vacrel)
 		lazy_cleanup_all_indexes(vacrel);
 }
 
+/*
+ *	lazy_scan_strategy() -- Determine freezing strategy.
+ *
+ * Our lazy freezing strategy is useful when putting off the work of freezing
+ * totally avoids freezing that turns out to have been wasted effort later on.
+ * Our eager freezing strategy is useful with larger tables that experience
+ * continual growth, where freezing pages proactively is needed just to avoid
+ * falling behind on freezing (eagerness is also likely to be cheaper in the
+ * short/medium term for such tables, but the long term picture matters most).
+ */
+static void
+lazy_scan_strategy(LVRelState *vacrel)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	/*
+	 * Decide freezing strategy.
+	 *
+	 * The eager freezing strategy is used when the threshold controlled by
+	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
+	 *
+	 * Also freeze eagerly with an unlogged or temp table, where the total
+	 * cost of freezing each page is just the cycles needed to prepare a set
+	 * of freeze plans.  Executing the freeze plans adds very little cost.
+	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
+	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 */
+	vacrel->eager_freeze_strategy =
+		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 !RelationIsPermanent(vacrel->rel));
+}
+
 /*
  *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
@@ -1771,10 +1810,12 @@ retry:
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 	 * freeze when pruning generated an FPI, if doing so means that we set the
 	 * page all-frozen afterwards (might not happen until final heap pass).
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will thereby become all-frozen in the visibility map.
 	 */
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
 		(prunestate->all_visible && prunestate->all_frozen &&
-		 fpi_before != pgWalUsage.wal_fpi))
+		 (fpi_before != pgWalUsage.wal_fpi || vacrel->eager_freeze_strategy)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ba965b8c7..7c68bd8ff 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
@@ -926,7 +930,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -939,6 +944,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/* Set pg_class fields in cutoffs */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1053,6 +1059,15 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
 		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	Assert(freeze_strategy_threshold >= 0);
+	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
+
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 0746d8022..23e316e59 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a37c9f984..f6aae528d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2524,6 +2524,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5afdeb04d..447645b73 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,6 +693,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 05b3862d0..b1137381a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9161,6 +9161,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to apply its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
       <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9196,7 +9211,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         Specifies the cutoff age (in transactions) that
         <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages that have an older XID.
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index e14ead882..79595b1cb 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> and <xref
+       linkend="guc-vacuum-freeze-table-age"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
-- 
2.38.1

#65

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Geoghegan (#60)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, 2022-12-26 at 12:53 -0800, Peter Geoghegan wrote:

* v12 merges together the code for the "freeze the page"
lazy_scan_prune path with the block that actually calls
heap_freeze_execute_prepared().

This should make it clear that pagefrz.freeze_required really does
mean that freezing is required. Hopefully that addresses Jeff's
recent
concern. It's certainly an improvement, in any case.

Better, thank you.

* On a related note, comments around the same point in
lazy_scan_prune
as well as comments above the HeapPageFreeze struct now explain a
concept I decided to call "nominal freezing". This is the case where
we "freeze a page" without having any freeze plans to execute.

"nominal freezing" is the new name for a concept I invented many
months ago, which helps to resolve subtle problems with the way that
heap_prepare_freeze_tuple is tasked with doing two different things
for its lazy_scan_prune caller: 1. telling lazy_scan_prune how it
would freeze each tuple (were it to freeze the page), and 2. helping
lazy_scan_prune to determine if the page should become all-frozen in
the VM. The latter is always conditioned on page-level freezing
actually going ahead, since everything else in
heap_prepare_freeze_tuple has to work that way.

We always freeze a page with zero freeze plans (or "nominally freeze"
the page) in lazy_scan_prune (which is nothing new in itself). We
thereby avoid breaking heap_prepare_freeze_tuple's working assumption
that all it needs to focus on what the page will look like after
freezing executes, while also avoiding senselessly throwing away the
ability to set a page all-frozen in the VM in lazy_scan_prune when
it'll cost us nothing extra. That is, by always freezing in the event
of zero freeze plans, we won't senselessly miss out on setting a page
all-frozen in cases where we don't actually have to execute any
freeze
plans to make that safe, while the "freeze the page path versus don't
freeze the page path" dichotomy still works as a high level
conceptual
abstraction.

I always understood "freezing" to mean that a concrete action was
taken, and associated WAL generated.

"Nominal freezing" is happening when there are no freeze plans at all.
I get that it's to manage control flow so that the right thing happens
later. But I think it should be defined in terms of what state the page
is in so that we know that following a given path is valid. Defining
"nominal freezing" as a case where there are no freeze plans is just
confusing to me.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#66

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Jeff Davis (#65)

Re: New strategies for freezing, advancing relfrozenxid early

On Fri, Dec 30, 2022 at 12:43 PM Jeff Davis <pgsql@j-davis.com> wrote:

I always understood "freezing" to mean that a concrete action was
taken, and associated WAL generated.

"When I use a word… it means just what I choose it to mean -- neither
more nor less".

I have also always understood freezing that way too. In fact, I still
do understand it that way -- I don't think that it has been undermined
by any of this. I've just invented this esoteric concept of nominal
freezing that can be ignored approximately all the time, to solve one
narrow problem that needed to be solved, that isn't that interesting
anywhere else.

"Nominal freezing" is happening when there are no freeze plans at all.
I get that it's to manage control flow so that the right thing happens
later. But I think it should be defined in terms of what state the page
is in so that we know that following a given path is valid. Defining
"nominal freezing" as a case where there are no freeze plans is just
confusing to me.

What would you prefer? The state that the page is in is not something
that I want to draw much attention to, because it's confusing in a way
that mostly isn't worth talking about. When we do nominal freezing, we
don't necessarily go on to set the page all-frozen. In fact, it's not
particularly likely that that will end up happening!

Bear in mind that the exact definition of "freeze the page" is
somewhat creative, even without bringing nominal freezing into it. It
just has to be in order to support the requirements we have for
MultiXacts (in particular for FRM_NOOP processing). The new concepts
don't quite map directly on to the old ones. At the same time, it
really is very often the case that "freezing the page" will perform
maximally aggressive freezing, in the sense that it does precisely
what a VACUUM FREEZE would do given the same page (in any Postgres
version).

--
Peter Geoghegan

#67

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#66)

Re: New strategies for freezing, advancing relfrozenxid early

On Fri, Dec 30, 2022 at 1:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

"Nominal freezing" is happening when there are no freeze plans at all.
I get that it's to manage control flow so that the right thing happens
later. But I think it should be defined in terms of what state the page
is in so that we know that following a given path is valid. Defining
"nominal freezing" as a case where there are no freeze plans is just
confusing to me.

What would you prefer? The state that the page is in is not something
that I want to draw much attention to, because it's confusing in a way
that mostly isn't worth talking about.

I probably should have addressed what you said more directly. Here goes:

Following the path of freezing a page is *always* valid, by
definition. Including when there are zero freeze plans to execute, or
even zero tuples to examine in the first place -- we'll at least be
able to perform nominal freezing, no matter what. OTOH, following the
"no freeze" path is permissible whenever the freeze_required flag
hasn't been set during any call to heap_prepare_freeze_tuple(). It is
never actually mandatory for lazy_scan_prune() to *not* freeze.

It's a bit like how a simple point can be understood as a degenerate
circle of radius 0. It's an abstract definition, which is just a tool
for describing things precisely -- hopefully a useful tool. I welcome
the opportunity to be able to describe things in a way that is clearer
or more useful, in whatever way. But it's not like I haven't already
put in significant effort to this exact question of what "freezing the
page" really means to lazy_scan_prune(). Naming things is hard.

--
Peter Geoghegan

#68

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Geoghegan (#67)

Re: New strategies for freezing, advancing relfrozenxid early

On Fri, 2022-12-30 at 16:58 -0800, Peter Geoghegan wrote:

Following the path of freezing a page is *always* valid, by
definition. Including when there are zero freeze plans to execute, or
even zero tuples to examine in the first place -- we'll at least be
able to perform nominal freezing, no matter what.

This is a much clearer description, in my opinion. Do you think this is
already reflected in the comments (and I missed it)?

Perhaps the comment in the "if (tuples_frozen == 0)" branch could be
something more like:

"We have no freeze plans to execute, so there's no cost to following
the freeze path. This is important in the case where the page is
entirely frozen already, so that the page will be marked as such in the
VM."

I'm not even sure we really want a new concept of "nominal freezing". I
think you are right to just call it a degenerate case where it can be
interpreted as either freezing zero things or not freezing; and the
former is convenient for us because we want to follow that code path.
That would be another good way of writing the comment, in my opinion.

Of course, I'm sure there are some nuances that I'm still missing.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#69

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Jeff Davis (#68)

Re: New strategies for freezing, advancing relfrozenxid early

On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2022-12-30 at 16:58 -0800, Peter Geoghegan wrote:

Following the path of freezing a page is *always* valid, by
definition. Including when there are zero freeze plans to execute, or
even zero tuples to examine in the first place -- we'll at least be
able to perform nominal freezing, no matter what.

This is a much clearer description, in my opinion. Do you think this is
already reflected in the comments (and I missed it)?

I am arguably the person least qualified to answer this question. :-)

Perhaps the comment in the "if (tuples_frozen == 0)" branch could be
something more like:

"We have no freeze plans to execute, so there's no cost to following
the freeze path. This is important in the case where the page is
entirely frozen already, so that the page will be marked as such in the
VM."

I'm happy to use your wording instead -- I'll come up with a patch for that.

In my mind it's just a restatement of what's there already. I assume
that you're right about it being clearer this way.

Of course, I'm sure there are some nuances that I'm still missing.

I don't think that there is, actually. I now believe that you totally
understand the mechanics involved here. I'm glad that I was able to
ascertain that that's all it was. It's worth going to the trouble of
getting something like this exactly right.

--
Peter Geoghegan

#70

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#69)

1 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Sat, Dec 31, 2022 at 12:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Dec 31, 2022 at 11:46 AM Jeff Davis <pgsql@j-davis.com> wrote:

"We have no freeze plans to execute, so there's no cost to following
the freeze path. This is important in the case where the page is
entirely frozen already, so that the page will be marked as such in the
VM."

I'm happy to use your wording instead -- I'll come up with a patch for that.

What do you think of the wording adjustments in the attached patch?
It's based on your suggested wording.

--
Peter Geoghegan

Attachments:

0001-Tweak-page-level-freezing-comments.patchapplication/octet-stream; name=0001-Tweak-page-level-freezing-comments.patchDownload

From 05a682379f4352f62de3d8d639f13ca963d55db2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 2 Jan 2023 11:33:53 -0800
Subject: [PATCH 1/3] Tweak page-level freezing comments.

Clarify what it means when lazy_scan_prune opts to "freeze a page".
---
 src/include/access/heapam.h          | 23 ++++++++---------------
 src/backend/access/heap/vacuumlazy.c | 14 +++++++-------
 2 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 09a1993f4..df496cc40 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -145,16 +145,14 @@ typedef struct HeapPageFreeze
 	/*
 	 * "Freeze" NewRelfrozenXid/NewRelminMxid trackers.
 	 *
-	 * Trackers used when heap_freeze_execute_prepared freezes the page, and
-	 * when page is "nominally frozen", which happens with pages where every
-	 * call to heap_prepare_freeze_tuple produced no usable freeze plan.
-	 *
-	 * "Nominal freezing" enables vacuumlazy.c's approach of setting a page
-	 * all-frozen in the visibility map when every tuple's 'totally_frozen'
-	 * result is true.  That always works in the same way, independent of the
-	 * need to freeze tuples, and without complicating the general rule around
-	 * 'totally_frozen' results (which is that 'totally_frozen' results are
-	 * only to be trusted with a page that goes on to be frozen by caller).
+	 * Trackers used when heap_freeze_execute_prepared freezes, or when there
+	 * are zero freeze plans for a page.  It is always valid for vacuumlazy.c
+	 * to freeze any page, by definition.  This even includes pages that have
+	 * no tuples with storage to consider in the first place.  That way the
+	 * 'totally_frozen' results from heap_prepare_freeze_tuple can always be
+	 * used in the same way, even when no freeze plans need to be executed to
+	 * "freeze the page".  Only the "freeze" path needs to consider the need
+	 * to set pages all-frozen in the visibility map under this scheme.
 	 *
 	 * When we freeze a page, we generally freeze all XIDs < OldestXmin, only
 	 * leaving behind XIDs that are ineligible for freezing, if any.  And so
@@ -178,11 +176,6 @@ typedef struct HeapPageFreeze
 	 * VACUUM scans a page that isn't cleanup locked.  Both code paths are
 	 * based on the same general idea (do less work for this page during the
 	 * ongoing VACUUM, at the cost of having to accept older final values).
-	 *
-	 * When vacuumlazy.c caller decides to do "no freeze" processing, it must
-	 * not go on to set the page all-frozen (setting the page all-visible
-	 * could still be okay).  heap_prepare_freeze_tuple's 'totally_frozen'
-	 * results can only be used on a page that also gets frozen as instructed.
 	 */
 	TransactionId NoFreezePageRelfrozenXid;
 	MultiXactId NoFreezePageRelminMxid;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e962b8d72..1c37468c4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1788,13 +1788,13 @@ retry:
 		if (tuples_frozen == 0)
 		{
 			/*
-			 * We're freezing all eligible tuples on the page, but have no
-			 * freeze plans to execute.  This is structured as a case where
-			 * the page is nominally frozen so that we set pages all-frozen
-			 * whenever no freeze plans need to be executed to make it safe.
-			 * If this was handled via "no freeze" processing instead then
-			 * VACUUM would senselessly waste certain opportunities to set
-			 * pages all-frozen (not just all-visible) at no added cost.
+			 * We have no freeze plans to execute, so there's no added cost
+			 * from following the freeze path.  That's why it was chosen.
+			 * This is important in the case where the page only contains
+			 * totally frozen tuples at this point (perhaps only following
+			 * pruning).  Such pages can be marked all-frozen in the VM by our
+			 * caller, even though none of its tuples were newly frozen here
+			 * (note that the "no freeze" path never sets pages all-frozen).
 			 *
 			 * We never increment the frozen_pages instrumentation counter
 			 * here, since it only counts pages with newly frozen tuples
-- 
2.38.1

#71

Jeff Davis

pgsql@j-davis.com

about 3 years ago

In reply to: Peter Geoghegan (#70)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, 2023-01-02 at 11:45 -0800, Peter Geoghegan wrote:

What do you think of the wording adjustments in the attached patch?
It's based on your suggested wording.

Great, thank you.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#72

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Jeff Davis (#71)

3 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 2, 2023 at 6:26 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2023-01-02 at 11:45 -0800, Peter Geoghegan wrote:

What do you think of the wording adjustments in the attached patch?
It's based on your suggested wording.

Great, thank you.

Pushed that today.

Attached is v14.

v14 simplifies the handling of setting the visibility map at the end
of the blkno-wise loop in lazy_scan_heap(). And,
visibilitymap_snap_next() doesn't tell its caller (lazy_scan_heap)
anything about the visibility status of each returned block -- we no
longer need a all_visible_according_to_vm local variable to help with
setting the visibility map.

This new approach to setting the VM is related to hardening that I
plan on adding, which makes the visibility map robust against certain
race conditions that can lead to setting a page all-frozen but not
all-visible. I go into that here:

/messages/by-id/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com

(It's the second patch -- the first patch already became yesterday's
commit 6daeeb1f.)

In general I don't think that we should be using
all_visible_according_to_vm for anything, especially not anything
critical -- it is just information about how the page used to be in
the past, after all. This will be more of a problem with visibility
map snapshots, since all_visible_according_to_vm could be information
that is hours old by the time it's actually used by lazy_scan_heap().
But it is an existing issue.

BTW, it would be helpful if I could get a +1 to the visibility map
patch posted on that other thread. It's practically a bug fix -- the
VM shouldn't be able to show contradictory information about any given
heap page (i.e. "page is all-frozen but not all-visible"), no matter
what. Just on general principle.

--
Peter Geoghegan

Attachments:

v14-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchapplication/x-patch; name=v14-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchDownload

From a924e2c5430e4c8f2ad2c9f85ea9824a510e780e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v14 1/3] Add eager and lazy freezing strategies to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  VACUUM determines its freezing
strategy based on the value of the new vacuum_freeze_strategy_threshold
GUC (or reloption) in most cases: Tables that exceeds the size threshold
use the eager freezing strategy.  Otherwise VACUUM uses the lazy
freezing strategy,  which is essentially the same approach that VACUUM
has always taken to freezing (though not quite, due to the influence of
page level freezing following recent work).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).
Eager freezing is generally only applied when vacuuming larger tables,
where freezing most individual heap pages at the first opportunity (in
the first VACUUM operation where they can definitely be set all-visible)
will improve performance stability.

A later commit will add vmsnap scanning strategies, which are designed
to work in tandem with the freezing strategies from this commit.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/commands/vacuum.h                 | 10 +++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++++
 src/backend/access/heap/heapam.c              |  1 +
 src/backend/access/heap/vacuumlazy.c          | 43 ++++++++++++++++++-
 src/backend/commands/vacuum.c                 | 17 +++++++-
 src/backend/postmaster/autovacuum.c           | 10 +++++
 src/backend/utils/misc/guc_tables.c           | 11 +++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 19 +++++++-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++++
 doc/src/sgml/ref/vacuum.sgml                  | 16 +++----
 12 files changed, 143 insertions(+), 11 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5efb94236..2a6f4d771 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -220,6 +220,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -272,6 +275,12 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
+	 * rel's main fork) that triggers VACUUM's eager freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold;
 };
 
 /*
@@ -295,6 +304,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038..bcc5e589a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 14c23101a..fb6b5a6d6 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bad2a89e4..30dff8d3d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6912,6 +6912,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.freeze_strategy_threshold = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a42e881da..b110061b0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,6 +153,8 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -243,6 +245,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void lazy_scan_strategy(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -472,6 +475,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
+	/*
+	 * Now determine VACUUM's freezing strategy.
+	 */
+	lazy_scan_strategy(vacrel);
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -1251,6 +1258,38 @@ lazy_scan_heap(LVRelState *vacrel)
 		lazy_cleanup_all_indexes(vacrel);
 }
 
+/*
+ *	lazy_scan_strategy() -- Determine freezing strategy.
+ *
+ * Our lazy freezing strategy is useful when putting off the work of freezing
+ * totally avoids freezing that turns out to have been wasted effort later on.
+ * Our eager freezing strategy is useful with larger tables that experience
+ * continual growth, where freezing pages proactively is needed just to avoid
+ * falling behind on freezing (eagerness is also likely to be cheaper in the
+ * short/medium term for such tables, but the long term picture matters most).
+ */
+static void
+lazy_scan_strategy(LVRelState *vacrel)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	/*
+	 * Decide freezing strategy.
+	 *
+	 * The eager freezing strategy is used when the threshold controlled by
+	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
+	 *
+	 * Also freeze eagerly with an unlogged or temp table, where the total
+	 * cost of freezing each page is just the cycles needed to prepare a set
+	 * of freeze plans.  Executing the freeze plans adds very little cost.
+	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
+	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 */
+	vacrel->eager_freeze_strategy =
+		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 !RelationIsPermanent(vacrel->rel));
+}
+
 /*
  *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
@@ -1775,10 +1814,12 @@ retry:
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 	 * freeze when pruning generated an FPI, if doing so means that we set the
 	 * page all-frozen afterwards (might not happen until final heap pass).
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will thereby become all-frozen in the visibility map.
 	 */
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
 		(prunestate->all_visible && prunestate->all_frozen &&
-		 fpi_before != pgWalUsage.wal_fpi))
+		 (fpi_before != pgWalUsage.wal_fpi || vacrel->eager_freeze_strategy)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 158b1b497..d9cfe4372 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -263,6 +264,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
@@ -926,7 +930,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -939,6 +944,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/* Set pg_class fields in cutoffs */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1053,6 +1059,15 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
 		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	Assert(freeze_strategy_threshold >= 0);
+	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
+
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index e40bd39b3..d48e83532 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2872,6 +2881,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 68328b140..3e463cb42 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2524,6 +2524,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5afdeb04d..447645b73 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,6 +693,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 05b3862d0..b1137381a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9161,6 +9161,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to apply its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
       <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9196,7 +9211,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         Specifies the cutoff age (in transactions) that
         <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages that have an older XID.
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index e14ead882..79595b1cb 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -119,14 +119,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> and <xref
+       linkend="guc-vacuum-freeze-table-age"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
-- 
2.38.1

v14-0003-Finish-removing-aggressive-mode-VACUUM.patchapplication/x-patch; name=v14-0003-Finish-removing-aggressive-mode-VACUUM.patchDownload

From a505bd73abb02ede1af3efd43733d9b48f2ead7a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v14 3/3] Finish removing aggressive mode VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Although pre-visibility-map
VACUUM was far less efficient (especially after 9.6's commit fd31cd26),
its naive approach had one notable advantage: users only had to think
about a single kind of lazy vacuum (the only kind that existed).

Break the final remaining dependency on aggressive mode: replace the
rules governing when VACUUM will wait for a cleanup lock with a new set
of rules more attuned to the needs of the table.  With that last
dependency gone, there is no need for aggressive mode, so get rid of it.
Users once again only have to think about one kind of lazy vacuum.

In general, all of the behaviors associated with aggressive mode prior
to Postgres 16 have been retained; they just get applied selectively, on
a more dynamic timeline.  For example, the aforementioned change to
VACUUM's cleanup lock behavior retains the general idea of sometimes
waiting for a cleanup lock to make sure that older XIDs get frozen, so
that relfrozenxid can be advanced by a sufficient amount.  All that
really changes is the information driving VACUUM's decision on waiting.

We use new, dedicated cutoffs, rather than applying the FreezeLimit and
MultiXactCutoff used when deciding whether we should trigger freezing on
the basis of XID/MXID age.  These minimum fallback cutoffs (which are
called MinXid and MinMulti) are typically far older than the standard
FreezeLimit/MultiXactCutoff cutoffs.  VACUUM doesn't need an aggressive
mode to decide on whether to wait for a cleanup lock anymore; it can
decide everything at the level of individual heap pages.

It is okay to aggressively punt on waiting for a cleanup lock like this
because VACUUM now directly understands the importance of never falling
too far behind on the work of freezing physical heap pages at the level
of the whole table, following recent work to add VM scanning strategies.
It is generally safer for VACUUM to press on with freezing other heap
pages from the table instead.  Even if relfrozenxid can only be advanced
by relatively few XIDs as a consequence, VACUUM should have more than
ample opportunity to catch up next time, since there is bound to be no
more than a small number of problematic unfrozen pages left behind.
VACUUM now tends to consistently advance relfrozenxid (at least by some
small amount) all the time in larger tables, so all that has to happen
for relfrozenxid to fully catch up is for a few remaining unfrozen pages
to get frozen.  Since relfrozenxid is now considered to be no more than
a lagging indicator of freezing, and since relfrozenxid isn't used to
trigger freezing in the way that it once was, time is on our side.

Also teach VACUUM to wait for a short while for cleanup locks when doing
so has a decent chance of preserving its ability to advance relfrozenxid
up to FreezeLimit (and/or to advance relminmxid up to MultiXactCutoff).
As a result, VACUUM typically manages to advance relfrozenxid by just as
much as it would have had it promised to advance it up to FreezeLimit
(i.e. had it made the traditional aggressive VACUUM guarantee), even
when vacuuming a table that happens to have relatively many cleanup lock
conflicts affecting pages with older XIDs/MXIDs.  VACUUM thereby avoids
missing out on advancing relfrozenxid up to the traditional target
amount when it really can be avoided fairly easily, without promising to
do so (VACUUM only promises to advance up to MinXid/MinMulti).

XXX We also need to avoid the special auto-cancellation behavior for
antiwraparound autovacuums to make this truly safe.  See also, related
patch for this: https://commitfest.postgresql.org/41/4027/

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   9 +-
 src/backend/access/heap/heapam.c              |   2 +
 src/backend/access/heap/vacuumlazy.c          | 221 +++---
 src/backend/commands/vacuum.c                 |  42 +-
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 doc/src/sgml/config.sgml                      |  10 +-
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  10 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 doc/src/sgml/xact.sgml                        |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  33 +-
 15 files changed, 560 insertions(+), 530 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 3219d0e4c..722602060 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -276,6 +276,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
@@ -348,7 +355,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+extern void vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 							   struct VacuumCutoffs *cutoffs);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5a73f6023..9801bf0f5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6912,6 +6912,8 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.MinXid = FreezeLimit;
+	cutoffs.MinMulti = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold = 0;
 	cutoffs.tableagefrac = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 75dd7d0f4..cde9886f5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -157,8 +157,6 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -263,7 +261,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							LVPagePruneState *prunestate);
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
-							  bool *hastup, bool *recordfreespace);
+							  bool *hastup, bool *recordfreespace,
+							  Size *freespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -461,7 +460,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
+	vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
@@ -541,17 +540,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(vacrel->cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(vacrel->cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
@@ -559,7 +555,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
 		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
-		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -628,33 +623,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
 			memset(&walusage, 0, sizeof(WalUsage));
 			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
 			initStringInfo(&buf);
+
 			if (verbose)
-			{
-				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
-			}
 			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			else
-			{
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			appendStringInfo(&buf, msgfmt,
 							 vacrel->dbname,
 							 vacrel->relnamespace,
@@ -943,6 +919,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			bool		hastup,
 						recordfreespace;
+			Size		freespace;
 
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
@@ -956,10 +933,8 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Collect LP_DEAD items in dead_items array, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
-								  &recordfreespace))
+								  &recordfreespace, &freespace))
 			{
-				Size		freespace = 0;
-
 				/*
 				 * Processed page successfully (without cleanup lock) -- just
 				 * need to perform rel truncation and FSM steps, much like the
@@ -968,21 +943,14 @@ lazy_scan_heap(LVRelState *vacrel)
 				 */
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
-				if (recordfreespace)
-					freespace = PageGetHeapFreeSpace(page);
-				UnlockReleaseBuffer(buf);
 				if (recordfreespace)
 					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+
+				/* lock and pin released by lazy_scan_noprune */
 				continue;
 			}
 
-			/*
-			 * lazy_scan_noprune could not do all required processing.  Wait
-			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
-			 */
-			Assert(vacrel->aggressive);
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockBufferForCleanup(buf);
+			/* cleanup lock acquired by lazy_scan_noprune */
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
@@ -1394,8 +1362,6 @@ lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 	if (force_scan_all)
 		vacrel->vmstrat = VMSNAP_SCAN_ALL;
 
-	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
-
 	/* Inform vmsnap infrastructure of our chosen strategy */
 	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
 
@@ -2000,17 +1966,32 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items array for removal from indexes (assuming caller's page
+ * can be processed successfully here).
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We return true to indicate that processing succeeded, in which case we'll
+ * have dropped the lock and pin on buf/page.  Else returns false, indicating
+ * that page must be processed by lazy_scan_prune in the usual way after all.
+ * Acquires a cleanup lock on buf/page for caller before returning false.
+ *
+ * We go to considerable trouble to get a cleanup lock on any page that has
+ * XIDs/MXIDs that need to be frozen in order for VACUUM to be able to set
+ * relfrozenxid/relminmxid to values >= FreezeLimit/MultiXactCutoff cutoffs.
+ * But we don't strictly guarantee it; we only guarantee that final values
+ * will be >= MinXid/MinMulti cutoffs in the worst case.
+ *
+ * We prefer to "under promise and over deliver" like this because a strong
+ * guarantee has the potential to make a bad situation even worse.  VACUUM
+ * should avoid waiting for a cleanup lock for an indefinitely long time until
+ * it has already exhausted every available alternative.  It's quite possible
+ * (and perhaps even likely) that the problem will go away on its own.  But
+ * even when it doesn't, our approach at least makes it likely that the first
+ * VACUUM that encounters the issue will catch up on whatever freezing may
+ * still be required for every other page in the target rel.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
- * generic FSM processing for page.
+ * generic FSM processing for page, using *freespace value set here.
  */
 static bool
 lazy_scan_noprune(LVRelState *vacrel,
@@ -2018,7 +1999,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				  BlockNumber blkno,
 				  Page page,
 				  bool *hastup,
-				  bool *recordfreespace)
+				  bool *recordfreespace,
+				  Size *freespace)
 {
 	OffsetNumber offnum,
 				maxoff;
@@ -2026,6 +2008,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples,
 				missed_dead_tuples;
+	bool		should_freeze = false;
 	HeapTupleHeader tupleheader;
 	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
 	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
@@ -2035,6 +2018,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	*hastup = false;			/* for now */
 	*recordfreespace = false;	/* for now */
+	*freespace = PageGetHeapFreeSpace(page);
 
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -2076,34 +2060,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
 									 &NoFreezePageRelfrozenXid,
 									 &NoFreezePageRelminMxid))
-		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
-			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
-			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
-			 */
-		}
+			should_freeze = true;
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
@@ -2152,10 +2109,98 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * By here we know for sure that caller can put off freezing and pruning
-	 * this particular page until the next VACUUM.  Remember its details now.
-	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
+	 * Release lock (but not pin) on page now.  Then consider if we should
+	 * back out of accepting reduced processing for this page.
+	 *
+	 * Our caller's initial inability to get a cleanup lock will often turn
+	 * out to have been nothing more than a momentary blip, and it would be a
+	 * shame if relfrozenxid/relminmxid values < FreezeLimit/MultiXactCutoff
+	 * were used without good reason.  For example, the checkpointer might
+	 * have been writing out this page a moment ago, in which case its buffer
+	 * pin might have already been released by now.
+	 *
+	 * It's also possible that the conflicting buffer pin will continue to
+	 * block cleanup lock acquisition on the buffer for an extended period.
+	 * For example, it isn't uncommon for heap_lock_tuple to sleep while
+	 * holding a buffer pin, in which case a conflicting pin could easily be
+	 * held for much longer than VACUUM can reasonably be expected to wait.
+	 * There are also truly pathological cases to worry about.  For example,
+	 * the case where buggy application code holds open a cursor forever.
 	 */
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	if (should_freeze)
+	{
+		/*
+		 * If page has tuple with a dangerously old XID/MXID (an XID < MinXid,
+		 * or an MXID < MinMulti), then we wait for however long it takes to
+		 * get a cleanup lock.
+		 *
+		 * Check for that first (get it out of the way).
+		 */
+		if (TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								  vacrel->cutoffs.MinXid) ||
+			MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								vacrel->cutoffs.MinMulti))
+		{
+			/*
+			 * MinXid/MinMulti are considered to be only barely adequate final
+			 * values, so we only expect to end up here when previous VACUUMs
+			 * put off processing by lazy_scan_prune in the hope that it would
+			 * never come to this.  That hasn't worked out, so we must wait.
+			 */
+			LockBufferForCleanup(buf);
+			return false;
+		}
+
+		/*
+		 * Page has tuple with XID < FreezeLimit, or MXID < MultiXactCutoff,
+		 * but they're not so old that we're _strictly_ obligated to freeze.
+		 *
+		 * We are willing to go to the trouble of waiting for a cleanup lock
+		 * for a short while for such a page -- just not indefinitely long.
+		 * This avoids squandering opportunities to advance relfrozenxid or
+		 * relminmxid by the target amount during any one VACUUM, which is
+		 * particularly important with larger tables that only get vacuumed
+		 * when autovacuum.c is concerned about table age.  It would not be
+		 * okay if the number of autovacuums such a table ended up requiring
+		 * noticeably exceeded the expected autovacuum_freeze_max_age cadence.
+		 *
+		 * We are willing to wait and try again a total of 3 times.  If that
+		 * doesn't work then we just give up.  We only wait here when it is
+		 * actually expected to preserve current NewRelfrozenXid/NewRelminMxid
+		 * tracker values, and when trackers will actually be used to update
+		 * pg_class later on.  This also tends to limit the impact of waiting
+		 * for VACUUMs that experience relatively many cleanup lock conflicts.
+		 */
+		if (vacrel->vmstrat != VMSNAP_SCAN_LAZY &&
+			(TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								   vacrel->NewRelfrozenXid) ||
+			 MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								 vacrel->NewRelminMxid)))
+		{
+			/* wait 10ms, then 20ms, then 30ms, then give up */
+			for (int i = 1; i <= 3; i++)
+			{
+				CHECK_FOR_INTERRUPTS();
+
+				pg_usleep(1000L * 10L * i);
+				if (ConditionalLockBufferForCleanup(buf))
+				{
+					/* Go process page in lazy_scan_prune after all */
+					return false;
+				}
+			}
+		}
+
+		/* Accept reduced processing for this page after all */
+	}
+
+	/*
+	 * By here we know for sure that caller will put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * Also drop the buffer pin that we held onto during cleanup lock steps.
+	 */
+	ReleaseBuffer(buf);
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f84f4e2c7..e4a00e3c2 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -916,13 +916,8 @@ get_all_vacuum_rels(int options)
  * The target relation and VACUUM parameters are our inputs.
  *
  * Output parameters are the cutoffs that VACUUM caller should use.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
  */
-bool
+void
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				   struct VacuumCutoffs *cutoffs)
 {
@@ -1092,6 +1087,39 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 		multixact_freeze_table_age > effective_multixact_freeze_max_age)
 		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
+	/*
+	 * Determine the cutoffs used by VACUUM to decide on whether to wait for a
+	 * cleanup lock on a page (that it can't cleanup lock right away).  These
+	 * are the MinXid and MinMulti cutoffs.  They are related to the cutoffs
+	 * for freezing (FreezeLimit and MultiXactCutoff), and are only applied on
+	 * pages that we cannot freeze right away.  See vacuumlazy.c for details.
+	 *
+	 * VACUUM can ratchet back NewRelfrozenXid and/or NewRelminMxid instead of
+	 * waiting indefinitely for a cleanup lock in almost all cases.  The high
+	 * level goal is to create as many opportunities as possible to freeze
+	 * (across many successive VACUUM operations), while avoiding waiting for
+	 * a cleanup lock whenever possible.  Any time spent waiting is time spent
+	 * not freezing other eligible pages, which is typically a bad trade-off.
+	 *
+	 * As a consequence of all this, MinXid and MinMulti also act as limits on
+	 * the oldest acceptable values that can ever be set in pg_class by VACUUM
+	 * (though this is only relevant when they have already attained XID/XMID
+	 * ages that approach freeze_table_age and/or multixact_freeze_table_age).
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
 	 * XMID table age (whichever is greater currently).
@@ -1109,8 +1137,6 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 */
 	if (params->is_wraparound)
 		cutoffs->tableagefrac = 1.0;
-
-	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 1730425de..d03c5fa5d 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d97284ec8..42ddee182 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8256,7 +8256,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8445,7 +8445,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9195,7 +9195,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         billion, <command>VACUUM</command> will silently limit the
         effective value to <xref
          linkend="guc-autovacuum-freeze-max-age"/>. For more
-        information see <xref linkend="vacuum-for-wraparound"/>.
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
@@ -9228,7 +9228,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9284,7 +9284,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        billion, <command>VACUUM</command> will silently limit the
        effective value to <xref
         linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
-       information see <xref linkend="vacuum-for-wraparound"/>.
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index c137debb1..d4237ec5d 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -156,9 +156,11 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
      <para>
       Normally, <command>VACUUM</command> will skip pages based on the <link
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.
+      all tuples are known to be frozen are always skipped.  Pages
+      where all tuples are known to be visible to all transactions are
+      skipped whenever <command>VACUUM</command> determined that
+      advancing <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> was unnecessary.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -213,7 +215,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/doc/src/sgml/xact.sgml b/doc/src/sgml/xact.sgml
index b467660ee..c4146539f 100644
--- a/doc/src/sgml/xact.sgml
+++ b/doc/src/sgml/xact.sgml
@@ -49,8 +49,8 @@
 
   <para>
    The internal transaction ID type <type>xid</type> is 32 bits wide
-   and <link linkend="vacuum-for-wraparound">wraps around</link> every
-   4 billion transactions. A 32-bit epoch is incremented during each
+   and <link linkend="vacuum-xid-space">wraps around</link> every
+   2 billion transactions. A 32-bit epoch is incremented during each
    wraparound. There is also a 64-bit type <type>xid8</type> which
    includes this epoch and therefore does not wrap around during the
    life of an installation;  it can be converted to xid by casting.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..f9e4194cd 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,21 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), while still using default
+# settings for vacuum_freeze_table_age/autovacuum_freeze_max_age.
+#
+# This makes VACUUM freeze tuples just as aggressively as it would if the
+# VACUUM command's FREEZE option was specified with almost all heap pages.
+# However, VACUUM is still unwilling to wait indefinitely for a cleanup lock,
+# just to freeze a few XIDs/MXIDs that still aren't very old.
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +81,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +97,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +108,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +121,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +133,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +141,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
-- 
2.38.1

v14-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchapplication/x-patch; name=v14-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchDownload

From 718ad2b6e7268738137a81f126994d5fa5c24a85 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v14 2/3] Add eager and lazy VM strategies to VACUUM.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will scan (or skip) heap pages (scanning strategy).  The
data structure we a local copy of the visibility map at the start of
VACUUM.  It spills to disk as required, though only with a larger table.

VACUUM decides on its visibility map scanning and freezing strategies
together, shortly before the first pass over the heap begins, since the
concepts are closely related, and work in tandem.  Lazy scanning allows
VACUUM to skip all-visible pages, while eager scanning allows VACUUM to
advance relfrozenxid/relminmxid at the end of the VACUUM operation.

This work, combined with recent work to add freezing strategies, results
in VACUUM advancing relfrozenxid at a cadence that is barely influenced
by autovacuum_freeze_max_age at all.  Now antiwraparound autovacuums
will be far less common in practice.  Freezing now drives relfrozenxid,
rather than relfrozenxid driving freezing.  Even tables that always use
lazy freezing will have a decent chance of relfrozenxid advancement long
before table age nears or exceeds autovacuum_freeze_max_age.

This also lays the groundwork for completely removing aggressive mode
VACUUMs in a later commit.  Scanning strategies now supersede the "early
aggressive VACUUM" concept implemented by vacuum_freeze_table_age, which
is now just a compatibility option (its new default of -1 is interpreted
as "just use autovacuum_freeze_max_age").  Later work that makes the
choice to wait for a cleanup lock depend entirely on individual page
characteristics will decouple that "aggressive behavior" from the eager
scanning strategy behavior (a behavior that's not really "aggressive" in
any general sense, since it's chosen based on both costs and benefits).

Also add explicit I/O prefetching of heap pages, which is controlled by
maintenance_io_concurrency.  We prefetch at the point that the next
block in line is requested by VACUUM.  Prefetching is under the direct
control of the visibility map snapshot code, since VACUUM's vmsnap is
now an authoritative guide to which pages VACUUM will scan.  VACUUM's
final scanned_pages is "locked in" when it decides on scanning strategy
(so scanned_pages is finalized before the first heap pass even begins).

Prefetching should totally avoid the loss of performance that might
otherwise result from removing SKIP_PAGES_THRESHOLD in this commit.
SKIP_PAGES_THRESHOLD was intended to force OS readahead and encourage
relfrozenxid advancement.  See commit bf136cf6 from around the time the
visibility map first went in for full details.

Also teach VACUUM to use scanned_pages (not rel_pages) to cap the size
of the dead_items array.  This approach is strictly better, since there
is no question of scanning any pages other than the precise set of pages
already locked in by vmsnap by the time dead_items is allocated.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h            |  17 +
 src/include/commands/vacuum.h                 |  15 +-
 src/backend/access/heap/heapam.c              |   1 +
 src/backend/access/heap/vacuumlazy.c          | 596 ++++++++++--------
 src/backend/access/heap/visibilitymap.c       | 541 ++++++++++++++++
 src/backend/commands/vacuum.c                 |  68 +-
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      |  66 +-
 doc/src/sgml/ref/vacuum.sgml                  |   4 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 12 files changed, 989 insertions(+), 352 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index daaa01a25..d8df744da 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,17 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
+/* VACUUM scanning strategy */
+typedef enum vmstrategy
+{
+	VMSNAP_SCAN_LAZY,			/* Skip all-visible and all-frozen pages */
+	VMSNAP_SCAN_EAGER,			/* Only skip all-frozen pages */
+	VMSNAP_SCAN_ALL				/* Don't skip any pages (scan them instead) */
+} vmstrategy;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +46,12 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+											  BlockNumber *scanned_pages_lazy,
+											  BlockNumber *scanned_pages_eager);
+extern void visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat);
+extern BlockNumber visibilitymap_snap_next(vmsnapshot *vmsnap);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 2a6f4d771..3219d0e4c 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 
 /*
  * Values used by index_cleanup and truncate params.
@@ -281,6 +281,19 @@ struct VacuumCutoffs
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
 	 */
 	BlockNumber freeze_strategy_threshold;
+
+	/*
+	 * The tableagefrac value 1.0 represents the point that autovacuum.c
+	 * scheduling (and VACUUM itself) considers relfrozenxid advancement
+	 * strictly necessary.
+	 *
+	 * Lower values provide useful context, and influence whether VACUUM will
+	 * opt to advance relfrozenxid before the point that it is strictly
+	 * necessary.  VACUUM can (and often does) opt to advance relfrozenxid
+	 * proactively.  It is especially likely with tables where the _added_
+	 * costs happen to be low.
+	 */
+	double		tableagefrac;
 };
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 30dff8d3d..5a73f6023 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6913,6 +6913,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold = 0;
+	cutoffs.tableagefrac = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b110061b0..75dd7d0f4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -110,10 +110,18 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * tableagefrac-wise cutoffs influencing VACUUM's choice of scanning strategy
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define TABLEAGEFRAC_MIDPOINT		0.5 /* half way to antiwraparound AV */
+#define TABLEAGEFRAC_HIGHPOINT		0.9 /* Eagerness now mandatory */
+
+/*
+ * Thresholds (expressed as a proportion of rel_pages) that determine the
+ * cutoff (in extra pages scanned) for eager vmsnap scanning behavior at
+ * particular tableagefrac-wise table ages
+ */
+#define MAX_PAGES_YOUNG_TABLEAGE	0.05	/* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE		0.70	/* 70% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -151,8 +159,6 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -171,7 +177,9 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map (as of time that VACUUM began) */
+	vmsnapshot *vmsnap;
+	vmstrategy	vmstrat;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -245,11 +253,8 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static void lazy_scan_strategy(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  bool force_scan_all);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -279,7 +284,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -311,10 +317,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	BlockNumber orig_rel_pages,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -461,37 +467,29 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		vacrel->aggressive = true;
-		skipwithvm = false;
-	}
-
-	vacrel->skipwithvm = skipwithvm;
 
 	/*
-	 * Now determine VACUUM's freezing strategy.
+	 * Now determine VACUUM's freezing and vmsnap scanning strategies.
+	 *
+	 * This process is driven in part by information from VACUUM's visibility
+	 * map snapshot, which will be acquired in passing.  lazy_scan_heap will
+	 * use the same immutable VM snapshot to determine which pages to scan.
+	 * Using an immutable structure (instead of the live visibility map) makes
+	 * VACUUM avoid scanning concurrently modified pages.  These pages can
+	 * only have deleted tuples that OldestXmin will consider RECENTLY_DEAD.
 	 */
-	lazy_scan_strategy(vacrel);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   (params->options &
+										VACOPT_DISABLE_PAGE_SKIPPING) != 0);
 	if (verbose)
-	{
-		if (vacrel->aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							vacrel->dbname, vacrel->relnamespace,
-							vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							vacrel->dbname, vacrel->relnamespace,
-							vacrel->relname)));
-	}
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						vacrel->dbname, vacrel->relnamespace,
+						vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -501,13 +499,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -554,12 +553,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
+		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
 		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -604,6 +602,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -631,10 +632,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -829,13 +826,11 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_fsm_block_to_vacuum = 0;
+	vmsnapshot *vmsnap = vacrel->vmsnap;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -849,46 +844,27 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
-	for (blkno = 0; blkno < rel_pages; blkno++)
+	next_block_to_scan = visibilitymap_snap_next(vmsnap);
+	while (next_block_to_scan < rel_pages)
 	{
+		BlockNumber blkno = next_block_to_scan;
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		next_block_to_scan = visibilitymap_snap_next(vmsnap);
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * visibilitymap_snap_next must always force us to scan the last page
+		 * in rel (in the range of rel_pages) so that VACUUM can avoid useless
+		 * attempts at rel truncation (per should_attempt_truncation comments)
+		 */
+		Assert(next_block_to_scan > blkno);
+		Assert(next_block_to_scan < rel_pages || blkno == rel_pages - 1);
 
 		vacrel->scanned_pages++;
 
-		/* Report as block scanned, update error traceback information */
+		/* Report all blocks < blkno as initial-heap-pass processed */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
@@ -1091,44 +1067,47 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
 			if (prunestate.all_frozen)
+			{
+				Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 				flags |= VISIBILITYMAP_ALL_FROZEN;
+			}
 
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
+			if (!PageIsAllVisible(page))
+			{
+				/*
+				 * We could get away with avoiding dirtying the heap page just
+				 * to set PD_ALL_VISIBLE in many cases, but not when checksums
+				 * are enabled.  We nevertheless mark the page dirty in all
+				 * cases to keep things simple. (It is very likely that the
+				 * heap page is already dirty by now anyway.)
+				 */
+				PageSetAllVisible(page);
+				MarkBufferDirty(buf);
+			}
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, prunestate.visibility_cutoff_xid,
 							  flags);
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * It should never be the case that the page-level bit is clear while
+		 * corresponding visibility map all-visible/all-frozen bits are set.
+		 * When we haven't updated the visibility map we defensively make sure
+		 * that it is current instead.
+		 *
+		 * Note that it's sometimes possible that PD_ALL_VISIBLE will be set
+		 * even though we don't consider the page to be all-visible, so we
+		 * can't check that here.  See comments in lazy_scan_prune.
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+		else if (!PageIsAllVisible(page) &&
+				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1136,50 +1115,6 @@ lazy_scan_heap(LVRelState *vacrel)
 								VISIBILITYMAP_VALID_BITS);
 		}
 
-		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
-		 * set, however.
-		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
-		{
-			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both prunestate fields.
-		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
-				 prunestate.all_frozen &&
-				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-		{
-			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
-			 */
-			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
-
 		/*
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
@@ -1216,12 +1151,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 	}
 
+	/* initial heap pass finished (final pass may still be required) */
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	/* report all blocks as initial-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
@@ -1238,20 +1174,26 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
+	 * related heap vacuuming in final heap pass
 	 */
 	if (dead_items->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * Now that both our initial heap pass and final heap pass (if any) have
+	 * ended, vacuum the Free Space Map. (Actually, similar FSM vacuuming will
+	 * have taken place earlier when VACUUM needed to call lazy_vacuum to deal
+	 * with running out of dead_items space.  Hopefully that will be rare.)
 	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+	if (rel_pages > 0)
+	{
+		Assert(vacrel->scanned_pages > 0);
+		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+								rel_pages);
+	}
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+	/* report all blocks as final-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
 
 	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
@@ -1259,7 +1201,7 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine freezing strategy.
+ *	lazy_scan_strategy() -- Determine freezing/vmsnap scanning strategies.
  *
  * Our lazy freezing strategy is useful when putting off the work of freezing
  * totally avoids freezing that turns out to have been wasted effort later on.
@@ -1267,11 +1209,42 @@ lazy_scan_heap(LVRelState *vacrel)
  * continual growth, where freezing pages proactively is needed just to avoid
  * falling behind on freezing (eagerness is also likely to be cheaper in the
  * short/medium term for such tables, but the long term picture matters most).
+ *
+ * Our lazy vmsnap scanning strategy is useful when we can save a significant
+ * amount of work in the short term by not advancing relfrozenxid/relminmxid.
+ * Our eager vmsnap scanning strategy is useful when there is hardly any work
+ * avoided by being lazy anyway, and/or when tableagefrac is nearing or has
+ * already surpassed 1.0, which is the point of antiwraparound autovacuuming.
+ *
+ * Freezing and scanning strategies are structured as two independent choices,
+ * but they are not independent in any practical sense (it's just mechanical).
+ * Eager and lazy behaviors go hand in hand, since the choice of each strategy
+ * is driven by similar considerations about the needs of the target table.
+ * Moreover, choosing eager scanning strategy can easily result in freezing
+ * many more pages (compared to an equivalent lazy scanning strategy VACUUM),
+ * since VACUUM can only freeze pages that it actually scans.  (All-visible
+ * pages may well have XIDs < FreezeLimit by now, but VACUUM has no way of
+ * noticing that it should freeze such pages besides just scanning them.)
+ *
+ * The single most important justification for the eager behaviors is system
+ * level performance stability.  It is often better to freeze all-visible
+ * pages before we're truly forced to (just to advance relfrozenxid) as a way
+ * of avoiding big spikes, where VACUUM has to freeze many pages all at once.
+ *
+ * Returns final scanned_pages for the VACUUM operation.  The exact number of
+ * pages that lazy_scan_heap scans depends in part on which vmsnap scanning
+ * strategy we choose (only eager scanning will scan rel's all-visible pages).
  */
-static void
-lazy_scan_strategy(LVRelState *vacrel)
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->rel_pages,
+				scanned_pages_lazy,
+				scanned_pages_eager,
+				nextra_scanned_eager,
+				nextra_young_threshold,
+				nextra_old_threshold,
+				nextra_toomany_threshold;
 
 	/*
 	 * Decide freezing strategy.
@@ -1279,121 +1252,161 @@ lazy_scan_strategy(LVRelState *vacrel)
 	 * The eager freezing strategy is used when the threshold controlled by
 	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
 	 *
+	 * Also freeze eagerly whenever table age is close to requiring (or is
+	 * actually undergoing) an antiwraparound autovacuum.  This may delay the
+	 * next antiwraparound autovacuum against the table.  We avoid relying on
+	 * them, if at all possible (mostly-static tables tend to rely on them).
+	 *
 	 * Also freeze eagerly with an unlogged or temp table, where the total
 	 * cost of freezing each page is just the cycles needed to prepare a set
 	 * of freeze plans.  Executing the freeze plans adds very little cost.
 	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
 	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 *
+	 * Once a table first becomes big enough for eager freezing, it's almost
+	 * inevitable that it will also naturally settle into a cadence where
+	 * relfrozenxid is advanced during every VACUUM (barring rel truncation).
+	 * This is a consequence of eager freezing strategy avoiding creating new
+	 * all-visible pages: if there never are any all-visible pages (if all
+	 * skippable pages are fully all-frozen), then there is no way that lazy
+	 * scanning strategy can ever look better than eager scanning strategy.
+	 * There are still ways that the occasional all-visible page could slip
+	 * into a table that we always freeze eagerly (at least when its tuples
+	 * tend to contain MultiXacts), but that should have negligible impact.
 	 */
 	vacrel->eager_freeze_strategy =
 		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 vacrel->cutoffs.tableagefrac > TABLEAGEFRAC_HIGHPOINT ||
 		 !RelationIsPermanent(vacrel->rel));
-}
-
-/*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
- *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
- *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
- */
-static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
-{
-	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
-
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
-	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
-
-		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-		{
-			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
-		 */
-		if (next_unskippable_block == rel_pages - 1)
-			break;
-
-		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
-			break;
-
-		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
-		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
-
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
-	}
 
 	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
+	 * Decide vmsnap scanning strategy.
 	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 * First acquire a visibility map snapshot, which determines the number of
+	 * pages that each vmsnap scanning strategy is required to scan for us in
+	 * passing.
+	 *
+	 * The number of "extra" scanned_pages added by choosing VMSNAP_SCAN_EAGER
+	 * over VMSNAP_SCAN_LAZY is a key input into the decision making process.
+	 * It is a good proxy for the added cost of applying our eager vmsnap
+	 * strategy during this particular VACUUM.  (We may or may not have to
+	 * dirty/freeze the extra pages when we scan them, which isn't something
+	 * that we try to model.  It shouldn't matter very much at this level.)
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
+	vacrel->vmsnap = visibilitymap_snap_acquire(vacrel->rel, rel_pages,
+												&scanned_pages_lazy,
+												&scanned_pages_eager);
+	nextra_scanned_eager = scanned_pages_eager - scanned_pages_lazy;
+
+	/*
+	 * Next determine guideline "nextra_scanned_eager" thresholds, which are
+	 * applied based in part on tableagefrac (when nextra_toomany_threshold is
+	 * determined below).  These thresholds also represent minimum and maximum
+	 * sensible thresholds that can ever make sense for a table of this size
+	 * (when the table's age isn't old enough to make eagerness mandatory).
+	 *
+	 * For the most part we only care about relative (not absolute) costs.  We
+	 * want to advance relfrozenxid at an opportune time, during a VACUUM that
+	 * has to scan relatively many pages either way (whether due to the need
+	 * to remove dead tuples from many pages, or due to the table containing
+	 * lots of existing all-frozen pages, or due to a combination of both).
+	 * Even small tables (where lazy freezing is used) shouldn't have to do
+	 * dramatically more work than usual when advancing relfrozenxid, which
+	 * our policy of waiting for the right VACUUM largely avoids, in practice.
+	 */
+	nextra_young_threshold = (double) rel_pages * MAX_PAGES_YOUNG_TABLEAGE;
+	nextra_old_threshold = (double) rel_pages * MAX_PAGES_OLD_TABLEAGE;
+
+	/*
+	 * Next determine nextra_toomany_threshold, which represents how many
+	 * extra scanned_pages are deemed too high a cost to pay for eagerness,
+	 * given present conditions.  This is our model's break-even point.
+	 */
+	Assert(rel_pages >= nextra_scanned_eager && vacrel->scanned_pages == 0);
+	if (vacrel->cutoffs.tableagefrac < TABLEAGEFRAC_MIDPOINT)
+	{
+		/*
+		 * The table's age is still below table age mid point, so table age is
+		 * still of only minimal concern.  We're still willing to act eagerly
+		 * when it's _very_ cheap to do so: when use of VMSNAP_SCAN_EAGER will
+		 * force us to scan some extra pages not exceeding 5% of rel_pages.
+		 */
+		nextra_toomany_threshold = nextra_young_threshold;
+	}
+	else if (vacrel->cutoffs.tableagefrac <= TABLEAGEFRAC_HIGHPOINT)
+	{
+		double		nextra_scale;
+
+		/*
+		 * The table's age is starting to become a concern, but not to the
+		 * extent that we'll force the use of VMSNAP_SCAN_EAGER strategy.
+		 * We'll need to interpolate to get an nextra_scanned_eager-based
+		 * threshold.
+		 *
+		 * If tableagefrac is only barely over the midway point, then we'll
+		 * choose an nextra_scanned_eager threshold of ~5% of rel_pages.  The
+		 * opposite extreme occurs when tableagefrac is very near to the high
+		 * point.  That will make our nextra_scanned_eager threshold very
+		 * aggressive: we'll go with VMSNAP_SCAN_EAGER when doing so requires
+		 * we scan a number of extra blocks as high as ~70% of rel_pages.
+		 *
+		 * Note that the threshold grows (on a percentage basis) by ~8.1% of
+		 * rel_pages for every additional 5%-of-tableagefrac increment added
+		 * (after tableagefrac has crossed the 50%-of-tableagefrac mid point,
+		 * until the 90%-of-tableagefrac high point is reached, when we switch
+		 * over to not caring about the added cost of eager freezing at all).
+		 */
+		nextra_scale =
+			1.0 - ((TABLEAGEFRAC_HIGHPOINT - vacrel->cutoffs.tableagefrac) /
+				   (TABLEAGEFRAC_HIGHPOINT - TABLEAGEFRAC_MIDPOINT));
+
+		nextra_toomany_threshold =
+			(nextra_young_threshold * (1.0 - nextra_scale)) +
+			(nextra_old_threshold * nextra_scale);
+	}
 	else
 	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
+		/*
+		 * The table's age surpasses the high point, and so is approaching (or
+		 * may even surpass) the point that an antiwraparound autovacuum is
+		 * required.  Force VMSNAP_SCAN_EAGER, no matter how many extra pages
+		 * we'll be required to scan as a result (costs no longer matter).
+		 *
+		 * Note that there is a discontinuity when tableagefrac crosses this
+		 * 90%-of-tableagefrac high point: the threshold set here jumps from
+		 * 70% of rel_pages to 100% of rel_pages (MaxBlockNumber, actually).
+		 * It's useful to only care about table age once it gets this high.
+		 * That way even extreme cases will have at least some chance of using
+		 * eager scanning before an antiwraparound autovacuum is launched.
+		 */
+		nextra_toomany_threshold = MaxBlockNumber;
 	}
 
-	return next_unskippable_block;
+	/* Make final choice on scanning strategy using final threshold */
+	nextra_toomany_threshold = Max(32, nextra_toomany_threshold);
+	vacrel->vmstrat = (nextra_scanned_eager >= nextra_toomany_threshold ?
+					   VMSNAP_SCAN_LAZY : VMSNAP_SCAN_EAGER);
+
+	/*
+	 * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+	 * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+	 */
+	if (force_scan_all)
+		vacrel->vmstrat = VMSNAP_SCAN_ALL;
+
+	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
+
+	/* Inform vmsnap infrastructure of our chosen strategy */
+	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
+
+	/* Return appropriate scanned_pages for final strategy chosen */
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
+		return scanned_pages_lazy;
+	if (vacrel->vmstrat == VMSNAP_SCAN_EAGER)
+		return scanned_pages_eager;
+
+	/* DISABLE_PAGE_SKIPPING/VMSNAP_SCAN_ALL case */
+	return rel_pages;
 }
 
 /*
@@ -1859,7 +1872,11 @@ retry:
 			 * cutoff by stepping back from OldestXmin.
 			 */
 			if (prunestate->all_visible && prunestate->all_frozen)
+			{
+				/* Do recovery conflict for VM now instead of later on */
 				snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+				prunestate->visibility_cutoff_xid = InvalidTransactionId;
+			}
 			else
 			{
 				/* Avoids false conflicts when hot_standby_feedback in use */
@@ -1885,6 +1902,30 @@ retry:
 		tuples_frozen = 0;		/* avoid miscounts in instrumentation */
 	}
 
+	/*
+	 * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
+	 * set.  Check that here in passing.
+	 *
+	 * We cannot condition this on what the all_visible flag says about the
+	 * page.  The OldestXmin cutoff used by two successive VACUUMs against the
+	 * same table can "move backwards", since it's conservative.  It's quite
+	 * possible that we won't consider a page all-visible now, despite the
+	 * page having its PD_ALL_VISIBLE bit set by some slightly earlier VACUUM.
+	 */
+	if (PageIsAllVisible(page) &&
+		(lpdead_items > 0 || tuples_deleted > 0 || recently_dead_tuples > 0))
+	{
+		elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+			 vacrel->relname, blkno);
+
+		/*
+		 * Clear PD_ALL_VISIBLE now, but leave it up to our caller to correct
+		 * any inconsistencies with the visibility map
+		 */
+		PageClearAllVisible(page);
+		MarkBufferDirty(buf);
+	}
+
 	/*
 	 * VACUUM will call heap_page_is_all_visible() during the second pass over
 	 * the heap to determine all_visible and all_frozen for the page -- this
@@ -2832,6 +2873,14 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * Also don't attempt it if we are doing early pruning/vacuuming, because a
  * scan which cannot find a truncated heap page cannot determine that the
  * snapshot is too old to read that page.
+ *
+ * Note that we effectively rely on visibilitymap_snap_next() having forced
+ * VACUUM to scan the final page (rel_pages - 1) in all cases.  Without that,
+ * we'd tend to needlessly acquire an AccessExclusiveLock just to attempt rel
+ * truncation that is bound to fail.  VACUUM cannot set vacrel->nonempty_pages
+ * in pages that it skips using the VM, so we must avoid interpreting skipped
+ * pages as empty pages when it makes little sense.  Observing that the final
+ * page has tuples is a simple way of avoiding pathological locking behavior.
  */
 static bool
 should_attempt_truncation(LVRelState *vacrel)
@@ -3122,14 +3171,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3138,15 +3186,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3168,12 +3214,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a..4fd6aabc0 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,10 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap_acquire - acquire snapshot of visibility map
+ *		visibilitymap_snap_strategy - set VACUUM's scanning strategy
+ *		visibilitymap_snap_next - get next block to scan from vmsnap
+ *		visibilitymap_snap_release - release previously acquired snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +56,10 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset.  This also provides us with a
+ * convenient way of performing I/O prefetching on behalf of VACUUM, since the
+ * pages that VACUUM's first heap pass will scan are fully predetermined.
  *
  * LOCKING
  *
@@ -92,10 +100,12 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
+#include "utils/spccache.h"
 
 
 /*#define TRACE_VISIBILITYMAP */
@@ -124,9 +134,81 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Prefetching of heap pages takes place as VACUUM requests the next block in
+ * line from its visibility map snapshot
+ *
+ * XXX MIN_PREFETCH_SIZE of 32 is a little on the high side, but matches
+ * hard-coded constant used by vacuumlazy.c when prefetching for rel
+ * truncation.  Might be better to increase the maintenance_io_concurrency
+ * default, or to do nothing like this at all.
+ */
+#define STAGED_BUFSIZE			(MAX_IO_CONCURRENCY * 2)
+#define MIN_PREFETCH_SIZE		((BlockNumber) 32)
+
+/*
+ * Snapshot of visibility map at the start of a VACUUM operation
+ */
+struct vmsnapshot
+{
+	/* Target heap rel */
+	Relation	rel;
+	/* Scanning strategy used by VACUUM operation */
+	vmstrategy	strat;
+	/* Per-strategy final scanned_pages */
+	BlockNumber rel_pages;
+	BlockNumber scanned_pages_lazy;
+	BlockNumber scanned_pages_eager;
+
+	/*
+	 * Materialized visibility map state.
+	 *
+	 * VM snapshots spill to a temp file when required.
+	 */
+	BlockNumber nvmpages;
+	BufFile    *file;
+
+	/*
+	 * Prefetch distance, used to perform I/O prefetching of heap pages
+	 */
+	int			prefetch_distance;
+
+	/* Current VM page cached */
+	BlockNumber curvmpage;
+	char	   *rawmap;
+	PGAlignedBlock vmpage;
+
+	/* Staging area for blocks returned to VACUUM */
+	BlockNumber staged[STAGED_BUFSIZE];
+	int			current_nblocks_staged;
+
+	/*
+	 * Next block from range of rel_pages to consider placing in staged block
+	 * array (it will be placed there if it's going to be scanned by VACUUM)
+	 */
+	BlockNumber next_block;
+
+	/*
+	 * Number of blocks that we still need to return, and number of blocks
+	 * that we still need to prefetch
+	 */
+	BlockNumber scanned_pages_to_return;
+	BlockNumber scanned_pages_to_prefetch;
+
+	/* offset of next block in line to return (from staged) */
+	int			next_return_idx;
+	/* offset of next block in line to prefetch (from staged) */
+	int			next_prefetch_idx;
+	/* offset of first garbage/invalid element (from staged) */
+	int			first_invalid_idx;
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static void vm_snap_stage_blocks(vmsnapshot *vmsnap);
+static uint8 vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 
 
 /*
@@ -373,6 +455,356 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap_acquire - get read-only snapshot of visibility map
+ *
+ * Initializes VACUUM caller's snapshot, allocating memory in current context.
+ * Used by VACUUM to determine which pages it must scan up front.
+ *
+ * Set scanned_pages_lazy and scanned_pages_eager to help VACUUM decide on its
+ * scanning strategy.  These are VACUUM's scanned_pages when it opts to skip
+ * all eligible pages and scanned_pages when it opts to just skip all-frozen
+ * pages, respectively.
+ *
+ * Caller finalizes scanning strategy by calling visibilitymap_snap_strategy.
+ * This determines the kind of blocks visibilitymap_snap_next should indicate
+ * need to be scanned by VACUUM.
+ */
+vmsnapshot *
+visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+						   BlockNumber *scanned_pages_lazy,
+						   BlockNumber *scanned_pages_eager)
+{
+	BlockNumber nvmpages = 0,
+				mapBlockLast = 0,
+				all_visible = 0,
+				all_frozen = 0;
+	uint8		mapbits_last_page = 0;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_acquire %s %u",
+		 RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	if (rel_pages > 0)
+	{
+		mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages - 1);
+		nvmpages = mapBlockLast + 1;
+	}
+
+	/* Allocate and initialize VM snapshot state */
+	vmsnap = palloc0(sizeof(vmsnapshot));
+	vmsnap->rel = rel;
+	vmsnap->strat = VMSNAP_SCAN_ALL;	/* for now */
+	vmsnap->rel_pages = rel_pages;	/* scanned_pages for VMSNAP_SCAN_ALL */
+	vmsnap->scanned_pages_lazy = 0;
+	vmsnap->scanned_pages_eager = 0;
+
+	/*
+	 * vmsnap temp file state.
+	 *
+	 * Only relations large enough to need more than one visibility map page
+	 * use a temp file (cannot wholly rely on vmsnap's single page cache).
+	 */
+	vmsnap->nvmpages = nvmpages;
+	vmsnap->file = NULL;
+	if (nvmpages > 1)
+		vmsnap->file = BufFileCreateTemp(false);
+	vmsnap->prefetch_distance = 0;
+#ifdef USE_PREFETCH
+	vmsnap->prefetch_distance =
+		get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
+#endif
+	vmsnap->prefetch_distance = Max(vmsnap->prefetch_distance, MIN_PREFETCH_SIZE);
+
+	/* cache of VM pages read from temp file */
+	vmsnap->curvmpage = 0;
+	vmsnap->rawmap = NULL;
+
+	/* staged blocks array state */
+	vmsnap->current_nblocks_staged = 0;
+	vmsnap->next_block = 0;
+	vmsnap->scanned_pages_to_return = 0;
+	vmsnap->scanned_pages_to_prefetch = 0;
+	/* Offsets into staged blocks array */
+	vmsnap->next_return_idx = 0;
+	vmsnap->next_prefetch_idx = 0;
+	vmsnap->first_invalid_idx = 0;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		/* Cache page locally */
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(vmsnap->vmpage.data, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/* Finish off this VM page using snapshot's vmpage cache */
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = map = PageGetContents(vmsnap->vmpage.data);
+		umap = (uint64 *) map;
+
+		if (mapBlock == mapBlockLast)
+		{
+			uint32		mapByte;
+			uint8		mapOffset;
+
+			/*
+			 * The last VM page requires some extra steps.
+			 *
+			 * First get the status of the last heap page (page in the range
+			 * of rel_pages) in passing.
+			 */
+			Assert(mapBlock == HEAPBLK_TO_MAPBLOCK(rel_pages - 1));
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages - 1);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages - 1);
+			mapbits_last_page = ((map[mapByte] >> mapOffset) &
+								 VISIBILITYMAP_VALID_BITS);
+
+			/*
+			 * Also defensively "truncate" our local copy of the last page in
+			 * order to reliably exclude heap pages beyond the range of
+			 * rel_pages.  This is just paranoia.
+			 */
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages);
+			if (mapByte != 0 || mapOffset != 0)
+			{
+				MemSet(&map[mapByte + 1], 0, MAPSIZE - (mapByte + 1));
+				map[mapByte] &= (1 << mapOffset) - 1;
+			}
+		}
+
+		/* Maintain count of all-frozen and all-visible pages */
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+
+		/* Finally, write out vmpage cache VM page to vmsnap's temp file */
+		if (vmsnap->file)
+			BufFileWrite(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+	}
+
+	/*
+	 * Should always have at least as many all_visible pages as all_frozen
+	 * pages.  Even still, we generally only interpret a page as all-frozen
+	 * when both the all-visible and all-frozen bits are set together.  Clamp
+	 * so that we'll avoid giving our caller an obviously bogus summary of the
+	 * visibility map when certain pages only have their all-frozen bit set.
+	 *
+	 * This is just defensive, but it's not a completely hypothetical concern;
+	 * historical vacuumlazy.c bugs allowed such inconsistencies to slip in.
+	 */
+	Assert(all_frozen <= all_visible && all_visible <= rel_pages);
+	all_frozen = Min(all_frozen, all_visible);
+
+	/*
+	 * Done copying all VM pages from authoritative VM into a VM snapshot.
+	 *
+	 * Figure out the final scanned_pages for the two skipping policies that
+	 * we might use: skipallvis (skip both all-frozen and all-visible) and
+	 * skipallfrozen (just skip all-frozen).
+	 */
+	vmsnap->scanned_pages_lazy = rel_pages - all_visible;
+	vmsnap->scanned_pages_eager = rel_pages - all_frozen;
+
+	/*
+	 * When the last page is skippable in principle, it still won't be treated
+	 * as skippable by visibilitymap_snap_next, which recognizes the last page
+	 * as a special case.  Compensate by incrementing each scanning strategy's
+	 * scanned_pages as needed to avoid counting the last page as skippable.
+	 *
+	 * As usual we expect that the all-frozen bit can only be set alongside
+	 * the all-visible bit (for any given page), but only interpret a page as
+	 * truly all-frozen when both of its VM bits are set together.
+	 */
+	if (mapbits_last_page & VISIBILITYMAP_ALL_VISIBLE)
+	{
+		vmsnap->scanned_pages_lazy++;
+		if (mapbits_last_page & VISIBILITYMAP_ALL_FROZEN)
+			vmsnap->scanned_pages_eager++;
+	}
+
+	*scanned_pages_lazy = vmsnap->scanned_pages_lazy;
+	*scanned_pages_eager = vmsnap->scanned_pages_eager;
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_strategy -- determine VACUUM's scanning strategy.
+ *
+ * VACUUM chooses a vmsnap strategy according to priorities around advancing
+ * relfrozenxid.  See visibilitymap_snap_acquire.
+ */
+void
+visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat)
+{
+	int			nprefetch;
+
+	/* Remember final scanning strategy */
+	vmsnap->strat = strat;
+
+	if (vmsnap->strat == VMSNAP_SCAN_LAZY)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_lazy;
+	else if (vmsnap->strat == VMSNAP_SCAN_EAGER)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_eager;
+	else
+		vmsnap->scanned_pages_to_return = vmsnap->rel_pages;
+
+	vmsnap->scanned_pages_to_prefetch = vmsnap->scanned_pages_to_return;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_strategy %s %d %u",
+		 RelationGetRelationName(vmsnap->rel), (int) strat,
+		 vmsnap->scanned_pages_to_return);
+#endif
+
+	/*
+	 * Stage blocks (may have to read from temp file).
+	 *
+	 * We rely on the assumption that we'll always have a large enough staged
+	 * blocks array to accommodate any possible prefetch distance.
+	 */
+	vm_snap_stage_blocks(vmsnap);
+
+	nprefetch = Min(vmsnap->current_nblocks_staged, vmsnap->prefetch_distance);
+#ifdef USE_PREFETCH
+	for (int i = 0; i < nprefetch; i++)
+	{
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, vmsnap->staged[i]);
+	}
+#endif
+
+	vmsnap->scanned_pages_to_prefetch -= nprefetch;
+	vmsnap->next_prefetch_idx += nprefetch;
+}
+
+/*
+ *	visibilitymap_snap_next -- get next block to scan from vmsnap.
+ *
+ * Returns next block in line for VACUUM to scan according to vmsnap.  Caller
+ * skips any and all blocks preceding returned block.
+ *
+ * VACUUM always scans the last page to determine whether it has tuples.  This
+ * is useful as a way of avoiding certain pathological cases with heap rel
+ * truncation.  We always return the final block (rel_pages - 1) here last.
+ */
+BlockNumber
+visibilitymap_snap_next(vmsnapshot *vmsnap)
+{
+	BlockNumber next_block_to_scan;
+
+	if (vmsnap->scanned_pages_to_return == 0)
+		return InvalidBlockNumber;
+
+	/* Prepare to return this block */
+	next_block_to_scan = vmsnap->staged[vmsnap->next_return_idx++];
+	vmsnap->current_nblocks_staged--;
+	vmsnap->scanned_pages_to_return--;
+
+	/*
+	 * Did the staged blocks array just run out of blocks to return to caller,
+	 * or do we need to stage more blocks for I/O prefetching purposes?
+	 */
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+	if ((vmsnap->current_nblocks_staged == 0 &&
+		 vmsnap->scanned_pages_to_return > 0) ||
+		(vmsnap->next_prefetch_idx == vmsnap->first_invalid_idx &&
+		 vmsnap->scanned_pages_to_prefetch > 0))
+	{
+		if (vmsnap->current_nblocks_staged > 0)
+		{
+			/*
+			 * We've run out of prefetchable blocks, but still have some
+			 * non-returned blocks.  Shift existing blocks to the start of the
+			 * array.  The newly staged blocks go after these ones.
+			 */
+			memmove(&vmsnap->staged[0],
+					&vmsnap->staged[vmsnap->next_return_idx],
+					sizeof(BlockNumber) * vmsnap->current_nblocks_staged);
+		}
+
+		/*
+		 * Reset offsets in staged blocks array, while accounting for likely
+		 * presence of preexisting blocks that have already been prefetched
+		 * but have yet to be returned to VACUUM caller
+		 */
+		vmsnap->next_prefetch_idx -= vmsnap->next_return_idx;
+		vmsnap->first_invalid_idx -= vmsnap->next_return_idx;
+		vmsnap->next_return_idx = 0;
+
+		/* Stage more blocks (may have to read from temp file) */
+		vm_snap_stage_blocks(vmsnap);
+	}
+
+	/*
+	 * By here we're guaranteed to have at least one prefetchable block in the
+	 * staged blocks array (unless we've already prefetched all blocks that
+	 * will ever be returned to VACUUM caller)
+	 */
+	if (vmsnap->next_prefetch_idx < vmsnap->first_invalid_idx)
+	{
+#ifdef USE_PREFETCH
+		/* Still have remaining blocks to prefetch, so prefetch next one */
+		BlockNumber prefetch = vmsnap->staged[vmsnap->next_prefetch_idx++];
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, prefetch);
+#else
+		vmsnap->next_prefetch_idx++;
+#endif
+		Assert(vmsnap->current_nblocks_staged > 1);
+		Assert(vmsnap->scanned_pages_to_prefetch > 0);
+		vmsnap->scanned_pages_to_prefetch--;
+	}
+	else
+	{
+		Assert(vmsnap->scanned_pages_to_prefetch == 0);
+	}
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_next %s %u",
+		 RelationGetRelationName(vmsnap->rel), next_block_to_scan);
+#endif
+
+	return next_block_to_scan;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Frees resources allocated in visibilitymap_snap_acquire for VACUUM.
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->scanned_pages_to_return == 0);
+	if (vmsnap->file)
+		BufFileClose(vmsnap->file);
+	pfree(vmsnap);
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
@@ -677,3 +1109,112 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
+
+/*
+ * Stage some heap blocks from vmsnap to return to VACUUM caller.
+ *
+ * Called when we completely run out of staged blocks to return to VACUUM, or
+ * when vmsnap still has some pending staged blocks, but too few to be able to
+ * prefetch incrementally as the remaining blocks are returned to VACUUM.
+ */
+static void
+vm_snap_stage_blocks(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->current_nblocks_staged < STAGED_BUFSIZE);
+	Assert(vmsnap->first_invalid_idx < STAGED_BUFSIZE);
+	Assert(vmsnap->next_return_idx <= vmsnap->first_invalid_idx);
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+
+	while (vmsnap->next_block < vmsnap->rel_pages &&
+		   vmsnap->current_nblocks_staged < STAGED_BUFSIZE)
+	{
+		for (;;)
+		{
+			uint8		mapbits = vm_snap_get_status(vmsnap,
+													 vmsnap->next_block);
+
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+				break;
+			}
+
+			/*
+			 * Stop staging blocks just before final page, which must always
+			 * be scanned by VACUUM
+			 */
+			if (vmsnap->next_block == vmsnap->rel_pages - 1)
+				break;
+
+			/* VMSNAP_SCAN_ALL forcing VACUUM to scan every page? */
+			if (vmsnap->strat == VMSNAP_SCAN_ALL)
+				break;
+
+			/*
+			 * Check if VACUUM must scan this page because it's not all-frozen
+			 * and VACUUM opted to use VMSNAP_SCAN_EAGER strategy
+			 */
+			if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
+				vmsnap->strat == VMSNAP_SCAN_EAGER)
+				break;
+
+			/* VACUUM will skip this block -- so don't stage it for later */
+			vmsnap->next_block++;
+		}
+
+		/* VACUUM will scan this block, so stage it for later */
+		vmsnap->staged[vmsnap->first_invalid_idx++] = vmsnap->next_block++;
+		vmsnap->current_nblocks_staged++;
+	}
+}
+
+/*
+ * Get status of bits from vm snapshot
+ */
+static uint8
+vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "vm_snap_get_status %u", heapBlk);
+#endif
+
+	/*
+	 * If we didn't see the VM page when the snapshot was first acquired we
+	 * defensively assume heapBlk not all-visible or all-frozen
+	 */
+	Assert(heapBlk <= vmsnap->rel_pages);
+	if (unlikely(mapBlock >= vmsnap->nvmpages))
+		return 0;
+
+	/*
+	 * Read from temp file when required.
+	 *
+	 * Although this routine supports random access, sequential access is
+	 * expected.  We should only need to read each temp file page into cache
+	 * at most once per VACUUM.
+	 */
+	if (unlikely(mapBlock != vmsnap->curvmpage))
+	{
+		size_t		nread;
+
+		if (BufFileSeekBlock(vmsnap->file, mapBlock) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek to block %u of vmsnap temporary file",
+							mapBlock)));
+		nread = BufFileRead(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+		if (nread != BLCKSZ)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u of vmsnap temporary file: read only %zu of %zu bytes",
+							mapBlock, nread, (size_t) BLCKSZ)));
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = PageGetContents(vmsnap->vmpage.data);
+	}
+
+	return ((vmsnap->rawmap[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d9cfe4372..f84f4e2c7 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -933,11 +933,11 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				effective_multixact_freeze_max_age,
 				freeze_strategy_threshold;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
@@ -1069,48 +1069,48 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
-	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.  The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/*
+	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+	 * XMID table age (whichever is greater currently).
+	 */
+	XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+		((double) freeze_table_age + 0.5);
+	MXIDFrac = (double) (nextMXID - cutoffs->relminmxid) /
+		((double) multixact_freeze_table_age + 0.5);
+	cutoffs->tableagefrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		cutoffs->tableagefrac = 1.0;
+
+	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 3e463cb42..33cd2dd60 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2497,10 +2497,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2517,10 +2517,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 447645b73..c44c1c4e7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -659,6 +659,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -692,11 +699,9 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b1137381a..d97284ec8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9184,20 +9184,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9266,19 +9274,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 79595b1cb..c137debb1 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -158,9 +158,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
       all tuples are known to be frozen can always be skipped, and those
       where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      skipped except when performing an aggressive vacuum.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.38.1

#73

Matthias van de Meent

boekewurm+postgres@gmail.com

about 3 years ago

In reply to: Peter Geoghegan (#72)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, 3 Jan 2023 at 21:30, Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v14.

Some reviews (untested; only code review so far) on these versions of
the patches:

[PATCH v14 1/3] Add eager and lazy freezing strategies to VACUUM.

+    /*
+     * Threshold cutoff point (expressed in # of physical heap rel blocks in
+     * rel's main fork) that triggers VACUUM's eager freezing strategy
+     */

I don't think the mention of 'cutoff point' is necessary when it has
'Threshold'.

+    int            freeze_strategy_threshold;    /* threshold to use eager
[...]
+    BlockNumber freeze_strategy_threshold;

Is there a way to disable the 'eager' freezing strategy? `int` cannot
hold the maximum BlockNumber...

+ lazy_scan_strategy(vacrel);
if (verbose)

I'm slightly suprised you didn't update the message for verbose vacuum
to indicate whether we used the eager strategy: there are several GUCs
for tuning this behaviour, so you'd expect to want direct confirmation
that the configuration is effective.
(looks at further patches) I see that the message for verbose vacuum
sees significant changes in patch 2 instead.

---

[PATCH v14 2/3] Add eager and lazy VM strategies to VACUUM.

General comments:

I don't see anything regarding scan synchronization in the vmsnap scan
system. I understand that VACUUM is a load that is significantly
different from normal SEQSCANs, but are there good reasons to _not_
synchronize the start of VACUUM?

Right now, we don't use syncscan to determine a startpoint. I can't
find the reason why in the available documentation: [0]/messages/by-id/19398.1212328662@sss.pgh.pa.us discusses the
issue but without clearly describing an issue why it wouldn't be
interesting from a 'nothing lost' perspective.

In addition, I noticed that progress reporting of blocks scanned
("heap_blocks_scanned", duh) includes skipped pages. Now that we have
a solid grasp of how many blocks we're planning to scan, we can update
the reported stats to how many blocks we're planning to scan (and have
scanned), increasing the user value of that progress view.

[0]: /messages/by-id/19398.1212328662@sss.pgh.pa.us

+ double tableagefrac;

I think this can use some extra info on the field itself, that it is
the fraction of how "old" the relfrozenxid and relminmxid fields are,
as a fraction between 0 (latest values; nextXID and nextMXID), and 1
(values that are old by at least freeze_table_age and
multixact_freeze_table_age (multi)transaction ids, respectively).

-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80    /* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80    /* don't skip using VM */

I'm not super happy with this change. I don't think we should touch
the VM using snapshots _at all_ when disable_page_skipping is set:

+     * Decide vmsnap scanning strategy.
*
-     * This test also enables more frequent relfrozenxid advancement during
-     * non-aggressive VACUUMs.  If the range has any all-visible pages then
-     * skipping makes updating relfrozenxid unsafe, which is a real downside.
+     * First acquire a visibility map snapshot, which determines the number of
+     * pages that each vmsnap scanning strategy is required to scan for us in
+     * passing.

I think we should not take disk-backed vm snapshots when
force_scan_all is set. We need VACUUM to be able to run on very
resource-constrained environments, and this does not do that - it adds
a disk space requirement for the VM snapshot for all but the smallest
relation sizes, which is bad when you realize that we need VACUUM when
we want to clean up things like CLOG.

Additionally, it took me several reads of the code and comments to
understand what the decision-making process for lazy vs eager is, and
why. The comments are interspersed with the code, with no single place
that describes it from a bird's eyes' view. I think something like the
following would be appreciated by other readers of the code:

+ We determine whether we choose the eager or lazy scanning strategy
based on how many extra pages the eager strategy would take over the
lazy strategy, and how "old" the table is (as determined in
tableagefrac):
+ When a table is still "young" (tableagefrac <
TABLEAGEFRAC_MIDPOINT), the eager strategy is accepted if we need to
scan 5% (MAX_PAGES_YOUNG_TABLEAGE) more of the table.
+ As the table gets "older" (tableagefrac between MIDPOINT and
HIGHPOINT), the threshold for eager scanning is relaxed linearly from
this 5% to 70% (MAX_PAGES_OLD_TABLEAGE) of the table being scanned
extra (over what would be scanned by the lazy strategy).
+ Once the tableagefrac passes HIGHPOINT, we stop considering the lazy
strategy, and always eagerly scan the table.

@@ -1885,6 +1902,30 @@ retry:
tuples_frozen = 0; /* avoid miscounts in instrumentation */
}
/*
+     * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
+     * set.  Check that here in passing.
+     *
[...]

I'm not sure this patch is the appropriate place for this added check.
I don't disagree with the change, I just think that it's unrelated to
the rest of the patch. Same with some of the changes in
lazy_scan_heap.

+vm_snap_stage_blocks

Doesn't this waste a lot of cycles on skipping frozen blocks if most
of the relation is frozen? I'd expected something more like a byte- or
word-wise processing of skippable blocks, as opposed to this per-block
loop. I don't think it's strictly necessary to patch, but I think it
would be a very useful addition for those with larger tables.

+    XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+        ((double) freeze_table_age + 0.5);

I don't quite understand what this `+ 0.5` is used for, could you explain?

+ [...] Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.

This claim that it happens more proactively in "every" VACUUM
operation is false, so I think the removal of "every" would be better.

---

[PATCH v14 3/3] Finish removing aggressive mode VACUUM.

I've not completed a review for this patch - I'll continue on that
tomorrow - but here's a first look:

I don't quite enjoy the refactoring+rewriting of the docs section;
it's difficult to determine what changed when so many things changed
line lengths and were moved around. Tomorrow I'll take a closer look,
but a separation of changes vs moved would be useful for review.

+    /*
+     * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+     * set in pg_class at the end of VACUUM.
+     */
+    TransactionId MinXid;
+    MultiXactId MinMulti;

I don't quite like this wording, but I'm not sure what would be better.

+    cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
[...]
+    cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);

Why are these values adjusted down (up?) by 5%? If I configure this
GUC, I'd expect this to be used effectively verbatim; not adjusted by
an arbitrary factor.

---

That's it for now; thanks for working on this,

Kind regards,

Matthias van de Meent

#74

Matthias van de Meent

boekewurm+postgres@gmail.com

about 3 years ago

In reply to: Matthias van de Meent (#73)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, 5 Jan 2023 at 02:21, I wrote:

On Tue, 3 Jan 2023 at 21:30, Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v14.
[PATCH v14 3/3] Finish removing aggressive mode VACUUM.

I've not completed a review for this patch - I'll continue on that
tomorrow:

This is that.

@@ -2152,10 +2109,98 @@ lazy_scan_noprune(LVRelState *vacrel,
[...]
+            /* wait 10ms, then 20ms, then 30ms, then give up */
[...]
+                pg_usleep(1000L * 10L * i);

Could this use something like autovacuum_cost_delay? I don't quite
like the use of arbitrary hardcoded millisecond delays - it can slow a
system down by a significant fraction, especially on high-contention
systems, and this potential of 60ms delay per scanned page can limit
the throughput of this new vacuum strategy to < 17 pages/second
(<136kB/sec) for highly contended sections, which is not great.

It is also not unlikely that in the time it was waiting, the page
contents were updated significantly (concurrent prune, DELETEs
committed), which could result in improved bounds. I think we should
redo the dead items check if we waited, but failed to get a lock - any
tuples removed now reduce work we'll have to do later.

+++ b/doc/src/sgml/ref/vacuum.sgml
[...] Pages where
+      all tuples are known to be frozen are always skipped.

"...are always skipped, unless the >DISABLE_PAGE_SKIPPING< option is used."

+++ b/doc/src/sgml/maintenance.sgml

There are a lot of details being lost from the previous version of
that document. Some of the details are obsolete (mentions of
aggressive VACUUM and freezing behavior), but others are not
(FrozenTransactionId in rows from a pre-9.4 system, the need for
vacuum for prevention of issues surrounding XID wraparound).

I also am not sure this is the best place to store most of these
mentions, but I can't find a different place where these details on
certain interesting parts of the system are documented, and plain
removal of the information does not sit right with me.

Specifically, I don't like the removal of the following information
from our documentation:

- Size of pg_xact and pg_commit_ts data in relation to autovacuum_freeze_max_age
Although it is less likely with the new behaviour that we'll hit
these limits due to more eager freezing of transactions, it is still
important for users to have easy access to this information, and
tuning this for storage size is not useless information.

- The reason why VACUUM is essential to the long-term consistency of
Postgres' MVCC system
Informing the user about our use of 32-bit transaction IDs and
that we update an epoch when this XID wraps around does not
automatically make the user aware of the issues that surface around
XID wraparound. Retaining the explainer for XID wraparound in the docs
seems like a decent idea - it may be moved, but please don't delete
it.

- Special transaction IDs, their meaning and where they can occur
I can't seem to find any other information in the docs section, and
it is useful to have users understand that certain values are
considered special: FrozenTransactionId and BootstrapTransactionId.

Kind regards,

Matthias van de Meent

#75

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Matthias van de Meent (#73)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 4, 2023 at 5:21 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Some reviews (untested; only code review so far) on these versions of
the patches:

Thanks for the review!

[PATCH v14 1/3] Add eager and lazy freezing strategies to VACUUM.

I don't think the mention of 'cutoff point' is necessary when it has
'Threshold'.

Fair. Will fix.

+    int            freeze_strategy_threshold;    /* threshold to use eager
[...]
+    BlockNumber freeze_strategy_threshold;
Is there a way to disable the 'eager' freezing strategy? `int` cannot
hold the maximum BlockNumber...

I'm going to fix this by switching over to making the GUC (and the
reloption) GUC_UNIT_MB, while keeping it in ConfigureNamesInt[]. That
approach is a little bit more cumbersome, but not by much. That'll
solve this problem.

+ lazy_scan_strategy(vacrel);
if (verbose)

I'm slightly suprised you didn't update the message for verbose vacuum
to indicate whether we used the eager strategy: there are several GUCs
for tuning this behaviour, so you'd expect to want direct confirmation
that the configuration is effective.

Perhaps that would be worth doing, but I don't think that it's all
that useful in the grand scheme of things. I wouldn't mind including
it, but I think that it shouldn't be given much prominence. It's
certainly far less important than "aggressive vs non-aggressive" is
right now.

Eagerness is not just a synonym of aggressiveness. For example, every
VACUUM of a table like pgbench_tellers or pgbench_branches will use
eager scanning strategy. More generally, you have to bear in mind that
the actual state of the table is just as important as the GUCs
themselves. We try to avoid obligations that could be very hard or
even impossible for vacuumlazy.c to fulfill.

There are far weaker constraints on things like the final relfrozenxid
value we'll set in pg_class (more on this below, when I talk about
MinXid/MinMulti). It will advance far more frequently and by many more
XIDs than it would today, on average. But occasionally it will allow a
far earlier relfrozenxid than aggressive mode would ever allow, since
making some small amount of progress now is almost always much better
than making no progress at all.

(looks at further patches) I see that the message for verbose vacuum
sees significant changes in patch 2 instead.

It just works out to be slightly simpler that way. I want to add the
scanned_pages stuff to VERBOSE in the vmsnap/scanning strategies
commit, so I need to make significant changes to the initial VERBOSE
message in that commit. There is little point in preserving
information about aggressive mode if it's removed in the very next
commit anyway.

[PATCH v14 2/3] Add eager and lazy VM strategies to VACUUM.

Right now, we don't use syncscan to determine a startpoint. I can't
find the reason why in the available documentation: [0] discusses the
issue but without clearly describing an issue why it wouldn't be
interesting from a 'nothing lost' perspective.

That's not something I've given much thought to. It's a separate issue, I think.

Though I will say that one reason why I think that the vm snapshot
concept will become important is that working off an immutable
structure makes various things much easier, in fairly obvious ways. It
makes it straightforward to reorder work. So things like parallel heap
vacuuming are a lot more straightforward.

I also think that it would be useful to teach VACUUM to speculatively
scan a random sample of pages, just like a normal VACUUM. We start out
doing a normal VACUUM that just processes scanned_pages in a random
order. At some point we look at the state of pages so far. If it looks
like the table really doesn't urgently need to be vacuumed, then we
can give up before paying much of a cost. If it looks like the table
really needs to be VACUUM'd, we can press on almost like any other
VACUUM would.

This is related to the problem of bad statistics that drive
autovacuum. Deciding as much as possible at runtime, dynamically,
seems promising to me.

In addition, I noticed that progress reporting of blocks scanned
("heap_blocks_scanned", duh) includes skipped pages. Now that we have
a solid grasp of how many blocks we're planning to scan, we can update
the reported stats to how many blocks we're planning to scan (and have
scanned), increasing the user value of that progress view.

Yeah, that's definitely a natural direction to go with this. Knowing
scanned_pages from the start is a basis for much more useful progress
reporting.

+ double tableagefrac;

I think this can use some extra info on the field itself, that it is
the fraction of how "old" the relfrozenxid and relminmxid fields are,
as a fraction between 0 (latest values; nextXID and nextMXID), and 1
(values that are old by at least freeze_table_age and
multixact_freeze_table_age (multi)transaction ids, respectively).

Agreed that that needs more than that in comments above the
"tableagefrac" struct field.

+     * Decide vmsnap scanning strategy.
*
-     * This test also enables more frequent relfrozenxid advancement during
-     * non-aggressive VACUUMs.  If the range has any all-visible pages then
-     * skipping makes updating relfrozenxid unsafe, which is a real downside.
+     * First acquire a visibility map snapshot, which determines the number of
+     * pages that each vmsnap scanning strategy is required to scan for us in
+     * passing.
I think we should not take disk-backed vm snapshots when
force_scan_all is set. We need VACUUM to be able to run on very
resource-constrained environments, and this does not do that - it adds
a disk space requirement for the VM snapshot for all but the smallest
relation sizes, which is bad when you realize that we need VACUUM when
we want to clean up things like CLOG.

I agree that I still have work to do to make visibility map snapshots
as robust as possible in resource constrained environments, including
in cases where there is simply no disk space at all. They should
gracefully degrade even when there isn't space on disk to store a copy
of the VM in temp files, or even a single page.

Additionally, it took me several reads of the code and comments to
understand what the decision-making process for lazy vs eager is, and
why. The comments are interspersed with the code, with no single place
that describes it from a bird's eyes' view.

You probably have a good point there. I'll try to come up with
something, possibly based on your suggested wording.

@@ -1885,6 +1902,30 @@ retry:
tuples_frozen = 0; /* avoid miscounts in instrumentation */
}
/*
+     * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
+     * set.  Check that here in passing.
+     *
[...]
I'm not sure this patch is the appropriate place for this added check.
I don't disagree with the change, I just think that it's unrelated to
the rest of the patch. Same with some of the changes in
lazy_scan_heap.

This issue is hard to explain. I kind of need to do this in the VM
snapshot/scanning strategies commit, because it removes the
all_visible_according_to_vm local variable used inside lazy_scan_heap.

This change that you highlight detects cases where PD_ALL_VISIBLE is
set incorrectly earlier in lazy_scan_prune is part of that, and then
unsets it, so that once lazy_scan_prune returns and lazy_scan_heap
needs to consider setting the VM, it can trust PD_ALL_VISIBLE -- it is
definitely up to date at that point, even in cases involving
corruption. So the steps where we consider setting the VM now always
starts from a clean slate.

Now we won't just unset both PD_ALL_VISIBLE and the VM bits in the
event of corruption like this. We'll complain about it in
lazy_scan_prune, then fully fix the issue in the most appropriate way
in lazy_scan_heap (could be setting the page all-visible now, even
though it shouldn't have been set but was set when we first arrived).
We also won't fail to complain about PD_ALL_VISIBLE corruption because
lazy_scan_prune "destroyed the evidence" before lazy_scan_heap had the
chance to notice the problem. PD_ALL_VISIBLE corruption should never
happen, obviously, so we should make a point of complaining about it
whenever it can be detected. Which is much more often than what you
see on HEAD today.

+vm_snap_stage_blocks

Doesn't this waste a lot of cycles on skipping frozen blocks if most
of the relation is frozen? I'd expected something more like a byte- or
word-wise processing of skippable blocks, as opposed to this per-block
loop. I don't think it's strictly necessary to patch, but I think it
would be a very useful addition for those with larger tables.

I agree that the visibility map snapshot stuff could stand to be a bit
more frugal with memory. It's certainly not critical, but it is
probably fairly easy to do better here, and so I should do better.

+    XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+        ((double) freeze_table_age + 0.5);
I don't quite understand what this `+ 0.5` is used for, could you explain?

It avoids division by zero.

+ [...] Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
This claim that it happens more proactively in "every" VACUUM
operation is false, so I think the removal of "every" would be better.

Good catch. Will fix.

[PATCH v14 3/3] Finish removing aggressive mode VACUUM.

I don't quite enjoy the refactoring+rewriting of the docs section;
it's difficult to determine what changed when so many things changed
line lengths and were moved around. Tomorrow I'll take a closer look,
but a separation of changes vs moved would be useful for review.

I think that I should break out the doc changes some more. The docs
are likely the least worked out thing at this point.

+    cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
[...]
+    cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
Why are these values adjusted down (up?) by 5%? If I configure this
GUC, I'd expect this to be used effectively verbatim; not adjusted by
an arbitrary factor.

It is kind of arbitrary, but not in the way that you suggest. This
isn't documented in the user docs, and shouldn't really need to be. It
should have very little if any noticeable impact on our final
relfrozenxid/relminmxid in practice. If it does have any noticeable
impact, I strongly suspect it'll be a useful, positive impact.

MinXid/MinMulti control the behavior around whether or not
lazy_scan_noprune is willing to wait the hard way for a cleanup lock,
no matter how long it takes. We do still need something like that, but
it can be far looser than it is right now. The problem with aggressive
mode is that it absolutely insists on a certain outcome, no matter the
cost, and regardless of whether or not a slightly inferior outcome is
acceptable. It's extremely rigid. Rigid things tend to break. Loose,
springy things much less so.

I think that it's an extremely bad idea to wait indefinitely for a
cleanup lock. Sure, it'll work out the vast majority of the time --
it's *very* likely to work. But when it doesn't work right away, there
is no telling how long the wait will be -- all bets are off. Could be
a day, a week, a month -- who knows? The application itself is the
crucial factor here, and in general the application can do whatever it
wants to do -- that is the reality. So we should be willing to kick
the can down the road in almost all cases -- that is actually the
responsible thing to do under the circumstances. We need to get on
with freezing every other page in the table!

There just cannot be very many pages that can't be cleanup locked at
any given time, so waiting indefinitely is a very drastic measure in
response to a problem that is quite likely to go away on its own. A
problem that waiting doesn't really solve anyway. Maybe the only thing
that will work is waiting for a very long time, but we have nothing to
lose (and everything to gain) by waiting to wait.

--
Peter Geoghegan

#76

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Matthias van de Meent (#74)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 5, 2023 at 10:19 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Could this use something like autovacuum_cost_delay? I don't quite
like the use of arbitrary hardcoded millisecond delays

It's not unlike (say) the way that there can sometimes be hardcoded
waits inside GetMultiXactIdMembers(), which does run during VACUUM.

It's not supposed to be noticeable at all. If it is noticeable in any
practical sense, then the design is flawed, and should be fixed.

it can slow a
system down by a significant fraction, especially on high-contention
systems, and this potential of 60ms delay per scanned page can limit
the throughput of this new vacuum strategy to < 17 pages/second
(<136kB/sec) for highly contended sections, which is not great.

We're only willing to wait the full 60ms when smaller waits don't work
out. And when 60ms doesn't do it, we'll then accept an older final
NewRelfrozenXid value. Our willingness to wait at all is conditioned
on the existing NewRelfrozenXid tracker being affected at all by
whether or not we accept reduced lazy_scan_noprune processing for the
page. So the waits are naturally self-limiting.

You may be right that I need to do more about the possibility of
something like that happening -- it's a legitimate concern. But I
think that this may be enough on its own. I've never seen a workload
where more than a small fraction of all pages couldn't be cleanup
locked right away. But I *have* seen workloads where VACUUM vainly
waited forever for a cleanup lock on one single heap page.

It is also not unlikely that in the time it was waiting, the page
contents were updated significantly (concurrent prune, DELETEs
committed), which could result in improved bounds. I think we should
redo the dead items check if we waited, but failed to get a lock - any
tuples removed now reduce work we'll have to do later.

I don't think that it matters very much. That's always true. It seems
very unlikely that we'll get better bounds here, unless it happens by
getting a full cleanup lock and then doing full lazy_scan_prune
processing after all.

Sure, it's possible that a concurrent opportunistic prune could make
the crucial difference, even though we ourselves couldn't get a
cleanup lock despite going to considerable trouble. I just don't think
that it's worth doing anything about.

+++ b/doc/src/sgml/ref/vacuum.sgml
[...] Pages where
+      all tuples are known to be frozen are always skipped.
"...are always skipped, unless the >DISABLE_PAGE_SKIPPING< option is used."

I'll look into changing this.

+++ b/doc/src/sgml/maintenance.sgml
There are a lot of details being lost from the previous version of
that document. Some of the details are obsolete (mentions of
aggressive VACUUM and freezing behavior), but others are not
(FrozenTransactionId in rows from a pre-9.4 system, the need for
vacuum for prevention of issues surrounding XID wraparound).

I will admit that I really hate the "Routine Vacuuming" docs, and
think that they explain things in just about the worst possible way.

I also think that this needs to be broken up into pieces. As I said
recently, the docs are the part of the patch series that is the least
worked out.

I also am not sure this is the best place to store most of these
mentions, but I can't find a different place where these details on
certain interesting parts of the system are documented, and plain
removal of the information does not sit right with me.

I'm usually the person that argues for describing more implementation
details in the docs. But starting with low-level details here is
deeply confusing. At most these are things that should be discussed in
the context of internals, as part of some completely different
chapter.

I'll see about moving details of things like FrozenTransactionId somewhere else.

Specifically, I don't like the removal of the following information
from our documentation:

- Size of pg_xact and pg_commit_ts data in relation to autovacuum_freeze_max_age
Although it is less likely with the new behaviour that we'll hit
these limits due to more eager freezing of transactions, it is still
important for users to have easy access to this information, and
tuning this for storage size is not useless information.

That is a fair point. Though note that these things have weaker
relationships with settings like autovacuum_freeze_max_age now. Mostly
this is a positive improvement (in the sense that we can truncate
SLRUs much more aggressively on average), but not always.

- The reason why VACUUM is essential to the long-term consistency of
Postgres' MVCC system
Informing the user about our use of 32-bit transaction IDs and
that we update an epoch when this XID wraps around does not
automatically make the user aware of the issues that surface around
XID wraparound. Retaining the explainer for XID wraparound in the docs
seems like a decent idea - it may be moved, but please don't delete
it.

We do need to stop telling users to enter single user mode. It's quite
simply obsolete, bad advice, and has been since Postgres 14. It's the
worst thing that you could do, in fact.

--
Peter Geoghegan

#77

Peter Geoghegan

pg@bowt.ie

about 3 years ago

In reply to: Peter Geoghegan (#72)

3 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Jan 3, 2023 at 12:30 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v14.

This has stopped applying due to conflicts with nearby work on VACUUM
from Tom. So I attached a new revision, v15, just to make CFTester
green again.

I didn't have time to incorporate any of the feedback from Matthias
just yet. That will have to wait until v16.

--
Peter Geoghegan

Attachments:

v15-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchapplication/x-patch; name=v15-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchDownload

From 57bc483b4c8592efb3b3ce0147f465b9892c3565 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v15 2/3] Add eager and lazy VM strategies to VACUUM.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM, and use the snapshot to determine when
and how VACUUM will scan (or skip) heap pages (scanning strategy).  The
data structure we a local copy of the visibility map at the start of
VACUUM.  It spills to disk as required, though only with a larger table.

VACUUM decides on its visibility map scanning and freezing strategies
together, shortly before the first pass over the heap begins, since the
concepts are closely related, and work in tandem.  Lazy scanning allows
VACUUM to skip all-visible pages, while eager scanning allows VACUUM to
advance relfrozenxid/relminmxid at the end of the VACUUM operation.

This work, combined with recent work to add freezing strategies, results
in VACUUM advancing relfrozenxid at a cadence that is barely influenced
by autovacuum_freeze_max_age at all.  Now antiwraparound autovacuums
will be far less common in practice.  Freezing now drives relfrozenxid,
rather than relfrozenxid driving freezing.  Even tables that always use
lazy freezing will have a decent chance of relfrozenxid advancement long
before table age nears or exceeds autovacuum_freeze_max_age.

This also lays the groundwork for completely removing aggressive mode
VACUUMs in a later commit.  Scanning strategies now supersede the "early
aggressive VACUUM" concept implemented by vacuum_freeze_table_age, which
is now just a compatibility option (its new default of -1 is interpreted
as "just use autovacuum_freeze_max_age").  Later work that makes the
choice to wait for a cleanup lock depend entirely on individual page
characteristics will decouple that "aggressive behavior" from the eager
scanning strategy behavior (a behavior that's not really "aggressive" in
any general sense, since it's chosen based on both costs and benefits).

Also add explicit I/O prefetching of heap pages, which is controlled by
maintenance_io_concurrency.  We prefetch at the point that the next
block in line is requested by VACUUM.  Prefetching is under the direct
control of the visibility map snapshot code, since VACUUM's vmsnap is
now an authoritative guide to which pages VACUUM will scan.  VACUUM's
final scanned_pages is "locked in" when it decides on scanning strategy
(so scanned_pages is finalized before the first heap pass even begins).

Prefetching should totally avoid the loss of performance that might
otherwise result from removing SKIP_PAGES_THRESHOLD in this commit.
SKIP_PAGES_THRESHOLD was intended to force OS readahead and encourage
relfrozenxid advancement.  See commit bf136cf6 from around the time the
visibility map first went in for full details.

Also teach VACUUM to use scanned_pages (not rel_pages) to cap the size
of the dead_items array.  This approach is strictly better, since there
is no question of scanning any pages other than the precise set of pages
already locked in by vmsnap by the time dead_items is allocated.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h            |  17 +
 src/include/commands/vacuum.h                 |  15 +-
 src/backend/access/heap/heapam.c              |   1 +
 src/backend/access/heap/vacuumlazy.c          | 608 ++++++++++--------
 src/backend/access/heap/visibilitymap.c       | 541 ++++++++++++++++
 src/backend/commands/vacuum.c                 |  68 +-
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      |  66 +-
 doc/src/sgml/ref/vacuum.sgml                  |   4 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 12 files changed, 999 insertions(+), 354 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index daaa01a25..d8df744da 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,17 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
+/* VACUUM scanning strategy */
+typedef enum vmstrategy
+{
+	VMSNAP_SCAN_LAZY,			/* Skip all-visible and all-frozen pages */
+	VMSNAP_SCAN_EAGER,			/* Only skip all-frozen pages */
+	VMSNAP_SCAN_ALL				/* Don't skip any pages (scan them instead) */
+} vmstrategy;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +46,12 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+											  BlockNumber *scanned_pages_lazy,
+											  BlockNumber *scanned_pages_eager);
+extern void visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat);
+extern BlockNumber visibilitymap_snap_next(vmsnapshot *vmsnap);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index f4b33f971..d006a5721 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 #define VACOPT_SKIP_DATABASE_STATS 0x100	/* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x200	/* only vac_update_datfrozenxid() */
 
@@ -283,6 +283,19 @@ struct VacuumCutoffs
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
 	 */
 	BlockNumber freeze_strategy_threshold;
+
+	/*
+	 * The tableagefrac value 1.0 represents the point that autovacuum.c
+	 * scheduling (and VACUUM itself) considers relfrozenxid advancement
+	 * strictly necessary.
+	 *
+	 * Lower values provide useful context, and influence whether VACUUM will
+	 * opt to advance relfrozenxid before the point that it is strictly
+	 * necessary.  VACUUM can (and often does) opt to advance relfrozenxid
+	 * proactively.  It is especially likely with tables where the _added_
+	 * costs happen to be low.
+	 */
+	double		tableagefrac;
 };
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4ec5ff02f..08b92d454 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6919,6 +6919,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold = 0;
+	cutoffs.tableagefrac = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b110061b0..c555174be 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -110,10 +110,18 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * tableagefrac-wise cutoffs influencing VACUUM's choice of scanning strategy
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define TABLEAGEFRAC_MIDPOINT		0.5 /* half way to antiwraparound AV */
+#define TABLEAGEFRAC_HIGHPOINT		0.9 /* Eagerness now mandatory */
+
+/*
+ * Thresholds (expressed as a proportion of rel_pages) that determine the
+ * cutoff (in extra pages scanned) for eager vmsnap scanning behavior at
+ * particular tableagefrac-wise table ages
+ */
+#define MAX_PAGES_YOUNG_TABLEAGE	0.05	/* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE		0.70	/* 70% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -151,8 +159,6 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -171,7 +177,9 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map (as of time that VACUUM began) */
+	vmsnapshot *vmsnap;
+	vmstrategy	vmstrat;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -245,11 +253,8 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static void lazy_scan_strategy(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  bool force_scan_all);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -279,7 +284,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -311,10 +317,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	BlockNumber orig_rel_pages,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -461,37 +467,29 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		vacrel->aggressive = true;
-		skipwithvm = false;
-	}
-
-	vacrel->skipwithvm = skipwithvm;
 
 	/*
-	 * Now determine VACUUM's freezing strategy.
+	 * Now determine VACUUM's freezing and scanning strategies.
+	 *
+	 * This process is driven in part by information from VACUUM's visibility
+	 * map snapshot, which will be acquired in passing.  lazy_scan_heap will
+	 * use the same immutable VM snapshot to determine which pages to scan.
+	 * Using an immutable structure (instead of the live visibility map) makes
+	 * VACUUM avoid scanning concurrently modified pages.  These pages can
+	 * only have deleted tuples that OldestXmin will consider RECENTLY_DEAD.
 	 */
-	lazy_scan_strategy(vacrel);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   (params->options &
+										VACOPT_DISABLE_PAGE_SKIPPING) != 0);
 	if (verbose)
-	{
-		if (vacrel->aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							vacrel->dbname, vacrel->relnamespace,
-							vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							vacrel->dbname, vacrel->relnamespace,
-							vacrel->relname)));
-	}
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						vacrel->dbname, vacrel->relnamespace,
+						vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -501,13 +499,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -554,12 +553,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
+		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
 		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -604,6 +602,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -631,10 +632,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -829,13 +826,11 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_fsm_block_to_vacuum = 0;
+	vmsnapshot *vmsnap = vacrel->vmsnap;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -849,46 +844,27 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
-	for (blkno = 0; blkno < rel_pages; blkno++)
+	next_block_to_scan = visibilitymap_snap_next(vmsnap);
+	while (next_block_to_scan < rel_pages)
 	{
+		BlockNumber blkno = next_block_to_scan;
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		next_block_to_scan = visibilitymap_snap_next(vmsnap);
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * visibilitymap_snap_next must always force us to scan the last page
+		 * in rel (in the range of rel_pages) so that VACUUM can avoid useless
+		 * attempts at rel truncation (per should_attempt_truncation comments)
+		 */
+		Assert(next_block_to_scan > blkno);
+		Assert(next_block_to_scan < rel_pages || blkno == rel_pages - 1);
 
 		vacrel->scanned_pages++;
 
-		/* Report as block scanned, update error traceback information */
+		/* Report all blocks < blkno as initial-heap-pass processed */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
@@ -1027,8 +1003,6 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
-
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
 			vacrel->nonempty_pages = blkno + 1;
@@ -1091,44 +1065,49 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Update visibility map status of this page where required
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
 			if (prunestate.all_frozen)
+			{
+				Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 				flags |= VISIBILITYMAP_ALL_FROZEN;
+			}
 
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
+			if (!PageIsAllVisible(page))
+			{
+				/*
+				 * We could get away with avoiding dirtying the heap page just
+				 * to set PD_ALL_VISIBLE in many cases, but not when checksums
+				 * are enabled.  We nevertheless mark the page dirty in all
+				 * cases to keep things simple. (It is very likely that the
+				 * heap page is already dirty by now anyway.)
+				 */
+				PageSetAllVisible(page);
+				MarkBufferDirty(buf);
+			}
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, prunestate.visibility_cutoff_xid,
 							  flags);
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * When we haven't updated the visibility map we defensively verify
+		 * that it's consistent with the page's PD_ALL_VISIBLE bit instead.
+		 * It should never be the case that the page-level bit is clear while
+		 * corresponding visibility map all-visible/all-frozen bits are set.
+		 * (Though note that the reverse is okay if checksums are disabled.)
+		 *
+		 * lazy_scan_prune will have detected any case where the state of the
+		 * page doesn't agree with the page's own PD_ALL_VISIBLE bit.  It will
+		 * also unset the bit to repair any detected page-level inconsistency.
+		 * That's why PD_ALL_VISIBLE is considered authoritative here.
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+		else if (!PageIsAllVisible(page) &&
+				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1136,50 +1115,6 @@ lazy_scan_heap(LVRelState *vacrel)
 								VISIBILITYMAP_VALID_BITS);
 		}
 
-		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
-		 * set, however.
-		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
-		{
-			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both prunestate fields.
-		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
-				 prunestate.all_frozen &&
-				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-		{
-			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
-			 */
-			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
-
 		/*
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
@@ -1216,12 +1151,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 	}
 
+	/* initial heap pass finished (final pass may still be required) */
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	/* report all blocks as initial-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
@@ -1238,20 +1174,26 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
+	 * related heap vacuuming in final heap pass
 	 */
 	if (dead_items->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * Now that both our initial heap pass and final heap pass (if any) have
+	 * ended, vacuum the Free Space Map. (Actually, similar FSM vacuuming will
+	 * have taken place earlier when VACUUM needed to call lazy_vacuum to deal
+	 * with running out of dead_items space.  Hopefully that will be rare.)
 	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+	if (rel_pages > 0)
+	{
+		Assert(vacrel->scanned_pages > 0);
+		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+								rel_pages);
+	}
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+	/* report all blocks as final-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
 
 	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
@@ -1259,7 +1201,7 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine freezing strategy.
+ *	lazy_scan_strategy() -- Determine freezing/vmsnap scanning strategies.
  *
  * Our lazy freezing strategy is useful when putting off the work of freezing
  * totally avoids freezing that turns out to have been wasted effort later on.
@@ -1267,11 +1209,42 @@ lazy_scan_heap(LVRelState *vacrel)
  * continual growth, where freezing pages proactively is needed just to avoid
  * falling behind on freezing (eagerness is also likely to be cheaper in the
  * short/medium term for such tables, but the long term picture matters most).
+ *
+ * Our lazy vmsnap scanning strategy is useful when we can save a significant
+ * amount of work in the short term by not advancing relfrozenxid/relminmxid.
+ * Our eager vmsnap scanning strategy is useful when there is hardly any work
+ * avoided by being lazy anyway, and/or when tableagefrac is nearing or has
+ * already surpassed 1.0, which is the point of antiwraparound autovacuuming.
+ *
+ * Freezing and scanning strategies are structured as two independent choices,
+ * but they are not independent in any practical sense (it's just mechanical).
+ * Eager and lazy behaviors go hand in hand, since the choice of each strategy
+ * is driven by similar considerations about the needs of the target table.
+ * Moreover, choosing eager scanning strategy can easily result in freezing
+ * many more pages (compared to an equivalent lazy scanning strategy VACUUM),
+ * since VACUUM can only freeze pages that it actually scans.  (All-visible
+ * pages may well have XIDs < FreezeLimit by now, but VACUUM has no way of
+ * noticing that it should freeze such pages besides just scanning them.)
+ *
+ * The single most important justification for the eager behaviors is system
+ * level performance stability.  It is often better to freeze all-visible
+ * pages before we're truly forced to (just to advance relfrozenxid) as a way
+ * of avoiding big spikes, where VACUUM has to freeze many pages all at once.
+ *
+ * Returns final scanned_pages for the VACUUM operation.  The exact number of
+ * pages that lazy_scan_heap scans depends in part on which vmsnap scanning
+ * strategy we choose (only eager scanning will scan rel's all-visible pages).
  */
-static void
-lazy_scan_strategy(LVRelState *vacrel)
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->rel_pages,
+				scanned_pages_lazy,
+				scanned_pages_eager,
+				nextra_scanned_eager,
+				nextra_young_threshold,
+				nextra_old_threshold,
+				nextra_toomany_threshold;
 
 	/*
 	 * Decide freezing strategy.
@@ -1279,121 +1252,160 @@ lazy_scan_strategy(LVRelState *vacrel)
 	 * The eager freezing strategy is used when the threshold controlled by
 	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
 	 *
+	 * Also freeze eagerly whenever table age is close to requiring (or is
+	 * actually undergoing) an antiwraparound autovacuum.  This may delay the
+	 * next antiwraparound autovacuum against the table.  We avoid relying on
+	 * them, if at all possible (mostly-static tables tend to rely on them).
+	 *
 	 * Also freeze eagerly with an unlogged or temp table, where the total
 	 * cost of freezing each page is just the cycles needed to prepare a set
 	 * of freeze plans.  Executing the freeze plans adds very little cost.
 	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
 	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 *
+	 * Once a table first becomes big enough for eager freezing, it's almost
+	 * inevitable that it will also naturally settle into a cadence where
+	 * relfrozenxid is advanced during every VACUUM (barring rel truncation).
+	 * This is a consequence of eager freezing strategy avoiding creating new
+	 * all-visible pages: if there never are any all-visible pages (if all
+	 * skippable pages are fully all-frozen), then there is no way that lazy
+	 * scanning strategy can ever look better than eager scanning strategy.
+	 * There are still ways that the occasional all-visible page could slip
+	 * into a table that we always freeze eagerly (at least when its tuples
+	 * tend to contain MultiXacts), but that should have negligible impact.
 	 */
 	vacrel->eager_freeze_strategy =
 		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 vacrel->cutoffs.tableagefrac > TABLEAGEFRAC_HIGHPOINT ||
 		 !RelationIsPermanent(vacrel->rel));
-}
-
-/*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
- *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
- *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
- */
-static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
-{
-	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
-
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
-	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
-
-		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-		{
-			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
-		 */
-		if (next_unskippable_block == rel_pages - 1)
-			break;
-
-		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
-			break;
-
-		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
-		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
-
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
-	}
 
 	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
+	 * Decide vmsnap scanning strategy.
 	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 * First acquire a visibility map snapshot, which determines the number of
+	 * pages that each vmsnap scanning strategy is required to scan for us in
+	 * passing.
+	 *
+	 * The number of "extra" scanned_pages added by choosing VMSNAP_SCAN_EAGER
+	 * over VMSNAP_SCAN_LAZY is a key input into the decision making process.
+	 * It is a good proxy for the added cost of applying our eager vmsnap
+	 * strategy during this particular VACUUM.  (We may or may not have to
+	 * dirty/freeze the extra pages when we scan them, which isn't something
+	 * that we try to model.  It shouldn't matter very much at this level.)
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
+	vacrel->vmsnap = visibilitymap_snap_acquire(vacrel->rel, rel_pages,
+												&scanned_pages_lazy,
+												&scanned_pages_eager);
+	nextra_scanned_eager = scanned_pages_eager - scanned_pages_lazy;
+
+	/*
+	 * Next determine guideline "nextra_scanned_eager" thresholds, which are
+	 * applied based in part on tableagefrac (when nextra_toomany_threshold is
+	 * determined below).  These thresholds also represent minimum and maximum
+	 * sensible thresholds that can ever make sense for a table of this size
+	 * (when the table's age isn't old enough to make eagerness mandatory).
+	 *
+	 * For the most part we only care about relative (not absolute) costs.  We
+	 * want to advance relfrozenxid at an opportune time, during a VACUUM that
+	 * has to scan relatively many pages either way (whether due to the need
+	 * to remove dead tuples from many pages, or due to the table containing
+	 * lots of existing all-frozen pages, or due to a combination of both).
+	 * Even small tables (where lazy freezing is used) shouldn't have to do
+	 * dramatically more work than usual when advancing relfrozenxid, which
+	 * our policy of waiting for the right VACUUM largely avoids, in practice.
+	 */
+	nextra_young_threshold = (double) rel_pages * MAX_PAGES_YOUNG_TABLEAGE;
+	nextra_old_threshold = (double) rel_pages * MAX_PAGES_OLD_TABLEAGE;
+
+	/*
+	 * Next determine nextra_toomany_threshold, which represents how many
+	 * extra scanned_pages are deemed too high a cost to pay for eagerness,
+	 * given present conditions.  This is our model's break-even point.
+	 */
+	if (vacrel->cutoffs.tableagefrac < TABLEAGEFRAC_MIDPOINT)
+	{
+		/*
+		 * The table's age is still below table age mid point, so table age is
+		 * still of only minimal concern.  We're still willing to act eagerly
+		 * when it's _very_ cheap to do so: when use of VMSNAP_SCAN_EAGER will
+		 * force us to scan some extra pages not exceeding 5% of rel_pages.
+		 */
+		nextra_toomany_threshold = nextra_young_threshold;
+	}
+	else if (vacrel->cutoffs.tableagefrac <= TABLEAGEFRAC_HIGHPOINT)
+	{
+		double		nextra_scale;
+
+		/*
+		 * The table's age is starting to become a concern, but not to the
+		 * extent that we'll force the use of VMSNAP_SCAN_EAGER strategy.
+		 * We'll need to interpolate to get an nextra_scanned_eager-based
+		 * threshold.
+		 *
+		 * If tableagefrac is only barely over the midway point, then we'll
+		 * choose an nextra_scanned_eager threshold of ~5% of rel_pages.  The
+		 * opposite extreme occurs when tableagefrac is very near to the high
+		 * point.  That will make our nextra_scanned_eager threshold very
+		 * aggressive: we'll go with VMSNAP_SCAN_EAGER when doing so requires
+		 * we scan a number of extra blocks as high as ~70% of rel_pages.
+		 *
+		 * Note that the threshold grows (on a percentage basis) by ~8.1% of
+		 * rel_pages for every additional 5%-of-tableagefrac increment added
+		 * (after tableagefrac has crossed the 50%-of-tableagefrac mid point,
+		 * until the 90%-of-tableagefrac high point is reached, when we switch
+		 * over to not caring about the added cost of eager freezing at all).
+		 */
+		nextra_scale =
+			1.0 - ((TABLEAGEFRAC_HIGHPOINT - vacrel->cutoffs.tableagefrac) /
+				   (TABLEAGEFRAC_HIGHPOINT - TABLEAGEFRAC_MIDPOINT));
+
+		nextra_toomany_threshold =
+			(nextra_young_threshold * (1.0 - nextra_scale)) +
+			(nextra_old_threshold * nextra_scale);
+	}
 	else
 	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
+		/*
+		 * The table's age surpasses the high point, and so is approaching (or
+		 * may even surpass) the point that an antiwraparound autovacuum is
+		 * required.  Force VMSNAP_SCAN_EAGER, no matter how many extra pages
+		 * we'll be required to scan as a result (costs no longer matter).
+		 *
+		 * Note that there is a discontinuity when tableagefrac crosses this
+		 * 90%-of-tableagefrac high point: the threshold set here jumps from
+		 * 70% of rel_pages to 100% of rel_pages (MaxBlockNumber, actually).
+		 * It's useful to only care about table age once it gets this high.
+		 * That way even extreme cases will have at least some chance of using
+		 * eager scanning before an antiwraparound autovacuum is launched.
+		 */
+		nextra_toomany_threshold = MaxBlockNumber;
 	}
 
-	return next_unskippable_block;
+	/* Make final choice on scanning strategy using final threshold */
+	nextra_toomany_threshold = Max(nextra_toomany_threshold, 32);
+	vacrel->vmstrat = (nextra_scanned_eager >= nextra_toomany_threshold ?
+					   VMSNAP_SCAN_LAZY : VMSNAP_SCAN_EAGER);
+
+	/*
+	 * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+	 * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+	 */
+	if (force_scan_all)
+		vacrel->vmstrat = VMSNAP_SCAN_ALL;
+
+	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
+
+	/* Inform vmsnap infrastructure of our chosen strategy */
+	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
+
+	/* Return appropriate scanned_pages for final strategy chosen */
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
+		return scanned_pages_lazy;
+	if (vacrel->vmstrat == VMSNAP_SCAN_EAGER)
+		return scanned_pages_eager;
+
+	/* DISABLE_PAGE_SKIPPING/VMSNAP_SCAN_ALL case */
+	return rel_pages;
 }
 
 /*
@@ -1859,7 +1871,11 @@ retry:
 			 * cutoff by stepping back from OldestXmin.
 			 */
 			if (prunestate->all_visible && prunestate->all_frozen)
+			{
+				/* Using same cutoff when setting VM is now unnecessary */
 				snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+				prunestate->visibility_cutoff_xid = InvalidTransactionId;
+			}
 			else
 			{
 				/* Avoids false conflicts when hot_standby_feedback in use */
@@ -1885,6 +1901,39 @@ retry:
 		tuples_frozen = 0;		/* avoid miscounts in instrumentation */
 	}
 
+	/*
+	 * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
+	 * set.  Check that here in passing.
+	 *
+	 * We cannot condition this on what the all_visible flag says about the
+	 * page.  The OldestXmin cutoff used by two successive VACUUMs against the
+	 * same table can "move backwards", since it's conservative.  It's quite
+	 * possible that we won't consider a page all-visible now, despite the
+	 * page having its PD_ALL_VISIBLE bit set by some slightly earlier VACUUM.
+	 */
+	if (PageIsAllVisible(page) &&
+		(lpdead_items > 0 || tuples_deleted > 0 || recently_dead_tuples > 0))
+	{
+		elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u",
+			 vacrel->relname, blkno);
+
+		/*
+		 * Clear PD_ALL_VISIBLE now, but leave it up to our caller to correct
+		 * any remaining inconsistency between PD_ALL_VISIBLE and rel's VM.
+		 * The page might be eligible to be set all-visible once we finish, or
+		 * it might just get unset in the VM to bring things into agreement.
+		 *
+		 * Clearing PD_ALL_VISIBLE usually happens at exactly the same point
+		 * as the corresponding VM bit is cleared, since in general the VM bit
+		 * is never supposed to be set unless PD_ALL_VISIBLE is in agreement.
+		 * We prefer this phased approach because the decoupling allows VACUUM
+		 * to keep everything in agreement by following some standard steps in
+		 * all cases, independent of whether corruption is present.
+		 */
+		PageClearAllVisible(page);
+		MarkBufferDirty(buf);
+	}
+
 	/*
 	 * VACUUM will call heap_page_is_all_visible() during the second pass over
 	 * the heap to determine all_visible and all_frozen for the page -- this
@@ -2832,6 +2881,14 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * Also don't attempt it if we are doing early pruning/vacuuming, because a
  * scan which cannot find a truncated heap page cannot determine that the
  * snapshot is too old to read that page.
+ *
+ * Note that we effectively rely on visibilitymap_snap_next() having forced
+ * VACUUM to scan the final page (rel_pages - 1) in all cases.  Without that,
+ * we'd tend to needlessly acquire an AccessExclusiveLock just to attempt rel
+ * truncation that is bound to fail.  VACUUM cannot set vacrel->nonempty_pages
+ * in pages that it skips using the VM, so we must avoid interpreting skipped
+ * pages as empty pages when it makes little sense.  Observing that the final
+ * page has tuples is a simple way of avoiding pathological locking behavior.
  */
 static bool
 should_attempt_truncation(LVRelState *vacrel)
@@ -3122,14 +3179,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3138,15 +3194,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3168,12 +3222,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a..4fd6aabc0 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,10 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap_acquire - acquire snapshot of visibility map
+ *		visibilitymap_snap_strategy - set VACUUM's scanning strategy
+ *		visibilitymap_snap_next - get next block to scan from vmsnap
+ *		visibilitymap_snap_release - release previously acquired snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +56,10 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset.  This also provides us with a
+ * convenient way of performing I/O prefetching on behalf of VACUUM, since the
+ * pages that VACUUM's first heap pass will scan are fully predetermined.
  *
  * LOCKING
  *
@@ -92,10 +100,12 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
+#include "utils/spccache.h"
 
 
 /*#define TRACE_VISIBILITYMAP */
@@ -124,9 +134,81 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Prefetching of heap pages takes place as VACUUM requests the next block in
+ * line from its visibility map snapshot
+ *
+ * XXX MIN_PREFETCH_SIZE of 32 is a little on the high side, but matches
+ * hard-coded constant used by vacuumlazy.c when prefetching for rel
+ * truncation.  Might be better to increase the maintenance_io_concurrency
+ * default, or to do nothing like this at all.
+ */
+#define STAGED_BUFSIZE			(MAX_IO_CONCURRENCY * 2)
+#define MIN_PREFETCH_SIZE		((BlockNumber) 32)
+
+/*
+ * Snapshot of visibility map at the start of a VACUUM operation
+ */
+struct vmsnapshot
+{
+	/* Target heap rel */
+	Relation	rel;
+	/* Scanning strategy used by VACUUM operation */
+	vmstrategy	strat;
+	/* Per-strategy final scanned_pages */
+	BlockNumber rel_pages;
+	BlockNumber scanned_pages_lazy;
+	BlockNumber scanned_pages_eager;
+
+	/*
+	 * Materialized visibility map state.
+	 *
+	 * VM snapshots spill to a temp file when required.
+	 */
+	BlockNumber nvmpages;
+	BufFile    *file;
+
+	/*
+	 * Prefetch distance, used to perform I/O prefetching of heap pages
+	 */
+	int			prefetch_distance;
+
+	/* Current VM page cached */
+	BlockNumber curvmpage;
+	char	   *rawmap;
+	PGAlignedBlock vmpage;
+
+	/* Staging area for blocks returned to VACUUM */
+	BlockNumber staged[STAGED_BUFSIZE];
+	int			current_nblocks_staged;
+
+	/*
+	 * Next block from range of rel_pages to consider placing in staged block
+	 * array (it will be placed there if it's going to be scanned by VACUUM)
+	 */
+	BlockNumber next_block;
+
+	/*
+	 * Number of blocks that we still need to return, and number of blocks
+	 * that we still need to prefetch
+	 */
+	BlockNumber scanned_pages_to_return;
+	BlockNumber scanned_pages_to_prefetch;
+
+	/* offset of next block in line to return (from staged) */
+	int			next_return_idx;
+	/* offset of next block in line to prefetch (from staged) */
+	int			next_prefetch_idx;
+	/* offset of first garbage/invalid element (from staged) */
+	int			first_invalid_idx;
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static void vm_snap_stage_blocks(vmsnapshot *vmsnap);
+static uint8 vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 
 
 /*
@@ -373,6 +455,356 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap_acquire - get read-only snapshot of visibility map
+ *
+ * Initializes VACUUM caller's snapshot, allocating memory in current context.
+ * Used by VACUUM to determine which pages it must scan up front.
+ *
+ * Set scanned_pages_lazy and scanned_pages_eager to help VACUUM decide on its
+ * scanning strategy.  These are VACUUM's scanned_pages when it opts to skip
+ * all eligible pages and scanned_pages when it opts to just skip all-frozen
+ * pages, respectively.
+ *
+ * Caller finalizes scanning strategy by calling visibilitymap_snap_strategy.
+ * This determines the kind of blocks visibilitymap_snap_next should indicate
+ * need to be scanned by VACUUM.
+ */
+vmsnapshot *
+visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+						   BlockNumber *scanned_pages_lazy,
+						   BlockNumber *scanned_pages_eager)
+{
+	BlockNumber nvmpages = 0,
+				mapBlockLast = 0,
+				all_visible = 0,
+				all_frozen = 0;
+	uint8		mapbits_last_page = 0;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_acquire %s %u",
+		 RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	if (rel_pages > 0)
+	{
+		mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages - 1);
+		nvmpages = mapBlockLast + 1;
+	}
+
+	/* Allocate and initialize VM snapshot state */
+	vmsnap = palloc0(sizeof(vmsnapshot));
+	vmsnap->rel = rel;
+	vmsnap->strat = VMSNAP_SCAN_ALL;	/* for now */
+	vmsnap->rel_pages = rel_pages;	/* scanned_pages for VMSNAP_SCAN_ALL */
+	vmsnap->scanned_pages_lazy = 0;
+	vmsnap->scanned_pages_eager = 0;
+
+	/*
+	 * vmsnap temp file state.
+	 *
+	 * Only relations large enough to need more than one visibility map page
+	 * use a temp file (cannot wholly rely on vmsnap's single page cache).
+	 */
+	vmsnap->nvmpages = nvmpages;
+	vmsnap->file = NULL;
+	if (nvmpages > 1)
+		vmsnap->file = BufFileCreateTemp(false);
+	vmsnap->prefetch_distance = 0;
+#ifdef USE_PREFETCH
+	vmsnap->prefetch_distance =
+		get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
+#endif
+	vmsnap->prefetch_distance = Max(vmsnap->prefetch_distance, MIN_PREFETCH_SIZE);
+
+	/* cache of VM pages read from temp file */
+	vmsnap->curvmpage = 0;
+	vmsnap->rawmap = NULL;
+
+	/* staged blocks array state */
+	vmsnap->current_nblocks_staged = 0;
+	vmsnap->next_block = 0;
+	vmsnap->scanned_pages_to_return = 0;
+	vmsnap->scanned_pages_to_prefetch = 0;
+	/* Offsets into staged blocks array */
+	vmsnap->next_return_idx = 0;
+	vmsnap->next_prefetch_idx = 0;
+	vmsnap->first_invalid_idx = 0;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		/* Cache page locally */
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(vmsnap->vmpage.data, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/* Finish off this VM page using snapshot's vmpage cache */
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = map = PageGetContents(vmsnap->vmpage.data);
+		umap = (uint64 *) map;
+
+		if (mapBlock == mapBlockLast)
+		{
+			uint32		mapByte;
+			uint8		mapOffset;
+
+			/*
+			 * The last VM page requires some extra steps.
+			 *
+			 * First get the status of the last heap page (page in the range
+			 * of rel_pages) in passing.
+			 */
+			Assert(mapBlock == HEAPBLK_TO_MAPBLOCK(rel_pages - 1));
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages - 1);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages - 1);
+			mapbits_last_page = ((map[mapByte] >> mapOffset) &
+								 VISIBILITYMAP_VALID_BITS);
+
+			/*
+			 * Also defensively "truncate" our local copy of the last page in
+			 * order to reliably exclude heap pages beyond the range of
+			 * rel_pages.  This is just paranoia.
+			 */
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages);
+			if (mapByte != 0 || mapOffset != 0)
+			{
+				MemSet(&map[mapByte + 1], 0, MAPSIZE - (mapByte + 1));
+				map[mapByte] &= (1 << mapOffset) - 1;
+			}
+		}
+
+		/* Maintain count of all-frozen and all-visible pages */
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+
+		/* Finally, write out vmpage cache VM page to vmsnap's temp file */
+		if (vmsnap->file)
+			BufFileWrite(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+	}
+
+	/*
+	 * Should always have at least as many all_visible pages as all_frozen
+	 * pages.  Even still, we generally only interpret a page as all-frozen
+	 * when both the all-visible and all-frozen bits are set together.  Clamp
+	 * so that we'll avoid giving our caller an obviously bogus summary of the
+	 * visibility map when certain pages only have their all-frozen bit set.
+	 *
+	 * This is just defensive, but it's not a completely hypothetical concern;
+	 * historical vacuumlazy.c bugs allowed such inconsistencies to slip in.
+	 */
+	Assert(all_frozen <= all_visible && all_visible <= rel_pages);
+	all_frozen = Min(all_frozen, all_visible);
+
+	/*
+	 * Done copying all VM pages from authoritative VM into a VM snapshot.
+	 *
+	 * Figure out the final scanned_pages for the two skipping policies that
+	 * we might use: skipallvis (skip both all-frozen and all-visible) and
+	 * skipallfrozen (just skip all-frozen).
+	 */
+	vmsnap->scanned_pages_lazy = rel_pages - all_visible;
+	vmsnap->scanned_pages_eager = rel_pages - all_frozen;
+
+	/*
+	 * When the last page is skippable in principle, it still won't be treated
+	 * as skippable by visibilitymap_snap_next, which recognizes the last page
+	 * as a special case.  Compensate by incrementing each scanning strategy's
+	 * scanned_pages as needed to avoid counting the last page as skippable.
+	 *
+	 * As usual we expect that the all-frozen bit can only be set alongside
+	 * the all-visible bit (for any given page), but only interpret a page as
+	 * truly all-frozen when both of its VM bits are set together.
+	 */
+	if (mapbits_last_page & VISIBILITYMAP_ALL_VISIBLE)
+	{
+		vmsnap->scanned_pages_lazy++;
+		if (mapbits_last_page & VISIBILITYMAP_ALL_FROZEN)
+			vmsnap->scanned_pages_eager++;
+	}
+
+	*scanned_pages_lazy = vmsnap->scanned_pages_lazy;
+	*scanned_pages_eager = vmsnap->scanned_pages_eager;
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_strategy -- determine VACUUM's scanning strategy.
+ *
+ * VACUUM chooses a vmsnap strategy according to priorities around advancing
+ * relfrozenxid.  See visibilitymap_snap_acquire.
+ */
+void
+visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat)
+{
+	int			nprefetch;
+
+	/* Remember final scanning strategy */
+	vmsnap->strat = strat;
+
+	if (vmsnap->strat == VMSNAP_SCAN_LAZY)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_lazy;
+	else if (vmsnap->strat == VMSNAP_SCAN_EAGER)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_eager;
+	else
+		vmsnap->scanned_pages_to_return = vmsnap->rel_pages;
+
+	vmsnap->scanned_pages_to_prefetch = vmsnap->scanned_pages_to_return;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_strategy %s %d %u",
+		 RelationGetRelationName(vmsnap->rel), (int) strat,
+		 vmsnap->scanned_pages_to_return);
+#endif
+
+	/*
+	 * Stage blocks (may have to read from temp file).
+	 *
+	 * We rely on the assumption that we'll always have a large enough staged
+	 * blocks array to accommodate any possible prefetch distance.
+	 */
+	vm_snap_stage_blocks(vmsnap);
+
+	nprefetch = Min(vmsnap->current_nblocks_staged, vmsnap->prefetch_distance);
+#ifdef USE_PREFETCH
+	for (int i = 0; i < nprefetch; i++)
+	{
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, vmsnap->staged[i]);
+	}
+#endif
+
+	vmsnap->scanned_pages_to_prefetch -= nprefetch;
+	vmsnap->next_prefetch_idx += nprefetch;
+}
+
+/*
+ *	visibilitymap_snap_next -- get next block to scan from vmsnap.
+ *
+ * Returns next block in line for VACUUM to scan according to vmsnap.  Caller
+ * skips any and all blocks preceding returned block.
+ *
+ * VACUUM always scans the last page to determine whether it has tuples.  This
+ * is useful as a way of avoiding certain pathological cases with heap rel
+ * truncation.  We always return the final block (rel_pages - 1) here last.
+ */
+BlockNumber
+visibilitymap_snap_next(vmsnapshot *vmsnap)
+{
+	BlockNumber next_block_to_scan;
+
+	if (vmsnap->scanned_pages_to_return == 0)
+		return InvalidBlockNumber;
+
+	/* Prepare to return this block */
+	next_block_to_scan = vmsnap->staged[vmsnap->next_return_idx++];
+	vmsnap->current_nblocks_staged--;
+	vmsnap->scanned_pages_to_return--;
+
+	/*
+	 * Did the staged blocks array just run out of blocks to return to caller,
+	 * or do we need to stage more blocks for I/O prefetching purposes?
+	 */
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+	if ((vmsnap->current_nblocks_staged == 0 &&
+		 vmsnap->scanned_pages_to_return > 0) ||
+		(vmsnap->next_prefetch_idx == vmsnap->first_invalid_idx &&
+		 vmsnap->scanned_pages_to_prefetch > 0))
+	{
+		if (vmsnap->current_nblocks_staged > 0)
+		{
+			/*
+			 * We've run out of prefetchable blocks, but still have some
+			 * non-returned blocks.  Shift existing blocks to the start of the
+			 * array.  The newly staged blocks go after these ones.
+			 */
+			memmove(&vmsnap->staged[0],
+					&vmsnap->staged[vmsnap->next_return_idx],
+					sizeof(BlockNumber) * vmsnap->current_nblocks_staged);
+		}
+
+		/*
+		 * Reset offsets in staged blocks array, while accounting for likely
+		 * presence of preexisting blocks that have already been prefetched
+		 * but have yet to be returned to VACUUM caller
+		 */
+		vmsnap->next_prefetch_idx -= vmsnap->next_return_idx;
+		vmsnap->first_invalid_idx -= vmsnap->next_return_idx;
+		vmsnap->next_return_idx = 0;
+
+		/* Stage more blocks (may have to read from temp file) */
+		vm_snap_stage_blocks(vmsnap);
+	}
+
+	/*
+	 * By here we're guaranteed to have at least one prefetchable block in the
+	 * staged blocks array (unless we've already prefetched all blocks that
+	 * will ever be returned to VACUUM caller)
+	 */
+	if (vmsnap->next_prefetch_idx < vmsnap->first_invalid_idx)
+	{
+#ifdef USE_PREFETCH
+		/* Still have remaining blocks to prefetch, so prefetch next one */
+		BlockNumber prefetch = vmsnap->staged[vmsnap->next_prefetch_idx++];
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, prefetch);
+#else
+		vmsnap->next_prefetch_idx++;
+#endif
+		Assert(vmsnap->current_nblocks_staged > 1);
+		Assert(vmsnap->scanned_pages_to_prefetch > 0);
+		vmsnap->scanned_pages_to_prefetch--;
+	}
+	else
+	{
+		Assert(vmsnap->scanned_pages_to_prefetch == 0);
+	}
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_next %s %u",
+		 RelationGetRelationName(vmsnap->rel), next_block_to_scan);
+#endif
+
+	return next_block_to_scan;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Frees resources allocated in visibilitymap_snap_acquire for VACUUM.
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->scanned_pages_to_return == 0);
+	if (vmsnap->file)
+		BufFileClose(vmsnap->file);
+	pfree(vmsnap);
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
@@ -677,3 +1109,112 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
+
+/*
+ * Stage some heap blocks from vmsnap to return to VACUUM caller.
+ *
+ * Called when we completely run out of staged blocks to return to VACUUM, or
+ * when vmsnap still has some pending staged blocks, but too few to be able to
+ * prefetch incrementally as the remaining blocks are returned to VACUUM.
+ */
+static void
+vm_snap_stage_blocks(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->current_nblocks_staged < STAGED_BUFSIZE);
+	Assert(vmsnap->first_invalid_idx < STAGED_BUFSIZE);
+	Assert(vmsnap->next_return_idx <= vmsnap->first_invalid_idx);
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+
+	while (vmsnap->next_block < vmsnap->rel_pages &&
+		   vmsnap->current_nblocks_staged < STAGED_BUFSIZE)
+	{
+		for (;;)
+		{
+			uint8		mapbits = vm_snap_get_status(vmsnap,
+													 vmsnap->next_block);
+
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+				break;
+			}
+
+			/*
+			 * Stop staging blocks just before final page, which must always
+			 * be scanned by VACUUM
+			 */
+			if (vmsnap->next_block == vmsnap->rel_pages - 1)
+				break;
+
+			/* VMSNAP_SCAN_ALL forcing VACUUM to scan every page? */
+			if (vmsnap->strat == VMSNAP_SCAN_ALL)
+				break;
+
+			/*
+			 * Check if VACUUM must scan this page because it's not all-frozen
+			 * and VACUUM opted to use VMSNAP_SCAN_EAGER strategy
+			 */
+			if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
+				vmsnap->strat == VMSNAP_SCAN_EAGER)
+				break;
+
+			/* VACUUM will skip this block -- so don't stage it for later */
+			vmsnap->next_block++;
+		}
+
+		/* VACUUM will scan this block, so stage it for later */
+		vmsnap->staged[vmsnap->first_invalid_idx++] = vmsnap->next_block++;
+		vmsnap->current_nblocks_staged++;
+	}
+}
+
+/*
+ * Get status of bits from vm snapshot
+ */
+static uint8
+vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "vm_snap_get_status %u", heapBlk);
+#endif
+
+	/*
+	 * If we didn't see the VM page when the snapshot was first acquired we
+	 * defensively assume heapBlk not all-visible or all-frozen
+	 */
+	Assert(heapBlk <= vmsnap->rel_pages);
+	if (unlikely(mapBlock >= vmsnap->nvmpages))
+		return 0;
+
+	/*
+	 * Read from temp file when required.
+	 *
+	 * Although this routine supports random access, sequential access is
+	 * expected.  We should only need to read each temp file page into cache
+	 * at most once per VACUUM.
+	 */
+	if (unlikely(mapBlock != vmsnap->curvmpage))
+	{
+		size_t		nread;
+
+		if (BufFileSeekBlock(vmsnap->file, mapBlock) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek to block %u of vmsnap temporary file",
+							mapBlock)));
+		nread = BufFileRead(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+		if (nread != BLCKSZ)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u of vmsnap temporary file: read only %zu of %zu bytes",
+							mapBlock, nread, (size_t) BLCKSZ)));
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = PageGetContents(vmsnap->vmpage.data);
+	}
+
+	return ((vmsnap->rawmap[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 9b361a08a..4418ed3c4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -965,11 +965,11 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				effective_multixact_freeze_max_age,
 				freeze_strategy_threshold;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
@@ -1101,48 +1101,48 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
-	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.  The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/*
+	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+	 * XMID table age (whichever is greater currently).
+	 */
+	XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+		((double) freeze_table_age + 0.5);
+	MXIDFrac = (double) (nextMXID - cutoffs->relminmxid) /
+		((double) multixact_freeze_table_age + 0.5);
+	cutoffs->tableagefrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		cutoffs->tableagefrac = 1.0;
+
+	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 3e463cb42..33cd2dd60 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2497,10 +2497,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2517,10 +2517,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 447645b73..c44c1c4e7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -659,6 +659,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -692,11 +699,9 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b1137381a..d97284ec8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9184,20 +9184,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9266,19 +9274,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, in every
+         <command>VACUUM</command> operation.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 167b20c63..8b078221a 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -160,9 +160,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
       all tuples are known to be frozen can always be skipped, and those
       where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      skipped except when performing an aggressive vacuum.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.38.1

v15-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchapplication/x-patch; name=v15-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchDownload

From 47a9967a4d1e090b139339c33838d74974b2b192 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v15 1/3] Add eager and lazy freezing strategies to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  VACUUM determines its freezing
strategy based on the value of the new vacuum_freeze_strategy_threshold
GUC (or reloption) in most cases: Tables that exceeds the size threshold
use the eager freezing strategy.  Otherwise VACUUM uses the lazy
freezing strategy,  which is essentially the same approach that VACUUM
has always taken to freezing (though not quite, due to the influence of
page level freezing following recent work).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).
Eager freezing is generally only applied when vacuuming larger tables,
where freezing most individual heap pages at the first opportunity (in
the first VACUUM operation where they can definitely be set all-visible)
will improve performance stability.

A later commit will add vmsnap scanning strategies, which are designed
to work in tandem with the freezing strategies from this commit.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/commands/vacuum.h                 | 10 +++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++++
 src/backend/access/heap/heapam.c              |  1 +
 src/backend/access/heap/vacuumlazy.c          | 43 ++++++++++++++++++-
 src/backend/commands/vacuum.c                 | 17 +++++++-
 src/backend/postmaster/autovacuum.c           | 10 +++++
 src/backend/utils/misc/guc_tables.c           | 11 +++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 19 +++++++-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++++
 doc/src/sgml/ref/vacuum.sgml                  | 16 +++----
 12 files changed, 143 insertions(+), 11 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb770..f4b33f971 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -222,6 +222,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in total heap blocks,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -274,6 +277,12 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
+	 * rel's main fork) that triggers VACUUM's eager freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold;
 };
 
 /*
@@ -297,6 +306,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038..bcc5e589a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 14c23101a..fb6b5a6d6 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, INT_MAX
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 63c4f01f0..4ec5ff02f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6918,6 +6918,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.freeze_strategy_threshold = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a42e881da..b110061b0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,6 +153,8 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -243,6 +245,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void lazy_scan_strategy(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -472,6 +475,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
+	/*
+	 * Now determine VACUUM's freezing strategy.
+	 */
+	lazy_scan_strategy(vacrel);
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -1251,6 +1258,38 @@ lazy_scan_heap(LVRelState *vacrel)
 		lazy_cleanup_all_indexes(vacrel);
 }
 
+/*
+ *	lazy_scan_strategy() -- Determine freezing strategy.
+ *
+ * Our lazy freezing strategy is useful when putting off the work of freezing
+ * totally avoids freezing that turns out to have been wasted effort later on.
+ * Our eager freezing strategy is useful with larger tables that experience
+ * continual growth, where freezing pages proactively is needed just to avoid
+ * falling behind on freezing (eagerness is also likely to be cheaper in the
+ * short/medium term for such tables, but the long term picture matters most).
+ */
+static void
+lazy_scan_strategy(LVRelState *vacrel)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	/*
+	 * Decide freezing strategy.
+	 *
+	 * The eager freezing strategy is used when the threshold controlled by
+	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
+	 *
+	 * Also freeze eagerly with an unlogged or temp table, where the total
+	 * cost of freezing each page is just the cycles needed to prepare a set
+	 * of freeze plans.  Executing the freeze plans adds very little cost.
+	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
+	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 */
+	vacrel->eager_freeze_strategy =
+		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold ||
+		 !RelationIsPermanent(vacrel->rel));
+}
+
 /*
  *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
@@ -1775,10 +1814,12 @@ retry:
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 	 * freeze when pruning generated an FPI, if doing so means that we set the
 	 * page all-frozen afterwards (might not happen until final heap pass).
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will thereby become all-frozen in the visibility map.
 	 */
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
 		(prunestate->all_visible && prunestate->all_frozen &&
-		 fpi_before != pgWalUsage.wal_fpi))
+		 (fpi_before != pgWalUsage.wal_fpi || vacrel->eager_freeze_strategy)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c4ed7efce..9b361a08a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -67,6 +67,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -271,6 +272,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
@@ -958,7 +962,8 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -971,6 +976,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/* Set pg_class fields in cutoffs */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1085,6 +1091,15 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
 		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	Assert(freeze_strategy_threshold >= 0);
+	cutoffs->freeze_strategy_threshold = freeze_strategy_threshold;
+
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c5..ecddde3a1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2877,6 +2886,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 68328b140..3e463cb42 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2524,6 +2524,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&vacuum_freeze_strategy_threshold,
+		(UINT64CONST(4) * 1024 * 1024 * 1024) / BLCKSZ, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5afdeb04d..447645b73 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -693,6 +693,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 05b3862d0..b1137381a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9161,6 +9161,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in pages) that <command>VACUUM</command>
+        should use to decide whether to apply its eager freezing strategy.
+        The default is 4 gigabytes (<literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
       <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9196,7 +9211,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         Specifies the cutoff age (in transactions) that
         <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages that have an older XID.
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index c98223b2a..eabbf9e65 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 8fa842184..167b20c63 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -121,14 +121,14 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <term><literal>FREEZE</literal></term>
     <listitem>
      <para>
-      Selects aggressive <quote>freezing</quote> of tuples.
-      Specifying <literal>FREEZE</literal> is equivalent to performing
-      <command>VACUUM</command> with the
-      <xref linkend="guc-vacuum-freeze-min-age"/> and
-      <xref linkend="guc-vacuum-freeze-table-age"/> parameters
-      set to zero.  Aggressive freezing is always performed when the
-      table is rewritten, so this option is redundant when <literal>FULL</literal>
-      is specified.
+      Selects eager <quote>freezing</quote> of tuples.  Specifying
+      <literal>FREEZE</literal> is equivalent to performing
+      <command>VACUUM</command> with the <xref
+       linkend="guc-vacuum-freeze-strategy-threshold"/> and <xref
+       linkend="guc-vacuum-freeze-table-age"/> parameters set
+      to zero.  Eager freezing is always performed when the table is
+      rewritten, so this option is redundant when
+      <literal>FULL</literal> is specified.
      </para>
     </listitem>
    </varlistentry>
-- 
2.38.1

v15-0003-Finish-removing-aggressive-mode-VACUUM.patchapplication/x-patch; name=v15-0003-Finish-removing-aggressive-mode-VACUUM.patchDownload

From 9addc3abb817bb22d8e1b871bf781fcfc3f6b876 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v15 3/3] Finish removing aggressive mode VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Although pre-visibility-map
VACUUM was far less efficient (especially after 9.6's commit fd31cd26),
its naive approach had one notable advantage: users only had to think
about a single kind of lazy vacuum (the only kind that existed).

Break the final remaining dependency on aggressive mode: replace the
rules governing when VACUUM will wait for a cleanup lock with a new set
of rules more attuned to the needs of the table.  With that last
dependency gone, there is no need for aggressive mode, so get rid of it.
Users once again only have to think about one kind of lazy vacuum.

In general, all of the behaviors associated with aggressive mode prior
to Postgres 16 have been retained; they just get applied selectively, on
a more dynamic timeline.  For example, the aforementioned change to
VACUUM's cleanup lock behavior retains the general idea of sometimes
waiting for a cleanup lock to make sure that older XIDs get frozen, so
that relfrozenxid can be advanced by a sufficient amount.  All that
really changes is the information driving VACUUM's decision on waiting.

We use new, dedicated cutoffs, rather than applying the FreezeLimit and
MultiXactCutoff used when deciding whether we should trigger freezing on
the basis of XID/MXID age.  These minimum fallback cutoffs (which are
called MinXid and MinMulti) are typically far older than the standard
FreezeLimit/MultiXactCutoff cutoffs.  VACUUM doesn't need an aggressive
mode to decide on whether to wait for a cleanup lock anymore; it can
decide everything at the level of individual heap pages.

It is okay to aggressively punt on waiting for a cleanup lock like this
because VACUUM now directly understands the importance of never falling
too far behind on the work of freezing physical heap pages at the level
of the whole table, following recent work to add VM scanning strategies.
It is generally safer for VACUUM to press on with freezing other heap
pages from the table instead.  Even if relfrozenxid can only be advanced
by relatively few XIDs as a consequence, VACUUM should have more than
ample opportunity to catch up next time, since there is bound to be no
more than a small number of problematic unfrozen pages left behind.
VACUUM now tends to consistently advance relfrozenxid (at least by some
small amount) all the time in larger tables, so all that has to happen
for relfrozenxid to fully catch up is for a few remaining unfrozen pages
to get frozen.  Since relfrozenxid is now considered to be no more than
a lagging indicator of freezing, and since relfrozenxid isn't used to
trigger freezing in the way that it once was, time is on our side.

Also teach VACUUM to wait for a short while for cleanup locks when doing
so has a decent chance of preserving its ability to advance relfrozenxid
up to FreezeLimit (and/or to advance relminmxid up to MultiXactCutoff).
As a result, VACUUM typically manages to advance relfrozenxid by just as
much as it would have had it promised to advance it up to FreezeLimit
(i.e. had it made the traditional aggressive VACUUM guarantee), even
when vacuuming a table that happens to have relatively many cleanup lock
conflicts affecting pages with older XIDs/MXIDs.  VACUUM thereby avoids
missing out on advancing relfrozenxid up to the traditional target
amount when it really can be avoided fairly easily, without promising to
do so (VACUUM only promises to advance up to MinXid/MinMulti).

XXX We also need to avoid the special auto-cancellation behavior for
antiwraparound autovacuums to make this truly safe.  See also, related
patch for this: https://commitfest.postgresql.org/41/4027/

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   9 +-
 src/backend/access/heap/heapam.c              |   2 +
 src/backend/access/heap/vacuumlazy.c          | 220 +++---
 src/backend/commands/vacuum.c                 |  42 +-
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 doc/src/sgml/config.sgml                      |  10 +-
 doc/src/sgml/logicaldecoding.sgml             |   2 +-
 doc/src/sgml/maintenance.sgml                 | 721 ++++++++----------
 doc/src/sgml/ref/create_table.sgml            |   2 +-
 doc/src/sgml/ref/prepare_transaction.sgml     |   2 +-
 doc/src/sgml/ref/vacuum.sgml                  |  10 +-
 doc/src/sgml/ref/vacuumdb.sgml                |   4 +-
 doc/src/sgml/xact.sgml                        |   4 +-
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  33 +-
 15 files changed, 559 insertions(+), 530 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d006a5721..4db70ac55 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -278,6 +278,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold cutoff point (expressed in # of physical heap rel blocks in
 	 * rel's main fork) that triggers VACUUM's eager freezing strategy
@@ -350,7 +357,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+extern void vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 							   struct VacuumCutoffs *cutoffs);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 08b92d454..400e90894 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6918,6 +6918,8 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.MinXid = FreezeLimit;
+	cutoffs.MinMulti = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold = 0;
 	cutoffs.tableagefrac = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c555174be..845fde4ee 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -157,8 +157,6 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -263,7 +261,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							LVPagePruneState *prunestate);
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
-							  bool *hastup, bool *recordfreespace);
+							  bool *hastup, bool *recordfreespace,
+							  Size *freespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -461,7 +460,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
+	vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
@@ -541,17 +540,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(vacrel->cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(vacrel->cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
@@ -559,7 +555,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
 		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
-		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -628,33 +623,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
 			memset(&walusage, 0, sizeof(WalUsage));
 			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
 			initStringInfo(&buf);
+
 			if (verbose)
-			{
-				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
-			}
 			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			else
-			{
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			appendStringInfo(&buf, msgfmt,
 							 vacrel->dbname,
 							 vacrel->relnamespace,
@@ -943,6 +919,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			bool		hastup,
 						recordfreespace;
+			Size		freespace;
 
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
@@ -956,10 +933,8 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Collect LP_DEAD items in dead_items array, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
-								  &recordfreespace))
+								  &recordfreespace, &freespace))
 			{
-				Size		freespace = 0;
-
 				/*
 				 * Processed page successfully (without cleanup lock) -- just
 				 * need to perform rel truncation and FSM steps, much like the
@@ -968,21 +943,14 @@ lazy_scan_heap(LVRelState *vacrel)
 				 */
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
-				if (recordfreespace)
-					freespace = PageGetHeapFreeSpace(page);
-				UnlockReleaseBuffer(buf);
 				if (recordfreespace)
 					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+
+				/* lock and pin released by lazy_scan_noprune */
 				continue;
 			}
 
-			/*
-			 * lazy_scan_noprune could not do all required processing.  Wait
-			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
-			 */
-			Assert(vacrel->aggressive);
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockBufferForCleanup(buf);
+			/* cleanup lock acquired by lazy_scan_noprune */
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
@@ -1393,8 +1361,6 @@ lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 	if (force_scan_all)
 		vacrel->vmstrat = VMSNAP_SCAN_ALL;
 
-	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
-
 	/* Inform vmsnap infrastructure of our chosen strategy */
 	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
 
@@ -2008,17 +1974,32 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items array for removal from indexes (assuming caller's page
+ * can be processed successfully here).
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We return true to indicate that processing succeeded, in which case we'll
+ * have dropped the lock and pin on buf/page.  Else returns false, indicating
+ * that page must be processed by lazy_scan_prune in the usual way after all.
+ * Acquires a cleanup lock on buf/page for caller before returning false.
+ *
+ * We go to considerable trouble to get a cleanup lock on any page that has
+ * XIDs/MXIDs that need to be frozen in order for VACUUM to be able to set
+ * relfrozenxid/relminmxid to values >= FreezeLimit/MultiXactCutoff cutoffs.
+ * But we don't strictly guarantee it; we only guarantee that final values
+ * will be >= MinXid/MinMulti cutoffs in the worst case.
+ *
+ * We prefer to "under promise and over deliver" like this because a strong
+ * guarantee has the potential to make a bad situation even worse.  VACUUM
+ * should avoid waiting for a cleanup lock for an indefinitely long time until
+ * it has already exhausted every available alternative.  It's quite possible
+ * (and perhaps even likely) that the problem will go away on its own.  But
+ * even when it doesn't, our approach at least makes it likely that the first
+ * VACUUM that encounters the issue will catch up on whatever freezing may
+ * still be required for every other page in the target rel.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
- * generic FSM processing for page.
+ * generic FSM processing for page, using *freespace value set here.
  */
 static bool
 lazy_scan_noprune(LVRelState *vacrel,
@@ -2026,7 +2007,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				  BlockNumber blkno,
 				  Page page,
 				  bool *hastup,
-				  bool *recordfreespace)
+				  bool *recordfreespace,
+				  Size *freespace)
 {
 	OffsetNumber offnum,
 				maxoff;
@@ -2034,6 +2016,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples,
 				missed_dead_tuples;
+	bool		should_freeze = false;
 	HeapTupleHeader tupleheader;
 	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
 	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
@@ -2043,6 +2026,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	*hastup = false;			/* for now */
 	*recordfreespace = false;	/* for now */
+	*freespace = PageGetHeapFreeSpace(page);
 
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -2084,34 +2068,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
 									 &NoFreezePageRelfrozenXid,
 									 &NoFreezePageRelminMxid))
-		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
-			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
-			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
-			 */
-		}
+			should_freeze = true;
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
@@ -2160,10 +2117,97 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * By here we know for sure that caller can put off freezing and pruning
-	 * this particular page until the next VACUUM.  Remember its details now.
-	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
+	 * Release lock (but not pin) on page now.  Then consider if we should
+	 * back out of accepting reduced processing for this page.
+	 *
+	 * Our caller's initial inability to get a cleanup lock will often turn
+	 * out to have been nothing more than a momentary blip, and it would be a
+	 * shame if relfrozenxid/relminmxid values < FreezeLimit/MultiXactCutoff
+	 * were used without good reason.  For example, the checkpointer might
+	 * have been writing out this page a moment ago, in which case its buffer
+	 * pin might have already been released by now.
+	 *
+	 * It's also possible that the conflicting buffer pin will continue to
+	 * block cleanup lock acquisition on the buffer for an extended period.
+	 * For example, it isn't uncommon for heap_lock_tuple to sleep while
+	 * holding a buffer pin, in which case a conflicting pin could easily be
+	 * held for much longer than VACUUM can reasonably be expected to wait.
+	 * There are also truly pathological cases to worry about.  For example,
+	 * the case where buggy application code holds open a cursor forever.
 	 */
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	if (should_freeze)
+	{
+		/*
+		 * If page has tuple with a dangerously old XID/MXID (an XID < MinXid,
+		 * or an MXID < MinMulti), then we wait for however long it takes to
+		 * get a cleanup lock.
+		 *
+		 * Check for that first (get it out of the way).
+		 */
+		if (TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								  vacrel->cutoffs.MinXid) ||
+			MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								vacrel->cutoffs.MinMulti))
+		{
+			/*
+			 * MinXid/MinMulti are considered to be only barely adequate final
+			 * values, so we only expect to end up here when previous VACUUMs
+			 * put off processing by lazy_scan_prune in the hope that it would
+			 * never come to this.  That hasn't worked out, so we must wait.
+			 */
+			LockBufferForCleanup(buf);
+			return false;
+		}
+
+		/*
+		 * Page has tuple with XID < FreezeLimit, or MXID < MultiXactCutoff,
+		 * but they're not so old that we're _strictly_ obligated to freeze.
+		 *
+		 * We are willing to go to the trouble of waiting for a cleanup lock
+		 * for a short while for such a page -- just not indefinitely long.
+		 * This avoids squandering opportunities to advance relfrozenxid or
+		 * relminmxid by the target amount during any one VACUUM, which is
+		 * particularly important with larger tables that only get vacuumed
+		 * when autovacuum.c is concerned about table age.  It would not be
+		 * okay if the number of autovacuums such a table ended up requiring
+		 * noticeably exceeded the expected autovacuum_freeze_max_age cadence.
+		 *
+		 * We are willing to wait and try again a total of 3 times.  If that
+		 * doesn't work then we just give up.  We only wait here when it is
+		 * actually expected to preserve current NewRelfrozenXid/NewRelminMxid
+		 * tracker values, and when trackers will actually be used to update
+		 * pg_class later on.  This also tends to limit the impact of waiting
+		 * for VACUUMs that experience relatively many cleanup lock conflicts.
+		 */
+		if (vacrel->vmstrat != VMSNAP_SCAN_LAZY &&
+			(TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								   vacrel->NewRelfrozenXid) ||
+			 MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								 vacrel->NewRelminMxid)))
+		{
+			/* wait 10ms, then 20ms, then 30ms, then give up */
+			for (int i = 1; i <= 3; i++)
+			{
+				CHECK_FOR_INTERRUPTS();
+
+				pg_usleep(1000L * 10L * i);
+				if (ConditionalLockBufferForCleanup(buf))
+				{
+					/* Go process page in lazy_scan_prune after all */
+					return false;
+				}
+			}
+			/* Accept reduced processing for this page after all */
+		}
+	}
+
+	/*
+	 * By here we know for sure that caller will put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * Also drop the buffer pin that we held onto during cleanup lock steps.
+	 */
+	ReleaseBuffer(buf);
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 4418ed3c4..ef36e99e9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -948,13 +948,8 @@ get_all_vacuum_rels(int options)
  * The target relation and VACUUM parameters are our inputs.
  *
  * Output parameters are the cutoffs that VACUUM caller should use.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
  */
-bool
+void
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				   struct VacuumCutoffs *cutoffs)
 {
@@ -1124,6 +1119,39 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 		multixact_freeze_table_age > effective_multixact_freeze_max_age)
 		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
+	/*
+	 * Determine the cutoffs used by VACUUM to decide on whether to wait for a
+	 * cleanup lock on a page (that it can't cleanup lock right away).  These
+	 * are the MinXid and MinMulti cutoffs.  They are related to the cutoffs
+	 * for freezing (FreezeLimit and MultiXactCutoff), and are only applied on
+	 * pages that we cannot freeze right away.  See vacuumlazy.c for details.
+	 *
+	 * VACUUM can ratchet back NewRelfrozenXid and/or NewRelminMxid instead of
+	 * waiting indefinitely for a cleanup lock in almost all cases.  The high
+	 * level goal is to create as many opportunities as possible to freeze
+	 * (across many successive VACUUM operations), while avoiding waiting for
+	 * a cleanup lock whenever possible.  Any time spent waiting is time spent
+	 * not freezing other eligible pages, which is typically a bad trade-off.
+	 *
+	 * As a consequence of all this, MinXid and MinMulti also act as limits on
+	 * the oldest acceptable values that can ever be set in pg_class by VACUUM
+	 * (though this is only relevant when they have already attained XID/XMID
+	 * ages that approach freeze_table_age and/or multixact_freeze_table_age).
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
 	 * XMID table age (whichever is greater currently).
@@ -1141,8 +1169,6 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 */
 	if (params->is_wraparound)
 		cutoffs->tableagefrac = 1.0;
-
-	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 1730425de..d03c5fa5d 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d97284ec8..42ddee182 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8256,7 +8256,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Note that even when this parameter is disabled, the system
         will launch autovacuum processes if necessary to
         prevent transaction ID wraparound.  See <xref
-        linkend="vacuum-for-wraparound"/> for more information.
+        linkend="vacuum-xid-space"/> for more information.
        </para>
       </listitem>
      </varlistentry>
@@ -8445,7 +8445,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter can only be set at server start, but the setting
         can be reduced for individual tables by
         changing table storage parameters.
-        For more information see <xref linkend="vacuum-for-wraparound"/>.
+        For more information see <xref linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9195,7 +9195,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         billion, <command>VACUUM</command> will silently limit the
         effective value to <xref
          linkend="guc-autovacuum-freeze-max-age"/>. For more
-        information see <xref linkend="vacuum-for-wraparound"/>.
+        information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
@@ -9228,7 +9228,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the value of <xref linkend="guc-autovacuum-freeze-max-age"/>, so
         that there is not an unreasonably short time between forced
         autovacuums.  For more information see <xref
-        linkend="vacuum-for-wraparound"/>.
+        linkend="vacuum-xid-space"/>.
        </para>
       </listitem>
      </varlistentry>
@@ -9284,7 +9284,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        billion, <command>VACUUM</command> will silently limit the
        effective value to <xref
         linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
-       information see <xref linkend="vacuum-for-wraparound"/>.
+       information see <xref linkend="vacuum-xid-space"/>.
        </para>
        <note>
         <para>
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 38ee69dcc..380da3c1e 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -324,7 +324,7 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
       because neither required WAL nor required rows from the system catalogs
       can be removed by <command>VACUUM</command> as long as they are required by a replication
       slot.  In extreme cases this could cause the database to shut down to prevent
-      transaction ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+      transaction ID wraparound (see <xref linkend="vacuum-xid-space"/>).
       So if a slot is no longer required it should be dropped.
      </para>
     </caution>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..ed54a2988 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -400,202 +400,73 @@
    </para>
   </sect2>
 
-  <sect2 id="vacuum-for-wraparound">
-   <title>Preventing Transaction ID Wraparound Failures</title>
+  <sect2 id="freezing">
+   <title>Freezing tuples</title>
 
-   <indexterm zone="vacuum-for-wraparound">
-    <primary>transaction ID</primary>
-    <secondary>wraparound</secondary>
-   </indexterm>
+   <para>
+    <command>VACUUM</command> freezes a page's tuples (by processing
+    the tuple header fields described in <xref
+     linkend="storage-tuple-layout"/>) as a way of avoiding long term
+    dependencies on transaction status metadata referenced therein.
+    Heap pages that only contain frozen tuples are suitable for long
+    term storage.  Larger databases are often mostly comprised of cold
+    data that is modified very infrequently, plus a relatively small
+    amount of hot data that is updated far more frequently.
+    <command>VACUUM</command> applies a variety of techniques that
+    allow it to concentrate most of its efforts on hot data.
+   </para>
+
+   <sect3 id="vacuum-xid-space">
+    <title>Managing the 32-bit Transaction ID address space</title>
 
     <indexterm>
      <primary>wraparound</primary>
      <secondary>of transaction IDs</secondary>
     </indexterm>
 
-   <para>
-    <productname>PostgreSQL</productname>'s
-    <link linkend="mvcc-intro">MVCC</link> transaction semantics
-    depend on being able to compare transaction ID (<acronym>XID</acronym>)
-    numbers: a row version with an insertion XID greater than the current
-    transaction's XID is <quote>in the future</quote> and should not be visible
-    to the current transaction.  But since transaction IDs have limited size
-    (32 bits) a cluster that runs for a long time (more
-    than 4 billion transactions) would suffer <firstterm>transaction ID
-    wraparound</firstterm>: the XID counter wraps around to zero, and all of a sudden
-    transactions that were in the past appear to be in the future &mdash; which
-    means their output become invisible.  In short, catastrophic data loss.
-    (Actually the data is still there, but that's cold comfort if you cannot
-    get at it.)  To avoid this, it is necessary to vacuum every table
-    in every database at least once every two billion transactions.
-   </para>
-
-   <para>
-    The reason that periodic vacuuming solves the problem is that
-    <command>VACUUM</command> will mark rows as <emphasis>frozen</emphasis>, indicating that
-    they were inserted by a transaction that committed sufficiently far in
-    the past that the effects of the inserting transaction are certain to be
-    visible to all current and future transactions.
-    Normal XIDs are
-    compared using modulo-2<superscript>32</superscript> arithmetic. This means
-    that for every normal XID, there are two billion XIDs that are
-    <quote>older</quote> and two billion that are <quote>newer</quote>; another
-    way to say it is that the normal XID space is circular with no
-    endpoint. Therefore, once a row version has been created with a particular
-    normal XID, the row version will appear to be <quote>in the past</quote> for
-    the next two billion transactions, no matter which normal XID we are
-    talking about. If the row version still exists after more than two billion
-    transactions, it will suddenly appear to be in the future. To
-    prevent this, <productname>PostgreSQL</productname> reserves a special XID,
-    <literal>FrozenTransactionId</literal>, which does not follow the normal XID
-    comparison rules and is always considered older
-    than every normal XID.
-    Frozen row versions are treated as if the inserting XID were
-    <literal>FrozenTransactionId</literal>, so that they will appear to be
-    <quote>in the past</quote> to all normal transactions regardless of wraparound
-    issues, and so such row versions will be valid until deleted, no matter
-    how long that is.
-   </para>
-
-   <note>
     <para>
-     In <productname>PostgreSQL</productname> versions before 9.4, freezing was
-     implemented by actually replacing a row's insertion XID
-     with <literal>FrozenTransactionId</literal>, which was visible in the
-     row's <structname>xmin</structname> system column.  Newer versions just set a flag
-     bit, preserving the row's original <structname>xmin</structname> for possible
-     forensic use.  However, rows with <structname>xmin</structname> equal
-     to <literal>FrozenTransactionId</literal> (2) may still be found
-     in databases <application>pg_upgrade</application>'d from pre-9.4 versions.
+     <productname>PostgreSQL</productname>'s <link
+      linkend="mvcc-intro">MVCC</link> transaction semantics depend on
+     being able to compare transaction ID (<acronym>XID</acronym>)
+     numbers: a row version with an insertion XID greater than the
+     current transaction's XID is <quote>in the future</quote> and
+     should not be visible to the current transaction.  But since the
+     on-disk representation of transaction IDs is only 32-bits, the
+     system is incapable of representing
+     <emphasis>distances</emphasis> between any two XIDs that exceed
+     about 2 billion transaction IDs.
     </para>
+
     <para>
-     Also, system catalogs may contain rows with <structname>xmin</structname> equal
-     to <literal>BootstrapTransactionId</literal> (1), indicating that they were
-     inserted during the first phase of <application>initdb</application>.
-     Like <literal>FrozenTransactionId</literal>, this special XID is treated as
-     older than every normal XID.
+     One of the purposes of periodic vacuuming is to manage the
+     Transaction Id address space.  <command>VACUUM</command> will
+     mark rows as <emphasis>frozen</emphasis>, indicating that they
+     were inserted by a transaction that committed sufficiently far in
+     the past that the effects of the inserting transaction are
+     certain to be visible to all current and future transactions.
+     There is, in effect, an infinite distance between a frozen
+     transaction ID and any unfrozen transaction ID.  This allows the
+     on-disk representation of transaction IDs to recycle the 32-bit
+     address space efficiently.
     </para>
-   </note>
 
-   <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
-   </para>
-
-   <para>
-    <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
-    to determine which pages of a table must be scanned.  Normally, it
-    will skip pages that don't have any dead row versions even if those pages
-    might still have row versions with old XID values.  Therefore, normal
-    <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
-   </para>
-
-   <para>
-    The maximum time that a table can go unvacuumed is two billion
-    transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
-    XIDs older than the age specified by the configuration parameter <xref
-    linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
-    autovacuum is disabled.)
-   </para>
-
-   <para>
-    This implies that if a table is not otherwise vacuumed,
-    autovacuum will be invoked on it approximately once every
-    <varname>autovacuum_freeze_max_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname> transactions.
-    For tables that are regularly vacuumed for space reclamation purposes,
-    this is of little importance.  However, for static tables
-    (including tables that receive inserts, but no updates or deletes),
-    there is no need to vacuum for space reclamation, so it can
-    be useful to try to maximize the interval between forced autovacuums
-    on very large static tables.  Obviously one can do this either by
-    increasing <varname>autovacuum_freeze_max_age</varname> or decreasing
-    <varname>vacuum_freeze_min_age</varname>.
-   </para>
-
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
-   <para>
-    The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
-    subdirectories of the database cluster will take more space, because it
-    must store the commit status and (if <varname>track_commit_timestamp</varname> is
-    enabled) timestamp of all transactions back to
-    the <varname>autovacuum_freeze_max_age</varname> horizon.  The commit status uses
-    two bits per transaction, so if
-    <varname>autovacuum_freeze_max_age</varname> is set to its maximum allowed value
-    of two billion, <filename>pg_xact</filename> can be expected to grow to about half
-    a gigabyte and <filename>pg_commit_ts</filename> to about 20GB.  If this
-    is trivial compared to your total database size,
-    setting <varname>autovacuum_freeze_max_age</varname> to its maximum allowed value
-    is recommended.  Otherwise, set it depending on what you are willing to
-    allow for <filename>pg_xact</filename> and <filename>pg_commit_ts</filename> storage.
-    (The default, 200 million transactions, translates to about 50MB
-    of <filename>pg_xact</filename> storage and about 2GB of <filename>pg_commit_ts</filename>
-    storage.)
-   </para>
-
-   <para>
-    One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
-    it might cause <command>VACUUM</command> to do useless work: freezing a row
-    version is a waste of time if the row is modified
-    soon thereafter (causing it to acquire a new XID).  So the setting should
-    be large enough that rows are not frozen until they are unlikely to change
-    any more.
-   </para>
-
-   <para>
-    To track the age of the oldest unfrozen XIDs in a database,
-    <command>VACUUM</command> stores XID
-    statistics in the system tables <structname>pg_class</structname> and
-    <structname>pg_database</structname>.  In particular,
-    the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the oldest remaining unfrozen
-    XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
-    <structfield>datfrozenxid</structfield> column of a database's
-    <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
-    appearing in that database &mdash; it is just the minimum of the
-    per-table <structfield>relfrozenxid</structfield> values within the database.
-    A convenient way to
-    examine this information is to execute queries such as:
+    <para>
+     To track the age of the oldest unfrozen XIDs in a database,
+     <command>VACUUM</command> stores XID statistics in the system
+     tables <structname>pg_class</structname> and
+     <structname>pg_database</structname>.  In particular, the
+     <structfield>relfrozenxid</structfield> column of a table's
+     <structname>pg_class</structname> row contains the oldest
+     remaining unfrozen XID at the end of the most recent
+     <command>VACUUM</command>.  All rows inserted by transactions
+     older than this cutoff XID are guaranteed to have been frozen.
+     Similarly, the <structfield>datfrozenxid</structfield> column of
+     a database's <structname>pg_database</structname> row is a lower
+     bound on the unfrozen XIDs appearing in that database &mdash; it
+     is just the minimum of the per-table
+     <structfield>relfrozenxid</structfield> values within the
+     database.  A convenient way to examine this information is to
+     execute queries such as:
 
 <programlisting>
 SELECT c.oid::regclass as table_name,
@@ -607,83 +478,13 @@ WHERE c.relkind IN ('r', 'm');
 SELECT datname, age(datfrozenxid) FROM pg_database;
 </programlisting>
 
-    The <literal>age</literal> column measures the number of transactions from the
-    cutoff XID to the current transaction's XID.
-   </para>
-
-   <tip>
-    <para>
-     When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
-     parameter is specified, <command>VACUUM</command> prints various
-     statistics about the table.  This includes information about how
-     <structfield>relfrozenxid</structfield> and
-     <structfield>relminmxid</structfield> advanced.  The same details appear
-     in the server log when autovacuum logging (controlled by <xref
-      linkend="guc-log-autovacuum-min-duration"/>) reports on a
-     <command>VACUUM</command> operation executed by autovacuum.
+     The <literal>age</literal> column measures the number of transactions from the
+     cutoff XID to the current transaction's XID.
     </para>
-   </tip>
-
-   <para>
-    <command>VACUUM</command> normally only scans pages that have been modified
-    since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
-    advanced when every page of the table
-    that might contain unfrozen XIDs is scanned.  This happens when
-    <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
-    <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
-    pages that are not already all-frozen happen to
-    require vacuuming to remove dead row versions. When <command>VACUUM</command>
-    scans every page in the table that is not already all-frozen, it should
-    set <literal>age(relfrozenxid)</literal> to a value just a little more than the
-    <varname>vacuum_freeze_min_age</varname> setting
-    that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  <command>VACUUM</command>
-    will set <structfield>relfrozenxid</structfield> to the oldest XID
-    that remains in the table, so it's possible that the final value
-    will be much more recent than strictly required.
-    If no <structfield>relfrozenxid</structfield>-advancing
-    <command>VACUUM</command> is issued on the table until
-    <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
-    be forced for the table.
-   </para>
-
-   <para>
-    If for some reason autovacuum fails to clear old XIDs from a table, the
-    system will begin to emit warning messages like this when the database's
-    oldest XIDs reach forty million transactions from the wraparound point:
-
-<programlisting>
-WARNING:  database "mydb" must be vacuumed within 39985967 transactions
-HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
-</programlisting>
-
-    (A manual <command>VACUUM</command> should fix the problem, as suggested by the
-    hint; but note that the <command>VACUUM</command> must be performed by a
-    superuser, else it will fail to process system catalogs and thus not
-    be able to advance the database's <structfield>datfrozenxid</structfield>.)
-    If these warnings are
-    ignored, the system will shut down and refuse to start any new
-    transactions once there are fewer than three million transactions left
-    until wraparound:
-
-<programlisting>
-ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
-HINT:  Stop the postmaster and vacuum that database in single-user mode.
-</programlisting>
-
-    The three-million-transaction safety margin exists to let the
-    administrator recover without data loss, by manually executing the
-    required <command>VACUUM</command> commands.  However, since the system will not
-    execute commands once it has gone into the safety shutdown mode,
-    the only way to do this is to stop the server and start the server in single-user
-    mode to execute <command>VACUUM</command>.  The shutdown mode is not enforced
-    in single-user mode.  See the <xref linkend="app-postgres"/> reference
-    page for details about using single-user mode.
-   </para>
+   </sect3>
 
    <sect3 id="vacuum-for-multixact-wraparound">
-    <title>Multixacts and Wraparound</title>
+    <title>Managing the 32-bit MultiXactId address space</title>
 
     <indexterm>
      <primary>MultiXactId</primary>
@@ -704,47 +505,109 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      particular multixact ID is stored separately in
      the <filename>pg_multixact</filename> subdirectory, and only the multixact ID
      appears in the <structfield>xmax</structfield> field in the tuple header.
-     Like transaction IDs, multixact IDs are implemented as a
-     32-bit counter and corresponding storage, all of which requires
-     careful aging management, storage cleanup, and wraparound handling.
-     There is a separate storage area which holds the list of members in
-     each multixact, which also uses a 32-bit counter and which must also
-     be managed.
+     Like transaction IDs, multixact IDs are implemented as a 32-bit
+     counter and corresponding storage.
     </para>
 
     <para>
-     Whenever <command>VACUUM</command> scans any part of a table, it will replace
-     any multixact ID it encounters which is older than
-     <xref linkend="guc-vacuum-multixact-freeze-min-age"/>
-     by a different value, which can be the zero value, a single
-     transaction ID, or a newer multixact ID.  For each table,
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
-     possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
+     A separate <structfield>relminmxid</structfield> field can be
+     advanced any time <structfield>relfrozenxid</structfield> is
+     advanced.  <command>VACUUM</command> manages the MultiXactId
+     address space by implementing rules that are analogous to the
+     approach taken with Transaction IDs.  Many of the XID-based
+     settings that influence <command>VACUUM</command>'s behavior have
+     direct MultiXactId analogs. A convenient way to examine
+     information about the MultiXactId address space is to execute
+     queries such as:
+    </para>
+<programlisting>
+SELECT c.oid::regclass as table_name,
+       mxid_age(c.relminmxid)
+FROM pg_class c
+WHERE c.relkind IN ('r', 'm');
+
+SELECT datname, mxid_age(datminmxid) FROM pg_database;
+</programlisting>
+   </sect3>
+
+   <sect3 id="freezing-strategies">
+    <title>Lazy and eager freezing strategies</title>
+    <para>
+     When <command>VACUUM</command> is configured to freeze more
+     aggressively it will typically set the table's
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> fields to relatively recent
+     values.  However, there can be significant variation among tables
+     with varying workload characteristics.  There can even be
+     variation in how <structfield>relfrozenxid</structfield>
+     advancement takes place over time for the same table, across
+     successive <command>VACUUM</command> operations.  Sometimes
+     <command>VACUUM</command> will be able to advance
+     <structfield>relfrozenxid</structfield> and
+     <structfield>relminmxid</structfield> by relatively many
+     XIDs/MXIDs despite performing relatively little freezing work.  On
+     the other hand <command>VACUUM</command> can sometimes freeze many
+     individual pages while only advancing
+     <structfield>relfrozenxid</structfield> by as few as one or two
+     XIDs (this is typically seen following bulk loading).
     </para>
 
-    <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
+    <tip>
+     <para>
+      When the <command>VACUUM</command> command's <literal>VERBOSE</literal>
+      parameter is specified, <command>VACUUM</command> prints various
+      statistics about the table.  This includes information about how
+      <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> advanced, as well as
+      information about how many pages were newly frozen.  The same
+      details appear in the server log when autovacuum logging
+      (controlled by <xref linkend="guc-log-autovacuum-min-duration"/>)
+      reports on a <command>VACUUM</command> operation executed by
+      autovacuum.
+     </para>
+    </tip>
 
     <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a general rule, the design of <command>VACUUM</command>
+     prioritizes stable and predictable performance characteristics
+     over time, while still leaving some scope for freezing lazily when
+     a lazy strategy is likely to avoid unnecessary work altogether.  Tables
+     whose heap relation on-disk size is less than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> at the start of
+     <command>VACUUM</command> will have page freezing triggered based
+     on <quote>lazy</quote> criteria.  Freezing will only take place
+     when one or more XIDs attain an age greater than <xref
+      linkend="guc-vacuum-freeze-min-age"/>, or when one or more MXIDs
+     attain an age greater than <xref
+      linkend="guc-vacuum-multixact-freeze-min-age"/>.
+    </para>
+    <para>
+     Tables that are larger than <xref
+      linkend="guc-vacuum-freeze-strategy-threshold"/> will have
+     <command>VACUUM</command> trigger freezing for any and all pages
+     that are eligible to be frozen under the lazy criteria, as well as
+     pages that <command>VACUUM</command> considers all visible pages.
+     This is the eager freezing strategy.  The design makes the soft
+     assumption that larger tables will tend to consist of pages that
+     will only need to be processed by <command>VACUUM</command> once.
+     The overhead of freezing each page is expected to be slightly
+     higher in the short term, but much lower in the long term, at
+     least on average.  Eager freezing also limits the accumulation of
+     unfrozen pages, which tends to improve performance
+     <emphasis>stability</emphasis> over time.
+    </para>
+    <para>
+     Occasionally, <command>VACUUM</command> is required to advance
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> up to a specific value
+     to ensure the system always has a healthy amount of usable
+     transaction ID address space.  This usually only occurs when
+     <command>VACUUM</command> must be run by autovacuum specifically
+     for the purpose of advancing <structfield>relfrozenxid</structfield>,
+     when no <command>VACUUM</command> has been triggered for some
+     time.  In practice most individual tables will consistently have
+     somewhat recent values through routine vacuuming to clean up old
+     row versions.
     </para>
    </sect3>
   </sect2>
@@ -802,117 +665,197 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     <xref linkend="guc-superuser-reserved-connections"/> limits.
    </para>
 
-   <para>
-    Tables whose <structfield>relfrozenxid</structfield> value is more than
-    <xref linkend="guc-autovacuum-freeze-max-age"/> transactions old are always
-    vacuumed (this also applies to those tables whose freeze max age has
-    been modified via storage parameters; see below).  Otherwise, if the
-    number of tuples obsoleted since the last
-    <command>VACUUM</command> exceeds the <quote>vacuum threshold</quote>, the
-    table is vacuumed.  The vacuum threshold is defined as:
+   <sect3 id="triggering-thresholds">
+    <title>Triggering thresholds</title>
+    <para>
+     Tables whose <structfield>relfrozenxid</structfield> value is
+     more than <xref linkend="guc-autovacuum-freeze-max-age"/>
+     transactions old are always vacuumed (this also applies to those
+     tables whose freeze max age has been modified via storage
+     parameters; see below).  Otherwise, if the number of tuples
+     obsoleted since the last <command>VACUUM</command> exceeds the
+     <quote>vacuum threshold</quote>, the table is vacuumed.  The
+     vacuum threshold is defined as:
 <programlisting>
 vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
 </programlisting>
-    where the vacuum base threshold is
-    <xref linkend="guc-autovacuum-vacuum-threshold"/>,
-    the vacuum scale factor is
-    <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
+    where the vacuum base threshold is <xref
+     linkend="guc-autovacuum-vacuum-threshold"/>, the vacuum scale
+    factor is <xref linkend="guc-autovacuum-vacuum-scale-factor"/>,
     and the number of tuples is
     <structname>pg_class</structname>.<structfield>reltuples</structfield>.
-   </para>
+    </para>
 
-   <para>
-    The table is also vacuumed if the number of tuples inserted since the last
-    vacuum has exceeded the defined insert threshold, which is defined as:
+    <para>
+     The table is also vacuumed if the number of tuples inserted since
+     the last vacuum has exceeded the defined insert threshold, which
+     is defined as:
 <programlisting>
 vacuum insert threshold = vacuum base insert threshold + vacuum insert scale factor * number of tuples
 </programlisting>
-    where the vacuum insert base threshold is
-    <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>,
-    and vacuum insert scale factor is
-    <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.
-    Such vacuums may allow portions of the table to be marked as
-    <firstterm>all visible</firstterm> and also allow tuples to be frozen, which
-    can reduce the work required in subsequent vacuums.
-    For tables which receive <command>INSERT</command> operations but no or
-    almost no <command>UPDATE</command>/<command>DELETE</command> operations,
-    it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the cumulative statistics system;
-    it is a semi-accurate count updated by each <command>UPDATE</command>,
-    <command>DELETE</command> and <command>INSERT</command> operation.  (It is
-    only semi-accurate because some information might be lost under heavy
-    load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
-   </para>
+     where the vacuum insert base threshold
+     is <xref linkend="guc-autovacuum-vacuum-insert-threshold"/>, and
+     vacuum insert scale factor is <xref
+      linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  Such
+     vacuums may allow portions of the table to be marked as
+     <firstterm>all visible</firstterm> and also allow tuples to be
+     frozen.  The number of obsolete tuples and the number of inserted
+     tuples are obtained from the cumulative statistics system; it is
+     a semi-accurate count updated by each <command>UPDATE</command>,
+     <command>DELETE</command> and <command>INSERT</command>
+     operation.  (It is only semi-accurate because some information
+     might be lost under heavy load.)
+    </para>
 
-   <para>
-    For analyze, a similar condition is used: the threshold, defined as:
+    <para>
+     For analyze, a similar condition is used: the threshold, defined as:
 <programlisting>
 analyze threshold = analyze base threshold + analyze scale factor * number of tuples
 </programlisting>
-    is compared to the total number of tuples inserted, updated, or deleted
-    since the last <command>ANALYZE</command>.
-   </para>
-
-   <para>
-    Partitioned tables are not processed by autovacuum.  Statistics
-    should be collected by running a manual <command>ANALYZE</command> when it is
-    first populated, and again whenever the distribution of data in its
-    partitions changes significantly.
-   </para>
-
-   <para>
-    Temporary tables cannot be accessed by autovacuum.  Therefore,
-    appropriate vacuum and analyze operations should be performed via
-    session SQL commands.
-   </para>
-
-   <para>
-    The default thresholds and scale factors are taken from
-    <filename>postgresql.conf</filename>, but it is possible to override them
-    (and many other autovacuum control parameters) on a per-table basis; see
-    <xref linkend="sql-createtable-storage-parameters"/> for more information.
-    If a setting has been changed via a table's storage parameters, that value
-    is used when processing that table; otherwise the global settings are
-    used. See <xref linkend="runtime-config-autovacuum"/> for more details on
-    the global settings.
-   </para>
-
-   <para>
-    When multiple workers are running, the autovacuum cost delay parameters
-    (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
-    <quote>balanced</quote> among all the running workers, so that the
-    total I/O impact on the system is the same regardless of the number
-    of workers actually running.  However, any workers processing tables whose
-    per-table <literal>autovacuum_vacuum_cost_delay</literal> or
-    <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
-    are not considered in the balancing algorithm.
-   </para>
-
-   <para>
-    Autovacuum workers generally don't block other commands.  If a process
-    attempts to acquire a lock that conflicts with the
-    <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
-    acquisition will interrupt the autovacuum.  For conflicting lock modes,
-    see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
-    is running to prevent transaction ID wraparound (i.e., the autovacuum query
-    name in the <structname>pg_stat_activity</structname> view ends with
-    <literal>(to prevent wraparound)</literal>), the autovacuum is not
-    automatically interrupted.
-   </para>
-
-   <warning>
-    <para>
-     Regularly running commands that acquire locks conflicting with a
-     <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
-     effectively prevent autovacuums from ever completing.
+     is compared to the total number of tuples inserted, updated, or
+     deleted since the last <command>ANALYZE</command>.
     </para>
-   </warning>
+
+   </sect3>
+
+   <sect3 id="anti-wraparound">
+    <title>Anti-wraparound autovacuum</title>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of transaction IDs</secondary>
+    </indexterm>
+
+    <indexterm>
+     <primary>wraparound</primary>
+     <secondary>of multixact IDs</secondary>
+    </indexterm>
+
+    <para>
+     If no <structfield>relfrozenxid</structfield>-advancing
+     <command>VACUUM</command> is issued on the table before
+     <varname>autovacuum_freeze_max_age</varname> is reached, an
+     anti-wraparound autovacuum will soon be launched against the
+     table.  This reliably advances
+     <structfield>relfrozenxid</structfield> when there is no other
+     reason for <command>VACUUM</command> to run, or when a smaller
+     table had <command>VACUUM</command> operations that lazily opted
+     not to advance <structfield>relfrozenxid</structfield>.
+    </para>
+
+    <para>
+     An anti-wraparound autovacuum will also be triggered for any
+     table whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.  However,
+     if the storage occupied by multixacts members exceeds 2GB,
+     anti-wraparound vacuum might occur more often than this.
+    </para>
+
+    <para>
+     If for some reason autovacuum fails to clear old XIDs from a table, the
+     system will begin to emit warning messages like this when the database's
+     oldest XIDs reach forty million transactions from the wraparound point:
+
+<programlisting>
+WARNING:  database "mydb" must be vacuumed within 39985967 transactions
+HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
+</programlisting>
+
+     (A manual <command>VACUUM</command> should fix the problem, as suggested by the
+     hint; but note that the <command>VACUUM</command> must be performed by a
+     superuser, else it will fail to process system catalogs and thus not
+     be able to advance the database's <structfield>datfrozenxid</structfield>.)
+     If these warnings are
+     ignored, the system will shut down and refuse to start any new
+     transactions once there are fewer than three million transactions left
+     until wraparound:
+
+<programlisting>
+ERROR:  database is not accepting commands to avoid wraparound data loss in database "mydb"
+HINT:  Stop the postmaster and vacuum that database in single-user mode.
+</programlisting>
+
+     The three-million-transaction safety margin exists to let the
+     administrator recover by manually executing the required
+     <command>VACUUM</command> commands.  It is usually sufficient to
+     allow autovacuum to finish against the table with the oldest
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> value.  The wraparound
+     failsafe mechanism controlled by <xref
+      linkend="guc-vacuum-failsafe-age"/> and <xref
+      linkend="guc-vacuum-multixact-failsafe-age"/> will typically
+     trigger before warning messages are first emitted.  This happens
+     dynamically, in any antiwraparound autovacuum worker that is
+     tasked with advancing very old table ages.  It will also happen
+     during manual <command>VACUUM</command> operations.
+    </para>
+
+    <para>
+     The shutdown mode is not enforced in single-user mode, which can
+     be useful in some disaster recovery scenarios.  See the <xref
+      linkend="app-postgres"/> reference page for details about using
+     single-user mode.
+    </para>
+   </sect3>
+
+   <sect3 id="Limitations">
+    <title>Limitations</title>
+
+    <para>
+     Partitioned tables are not processed by autovacuum.  Statistics
+     should be collected by running a manual <command>ANALYZE</command> when it is
+     first populated, and again whenever the distribution of data in its
+     partitions changes significantly.
+    </para>
+
+    <para>
+     Temporary tables cannot be accessed by autovacuum.  Therefore,
+     appropriate vacuum and analyze operations should be performed via
+     session SQL commands.
+    </para>
+
+    <para>
+     The default thresholds and scale factors are taken from
+     <filename>postgresql.conf</filename>, but it is possible to override them
+     (and many other autovacuum control parameters) on a per-table basis; see
+     <xref linkend="sql-createtable-storage-parameters"/> for more information.
+     If a setting has been changed via a table's storage parameters, that value
+     is used when processing that table; otherwise the global settings are
+     used. See <xref linkend="runtime-config-autovacuum"/> for more details on
+     the global settings.
+    </para>
+
+    <para>
+     When multiple workers are running, the autovacuum cost delay parameters
+     (see <xref linkend="runtime-config-resource-vacuum-cost"/>) are
+     <quote>balanced</quote> among all the running workers, so that the
+     total I/O impact on the system is the same regardless of the number
+     of workers actually running.  However, any workers processing tables whose
+     per-table <literal>autovacuum_vacuum_cost_delay</literal> or
+     <literal>autovacuum_vacuum_cost_limit</literal> storage parameters have been set
+     are not considered in the balancing algorithm.
+    </para>
+
+    <para>
+     Autovacuum workers generally don't block other commands.  If a process
+     attempts to acquire a lock that conflicts with the
+     <literal>SHARE UPDATE EXCLUSIVE</literal> lock held by autovacuum, lock
+     acquisition will interrupt the autovacuum.  For conflicting lock modes,
+     see <xref linkend="table-lock-compatibility"/>.  However, if the autovacuum
+     is running to prevent transaction ID wraparound (i.e., the autovacuum query
+     name in the <structname>pg_stat_activity</structname> view ends with
+     <literal>(to prevent wraparound)</literal>), the autovacuum is not
+     automatically interrupted.
+    </para>
+
+    <warning>
+     <para>
+      Regularly running commands that acquire locks conflicting with a
+      <literal>SHARE UPDATE EXCLUSIVE</literal> lock (e.g., ANALYZE) can
+      effectively prevent autovacuums from ever completing.
+     </para>
+    </warning>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index eabbf9e65..859175718 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1503,7 +1503,7 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
      and/or <command>ANALYZE</command> operations on this table following the rules
      discussed in <xref linkend="autovacuum"/>.
      If false, this table will not be autovacuumed, except to prevent
-     transaction ID wraparound. See <xref linkend="vacuum-for-wraparound"/> for
+     transaction ID wraparound. See <xref linkend="vacuum-xid-space"/> for
      more about wraparound prevention.
      Note that the autovacuum daemon does not run at all (except to prevent
      transaction ID wraparound) if the <xref linkend="guc-autovacuum"/>
diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac..1817ed1e3 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -128,7 +128,7 @@ PREPARE TRANSACTION <replaceable class="parameter">transaction_id</replaceable>
     This will interfere with the ability of <command>VACUUM</command> to reclaim
     storage, and in extreme cases could cause the database to shut down
     to prevent transaction ID wraparound (see <xref
-    linkend="vacuum-for-wraparound"/>).  Keep in mind also that the transaction
+    linkend="vacuum-xid-space"/>).  Keep in mind also that the transaction
     continues to hold whatever locks it held.  The intended usage of the
     feature is that a prepared transaction will normally be committed or
     rolled back as soon as an external transaction manager has verified that
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 8b078221a..3cb4668ee 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -158,9 +158,11 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
      <para>
       Normally, <command>VACUUM</command> will skip pages based on the <link
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.
+      all tuples are known to be frozen are always skipped.  Pages
+      where all tuples are known to be visible to all transactions are
+      skipped whenever <command>VACUUM</command> determined that
+      advancing <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> was unnecessary.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
@@ -215,7 +217,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
       there are many dead tuples in the table.  This may be useful
       when it is necessary to make <command>VACUUM</command> run as
       quickly as possible to avoid imminent transaction ID wraparound
-      (see <xref linkend="vacuum-for-wraparound"/>).  However, the
+      (see <xref linkend="vacuum-xid-space"/>).  However, the
       wraparound failsafe mechanism controlled by <xref
        linkend="guc-vacuum-failsafe-age"/>  will generally trigger
       automatically to avoid transaction ID wraparound failure, and
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 841aced3b..48942c58f 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -180,7 +180,7 @@ PostgreSQL documentation
       <term><option>--freeze</option></term>
       <listitem>
        <para>
-        Aggressively <quote>freeze</quote> tuples.
+        Eagerly <quote>freeze</quote> tuples.
        </para>
       </listitem>
      </varlistentry>
@@ -259,7 +259,7 @@ PostgreSQL documentation
         transaction ID age of at least
         <replaceable class="parameter">xid_age</replaceable>.  This setting
         is useful for prioritizing tables to process to prevent transaction
-        ID wraparound (see <xref linkend="vacuum-for-wraparound"/>).
+        ID wraparound (see <xref linkend="vacuum-xid-space"/>).
        </para>
        <para>
         For the purposes of this option, the transaction ID age of a relation
diff --git a/doc/src/sgml/xact.sgml b/doc/src/sgml/xact.sgml
index b467660ee..c4146539f 100644
--- a/doc/src/sgml/xact.sgml
+++ b/doc/src/sgml/xact.sgml
@@ -49,8 +49,8 @@
 
   <para>
    The internal transaction ID type <type>xid</type> is 32 bits wide
-   and <link linkend="vacuum-for-wraparound">wraps around</link> every
-   4 billion transactions. A 32-bit epoch is incremented during each
+   and <link linkend="vacuum-xid-space">wraps around</link> every
+   2 billion transactions. A 32-bit epoch is incremented during each
    wraparound. There is also a 64-bit type <type>xid8</type> which
    includes this epoch and therefore does not wrap around during the
    life of an installation;  it can be converted to xid by casting.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..f9e4194cd 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,21 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), while still using default
+# settings for vacuum_freeze_table_age/autovacuum_freeze_max_age.
+#
+# This makes VACUUM freeze tuples just as aggressively as it would if the
+# VACUUM command's FREEZE option was specified with almost all heap pages.
+# However, VACUUM is still unwilling to wait indefinitely for a cleanup lock,
+# just to freeze a few XIDs/MXIDs that still aren't very old.
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +81,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +97,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +108,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +121,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +133,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +141,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
-- 
2.38.1

#78

Dilip Kumar

dilipbalaut@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#77)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 9, 2023 at 7:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jan 3, 2023 at 12:30 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v14.

This has stopped applying due to conflicts with nearby work on VACUUM
from Tom. So I attached a new revision, v15, just to make CFTester
green again.

I didn't have time to incorporate any of the feedback from Matthias
just yet. That will have to wait until v16.

I have looked into the patch set, I think 0001 looks good to me about
0002 I have a few questions, 0003 I haven't yet looked at

1.
+    /*
+     * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+     * XMID table age (whichever is greater currently).
+     */
+    XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+        ((double) freeze_table_age + 0.5);

I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should
be the actual fraction right? What is the point of adding 0.5 to the
divisor? If there is a logical reason, maybe we can explain in the
comments.

2.
While looking into the logic of 'lazy_scan_strategy', I think the idea
looks very good but the only thing is that
we have kept eager freeze and eager scan completely independent.
Don't you think that if a table is chosen for an eager scan
then we should force the eager freezing as well?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#79

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Dilip Kumar (#78)

Re: New strategies for freezing, advancing relfrozenxid early

On Sun, Jan 15, 2023 at 9:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have looked into the patch set, I think 0001 looks good to me about
0002 I have a few questions, 0003 I haven't yet looked at

Thanks for taking a look.

I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should
be the actual fraction right? What is the point of adding 0.5 to the
divisor? If there is a logical reason, maybe we can explain in the
comments.

It's just a way of avoiding division by zero.

While looking into the logic of 'lazy_scan_strategy', I think the idea
looks very good but the only thing is that
we have kept eager freeze and eager scan completely independent.
Don't you think that if a table is chosen for an eager scan
then we should force the eager freezing as well?

Earlier versions of the patch kind of worked that way.
lazy_scan_strategy would actually use twice the GUC setting to
determine scanning strategy. That approach could make our "transition
from lazy to eager strategies" involve an excessive amount of
"catch-up freezing" in the VACUUM operation that advanced relfrozenxid
for the first time, which you see an example of here:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch

Now we treat the scanning and freezing strategies as two independent
choices. Of course they're not independent in any practical sense, but
I think it's slightly simpler and more elegant that way -- it makes
the GUC vacuum_freeze_strategy_threshold strictly about freezing
strategy, while still leading to VACUUM advancing relfrozenxid in a
way that makes sense. It just happens as a second order effect. Why
add a special case?

In principle the break-even point for eager scanning strategy (i.e.
advancing relfrozenxid) is based on the added cost only under this
scheme. There is no reason for lazy_scan_strategy to care about what
happened in the past to make the eager scanning strategy look like a
good idea. Similarly, there isn't any practical reason why
lazy_scan_strategy needs to anticipate what will happen in the near
future with freezing.

--
Peter Geoghegan

#80

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Peter Geoghegan (#77)

3 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Sun, Jan 8, 2023 at 5:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

I didn't have time to incorporate any of the feedback from Matthias
just yet. That will have to wait until v16.

Attached is v16, which incorporates some of Matthias' feedback.

I've rolled back the major restructuring to the "Routine Vacuuming"
docs that previously appeared in 0003, preferring to take a much more
incremental approach. I do still think that somebody needs to do some
major reworking of that, just in general. That can be done by a
separate patch. There are now only fairly mechanical doc updates in
all 3 patches.

Other changes:

* vacuum_freeze_strategy_threshold is now MB-based, and can be set up to 512TB.

* Various refinements to comments.

--
Peter Geoghegan

Attachments:

v16-0003-Finish-removing-aggressive-mode-VACUUM.patchapplication/x-patch; name=v16-0003-Finish-removing-aggressive-mode-VACUUM.patchDownload

From eaeaab3c76b97f8fe1b96ff938aabbb7dd622960 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v16 3/3] Finish removing aggressive mode VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Although pre-visibility-map
VACUUM was far less efficient (especially after 9.6's commit fd31cd26),
its naive approach had one notable advantage: users only had to think
about a single kind of lazy vacuum (the only kind that existed).

Break the final remaining dependency on aggressive mode: replace the
rules governing when VACUUM will wait for a cleanup lock with a new set
of rules more attuned to the needs of the table.  With that last
dependency gone, there is no need for aggressive mode, so get rid of it.
Users once again only have to think about one kind of lazy vacuum.

In general, all of the behaviors associated with aggressive mode prior
to Postgres 16 have been retained; they just get applied selectively, on
a more dynamic timeline.  For example, the aforementioned change to
VACUUM's cleanup lock behavior retains the general idea of sometimes
waiting for a cleanup lock to make sure that older XIDs get frozen, so
that relfrozenxid can be advanced by a sufficient amount.  All that
really changes is the information driving VACUUM's decision on waiting.

We use new, dedicated cutoffs, rather than applying the FreezeLimit and
MultiXactCutoff used when deciding whether we should trigger freezing on
the basis of XID/MXID age.  These minimum fallback cutoffs (which are
called MinXid and MinMulti) are typically far older than the standard
FreezeLimit/MultiXactCutoff cutoffs.  VACUUM doesn't need an aggressive
mode to decide on whether to wait for a cleanup lock anymore; it can
decide everything at the level of individual heap pages.

It is okay to aggressively punt on waiting for a cleanup lock like this
because VACUUM now directly understands the importance of never falling
too far behind on the work of freezing physical heap pages at the level
of the whole table, following recent work to add VM scanning strategies.
It is generally safer for VACUUM to press on with freezing other heap
pages from the table instead.  Even if relfrozenxid can only be advanced
by relatively few XIDs as a consequence, VACUUM should have more than
ample opportunity to catch up next time, since there is bound to be no
more than a small number of problematic unfrozen pages left behind.
VACUUM now tends to consistently advance relfrozenxid (at least by some
small amount) all the time in larger tables, so all that has to happen
for relfrozenxid to fully catch up is for a few remaining unfrozen pages
to get frozen.  Since relfrozenxid is now considered to be no more than
a lagging indicator of freezing, and since relfrozenxid isn't used to
trigger freezing in the way that it once was, time is on our side.

Also teach VACUUM to wait for a short while for cleanup locks when doing
so has a decent chance of preserving its ability to advance relfrozenxid
up to FreezeLimit (and/or to advance relminmxid up to MultiXactCutoff).
As a result, VACUUM typically manages to advance relfrozenxid by just as
much as it would have had it promised to advance it up to FreezeLimit
(i.e. had it made the traditional aggressive VACUUM guarantee), even
when vacuuming a table that happens to have relatively many cleanup lock
conflicts affecting pages with older XIDs/MXIDs.  VACUUM thereby avoids
missing out on advancing relfrozenxid up to the traditional target
amount when it really can be avoided fairly easily, without promising to
do so (VACUUM only promises to advance up to MinXid/MinMulti).

XXX We also need to avoid the special auto-cancellation behavior for
antiwraparound autovacuums to make this truly safe.  See also, related
patch for this: https://commitfest.postgresql.org/41/4027/

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   9 +-
 src/backend/access/heap/heapam.c              |   2 +
 src/backend/access/heap/vacuumlazy.c          | 220 +++++++++++-------
 src/backend/commands/vacuum.c                 |  42 +++-
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 doc/src/sgml/maintenance.sgml                 |  37 +--
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  33 +--
 8 files changed, 220 insertions(+), 151 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 16642b8b7..af1aedb80 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -278,6 +278,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold that triggers VACUUM's eager freezing strategy
 	 */
@@ -354,7 +361,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+extern void vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 							   struct VacuumCutoffs *cutoffs);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 351b822b6..a34493c46 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7056,6 +7056,8 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.MinXid = FreezeLimit;
+	cutoffs.MinMulti = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold_nblocks = 0;
 	cutoffs.tableagefrac = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ecf4d7e05..702a9767f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -157,8 +157,6 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -264,7 +262,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							LVPagePruneState *prunestate);
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
-							  bool *hastup, bool *recordfreespace);
+							  bool *hastup, bool *recordfreespace,
+							  Size *freespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -462,7 +461,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
+	vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
@@ -542,17 +541,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(vacrel->cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(vacrel->cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
@@ -560,7 +556,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
 		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
-		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -629,33 +624,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
 			memset(&walusage, 0, sizeof(WalUsage));
 			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
 			initStringInfo(&buf);
+
 			if (verbose)
-			{
-				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
-			}
 			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			else
-			{
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			appendStringInfo(&buf, msgfmt,
 							 vacrel->dbname,
 							 vacrel->relnamespace,
@@ -941,6 +917,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			bool		hastup,
 						recordfreespace;
+			Size		freespace;
 
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
@@ -954,10 +931,8 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Collect LP_DEAD items in dead_items array, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
-								  &recordfreespace))
+								  &recordfreespace, &freespace))
 			{
-				Size		freespace = 0;
-
 				/*
 				 * Processed page successfully (without cleanup lock) -- just
 				 * need to perform rel truncation and FSM steps, much like the
@@ -966,21 +941,14 @@ lazy_scan_heap(LVRelState *vacrel)
 				 */
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
-				if (recordfreespace)
-					freespace = PageGetHeapFreeSpace(page);
-				UnlockReleaseBuffer(buf);
 				if (recordfreespace)
 					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+
+				/* lock and pin released by lazy_scan_noprune */
 				continue;
 			}
 
-			/*
-			 * lazy_scan_noprune could not do all required processing.  Wait
-			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
-			 */
-			Assert(vacrel->aggressive);
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockBufferForCleanup(buf);
+			/* cleanup lock acquired by lazy_scan_noprune */
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
@@ -1404,8 +1372,6 @@ lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 	if (force_scan_all)
 		vacrel->vmstrat = VMSNAP_SCAN_ALL;
 
-	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
-
 	/* Inform vmsnap infrastructure of our chosen strategy */
 	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
 
@@ -1999,17 +1965,32 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items array for removal from indexes (assuming caller's page
+ * can be processed successfully here).
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We return true to indicate that processing succeeded, in which case we'll
+ * have dropped the lock and pin on buf/page.  Else returns false, indicating
+ * that page must be processed by lazy_scan_prune in the usual way after all.
+ * Acquires a cleanup lock on buf/page for caller before returning false.
+ *
+ * We go to considerable trouble to get a cleanup lock on any page that has
+ * XIDs/MXIDs that need to be frozen in order for VACUUM to be able to set
+ * relfrozenxid/relminmxid to values >= FreezeLimit/MultiXactCutoff cutoffs.
+ * But we don't strictly guarantee it; we only guarantee that final values
+ * will be >= MinXid/MinMulti cutoffs in the worst case.
+ *
+ * We prefer to "under promise and over deliver" like this because a strong
+ * guarantee has the potential to make a bad situation even worse.  VACUUM
+ * should avoid waiting for a cleanup lock for an indefinitely long time until
+ * it has already exhausted every available alternative.  It's quite possible
+ * (and perhaps even likely) that the problem will go away on its own.  But
+ * even when it doesn't, our approach at least makes it likely that the first
+ * VACUUM that encounters the issue will catch up on whatever freezing may
+ * still be required for every other page in the target rel.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
- * generic FSM processing for page.
+ * generic FSM processing for page, using *freespace value set here.
  */
 static bool
 lazy_scan_noprune(LVRelState *vacrel,
@@ -2017,7 +1998,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				  BlockNumber blkno,
 				  Page page,
 				  bool *hastup,
-				  bool *recordfreespace)
+				  bool *recordfreespace,
+				  Size *freespace)
 {
 	OffsetNumber offnum,
 				maxoff;
@@ -2025,6 +2007,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples,
 				missed_dead_tuples;
+	bool		should_freeze = false;
 	HeapTupleHeader tupleheader;
 	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
 	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
@@ -2034,6 +2017,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	*hastup = false;			/* for now */
 	*recordfreespace = false;	/* for now */
+	*freespace = PageGetHeapFreeSpace(page);
 
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -2075,34 +2059,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
 									 &NoFreezePageRelfrozenXid,
 									 &NoFreezePageRelminMxid))
-		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
-			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
-			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
-			 */
-		}
+			should_freeze = true;
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
@@ -2151,10 +2108,97 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * By here we know for sure that caller can put off freezing and pruning
-	 * this particular page until the next VACUUM.  Remember its details now.
-	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
+	 * Release lock (but not pin) on page now.  Then consider if we should
+	 * back out of accepting reduced processing for this page.
+	 *
+	 * Our caller's initial inability to get a cleanup lock will often turn
+	 * out to have been nothing more than a momentary blip, and it would be a
+	 * shame if relfrozenxid/relminmxid values < FreezeLimit/MultiXactCutoff
+	 * were used without good reason.  For example, the checkpointer might
+	 * have been writing out this page a moment ago, in which case its buffer
+	 * pin might have already been released by now.
+	 *
+	 * It's also possible that the conflicting buffer pin will continue to
+	 * block cleanup lock acquisition on the buffer for an extended period.
+	 * For example, it isn't uncommon for heap_lock_tuple to sleep while
+	 * holding a buffer pin, in which case a conflicting pin could easily be
+	 * held for much longer than VACUUM can reasonably be expected to wait.
+	 * There are also truly pathological cases to worry about.  For example,
+	 * the case where buggy application code holds open a cursor forever.
 	 */
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	if (should_freeze)
+	{
+		/*
+		 * If page has tuple with a dangerously old XID/MXID (an XID < MinXid,
+		 * or an MXID < MinMulti), then we wait for however long it takes to
+		 * get a cleanup lock.
+		 *
+		 * Check for that first (get it out of the way).
+		 */
+		if (TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								  vacrel->cutoffs.MinXid) ||
+			MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								vacrel->cutoffs.MinMulti))
+		{
+			/*
+			 * MinXid/MinMulti are considered to be only barely adequate final
+			 * values, so we only expect to end up here when previous VACUUMs
+			 * put off processing by lazy_scan_prune in the hope that it would
+			 * never come to this.  That hasn't worked out, so we must wait.
+			 */
+			LockBufferForCleanup(buf);
+			return false;
+		}
+
+		/*
+		 * Page has tuple with XID < FreezeLimit, or MXID < MultiXactCutoff,
+		 * but they're not so old that we're _strictly_ obligated to freeze.
+		 *
+		 * We are willing to go to the trouble of waiting for a cleanup lock
+		 * for a short while for such a page -- just not indefinitely long.
+		 * This avoids squandering opportunities to advance relfrozenxid or
+		 * relminmxid by the target amount during any one VACUUM, which is
+		 * particularly important with larger tables that only get vacuumed
+		 * when autovacuum.c is concerned about table age.  It would not be
+		 * okay if the number of autovacuums such a table ended up requiring
+		 * noticeably exceeded the expected autovacuum_freeze_max_age cadence.
+		 *
+		 * We are willing to wait and try again a total of 3 times.  If that
+		 * doesn't work then we just give up.  We only wait here when it is
+		 * actually expected to preserve current NewRelfrozenXid/NewRelminMxid
+		 * tracker values, and when trackers will actually be used to update
+		 * pg_class later on.  This also tends to limit the impact of waiting
+		 * for VACUUMs that experience relatively many cleanup lock conflicts.
+		 */
+		if (vacrel->vmstrat != VMSNAP_SCAN_LAZY &&
+			(TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								   vacrel->NewRelfrozenXid) ||
+			 MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								 vacrel->NewRelminMxid)))
+		{
+			/* wait 10ms, then 20ms, then 30ms, then give up */
+			for (int i = 1; i <= 3; i++)
+			{
+				CHECK_FOR_INTERRUPTS();
+
+				pg_usleep(1000L * 10L * i);
+				if (ConditionalLockBufferForCleanup(buf))
+				{
+					/* Go process page in lazy_scan_prune after all */
+					return false;
+				}
+			}
+			/* Accept reduced processing for this page after all */
+		}
+	}
+
+	/*
+	 * By here we know for sure that caller will put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * Also drop the buffer pin that we held onto during cleanup lock steps.
+	 */
+	ReleaseBuffer(buf);
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index de2e98368..1664511d0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -952,13 +952,8 @@ get_all_vacuum_rels(int options)
  * The target relation and VACUUM parameters are our inputs.
  *
  * Output parameters are the cutoffs that VACUUM caller should use.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
  */
-bool
+void
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				   struct VacuumCutoffs *cutoffs)
 {
@@ -1137,6 +1132,39 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 		multixact_freeze_table_age > effective_multixact_freeze_max_age)
 		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
+	/*
+	 * Determine the cutoffs used by VACUUM to decide on whether to wait for a
+	 * cleanup lock on a page (that it can't cleanup lock right away).  These
+	 * are the MinXid and MinMulti cutoffs.  They are related to the cutoffs
+	 * for freezing (FreezeLimit and MultiXactCutoff), and are only applied on
+	 * pages that we cannot freeze right away.  See vacuumlazy.c for details.
+	 *
+	 * VACUUM can ratchet back NewRelfrozenXid and/or NewRelminMxid instead of
+	 * waiting indefinitely for a cleanup lock in almost all cases.  The high
+	 * level goal is to create as many opportunities as possible to freeze
+	 * (across many successive VACUUM operations), while avoiding waiting for
+	 * a cleanup lock whenever possible.  Any time spent waiting is time spent
+	 * not freezing other eligible pages, which is typically a bad trade-off.
+	 *
+	 * As a consequence of all this, MinXid and MinMulti also act as limits on
+	 * the oldest acceptable values that can ever be set in pg_class by VACUUM
+	 * (though this is only relevant when they have already attained XID/XMID
+	 * ages that approach freeze_table_age and/or multixact_freeze_table_age).
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
 	 * XMID table age (whichever is greater currently).
@@ -1154,8 +1182,6 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 */
 	if (params->is_wraparound)
 		cutoffs->tableagefrac = 1.0;
-
-	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c2..056ef0178 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 18c32983f..1c4bd1450 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -493,19 +493,13 @@
     will skip pages that don't have any dead row versions even if those pages
     might still have row versions with old XID values.  Therefore, normal
     <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
    </para>
 
    <para>
     The maximum time that a table can go unvacuumed is two billion
     transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
+    the time of the last vacuum that advanced <structfield>relfrozenxid</structfield>.
+    Autovacuum is invoked on any table that might contain unfrozen rows with
     XIDs older than the age specified by the configuration parameter <xref
     linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
     autovacuum is disabled.)
@@ -563,8 +557,7 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
+    advanced <structfield>relfrozenxid</structfield>.  Similarly, the
     <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
@@ -721,22 +714,14 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
-
-    <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a safety device, a vacuum to advance
+     <structfield>relminmxid</structfield> will occur for any table
+     whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.
+     Also, if the storage occupied by multixacts members exceeds 2GB,
+     vacuum scans will occur more often for all tables, starting with those that
+     have the oldest multixact-age.  This will occur even if
+     autovacuum is nominally disabled.
     </para>
    </sect3>
   </sect2>
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..f9e4194cd 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,21 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), while still using default
+# settings for vacuum_freeze_table_age/autovacuum_freeze_max_age.
+#
+# This makes VACUUM freeze tuples just as aggressively as it would if the
+# VACUUM command's FREEZE option was specified with almost all heap pages.
+# However, VACUUM is still unwilling to wait indefinitely for a cleanup lock,
+# just to freeze a few XIDs/MXIDs that still aren't very old.
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +81,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +97,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +108,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +121,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +133,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +141,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
-- 
2.39.0

v16-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchapplication/x-patch; name=v16-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchDownload

From cb7775cae4af79ff43de22e9ed0a9af992fe58da Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v16 2/3] Add eager and lazy VM strategies to VACUUM.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM.  This is a local copy of the visibility
map at the start of VACUUM, which can spill to a temp file as and when
required.  Tables that are small enough to only need a single visibility
map page don't need to use a temp file.  VACUUM now uses its VM snapshot
(not the authoritative VM) to determine which pages to scan.  VACUUM no
longer scans pages that were concurrently unset in the VM, since all of
the pages it will scan are known and fixed before scanning even begins.

VACUUM decides on its VM snapshot scanning strategy up-front, based on
information about costs taken from the snapshot, and relfrozenxid age.
Lazy scanning allows VACUUM to skip all-visible pages, whereas eager
scanning allows VACUUM to advance relfrozenxid.  This works in tandem
with VACUUM's freezing strategies.

This work often result in VACUUM advancing relfrozenxid at a cadence
that is driven by underlying physical costs, not table age (through
settings like autovacuum_freeze_max_age).  Antiwraparound autovacuums
will be far less common as a result.  Freezing now drives relfrozenxid,
rather than relfrozenxid driving freezing.  Even tables that always use
lazy freezing will have a decent chance of relfrozenxid advancement long
before table age nears autovacuum_freeze_max_age.

This also lays the groundwork for completely removing aggressive mode
VACUUMs in a later commit.  Scanning strategies now supersede the "early
aggressive VACUUM" concept implemented by vacuum_freeze_table_age, which
is now just a compatibility option (its new default of -1 is interpreted
as "just use autovacuum_freeze_max_age").  For now VACUUM will still
condition its cleanup lock wait behavior on being in aggressive mode.

Also add explicit I/O prefetching of heap pages, which is controlled by
maintenance_io_concurrency.  We prefetch at the point that the next
block in line is requested by VACUUM.  Prefetching is under the direct
control of the visibility map snapshot code, since VACUUM's vmsnap is
now an authoritative guide to which pages VACUUM will scan.

Prefetching should totally avoid the loss of performance that might
otherwise result from removing SKIP_PAGES_THRESHOLD in this commit.
SKIP_PAGES_THRESHOLD was intended to force OS readahead and encourage
relfrozenxid advancement.  See commit bf136cf6 from around the time the
visibility map first went in for full details.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h            |  17 +
 src/include/commands/vacuum.h                 |  20 +-
 src/backend/access/heap/heapam.c              |   1 +
 src/backend/access/heap/vacuumlazy.c          | 618 +++++++++---------
 src/backend/access/heap/visibilitymap.c       | 539 +++++++++++++++
 src/backend/commands/vacuum.c                 |  68 +-
 src/backend/utils/misc/guc_tables.c           |   8 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      |  66 +-
 doc/src/sgml/maintenance.sgml                 |  78 +--
 doc/src/sgml/ref/vacuum.sgml                  |  10 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 13 files changed, 1037 insertions(+), 413 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index daaa01a25..d8df744da 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,17 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
+/* VACUUM scanning strategy */
+typedef enum vmstrategy
+{
+	VMSNAP_SCAN_LAZY,			/* Skip all-visible and all-frozen pages */
+	VMSNAP_SCAN_EAGER,			/* Only skip all-frozen pages */
+	VMSNAP_SCAN_ALL				/* Don't skip any pages (scan them instead) */
+} vmstrategy;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +46,12 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+											  BlockNumber *scanned_pages_lazy,
+											  BlockNumber *scanned_pages_eager);
+extern void visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat);
+extern BlockNumber visibilitymap_snap_next(vmsnapshot *vmsnap);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d900b1be1..16642b8b7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 #define VACOPT_SKIP_DATABASE_STATS 0x100	/* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x200	/* only vac_update_datfrozenxid() */
 
@@ -282,6 +282,24 @@ struct VacuumCutoffs
 	 * Threshold that triggers VACUUM's eager freezing strategy
 	 */
 	BlockNumber freeze_strategy_threshold_nblocks;
+
+	/*
+	 * The tableagefrac value 1.0 represents the point that autovacuum.c
+	 * scheduling (and VACUUM itself) considers relfrozenxid/relminmxid
+	 * advancement strictly necessary.  Values near 0.0 mean that both
+	 * relfrozenxid and relminmxid are a recently allocated XID/MXID.
+	 *
+	 * We don't need separate relfrozenxid and relminmxid tableagefrac
+	 * variants.  We base tableagefrac on whichever pg_class field is closer
+	 * to the point of having autovacuum.c launch an autovacuum to advance the
+	 * field's value.
+	 *
+	 * Lower values provide useful context, and influence whether VACUUM will
+	 * opt to advance relfrozenxid before the point that it is strictly
+	 * necessary.  VACUUM can (and often does) opt to advance relfrozenxid
+	 * and/or relminmxid proactively.
+	 */
+	double		tableagefrac;
 };
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 95f4d59e3..351b822b6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7057,6 +7057,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold_nblocks = 0;
+	cutoffs.tableagefrac = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f9536e522..ecf4d7e05 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -110,10 +110,18 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * tableagefrac-wise cutoffs influencing VACUUM's choice of scanning strategy
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define TABLEAGEFRAC_MIDPOINT		0.5 /* half way to antiwraparound AV */
+#define TABLEAGEFRAC_HIGHPOINT		0.9 /* Eagerness now mandatory */
+
+/*
+ * Thresholds (expressed as a proportion of rel_pages) that determine the
+ * cutoff (in extra pages scanned) for eager vmsnap scanning behavior at
+ * particular tableagefrac-wise table ages
+ */
+#define MAX_PAGES_YOUNG_TABLEAGE	0.05	/* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE		0.70	/* 70% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -151,8 +159,6 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -171,7 +177,9 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map (as of time that VACUUM began) */
+	vmsnapshot *vmsnap;
+	vmstrategy	vmstrat;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -223,6 +231,7 @@ typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
 	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+	bool		pd_allvis_corrupt;	/* PD_ALL_VISIBLE bit spuriously set? */
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
@@ -245,11 +254,8 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static void lazy_scan_strategy(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  bool force_scan_all);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -279,7 +285,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -311,10 +318,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	BlockNumber orig_rel_pages,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -461,37 +468,29 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		vacrel->aggressive = true;
-		skipwithvm = false;
-	}
-
-	vacrel->skipwithvm = skipwithvm;
 
 	/*
-	 * Now determine VACUUM's freezing strategy.
+	 * Now determine VACUUM's freezing and scanning strategies.
+	 *
+	 * This process is driven in part by information from VACUUM's visibility
+	 * map snapshot, which will be acquired in passing.  lazy_scan_heap will
+	 * use the same immutable VM snapshot to determine which pages to scan.
+	 * Using an immutable structure (instead of the live visibility map) makes
+	 * VACUUM avoid scanning concurrently modified pages.  These pages can
+	 * only have deleted tuples that OldestXmin will consider RECENTLY_DEAD.
 	 */
-	lazy_scan_strategy(vacrel);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   (params->options &
+										VACOPT_DISABLE_PAGE_SKIPPING) != 0);
 	if (verbose)
-	{
-		if (vacrel->aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							vacrel->dbname, vacrel->relnamespace,
-							vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							vacrel->dbname, vacrel->relnamespace,
-							vacrel->relname)));
-	}
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						vacrel->dbname, vacrel->relnamespace,
+						vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -501,13 +500,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -554,12 +554,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
+		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
 		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -604,6 +603,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -631,10 +633,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -829,13 +827,10 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_fsm_block_to_vacuum = 0;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -849,46 +844,27 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
-	for (blkno = 0; blkno < rel_pages; blkno++)
+	next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap);
+	while (next_block_to_scan < rel_pages)
 	{
+		BlockNumber blkno = next_block_to_scan;
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap);
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * visibilitymap_snap_next must always force us to scan the last page
+		 * in rel (in the range of rel_pages) so that VACUUM can avoid useless
+		 * attempts at rel truncation (per should_attempt_truncation comments)
+		 */
+		Assert(next_block_to_scan > blkno);
+		Assert(next_block_to_scan < rel_pages || blkno == rel_pages - 1);
 
 		vacrel->scanned_pages++;
 
-		/* Report as block scanned, update error traceback information */
+		/* Report all blocks < blkno as initial-heap-pass processed */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
@@ -1025,12 +1001,24 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
-
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
 			vacrel->nonempty_pages = blkno + 1;
 
+		/*
+		 * Clear PD_ALL_VISIBLE (and page's visibility map bits) in the event
+		 * of lazy_scan_prune detecting an inconsistency
+		 */
+		if (unlikely(prunestate.pd_allvis_corrupt))
+		{
+			elog(WARNING, "page containing dead tuples has PD_ALL_VISIBLE set in relation \"%s\" page %u",
+				 vacrel->relname, blkno);
+			PageClearAllVisible(page);
+			MarkBufferDirty(buf);
+			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
+								VISIBILITYMAP_VALID_BITS);
+		}
+
 		if (vacrel->nindexes == 0)
 		{
 			/*
@@ -1089,10 +1077,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Set visibility map bits based on prunestate's instructions
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1102,34 +1089,36 @@ lazy_scan_heap(LVRelState *vacrel)
 				flags |= VISIBILITYMAP_ALL_FROZEN;
 			}
 
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
+			if (!PageIsAllVisible(page))
+			{
+				/*
+				 * We could avoid dirtying the page just to set PD_ALL_VISIBLE
+				 * when checksums are disabled.  It is very likely that the
+				 * heap page is already dirty anyway, so keep the rule simple:
+				 * always dirty a page when setting its PD_ALL_VISIBLE bit.
+				 */
+				PageSetAllVisible(page);
+				MarkBufferDirty(buf);
+			}
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, prunestate.visibility_cutoff_xid,
 							  flags);
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * When the page isn't eligible to become all-visible, we defensively
+		 * check that PD_ALL_VISIBLE agrees with the visibility map instead.
+		 * If there is disagreement then we clear both VM bits to repair.
+		 *
+		 * We don't expect (and deliberately avoid testing) mutual agreement;
+		 * it's okay for PD_ALL_VISIBLE to be set while both visibility map
+		 * bits remain unset (iff checksums are disabled).  It's even okay for
+		 * prunestate's all_visible flag to disagree with PD_ALL_VISIBLE here
+		 * (lazy_scan_prune's pd_allvis_corrupt comments explain why that is).
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
-				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
+		else if (!PageIsAllVisible(page) &&
+				 unlikely(visibilitymap_get_status(vacrel->rel, blkno,
+												   &vmbuffer) != 0))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1137,65 +1126,6 @@ lazy_scan_heap(LVRelState *vacrel)
 								VISIBILITYMAP_VALID_BITS);
 		}
 
-		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
-		 * set, however.
-		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
-		{
-			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both prunestate fields.
-		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
-				 prunestate.all_frozen &&
-				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-		{
-			/*
-			 * Avoid relying on all_visible_according_to_vm as a proxy for the
-			 * page-level PD_ALL_VISIBLE bit being set, since it might have
-			 * become stale -- even when all_visible is set in prunestate
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				PageSetAllVisible(page);
-				MarkBufferDirty(buf);
-			}
-
-			/*
-			 * Set the page all-frozen (and all-visible) in the VM.
-			 *
-			 * We can pass InvalidTransactionId as our visibility_cutoff_xid,
-			 * since a snapshotConflictHorizon sufficient to make everything
-			 * safe for REDO was logged when the page's tuples were frozen.
-			 */
-			Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
-			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_VISIBLE |
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
-
 		/*
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
@@ -1232,12 +1162,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 	}
 
+	/* initial heap pass finished (final pass may still be required) */
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	/* report all blocks as initial-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
@@ -1254,20 +1185,26 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
+	 * related heap vacuuming in final heap pass
 	 */
 	if (dead_items->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * Now that both our initial heap pass and final heap pass (if any) have
+	 * ended, vacuum the Free Space Map. (Actually, similar FSM vacuuming will
+	 * have taken place earlier when VACUUM needed to call lazy_vacuum to deal
+	 * with running out of dead_items space.  Hopefully that will be rare.)
 	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+	if (rel_pages > 0)
+	{
+		Assert(vacrel->scanned_pages > 0);
+		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+								rel_pages);
+	}
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+	/* report all blocks as final-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
 
 	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
@@ -1275,7 +1212,7 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine freezing strategy.
+ *	lazy_scan_strategy() -- Determine freezing/vmsnap scanning strategies.
  *
  * Our lazy freezing strategy is useful when putting off the work of freezing
  * totally avoids freezing that turns out to have been wasted effort later on.
@@ -1283,11 +1220,42 @@ lazy_scan_heap(LVRelState *vacrel)
  * continual growth, where freezing pages proactively is needed just to avoid
  * falling behind on freezing (eagerness is also likely to be cheaper in the
  * short/medium term for such tables, but the long term picture matters most).
+ *
+ * Our lazy vmsnap scanning strategy is useful when we can save a significant
+ * amount of work in the short term by not advancing relfrozenxid/relminmxid.
+ * Our eager vmsnap scanning strategy is useful when there is hardly any work
+ * avoided by being lazy anyway, and/or when tableagefrac is nearing or has
+ * already surpassed 1.0, which is the point of antiwraparound autovacuuming.
+ *
+ * Freezing and scanning strategies are structured as two independent choices,
+ * but they are not independent in any practical sense (it's just mechanical).
+ * Eager and lazy behaviors go hand in hand, since the choice of each strategy
+ * is driven by similar considerations about the needs of the target table.
+ * Moreover, choosing eager scanning strategy can easily result in freezing
+ * many more pages (compared to an equivalent lazy scanning strategy VACUUM),
+ * since VACUUM can only freeze pages that it actually scans.  (All-visible
+ * pages may well have XIDs < FreezeLimit by now, but VACUUM has no way of
+ * noticing that it should freeze such pages besides just scanning them.)
+ *
+ * The single most important justification for the eager behaviors is system
+ * level performance stability.  It is often better to freeze all-visible
+ * pages before we're truly forced to (just to advance relfrozenxid) as a way
+ * of avoiding big spikes, where VACUUM has to freeze many pages all at once.
+ *
+ * Returns final scanned_pages for the VACUUM operation.  The exact number of
+ * pages that lazy_scan_heap scans depends in part on which vmsnap scanning
+ * strategy we choose (only eager scanning will scan rel's all-visible pages).
  */
-static void
-lazy_scan_strategy(LVRelState *vacrel)
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->rel_pages,
+				scanned_pages_lazy,
+				scanned_pages_eager,
+				nextra_scanned_eager,
+				nextra_young_threshold,
+				nextra_old_threshold,
+				nextra_toomany_threshold;
 
 	/*
 	 * Decide freezing strategy.
@@ -1295,125 +1263,160 @@ lazy_scan_strategy(LVRelState *vacrel)
 	 * The eager freezing strategy is used when the threshold controlled by
 	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
 	 *
+	 * Also freeze eagerly whenever table age is close to requiring (or is
+	 * actually undergoing) an antiwraparound autovacuum.  This may delay the
+	 * next antiwraparound autovacuum against the table.  We avoid relying on
+	 * them, if at all possible (mostly-static tables tend to rely on them).
+	 *
 	 * Also freeze eagerly with an unlogged or temp table, where the total
 	 * cost of freezing each page is just the cycles needed to prepare a set
 	 * of freeze plans.  Executing the freeze plans adds very little cost.
 	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
 	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 *
+	 * Once a table first becomes big enough for eager freezing, it's almost
+	 * inevitable that it will also naturally settle into a cadence where
+	 * relfrozenxid is advanced during every VACUUM (barring rel truncation).
+	 * This is a consequence of eager freezing strategy avoiding creating new
+	 * all-visible pages: if there never are any all-visible pages (if all
+	 * skippable pages are fully all-frozen), then there is no way that lazy
+	 * scanning strategy can ever look better than eager scanning strategy.
+	 * There are still ways that the occasional all-visible page could slip
+	 * into a table that we always freeze eagerly (at least when its tuples
+	 * tend to contain MultiXacts), but that should have negligible impact.
 	 */
 	vacrel->eager_freeze_strategy =
 		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold_nblocks ||
+		 vacrel->cutoffs.tableagefrac > TABLEAGEFRAC_HIGHPOINT ||
 		 !RelationIsPermanent(vacrel->rel));
-}
-
-/*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
- *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
- *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
- */
-static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
-{
-	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
-
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
-	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
-
-		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-		{
-			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
-		 */
-		if (next_unskippable_block == rel_pages - 1)
-			break;
-
-		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
-		{
-			/* Caller shouldn't rely on all_visible_according_to_vm */
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
-		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
-
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
-	}
 
 	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
+	 * Decide vmsnap scanning strategy.
 	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 * First acquire a visibility map snapshot, which determines the number of
+	 * pages that each vmsnap scanning strategy is required to scan for us in
+	 * passing.
+	 *
+	 * The number of "extra" scanned_pages added by choosing VMSNAP_SCAN_EAGER
+	 * over VMSNAP_SCAN_LAZY is a key input into the decision making process.
+	 * It is a good proxy for the added cost of applying our eager vmsnap
+	 * strategy during this particular VACUUM.  (We may or may not have to
+	 * dirty/freeze the extra pages when we scan them, which isn't something
+	 * that we try to model.  It shouldn't matter very much at this level.)
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
+	vacrel->vmsnap = visibilitymap_snap_acquire(vacrel->rel, rel_pages,
+												&scanned_pages_lazy,
+												&scanned_pages_eager);
+	nextra_scanned_eager = scanned_pages_eager - scanned_pages_lazy;
+
+	/*
+	 * Next determine guideline "nextra_scanned_eager" thresholds, which are
+	 * applied based in part on tableagefrac (when nextra_toomany_threshold is
+	 * determined below).  These thresholds also represent minimum and maximum
+	 * sensible thresholds that can ever make sense for a table of this size
+	 * (when the table's age isn't old enough to make eagerness mandatory).
+	 *
+	 * For the most part we only care about relative (not absolute) costs.  We
+	 * want to advance relfrozenxid at an opportune time, during a VACUUM that
+	 * has to scan relatively many pages either way (whether due to the need
+	 * to remove dead tuples from many pages, or due to the table containing
+	 * lots of existing all-frozen pages, or due to a combination of both).
+	 * Even small tables (where lazy freezing is used) shouldn't have to do
+	 * dramatically more work than usual when advancing relfrozenxid, which
+	 * our policy of waiting for the right VACUUM largely avoids, in practice.
+	 */
+	nextra_young_threshold = (double) rel_pages * MAX_PAGES_YOUNG_TABLEAGE;
+	nextra_old_threshold = (double) rel_pages * MAX_PAGES_OLD_TABLEAGE;
+
+	/*
+	 * Next determine nextra_toomany_threshold, which represents how many
+	 * extra scanned_pages are deemed too high a cost to pay for eagerness,
+	 * given present conditions.  This is our model's break-even point.
+	 */
+	if (vacrel->cutoffs.tableagefrac < TABLEAGEFRAC_MIDPOINT)
+	{
+		/*
+		 * The table's age is still below table age mid point, so table age is
+		 * still of only minimal concern.  We're still willing to act eagerly
+		 * when it's _very_ cheap to do so: when use of VMSNAP_SCAN_EAGER will
+		 * force us to scan some extra pages not exceeding 5% of rel_pages.
+		 */
+		nextra_toomany_threshold = nextra_young_threshold;
+	}
+	else if (vacrel->cutoffs.tableagefrac <= TABLEAGEFRAC_HIGHPOINT)
+	{
+		double		nextra_scale;
+
+		/*
+		 * The table's age is starting to become a concern, but not to the
+		 * extent that we'll force the use of VMSNAP_SCAN_EAGER strategy.
+		 * We'll need to interpolate to get an nextra_scanned_eager-based
+		 * threshold.
+		 *
+		 * If tableagefrac is only barely over the midway point, then we'll
+		 * choose an nextra_scanned_eager threshold of ~5% of rel_pages.  The
+		 * opposite extreme occurs when tableagefrac is very near to the high
+		 * point.  That will make our nextra_scanned_eager threshold very
+		 * aggressive: we'll go with VMSNAP_SCAN_EAGER when doing so requires
+		 * we scan a number of extra blocks as high as ~70% of rel_pages.
+		 *
+		 * Note that the threshold grows (on a percentage basis) by ~8.1% of
+		 * rel_pages for every additional 5%-of-tableagefrac increment added
+		 * (after tableagefrac has crossed the 50%-of-tableagefrac mid point,
+		 * until the 90%-of-tableagefrac high point is reached, when we switch
+		 * over to not caring about the added cost of eager freezing at all).
+		 */
+		nextra_scale =
+			1.0 - ((TABLEAGEFRAC_HIGHPOINT - vacrel->cutoffs.tableagefrac) /
+				   (TABLEAGEFRAC_HIGHPOINT - TABLEAGEFRAC_MIDPOINT));
+
+		nextra_toomany_threshold =
+			(nextra_young_threshold * (1.0 - nextra_scale)) +
+			(nextra_old_threshold * nextra_scale);
+	}
 	else
 	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
+		/*
+		 * The table's age surpasses the high point, and so is approaching (or
+		 * may even surpass) the point that an antiwraparound autovacuum is
+		 * required.  Force VMSNAP_SCAN_EAGER, no matter how many extra pages
+		 * we'll be required to scan as a result (costs no longer matter).
+		 *
+		 * Note that there is a discontinuity when tableagefrac crosses this
+		 * 90%-of-tableagefrac high point: the threshold set here jumps from
+		 * 70% of rel_pages to 100% of rel_pages (MaxBlockNumber, actually).
+		 * It's useful to only care about table age once it gets this high.
+		 * That way even extreme cases will have at least some chance of using
+		 * eager scanning before an antiwraparound autovacuum is launched.
+		 */
+		nextra_toomany_threshold = MaxBlockNumber;
 	}
 
-	return next_unskippable_block;
+	/* Make final choice on scanning strategy using final threshold */
+	nextra_toomany_threshold = Max(nextra_toomany_threshold, 32);
+	vacrel->vmstrat = (nextra_scanned_eager >= nextra_toomany_threshold ?
+					   VMSNAP_SCAN_LAZY : VMSNAP_SCAN_EAGER);
+
+	/*
+	 * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+	 * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+	 */
+	if (force_scan_all)
+		vacrel->vmstrat = VMSNAP_SCAN_ALL;
+
+	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
+
+	/* Inform vmsnap infrastructure of our chosen strategy */
+	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
+
+	/* Return appropriate scanned_pages for final strategy chosen */
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
+		return scanned_pages_lazy;
+	if (vacrel->vmstrat == VMSNAP_SCAN_EAGER)
+		return scanned_pages_eager;
+
+	/* DISABLE_PAGE_SKIPPING/VMSNAP_SCAN_ALL case */
+	return rel_pages;
 }
 
 /*
@@ -1633,6 +1636,7 @@ retry:
 	 */
 	prunestate->hastup = false;
 	prunestate->has_lpdead_items = false;
+	prunestate->pd_allvis_corrupt = false;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1966,12 +1970,26 @@ retry:
 		prunestate->all_visible = false;
 	}
 
-	/* Finally, add page-local counts to whole-VACUUM counts */
+	/* Add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
 	vacrel->lpdead_items += lpdead_items;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
+
+	/*
+	 * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
+	 * already set.  Check that now, to help caller maintain the VM correctly.
+	 *
+	 * We deliberately avoid indicating corruption when a tuple was found to
+	 * be HEAPTUPLE_INSERT_IN_PROGRESS on a page that has PD_ALL_VISIBLE set.
+	 * That would lead to false positives, since OldestXmin is conservative.
+	 * (It's possible that this VACUUM has an earlier OldestXmin than a VACUUM
+	 * that ran against the same table at some point in the recent past.)
+	 */
+	if (PageIsAllVisible(page) &&
+		(lpdead_items > 0 || tuples_deleted > 0 || recently_dead_tuples > 0))
+		prunestate->pd_allvis_corrupt = true;
 }
 
 /*
@@ -2503,6 +2521,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		vacuumed_pages++;
 	}
 
+	/* final heap pass finished */
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
@@ -2846,6 +2865,14 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * Also don't attempt it if we are doing early pruning/vacuuming, because a
  * scan which cannot find a truncated heap page cannot determine that the
  * snapshot is too old to read that page.
+ *
+ * Note that we effectively rely on visibilitymap_snap_next() having forced
+ * VACUUM to scan the final page (rel_pages - 1) in all cases.  Without that,
+ * we'd tend to needlessly acquire an AccessExclusiveLock just to attempt rel
+ * truncation that is bound to fail.  VACUUM cannot set vacrel->nonempty_pages
+ * in pages that it skips using the VM, so we must avoid interpreting skipped
+ * pages as empty pages when it makes little sense.  Observing that the final
+ * page has tuples is a simple way of avoiding pathological locking behavior.
  */
 static bool
 should_attempt_truncation(LVRelState *vacrel)
@@ -3136,14 +3163,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3152,15 +3178,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3182,12 +3206,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb1..379c1ba5b 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,10 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap_acquire - acquire snapshot of visibility map
+ *		visibilitymap_snap_strategy - set VACUUM's scanning strategy
+ *		visibilitymap_snap_next - get next block to scan from vmsnap
+ *		visibilitymap_snap_release - release previously acquired snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +56,10 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset.  This also provides us with a
+ * convenient way of performing I/O prefetching on behalf of VACUUM, since the
+ * pages that VACUUM's first heap pass will scan are fully predetermined.
  *
  * LOCKING
  *
@@ -92,10 +100,12 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
+#include "utils/spccache.h"
 
 
 /*#define TRACE_VISIBILITYMAP */
@@ -124,9 +134,81 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Prefetching of heap pages takes place as VACUUM requests the next block in
+ * line from its visibility map snapshot
+ *
+ * XXX MIN_PREFETCH_SIZE of 32 is a little on the high side, but matches
+ * hard-coded constant used by vacuumlazy.c when prefetching for rel
+ * truncation.  Might be better to increase the maintenance_io_concurrency
+ * default, or to do nothing like this at all.
+ */
+#define STAGED_BUFSIZE			(MAX_IO_CONCURRENCY * 2)
+#define MIN_PREFETCH_SIZE		((BlockNumber) 32)
+
+/*
+ * Snapshot of visibility map at the start of a VACUUM operation
+ */
+struct vmsnapshot
+{
+	/* Target heap rel */
+	Relation	rel;
+	/* Scanning strategy used by VACUUM operation */
+	vmstrategy	strat;
+	/* Per-strategy final scanned_pages */
+	BlockNumber rel_pages;
+	BlockNumber scanned_pages_lazy;
+	BlockNumber scanned_pages_eager;
+
+	/*
+	 * Materialized visibility map state.
+	 *
+	 * VM snapshots spill to a temp file when required.
+	 */
+	BlockNumber nvmpages;
+	BufFile    *file;
+
+	/*
+	 * Prefetch distance, used to perform I/O prefetching of heap pages
+	 */
+	int			prefetch_distance;
+
+	/* Current VM page cached */
+	BlockNumber curvmpage;
+	char	   *rawmap;
+	PGAlignedBlock vmpage;
+
+	/* Staging area for blocks returned to VACUUM */
+	BlockNumber staged[STAGED_BUFSIZE];
+	int			current_nblocks_staged;
+
+	/*
+	 * Next block from range of rel_pages to consider placing in staged block
+	 * array (it will be placed there if it's going to be scanned by VACUUM)
+	 */
+	BlockNumber next_block;
+
+	/*
+	 * Number of blocks that we still need to return, and number of blocks
+	 * that we still need to prefetch
+	 */
+	BlockNumber scanned_pages_to_return;
+	BlockNumber scanned_pages_to_prefetch;
+
+	/* offset of next block in line to return (from staged) */
+	int			next_return_idx;
+	/* offset of next block in line to prefetch (from staged) */
+	int			next_prefetch_idx;
+	/* offset of first garbage/invalid element (from staged) */
+	int			first_invalid_idx;
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static void vm_snap_stage_blocks(vmsnapshot *vmsnap);
+static uint8 vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 
 
 /*
@@ -376,6 +458,354 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap_acquire - get read-only snapshot of visibility map
+ *
+ * Initializes VACUUM caller's snapshot, allocating memory in current context.
+ * Used by VACUUM to determine which pages it must scan up front.
+ *
+ * Set scanned_pages_lazy and scanned_pages_eager to help VACUUM decide on its
+ * scanning strategy.  These are VACUUM's scanned_pages when it opts to skip
+ * all eligible pages and scanned_pages when it opts to just skip all-frozen
+ * pages, respectively.
+ *
+ * Caller finalizes scanning strategy by calling visibilitymap_snap_strategy.
+ * This determines the kind of blocks visibilitymap_snap_next should indicate
+ * need to be scanned by VACUUM.
+ */
+vmsnapshot *
+visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+						   BlockNumber *scanned_pages_lazy,
+						   BlockNumber *scanned_pages_eager)
+{
+	BlockNumber nvmpages = 0,
+				mapBlockLast = 0,
+				all_visible = 0,
+				all_frozen = 0;
+	uint8		mapbits_last_page = 0;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_acquire %s %u",
+		 RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	if (rel_pages > 0)
+	{
+		mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages - 1);
+		nvmpages = mapBlockLast + 1;
+	}
+
+	/* Allocate and initialize VM snapshot state */
+	vmsnap = palloc0(sizeof(vmsnapshot));
+	vmsnap->rel = rel;
+	vmsnap->strat = VMSNAP_SCAN_ALL;	/* for now */
+	vmsnap->rel_pages = rel_pages;	/* scanned_pages for VMSNAP_SCAN_ALL */
+	vmsnap->scanned_pages_lazy = 0;
+	vmsnap->scanned_pages_eager = 0;
+
+	/*
+	 * vmsnap temp file state.
+	 *
+	 * Only relations large enough to need more than one visibility map page
+	 * use a temp file (cannot wholly rely on vmsnap's single page cache).
+	 */
+	vmsnap->nvmpages = nvmpages;
+	vmsnap->file = NULL;
+	if (nvmpages > 1)
+		vmsnap->file = BufFileCreateTemp(false);
+	vmsnap->prefetch_distance = 0;
+#ifdef USE_PREFETCH
+	vmsnap->prefetch_distance =
+		get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
+#endif
+	vmsnap->prefetch_distance = Max(vmsnap->prefetch_distance, MIN_PREFETCH_SIZE);
+
+	/* cache of VM pages read from temp file */
+	vmsnap->curvmpage = 0;
+	vmsnap->rawmap = NULL;
+
+	/* staged blocks array state */
+	vmsnap->current_nblocks_staged = 0;
+	vmsnap->next_block = 0;
+	vmsnap->scanned_pages_to_return = 0;
+	vmsnap->scanned_pages_to_prefetch = 0;
+	/* Offsets into staged blocks array */
+	vmsnap->next_return_idx = 0;
+	vmsnap->next_prefetch_idx = 0;
+	vmsnap->first_invalid_idx = 0;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		/* Cache page locally */
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(vmsnap->vmpage.data, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/* Finish off this VM page using snapshot's vmpage cache */
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = map = PageGetContents(vmsnap->vmpage.data);
+		umap = (uint64 *) map;
+
+		if (mapBlock == mapBlockLast)
+		{
+			uint32		mapByte;
+			uint8		mapOffset;
+
+			/*
+			 * The last VM page requires some extra steps.
+			 *
+			 * First get the status of the last heap page (page in the range
+			 * of rel_pages) in passing.
+			 */
+			Assert(mapBlock == HEAPBLK_TO_MAPBLOCK(rel_pages - 1));
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages - 1);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages - 1);
+			mapbits_last_page = ((map[mapByte] >> mapOffset) &
+								 VISIBILITYMAP_VALID_BITS);
+
+			/*
+			 * Also defensively "truncate" our local copy of the last page in
+			 * order to reliably exclude heap pages beyond the range of
+			 * rel_pages.  This is just paranoia.
+			 */
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages);
+			if (mapByte != 0 || mapOffset != 0)
+			{
+				MemSet(&map[mapByte + 1], 0, MAPSIZE - (mapByte + 1));
+				map[mapByte] &= (1 << mapOffset) - 1;
+			}
+		}
+
+		/* Maintain count of all-frozen and all-visible pages */
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+
+		/* Finally, write out vmpage cache VM page to vmsnap's temp file */
+		if (vmsnap->file)
+			BufFileWrite(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+	}
+
+	/*
+	 * Should always have at least as many all_visible pages as all_frozen
+	 * pages.  Even still, we generally only interpret a page as all-frozen
+	 * when both the all-visible and all-frozen bits are set together.  Clamp
+	 * so that we'll avoid giving our caller an obviously bogus summary of the
+	 * visibility map when certain pages only have their all-frozen bit set.
+	 * More paranoia.
+	 */
+	Assert(all_frozen <= all_visible && all_visible <= rel_pages);
+	all_frozen = Min(all_frozen, all_visible);
+
+	/*
+	 * Done copying all VM pages from authoritative VM into a VM snapshot.
+	 *
+	 * Figure out the final scanned_pages for the two skipping policies that
+	 * we might use: skipallvis (skip both all-frozen and all-visible) and
+	 * skipallfrozen (just skip all-frozen).
+	 */
+	vmsnap->scanned_pages_lazy = rel_pages - all_visible;
+	vmsnap->scanned_pages_eager = rel_pages - all_frozen;
+
+	/*
+	 * When the last page is skippable in principle, it still won't be treated
+	 * as skippable by visibilitymap_snap_next, which recognizes the last page
+	 * as a special case.  Compensate by incrementing each scanning strategy's
+	 * scanned_pages as needed to avoid counting the last page as skippable.
+	 *
+	 * As usual we expect that the all-frozen bit can only be set alongside
+	 * the all-visible bit (for any given page), but only interpret a page as
+	 * truly all-frozen when both of its VM bits are set together.
+	 */
+	if (mapbits_last_page & VISIBILITYMAP_ALL_VISIBLE)
+	{
+		vmsnap->scanned_pages_lazy++;
+		if (mapbits_last_page & VISIBILITYMAP_ALL_FROZEN)
+			vmsnap->scanned_pages_eager++;
+	}
+
+	*scanned_pages_lazy = vmsnap->scanned_pages_lazy;
+	*scanned_pages_eager = vmsnap->scanned_pages_eager;
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_strategy -- determine VACUUM's scanning strategy.
+ *
+ * VACUUM chooses a vmsnap strategy according to priorities around advancing
+ * relfrozenxid.  See visibilitymap_snap_acquire.
+ */
+void
+visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat)
+{
+	int			nprefetch;
+
+	/* Remember final scanning strategy */
+	vmsnap->strat = strat;
+
+	if (vmsnap->strat == VMSNAP_SCAN_LAZY)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_lazy;
+	else if (vmsnap->strat == VMSNAP_SCAN_EAGER)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_eager;
+	else
+		vmsnap->scanned_pages_to_return = vmsnap->rel_pages;
+
+	vmsnap->scanned_pages_to_prefetch = vmsnap->scanned_pages_to_return;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_strategy %s %d %u",
+		 RelationGetRelationName(vmsnap->rel), (int) strat,
+		 vmsnap->scanned_pages_to_return);
+#endif
+
+	/*
+	 * Stage blocks (may have to read from temp file).
+	 *
+	 * We rely on the assumption that we'll always have a large enough staged
+	 * blocks array to accommodate any possible prefetch distance.
+	 */
+	vm_snap_stage_blocks(vmsnap);
+
+	nprefetch = Min(vmsnap->current_nblocks_staged, vmsnap->prefetch_distance);
+#ifdef USE_PREFETCH
+	for (int i = 0; i < nprefetch; i++)
+	{
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, vmsnap->staged[i]);
+	}
+#endif
+
+	vmsnap->scanned_pages_to_prefetch -= nprefetch;
+	vmsnap->next_prefetch_idx += nprefetch;
+}
+
+/*
+ *	visibilitymap_snap_next -- get next block to scan from vmsnap.
+ *
+ * Returns next block in line for VACUUM to scan according to vmsnap.  Caller
+ * skips any and all blocks preceding returned block.
+ *
+ * VACUUM always scans the last page to determine whether it has tuples.  This
+ * is useful as a way of avoiding certain pathological cases with heap rel
+ * truncation.  We always return the final block (rel_pages - 1) here last.
+ */
+BlockNumber
+visibilitymap_snap_next(vmsnapshot *vmsnap)
+{
+	BlockNumber next_block_to_scan;
+
+	if (vmsnap->scanned_pages_to_return == 0)
+		return InvalidBlockNumber;
+
+	/* Prepare to return this block */
+	next_block_to_scan = vmsnap->staged[vmsnap->next_return_idx++];
+	vmsnap->current_nblocks_staged--;
+	vmsnap->scanned_pages_to_return--;
+
+	/*
+	 * Did the staged blocks array just run out of blocks to return to caller,
+	 * or do we need to stage more blocks for I/O prefetching purposes?
+	 */
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+	if ((vmsnap->current_nblocks_staged == 0 &&
+		 vmsnap->scanned_pages_to_return > 0) ||
+		(vmsnap->next_prefetch_idx == vmsnap->first_invalid_idx &&
+		 vmsnap->scanned_pages_to_prefetch > 0))
+	{
+		if (vmsnap->current_nblocks_staged > 0)
+		{
+			/*
+			 * We've run out of prefetchable blocks, but still have some
+			 * non-returned blocks.  Shift existing blocks to the start of the
+			 * array.  The newly staged blocks go after these ones.
+			 */
+			memmove(&vmsnap->staged[0],
+					&vmsnap->staged[vmsnap->next_return_idx],
+					sizeof(BlockNumber) * vmsnap->current_nblocks_staged);
+		}
+
+		/*
+		 * Reset offsets in staged blocks array, while accounting for likely
+		 * presence of preexisting blocks that have already been prefetched
+		 * but have yet to be returned to VACUUM caller
+		 */
+		vmsnap->next_prefetch_idx -= vmsnap->next_return_idx;
+		vmsnap->first_invalid_idx -= vmsnap->next_return_idx;
+		vmsnap->next_return_idx = 0;
+
+		/* Stage more blocks (may have to read from temp file) */
+		vm_snap_stage_blocks(vmsnap);
+	}
+
+	/*
+	 * By here we're guaranteed to have at least one prefetchable block in the
+	 * staged blocks array (unless we've already prefetched all blocks that
+	 * will ever be returned to VACUUM caller)
+	 */
+	if (vmsnap->next_prefetch_idx < vmsnap->first_invalid_idx)
+	{
+#ifdef USE_PREFETCH
+		/* Still have remaining blocks to prefetch, so prefetch next one */
+		BlockNumber prefetch = vmsnap->staged[vmsnap->next_prefetch_idx++];
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, prefetch);
+#else
+		vmsnap->next_prefetch_idx++;
+#endif
+		Assert(vmsnap->current_nblocks_staged > 1);
+		Assert(vmsnap->scanned_pages_to_prefetch > 0);
+		vmsnap->scanned_pages_to_prefetch--;
+	}
+	else
+	{
+		Assert(vmsnap->scanned_pages_to_prefetch == 0);
+	}
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_next %s %u",
+		 RelationGetRelationName(vmsnap->rel), next_block_to_scan);
+#endif
+
+	return next_block_to_scan;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Frees resources allocated in visibilitymap_snap_acquire for VACUUM.
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->scanned_pages_to_return == 0);
+	if (vmsnap->file)
+		BufFileClose(vmsnap->file);
+	pfree(vmsnap);
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
@@ -680,3 +1110,112 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
+
+/*
+ * Stage some heap blocks from vmsnap to return to VACUUM caller.
+ *
+ * Called when we completely run out of staged blocks to return to VACUUM, or
+ * when vmsnap still has some pending staged blocks, but too few to be able to
+ * prefetch incrementally as the remaining blocks are returned to VACUUM.
+ */
+static void
+vm_snap_stage_blocks(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->current_nblocks_staged < STAGED_BUFSIZE);
+	Assert(vmsnap->first_invalid_idx < STAGED_BUFSIZE);
+	Assert(vmsnap->next_return_idx <= vmsnap->first_invalid_idx);
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+
+	while (vmsnap->next_block < vmsnap->rel_pages &&
+		   vmsnap->current_nblocks_staged < STAGED_BUFSIZE)
+	{
+		for (;;)
+		{
+			uint8		mapbits = vm_snap_get_status(vmsnap,
+													 vmsnap->next_block);
+
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+				break;
+			}
+
+			/*
+			 * Stop staging blocks just before final page, which must always
+			 * be scanned by VACUUM
+			 */
+			if (vmsnap->next_block == vmsnap->rel_pages - 1)
+				break;
+
+			/* VMSNAP_SCAN_ALL forcing VACUUM to scan every page? */
+			if (vmsnap->strat == VMSNAP_SCAN_ALL)
+				break;
+
+			/*
+			 * Check if VACUUM must scan this page because it's not all-frozen
+			 * and VACUUM opted to use VMSNAP_SCAN_EAGER strategy
+			 */
+			if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
+				vmsnap->strat == VMSNAP_SCAN_EAGER)
+				break;
+
+			/* VACUUM will skip this block -- so don't stage it for later */
+			vmsnap->next_block++;
+		}
+
+		/* VACUUM will scan this block, so stage it for later */
+		vmsnap->staged[vmsnap->first_invalid_idx++] = vmsnap->next_block++;
+		vmsnap->current_nblocks_staged++;
+	}
+}
+
+/*
+ * Get status of bits from vm snapshot
+ */
+static uint8
+vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "vm_snap_get_status %u", heapBlk);
+#endif
+
+	/*
+	 * If we didn't see the VM page when the snapshot was first acquired we
+	 * defensively assume heapBlk not all-visible or all-frozen
+	 */
+	Assert(heapBlk <= vmsnap->rel_pages);
+	if (unlikely(mapBlock >= vmsnap->nvmpages))
+		return 0;
+
+	/*
+	 * Read from temp file when required.
+	 *
+	 * Although this routine supports random access, sequential access is
+	 * expected.  We should only need to read each temp file page into cache
+	 * at most once per VACUUM.
+	 */
+	if (unlikely(mapBlock != vmsnap->curvmpage))
+	{
+		size_t		nread;
+
+		if (BufFileSeekBlock(vmsnap->file, mapBlock) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek to block %u of vmsnap temporary file",
+							mapBlock)));
+		nread = BufFileRead(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+		if (nread != BLCKSZ)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u of vmsnap temporary file: read only %zu of %zu bytes",
+							mapBlock, nread, (size_t) BLCKSZ)));
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = PageGetContents(vmsnap->vmpage.data);
+	}
+
+	return ((vmsnap->rawmap[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index dcdccea03..de2e98368 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -970,11 +970,11 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				freeze_strategy_threshold;
 	uint64		threshold_nblocks;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
@@ -1114,48 +1114,48 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	cutoffs->freeze_strategy_threshold_nblocks = threshold_nblocks;
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
-	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
-	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.  The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
-	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/*
+	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+	 * XMID table age (whichever is greater currently).
+	 */
+	XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+		((double) freeze_table_age + 0.5);
+	MXIDFrac = (double) (nextMXID - cutoffs->relminmxid) /
+		((double) multixact_freeze_table_age + 0.5);
+	cutoffs->tableagefrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		cutoffs->tableagefrac = 1.0;
+
+	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 615bee883..e8c6c13da 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2497,10 +2497,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2517,10 +2517,10 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
 			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6d8c76cf6..acdf7be61 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -660,6 +660,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -693,11 +700,9 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #bytea_output = 'hex'			# hex, escape
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b995c3824..c98e6c306 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9210,20 +9210,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, based on criteria that considers both
+         costs and benefits.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9292,19 +9300,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, based on criteria that considers both
+         costs and benefits.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..18c32983f 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -497,13 +497,6 @@
     <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
     XID and MXID values, including those from all-visible but not all-frozen pages.
     In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
    </para>
 
    <para>
@@ -533,27 +526,9 @@
     <varname>vacuum_freeze_min_age</varname>.
    </para>
 
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
    <para>
     The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
+    is that the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
     subdirectories of the database cluster will take more space, because it
     must store the commit status and (if <varname>track_commit_timestamp</varname> is
     enabled) timestamp of all transactions back to
@@ -630,7 +605,7 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     advanced when every page of the table
     that might contain unfrozen XIDs is scanned.  This happens when
     <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
+    <varname>autovacuum_freeze_max_age</varname> transactions old, when
     <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
     pages that are not already all-frozen happen to
     require vacuuming to remove dead row versions. When <command>VACUUM</command>
@@ -648,6 +623,29 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     be forced for the table.
    </para>
 
+   <tip>
+   <para>
+    <varname>vacuum_freeze_table_age</varname> can be used to override
+    <varname>autovacuum_freeze_max_age</varname> locally.
+    <command>VACUUM</command> will advance
+    <structfield>relfrozenxid</structfield> in the same way as it
+    would had <varname>autovacuum_freeze_max_age</varname> been set to
+    the same value, without any direct impact on autovacuum
+    scheduling.
+   </para>
+   <para>
+    Prior to <productname>PostgreSQL</productname> 16,
+    <command>VACUUM</command> did not apply a cost model to decide
+    when to advance <structfield>relfrozenxid</structfield>, which
+    made <varname>vacuum_freeze_table_age</varname> an important
+    tunable setting.  This is no longer the case.  The revised
+    <varname>vacuum_freeze_table_age</varname> default of
+    <literal>-1</literal> makes <command>VACUUM</command> use
+    <varname>autovacuum_freeze_max_age</varname> as an input to its
+    cost model, which should be adequate in most environments.
+   </para>
+   </tip>
+
    <para>
     If for some reason autovacuum fails to clear old XIDs from a table, the
     system will begin to emit warning messages like this when the database's
@@ -720,12 +718,6 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      transaction ID, or a newer multixact ID.  For each table,
      <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
      possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
     </para>
 
     <para>
@@ -844,10 +836,22 @@ vacuum insert threshold = vacuum base insert threshold + vacuum insert scale fac
     <command>DELETE</command> and <command>INSERT</command> operation.  (It is
     only semi-accurate because some information might be lost under heavy
     load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
+    is more than <varname>autovacuum_freeze_max_age</varname> transactions old,
+    vacuum must freeze old tuples from existing all-visible pages to
+    be able to advance <structfield>relfrozenxid</structfield>;
+    otherwise, vacuum applies a cost model that advances
+    <structfield>relfrozenxid</structfield> whenever the added cost of
+    doing so during the ongoing operation is sufficiently low.
+    <varname>autovacuum_freeze_max_age</varname> is used to guide
+    <command>VACUUM</command> on how
+    <structfield>relfrozenxid</structfield> must be advanced in the
+    worst case, which is often only weakly predictive of the actual
+    rate.  Much depends on workload characteristics.  A cost model
+    dynamically determines whether or not to advance
+    <structfield>relfrozenxid</structfield> at the start of each
+    <command>VACUUM</command>.  The model finds the most opportune
+    time by weighing the added cost of advancement against the age
+    that <structfield>relfrozenxid</structfield> has already attained.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 545b23b54..6ba4385a0 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -158,11 +158,11 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
      <para>
       Normally, <command>VACUUM</command> will skip pages based on the <link
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      all tuples are known to be frozen can always be skipped.  Pages
+      where all tuples are known to be visible to all transactions are
+      skipped whenever <command>VACUUM</command> determined that
+      advancing <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> was unnecessary.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.39.0

v16-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchapplication/x-patch; name=v16-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchDownload

From 5b4d2cdc6b96cf038d800552aa359bfeb0c48a32 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v16 1/3] Add eager and lazy freezing strategies to VACUUM.

Avoid large build-ups of all-visible pages by making non-aggressive
VACUUMs freeze pages proactively for VACUUMs/tables where eager
vacuuming is deemed appropriate.  VACUUM determines its freezing
strategy based on the value of the new vacuum_freeze_strategy_threshold
GUC (or reloption) in most cases: Tables that exceeds the size threshold
use the eager freezing strategy.  Otherwise VACUUM uses the lazy
freezing strategy,  which is essentially the same approach that VACUUM
has always taken to freezing (though not quite, due to the influence of
page level freezing following recent work).

When the eager strategy is in use, lazy_scan_prune will trigger freezing
a page's tuples at the point that it notices that it will at least
become all-visible -- it can be made all-frozen instead.  We still
respect FreezeLimit, though: the presence of any XID < FreezeLimit also
triggers page-level freezing (just as it would with the lazy strategy).
Eager freezing is generally only applied when vacuuming larger tables,
where freezing most individual heap pages at the first opportunity (in
the first VACUUM operation where they can definitely be set all-visible)
will improve performance stability.

A later commit will add vmsnap scanning strategies, which are designed
to work in tandem with the freezing strategies from this commit.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/commands/vacuum.h                 |  9 ++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 11 +++++
 src/backend/access/heap/heapam.c              |  1 +
 src/backend/access/heap/vacuumlazy.c          | 43 ++++++++++++++++++-
 src/backend/commands/vacuum.c                 | 26 ++++++++++-
 src/backend/postmaster/autovacuum.c           | 10 +++++
 src/backend/utils/misc/guc_tables.c           | 11 +++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 19 +++++++-
 doc/src/sgml/ref/create_table.sgml            | 14 ++++++
 11 files changed, 143 insertions(+), 3 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb770..d900b1be1 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -222,6 +222,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in megabytes,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -274,6 +277,11 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold that triggers VACUUM's eager freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold_nblocks;
 };
 
 /*
@@ -297,6 +305,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038..bcc5e589a 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -308,6 +308,7 @@ typedef struct AutoVacOpts
 	int			vacuum_ins_threshold;
 	int			analyze_threshold;
 	int			vacuum_cost_limit;
+	int			freeze_strategy_threshold;
 	int			freeze_min_age;
 	int			freeze_max_age;
 	int			freeze_table_age;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 14c23101a..e982d0e76 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -260,6 +260,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, 10000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy, in megabytes.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, 536870912
+	},
 	{
 		{
 			"autovacuum_freeze_min_age",
@@ -1851,6 +1860,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_cost_limit)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_min_age)},
 		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 388df94a4..95f4d59e3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7056,6 +7056,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.freeze_strategy_threshold_nblocks = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f..f9536e522 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,6 +153,8 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -243,6 +245,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void lazy_scan_strategy(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -472,6 +475,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
+	/*
+	 * Now determine VACUUM's freezing strategy.
+	 */
+	lazy_scan_strategy(vacrel);
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -1267,6 +1274,38 @@ lazy_scan_heap(LVRelState *vacrel)
 		lazy_cleanup_all_indexes(vacrel);
 }
 
+/*
+ *	lazy_scan_strategy() -- Determine freezing strategy.
+ *
+ * Our lazy freezing strategy is useful when putting off the work of freezing
+ * totally avoids freezing that turns out to have been wasted effort later on.
+ * Our eager freezing strategy is useful with larger tables that experience
+ * continual growth, where freezing pages proactively is needed just to avoid
+ * falling behind on freezing (eagerness is also likely to be cheaper in the
+ * short/medium term for such tables, but the long term picture matters most).
+ */
+static void
+lazy_scan_strategy(LVRelState *vacrel)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	/*
+	 * Decide freezing strategy.
+	 *
+	 * The eager freezing strategy is used when the threshold controlled by
+	 * freeze_strategy_threshold GUC/reloption exceeds rel_pages.
+	 *
+	 * Also freeze eagerly with an unlogged or temp table, where the total
+	 * cost of freezing each page is just the cycles needed to prepare a set
+	 * of freeze plans.  Executing the freeze plans adds very little cost.
+	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
+	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 */
+	vacrel->eager_freeze_strategy =
+		(rel_pages >= vacrel->cutoffs.freeze_strategy_threshold_nblocks ||
+		 !RelationIsPermanent(vacrel->rel));
+}
+
 /*
  *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
@@ -1795,10 +1834,12 @@ retry:
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 	 * freeze when pruning generated an FPI, if doing so means that we set the
 	 * page all-frozen afterwards (might not happen until final heap pass).
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will thereby become all-frozen in the visibility map.
 	 */
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
 		(prunestate->all_visible && prunestate->all_frozen &&
-		 fpi_before != pgWalUsage.wal_fpi))
+		 (fpi_before != pgWalUsage.wal_fpi || vacrel->eager_freeze_strategy)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127..dcdccea03 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -68,6 +68,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -273,6 +274,9 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.multixact_freeze_table_age = -1;
 	}
 
+	/* Determine freezing strategy later on using GUC or reloption */
+	params.freeze_strategy_threshold = -1;
+
 	/* user-invoked vacuum is never "for wraparound" */
 	params.is_wraparound = false;
 
@@ -962,7 +966,9 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
+	uint64		threshold_nblocks;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -975,6 +981,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/* Set pg_class fields in cutoffs */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1089,6 +1096,23 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
 		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	Assert(freeze_strategy_threshold >= 0);
+
+	/*
+	 * Convert MB-based GUC to nblocks value used within vacuumlazy.c, while
+	 * being careful to avoid overflow
+	 */
+	threshold_nblocks =
+		(uint64) freeze_strategy_threshold * 1024L * 1024L / BLCKSZ;
+	threshold_nblocks = Min(threshold_nblocks, MaxBlockNumber);
+	cutoffs->freeze_strategy_threshold_nblocks = threshold_nblocks;
+
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c5..ecddde3a1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2877,6 +2886,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f8..615bee883 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2524,6 +2524,17 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy, in megabytes."),
+			NULL,
+			GUC_UNIT_MB
+		},
+		&vacuum_freeze_strategy_threshold,
+		4096, 0, 536870912,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4cceda416..6d8c76cf6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -694,6 +694,7 @@
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
 #vacuum_freeze_table_age = 150000000
+#vacuum_freeze_strategy_threshold = 4GB
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
 #vacuum_multixact_freeze_table_age = 150000000
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 77574e2d4..b995c3824 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9187,6 +9187,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in megabytes) that <command>VACUUM</command>
+        should use to decide whether to apply its eager freezing strategy.
+        The default is 4096 megabytes (equivalent to <literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
       <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9222,7 +9237,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         Specifies the cutoff age (in transactions) that
         <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages that have an older XID.
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index a03dee4af..f97cc7084 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.39.0

#81

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Peter Geoghegan (#80)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 16, 2023 at 10:10 AM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v16, which incorporates some of Matthias' feedback.

0001 (the freezing strategies patch) is now committable IMV. Or at
least will be once I polish the docs a bit more. I plan on committing
0001 some time next week, barring any objections.

I should point out that 0001 is far shorter and simpler than the
page-level freezing commit that already went in (commit 1de58df4). The
only thing in 0001 that seems like it might be a bit controversial
(when considered on its own) is the addition of the
vacuum_freeze_strategy_threshold GUC/reloption. Note in particular
that vacuum_freeze_strategy_threshold doesn't look like any other
existing GUC; it gets applied as a threshold on the size of the rel's
main fork at the beginning of vacuumlazy.c processing. As far as I
know there are no objections to that approach at this time, but it
does still seem worth drawing attention to now.

0001 also makes unlogged tables and temp tables always use eager
freezing strategy, no matter how the GUC/reloption are set. This seems
*very* easy to justify, since the potential downside of such a policy
is obviously extremely low, even when we make very pessimistic
assumptions. The usual cost we need to worry about when it comes to
freezing is the added WAL overhead -- that clearly won't apply when
we're vacuuming non-permanent tables. That really just leaves the cost
of dirtying extra pages, which in general could have a noticeable
system-level impact in the case of unlogged tables.

Dirtying extra pages when vacuuming an unlogged table is also a
non-issue. Even the eager freezing strategy only freezes "extra" pages
("extra" relative to the lazy strategy behavior) given a page that
will be set all-visible in any case [1]https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2 -- Peter Geoghegan. Such a page will need to have
its page-level PD_ALL_VISIBLE bit set in any case -- which is already
enough to dirty the page. And so there can never be any additional
pages dirtied as a result of the special policy 0001 adds for
non-permanent relations.

[1]: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2 -- Peter Geoghegan
--
Peter Geoghegan

#82

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Peter Geoghegan (#79)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 16, 2023 at 10:00 AM Peter Geoghegan <pg@bowt.ie> wrote:

Now we treat the scanning and freezing strategies as two independent
choices. Of course they're not independent in any practical sense, but
I think it's slightly simpler and more elegant that way -- it makes
the GUC vacuum_freeze_strategy_threshold strictly about freezing
strategy, while still leading to VACUUM advancing relfrozenxid in a
way that makes sense. It just happens as a second order effect. Why
add a special case?

This might be a better way to explain it:

The main page-level freezing commit (commit 1de58df4) already added an
optimization that triggers page-level freezing "early" (early relative
to vacuum_freeze_min_age). This happens whenever a page already needs
to have an FPI logged inside lazy_scan_prune -- even when we're using
the lazy freezing strategy. The optimization isn't configurable, and
gets applied regardless of freezing strategy (technically there is no
such thing as freezing strategies on HEAD just yet, though HEAD still
has this optimization).

There will be workloads where the FPI optimization will result in
freezing many more pages -- especially when data checksums are in use
(since then we could easily need to log an FPI just so pruning can set
a hint bit). As a result, certain VACUUMs that use the lazy freezing
strategy will freeze in almost the same way as an equivalent VACUUM
using the eager freezing strategy. Such a "nominally lazy but actually
quite eager" VACUUM operation should get the same benefit in terms of
relfrozenxid advancement as it would if it really had used the eager
freezing strategy instead. It's fairly obvious that we'll get the same
benefit in relfrozenxid advancement (comparable relfrozenxid results
for comparable freezing work), since the way that VACUUM decides on
its scanning strategy is not conditioned on freezing strategy (whether
by the ongoing VACUUM or any other VACUUM against the same table).

All that matters is the conditions in the table (in particular the
added cost of opting for eager scanning over lazy scanning) as
indicated by the visibility map at the start of each VACUUM -- how
those conditions came about really isn't interesting at that point.
And so lazy_scan_strategy doesn't care about them when it chooses
VACUUM's scanning strategy.

There are even tables/workloads where relfrozenxid will be able to
jump forward by a huge amount whenever VACUUM choosing the eager
scanning strategy, despite the fact that VACUUM generally does little
or no freezing to make that possible:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3

--
Peter Geoghegan

#83

Dilip Kumar

dilipbalaut@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#79)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 16, 2023 at 11:31 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think '(nextXID - cutoffs->relfrozenxid) / freeze_table_age' should
be the actual fraction right? What is the point of adding 0.5 to the
divisor? If there is a logical reason, maybe we can explain in the
comments.

It's just a way of avoiding division by zero.

oh, correct :)

While looking into the logic of 'lazy_scan_strategy', I think the idea
looks very good but the only thing is that
we have kept eager freeze and eager scan completely independent.
Don't you think that if a table is chosen for an eager scan
then we should force the eager freezing as well?

Earlier versions of the patch kind of worked that way.
lazy_scan_strategy would actually use twice the GUC setting to
determine scanning strategy. That approach could make our "transition
from lazy to eager strategies" involve an excessive amount of
"catch-up freezing" in the VACUUM operation that advanced relfrozenxid
for the first time, which you see an example of here:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch

Now we treat the scanning and freezing strategies as two independent
choices. Of course they're not independent in any practical sense, but
I think it's slightly simpler and more elegant that way -- it makes
the GUC vacuum_freeze_strategy_threshold strictly about freezing
strategy, while still leading to VACUUM advancing relfrozenxid in a
way that makes sense. It just happens as a second order effect. Why
add a special case?

I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
strictly for freezing. But the point is that the eager scanning
strategy is driven by table freezing needs of the table (tableagefrac)
that make sense, but if we have selected the eager freezing based on
the table age and its freezing need then why don't we force the eager
freezing as well if we have selected eager scanning, after all the
eager scanning is selected for satisfying the freezing need. But
OTOH, the eager scanning might get selected if it appears that we
might not have to scan too many extra pages compared to lazy scan so
in those cases forcing the eager freezing might not be wise. So maybe
it is a good idea to keep them the way you have in your patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#84

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Dilip Kumar (#83)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
strictly for freezing. But the point is that the eager scanning
strategy is driven by table freezing needs of the table (tableagefrac)
that make sense, but if we have selected the eager freezing based on
the table age and its freezing need then why don't we force the eager
freezing as well if we have selected eager scanning, after all the
eager scanning is selected for satisfying the freezing need.

Don't think of eager scanning as the new name for aggressive mode --
it's a fairly different concept, because we care about costs now.
Eager scanning can be chosen just because it's very cheap relative to
the alternative of lazy scanning, even when relfrozenxid is still very
recent. (This kind of behavior isn't really new [1]https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Constantly_updated_tables_.28usually_smaller_tables.29 -- Peter Geoghegan, but the exact
implementation from the patch is new.)

Tables such as pgbench_branches and pgbench_tellers will reliably use
eager scanning strategy, no matter how any GUC has been set -- just
because the added cost is always zero (relative to lazy scanning). It
really doesn't matter how far along tableagefrac here, ever. These
same tables will never use eager freezing strategy, unless the
vacuum_freeze_strategy_threshold GUC is misconfigured. (This is
another example of how scanning strategy and freezing strategy may
differ for the same table.)

You do have a good point, though. I think that I know what you mean.
Note that antiwraparound autovacuums (or VACUUMs of tables very near
to that point) *will* always use both the eager freezing strategy and
the eager scanning strategy -- which is probably close to what you
meant.

The important point is that there can be more than one reason to
prefer one strategy to another -- and the reasons can be rather
different. Occasionally it'll be a combination of two factors together
that push things in favor of one strategy over the other -- even
though either factor on its own would not have resulted in the same
choice.

[1]: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Constantly_updated_tables_.28usually_smaller_tables.29 -- Peter Geoghegan
--
Peter Geoghegan

#85

Dilip Kumar

dilipbalaut@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#84)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, Jan 17, 2023 at 10:05 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jan 16, 2023 at 8:13 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think that it makes sense to keep 'vacuum_freeze_strategy_threshold'
strictly for freezing. But the point is that the eager scanning
strategy is driven by table freezing needs of the table (tableagefrac)
that make sense, but if we have selected the eager freezing based on
the table age and its freezing need then why don't we force the eager
freezing as well if we have selected eager scanning, after all the
eager scanning is selected for satisfying the freezing need.

Don't think of eager scanning as the new name for aggressive mode --
it's a fairly different concept, because we care about costs now.
Eager scanning can be chosen just because it's very cheap relative to
the alternative of lazy scanning, even when relfrozenxid is still very
recent. (This kind of behavior isn't really new [1], but the exact
implementation from the patch is new.)

Tables such as pgbench_branches and pgbench_tellers will reliably use
eager scanning strategy, no matter how any GUC has been set -- just
because the added cost is always zero (relative to lazy scanning). It
really doesn't matter how far along tableagefrac here, ever. These
same tables will never use eager freezing strategy, unless the
vacuum_freeze_strategy_threshold GUC is misconfigured. (This is
another example of how scanning strategy and freezing strategy may
differ for the same table.)

Yes, I agree with that. Thanks for explaining in detail.

You do have a good point, though. I think that I know what you mean.
Note that antiwraparound autovacuums (or VACUUMs of tables very near
to that point) *will* always use both the eager freezing strategy and
the eager scanning strategy -- which is probably close to what you
meant.

Right

The important point is that there can be more than one reason to
prefer one strategy to another -- and the reasons can be rather
different. Occasionally it'll be a combination of two factors together
that push things in favor of one strategy over the other -- even
though either factor on its own would not have resulted in the same
choice.

Yes, that makes sense to me.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#86

Dilip Kumar

dilipbalaut@gmail.com

almost 3 years ago

In reply to: Dilip Kumar (#85)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 18, 2023 at 1:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 17, 2023 at 10:05 AM Peter Geoghegan <pg@bowt.ie> wrote:

My final set of comments for 0002

1.
+struct vmsnapshot
+{
+    /* Target heap rel */
+    Relation    rel;
+    /* Scanning strategy used by VACUUM operation */
+    vmstrategy    strat;
+    /* Per-strategy final scanned_pages */
+    BlockNumber rel_pages;
+    BlockNumber scanned_pages_lazy;
+    BlockNumber scanned_pages_eager;

I do not understand much use of maintaining these two
'scanned_pages_lazy' and 'scanned_pages_eager' variables. I think
just maintaining 'scanned_pages' should be sufficient. I do not see
in patches also they are really used. lazy_scan_strategy() is using
these variables but this is getting values of these out parameters
from visibilitymap_snap_acquire(). And visibilitymap_snap_strategy()
is also using this, but it seems there we just need the final result
of 'scanned_pages' instead of these two variables.

+#define MAX_PAGES_YOUNG_TABLEAGE    0.05    /* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE        0.70    /* 70% of rel_pages */

Why is the logic behind 5% and 70% are those based on some
experiments? Should those be tuning parameters so that with real
world use cases if we realise that it would be good if the eager scan
is getting selected more frequently or less frequently then we can
tune those parameters?

3.
+    /*
+     * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+     * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+     */
+    if (force_scan_all)
+        vacrel->vmstrat = VMSNAP_SCAN_ALL;

I think this should be moved as first if case, I mean why to do all
the calculations based on the 'tableagefrac' and
'TABLEAGEFRAC_XXPOINT' if we are forced to scan them all. I agree the
extra computation we are doing might not really matter compared to the
vacuum work we are going to perform but still seems logical to me to
do the simple check first.

4. Should we move prefetching as a separate patch, instead of merging
with the scanning strategy?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#87

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Dilip Kumar (#86)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 23, 2023 at 3:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

My final set of comments for 0002

Thanks for the review!

I do not understand much use of maintaining these two
'scanned_pages_lazy' and 'scanned_pages_eager' variables. I think
just maintaining 'scanned_pages' should be sufficient. I do not see
in patches also they are really used.

I agree that the visibility map snapshot struct could stand to be
cleaned up -- some of that state may not be needed, and it wouldn't be
that hard to use memory a little more economically, particularly with
very small tables. It's on my TODO list already.

+#define MAX_PAGES_YOUNG_TABLEAGE    0.05    /* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE        0.70    /* 70% of rel_pages */
Why is the logic behind 5% and 70% are those based on some
experiments? Should those be tuning parameters so that with real
world use cases if we realise that it would be good if the eager scan
is getting selected more frequently or less frequently then we can
tune those parameters?

The specific multipliers constants chosen (for
MAX_PAGES_YOUNG_TABLEAGE and MAX_PAGES_OLD_TABLEAGE) were based on
both experiments and intuition. The precise values could be somewhat
different without it really mattering, though. For example, with a
table like pgbench_history (which is a really important case for the
patch in general), there won't be any all-visible pages at all (at
least after a short while), so it won't matter what these constants
are -- eager scanning will always be chosen.

I don't think that they should be parameters. The useful parameter for
users remains vacuum_freeze_table_age/autovacuum_freeze_max_age (note
that vacuum_freeze_table_age usually gets its value from
autovacuum_freeze_max_age due to changes in 0002). Like today,
vacuum_freeze_table_age forces VACUUM to scan all not-all-frozen pages
so that relfrozenxid can be advanced. Unlike today, it forces eager
scanning (not aggressive mode). But even long before eager scanning is
*forced*, pressure to use eager scanning gradually builds. That
pressure will usually cause some VACUUM to use eager scanning before
it's strictly necessary. Overall,
vacuum_freeze_table_age/autovacuum_freeze_max_age now provide loose
guidance.

It really has to be loose in this sense in order for
lazy_scan_strategy() to have the freedom to do the right thing based
on the characteristics of the table as a whole, according to its
visibility map snapshot. This allows lazy_scan_strategy() to stumble
upon once-off opportunities to advance relfrozenxid inexpensively,
including cases where it could never happen with the current model.
These opportunities are side-effects of workload characteristics that
can be hard to predict [1]https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3[2]https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads -- Peter Geoghegan.

I think this should be moved as first if case, I mean why to do all
the calculations based on the 'tableagefrac' and
'TABLEAGEFRAC_XXPOINT' if we are forced to scan them all. I agree the
extra computation we are doing might not really matter compared to the
vacuum work we are going to perform but still seems logical to me to
do the simple check first.

This is only needed for DISABLE_PAGE_SKIPPING, which is an escape
hatch option that is never supposed to be needed. I don't think that
it's worth going to the trouble of indenting the code more just so
this is avoided -- it really is an afterthought. Besides, the compiler
might well be doing this for us.

4. Should we move prefetching as a separate patch, instead of merging
with the scanning strategy?

I don't think that breaking that out would be an improvement. A lot of
the prefetching stuff informs how the visibility map code is
structured.

[1]: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_3
[2]: https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Opportunistically_advancing_relfrozenxid_with_bursty.2C_real-world_workloads -- Peter Geoghegan
--
Peter Geoghegan

#88

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Peter Geoghegan (#81)

3 attachment(s)

Re: New strategies for freezing, advancing relfrozenxid early

On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

0001 (the freezing strategies patch) is now committable IMV. Or at
least will be once I polish the docs a bit more. I plan on committing
0001 some time next week, barring any objections.

I plan on committing 0001 (the freezing strategies commit) tomorrow
morning, US Pacific time.

Attached is v17. There are no significant differences compared to v17.
I decided to post a new version now, ahead of commit, to show how I've
cleaned up the docs in 0001 -- docs describing the new GUC, freeze
strategies, and so on.

--
Peter Geoghegan

Attachments:

v17-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchapplication/x-patch; name=v17-0001-Add-eager-and-lazy-freezing-strategies-to-VACUUM.patchDownload

From e41d3f45fcd6f639b768c22139006ad11422575f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v17 1/3] Add eager and lazy freezing strategies to VACUUM.

Eager freezing strategy avoids large build-ups of all-visible pages.  It
makes VACUUM trigger page-level freezing whenever doing so will enable
the page to become all-frozen in the visibility map.  This is useful for
tables that experience continual growth, particularly strict append-only
tables such as pgbench's history table.  Eager freezing significantly
improves performance stability by spreading out the cost of freezing
over time, rather than doing most freezing during aggressive VACUUMs.
It complements the insert autovacuum mechanism added by commit b07642db.

VACUUM determines its freezing strategy based on the value of the new
vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
tables that exceed the size threshold use the eager freezing strategy.
Unlogged tables and temp tables will always use eager freezing strategy,
since there is essentially no downside.  Our policy for non-permanent
relations results in no extra WAL writes, and no extra dirtying of pages
(freezing doesn't need to be WAL-logged here, plus eager freezing can
only affect pages that already need to have PD_ALL_VISIBLE set).

VACUUM uses lazy freezing strategy for logged tables that fall under the
GUC size threshold.  Page-level freezing triggers based on the criteria
established in commit 1de58df4, which added basic page-level freezing.
Note that even lazy freezing strategy will trigger freezing whenever a
page happens to have required that an FPI be written during pruning.

Eager freezing is strictly more aggressive than lazy freezing.  Settings
like vacuum_freeze_min_age still get applied in just the same way in
every VACUUM, independent of the strategy in use.  The only mechanical
difference between eager and lazy freezing strategies is that only the
former applies its own additional criteria to trigger freezing pages.

The vacuum_freeze_strategy_threshold default is 4096 megabytes (4 GiB).
This relatively low default setting prioritizes performance stability.
It will be reviewed at the end of the Postgres 16 beta period.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/commands/vacuum.h                 | 12 +++++
 src/include/utils/rel.h                       |  1 +
 src/backend/access/common/reloptions.c        | 12 +++++
 src/backend/access/heap/heapam.c              |  1 +
 src/backend/access/heap/vacuumlazy.c          | 43 +++++++++++++++-
 src/backend/commands/vacuum.c                 | 25 +++++++++-
 src/backend/postmaster/autovacuum.c           | 10 ++++
 src/backend/utils/misc/guc_tables.c           | 14 ++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 19 ++++++-
 doc/src/sgml/maintenance.sgml                 | 50 +++++++++++++++----
 doc/src/sgml/ref/create_table.sgml            | 14 ++++++
 12 files changed, 190 insertions(+), 12 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb770..50cc6fce5 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -191,6 +191,9 @@ typedef struct VacAttrStats
 #define VACOPT_SKIP_DATABASE_STATS 0x100	/* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x200	/* only vac_update_datfrozenxid() */
 
+/* Absolute maximum of VacuumParams->freeze_strategy_threshold is 512TB */
+#define MAX_VACUUM_THRESHOLD 536870912
+
 /*
  * Values used by index_cleanup and truncate params.
  *
@@ -222,6 +225,9 @@ typedef struct VacuumParams
 											 * use default */
 	int			multixact_freeze_table_age; /* multixact age at which to scan
 											 * whole table */
+	int			freeze_strategy_threshold;	/* threshold to use eager
+											 * freezing, in megabytes,
+											 * -1 to use default */
 	bool		is_wraparound;	/* force a for-wraparound vacuum */
 	int			log_min_duration;	/* minimum execution threshold in ms at
 									 * which autovacuum is logged, -1 to use
@@ -274,6 +280,11 @@ struct VacuumCutoffs
 	 */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+
+	/*
+	 * Threshold that triggers VACUUM's eager freezing strategy
+	 */
+	BlockNumber freeze_strategy_threshold_pages;
 };
 
 /*
@@ -297,6 +308,7 @@ extern PGDLLIMPORT int vacuum_freeze_min_age;
 extern PGDLLIMPORT int vacuum_freeze_table_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
 extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
+extern PGDLLIMPORT int vacuum_freeze_strategy_threshold;
 extern PGDLLIMPORT int vacuum_failsafe_age;
 extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index af9785038..39c7ccf0c 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -314,6 +314,7 @@ typedef struct AutoVacOpts
 	int			multixact_freeze_min_age;
 	int			multixact_freeze_max_age;
 	int			multixact_freeze_table_age;
+	int			freeze_strategy_threshold;
 	int			log_min_duration;
 	float8		vacuum_cost_delay;
 	float8		vacuum_scale_factor;
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 14c23101a..54ac90ff1 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -27,6 +27,7 @@
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/tablespace.h"
+#include "commands/vacuum.h"
 #include "commands/view.h"
 #include "nodes/makefuncs.h"
 #include "postmaster/postmaster.h"
@@ -312,6 +313,15 @@ static relopt_int intRelOpts[] =
 			ShareUpdateExclusiveLock
 		}, -1, 0, 2000000000
 	},
+	{
+		{
+			"autovacuum_freeze_strategy_threshold",
+			"Table size at which VACUUM freezes using eager strategy, in megabytes.",
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
+		},
+		-1, 0, MAX_VACUUM_THRESHOLD
+	},
 	{
 		{
 			"log_autovacuum_min_duration",
@@ -1863,6 +1873,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_max_age)},
 		{"autovacuum_multixact_freeze_table_age", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, multixact_freeze_table_age)},
+		{"autovacuum_freeze_strategy_threshold", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, freeze_strategy_threshold)},
 		{"log_autovacuum_min_duration", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, log_min_duration)},
 		{"toast_tuple_target", RELOPT_TYPE_INT,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 388df94a4..152f6c2d6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7056,6 +7056,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.freeze_strategy_threshold_pages = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f..03ea36624 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -153,6 +153,8 @@ typedef struct LVRelState
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
+	/* Eagerly freeze all tuples on pages about to be set all-visible? */
+	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -243,6 +245,7 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
+static void lazy_scan_strategy(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -472,6 +475,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	vacrel->skipwithvm = skipwithvm;
 
+	/*
+	 * Now determine VACUUM's freezing strategy.
+	 */
+	lazy_scan_strategy(vacrel);
 	if (verbose)
 	{
 		if (vacrel->aggressive)
@@ -1267,6 +1274,38 @@ lazy_scan_heap(LVRelState *vacrel)
 		lazy_cleanup_all_indexes(vacrel);
 }
 
+/*
+ *	lazy_scan_strategy() -- Determine freezing strategy.
+ *
+ * Our lazy freezing strategy is useful when putting off the work of freezing
+ * totally avoids freezing that turns out to have been wasted effort later on.
+ * Our eager freezing strategy is useful with larger tables that experience
+ * continual growth, where freezing pages proactively is needed just to avoid
+ * falling behind on freezing (eagerness is also likely to be cheaper in the
+ * short/medium term for such tables, but the long term picture matters most).
+ */
+static void
+lazy_scan_strategy(LVRelState *vacrel)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	/*
+	 * Decide freezing strategy.
+	 *
+	 * The eager freezing strategy is used whenever rel_pages exceeds a
+	 * threshold controlled by the freeze_strategy_threshold GUC/reloption.
+	 *
+	 * Also freeze eagerly with an unlogged or temp table, where the total
+	 * cost of freezing pages is mostly just the cycles needed to prepare a
+	 * set of freeze plans.  Executing the freeze plans adds very little cost.
+	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
+	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 */
+	vacrel->eager_freeze_strategy =
+		(rel_pages > vacrel->cutoffs.freeze_strategy_threshold_pages ||
+		 !RelationIsPermanent(vacrel->rel));
+}
+
 /*
  *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
  *
@@ -1795,10 +1834,12 @@ retry:
 	 * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  Also
 	 * freeze when pruning generated an FPI, if doing so means that we set the
 	 * page all-frozen afterwards (might not happen until final heap pass).
+	 * When ongoing VACUUM opted to use the eager freezing strategy, we freeze
+	 * any page that will thereby become all-frozen in the visibility map.
 	 */
 	if (pagefrz.freeze_required || tuples_frozen == 0 ||
 		(prunestate->all_visible && prunestate->all_frozen &&
-		 fpi_before != pgWalUsage.wal_fpi))
+		 (fpi_before != pgWalUsage.wal_fpi || vacrel->eager_freeze_strategy)))
 	{
 		/*
 		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127..62bb87846 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -68,6 +68,7 @@ int			vacuum_freeze_min_age;
 int			vacuum_freeze_table_age;
 int			vacuum_multixact_freeze_min_age;
 int			vacuum_multixact_freeze_table_age;
+int			vacuum_freeze_strategy_threshold;
 int			vacuum_failsafe_age;
 int			vacuum_multixact_failsafe_age;
 
@@ -264,6 +265,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.freeze_table_age = 0;
 		params.multixact_freeze_min_age = 0;
 		params.multixact_freeze_table_age = 0;
+		params.freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -271,6 +273,7 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		params.freeze_table_age = -1;
 		params.multixact_freeze_min_age = -1;
 		params.multixact_freeze_table_age = -1;
+		params.freeze_strategy_threshold = -1;
 	}
 
 	/* user-invoked vacuum is never "for wraparound" */
@@ -962,7 +965,9 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				multixact_freeze_min_age,
 				freeze_table_age,
 				multixact_freeze_table_age,
-				effective_multixact_freeze_max_age;
+				effective_multixact_freeze_max_age,
+				freeze_strategy_threshold;
+	uint64		threshold_strategy_pages;
 	TransactionId nextXID,
 				safeOldestXmin,
 				aggressiveXIDCutoff;
@@ -975,6 +980,7 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	multixact_freeze_min_age = params->multixact_freeze_min_age;
 	freeze_table_age = params->freeze_table_age;
 	multixact_freeze_table_age = params->multixact_freeze_table_age;
+	freeze_strategy_threshold = params->freeze_strategy_threshold;
 
 	/* Set pg_class fields in cutoffs */
 	cutoffs->relfrozenxid = rel->rd_rel->relfrozenxid;
@@ -1089,6 +1095,23 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	if (MultiXactIdPrecedes(cutoffs->OldestMxact, cutoffs->MultiXactCutoff))
 		cutoffs->MultiXactCutoff = cutoffs->OldestMxact;
 
+	/*
+	 * Determine the freeze_strategy_threshold to use: as specified by the
+	 * caller, or vacuum_freeze_strategy_threshold
+	 */
+	if (freeze_strategy_threshold < 0)
+		freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
+	Assert(freeze_strategy_threshold >= 0);
+
+	/*
+	 * Convert MB-based GUC to page-based value used within vacuumlazy.c,
+	 * while being careful to avoid overflow
+	 */
+	threshold_strategy_pages =
+		(uint64) freeze_strategy_threshold * 1024 * 1024 / BLCKSZ;
+	threshold_strategy_pages = Min(threshold_strategy_pages, MaxBlockNumber);
+	cutoffs->freeze_strategy_threshold_pages = threshold_strategy_pages;
+
 	/*
 	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
 	 *
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c5..ecddde3a1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -151,6 +151,7 @@ static int	default_freeze_min_age;
 static int	default_freeze_table_age;
 static int	default_multixact_freeze_min_age;
 static int	default_multixact_freeze_table_age;
+static int	default_freeze_strategy_threshold;
 
 /* Memory context for long-lived data */
 static MemoryContext AutovacMemCxt;
@@ -2010,6 +2011,7 @@ do_autovacuum(void)
 		default_freeze_table_age = 0;
 		default_multixact_freeze_min_age = 0;
 		default_multixact_freeze_table_age = 0;
+		default_freeze_strategy_threshold = 0;
 	}
 	else
 	{
@@ -2017,6 +2019,7 @@ do_autovacuum(void)
 		default_freeze_table_age = vacuum_freeze_table_age;
 		default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
 		default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
+		default_freeze_strategy_threshold = vacuum_freeze_strategy_threshold;
 	}
 
 	ReleaseSysCache(tuple);
@@ -2801,6 +2804,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		int			freeze_table_age;
 		int			multixact_freeze_min_age;
 		int			multixact_freeze_table_age;
+		int			freeze_strategy_threshold;
 		int			vac_cost_limit;
 		double		vac_cost_delay;
 		int			log_min_duration;
@@ -2850,6 +2854,11 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 			? avopts->multixact_freeze_table_age
 			: default_multixact_freeze_table_age;
 
+		freeze_strategy_threshold = (avopts &&
+									 avopts->freeze_strategy_threshold >= 0)
+			? avopts->freeze_strategy_threshold
+			: default_freeze_strategy_threshold;
+
 		tab = palloc(sizeof(autovac_table));
 		tab->at_relid = relid;
 		tab->at_sharedrel = classForm->relisshared;
@@ -2877,6 +2886,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
 		tab->at_params.freeze_table_age = freeze_table_age;
 		tab->at_params.multixact_freeze_min_age = multixact_freeze_min_age;
 		tab->at_params.multixact_freeze_table_age = multixact_freeze_table_age;
+		tab->at_params.freeze_strategy_threshold = freeze_strategy_threshold;
 		tab->at_params.is_wraparound = wraparound;
 		tab->at_params.log_min_duration = log_min_duration;
 		tab->at_vacuum_cost_limit = vac_cost_limit;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4ac808ed2..7a78d98d3 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2535,6 +2535,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_freeze_strategy_threshold", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Table size at which VACUUM freezes using eager strategy, in megabytes."),
+			gettext_noop("This is applied by comparing it to the size of a table's main fork at "
+						 "the beginning of each VACUUM. Eager freezing strategy is used when size "
+						 "exceeds the threshold or when table is a temporary or unlogged table. "
+						 "Otherwise lazy freezing strategy is used."),
+			GUC_UNIT_MB
+		},
+		&vacuum_freeze_strategy_threshold,
+		4096, 0, MAX_VACUUM_THRESHOLD,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_defer_cleanup_age", PGC_SIGHUP, REPLICATION_PRIMARY,
 			gettext_noop("Number of transactions by which VACUUM and HOT cleanup should be deferred, if any."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d06074b86..fda695e75 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -700,6 +700,7 @@
 #vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
+#vacuum_freeze_strategy_threshold = 4GB
 #bytea_output = 'hex'			# hex, escape
 #xmlbinary = 'base64'
 #xmloption = 'content'
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f985afc00..39480c653 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9225,6 +9225,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-vacuum-freeze-strategy-threshold" xreflabel="vacuum_freeze_strategy_threshold">
+      <term><varname>vacuum_freeze_strategy_threshold</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>vacuum_freeze_strategy_threshold</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies the cutoff size (in megabytes) that <command>VACUUM</command>
+        should use to decide whether to apply its eager freezing strategy.
+        The default is 4096 megabytes (equivalent to <literal>4GB</literal>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-vacuum-freeze-table-age" xreflabel="vacuum_freeze_table_age">
       <term><varname>vacuum_freeze_table_age</varname> (<type>integer</type>)
       <indexterm>
@@ -9260,7 +9275,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        <para>
         Specifies the cutoff age (in transactions) that
         <command>VACUUM</command> should use to decide whether to
-        trigger freezing of pages that have an older XID.
+        trigger freezing of pages that have an older XID.  When VACUUM
+        uses its eager freezing strategy, freezing a page can also be
+        triggered when the page contains only all-visible tuples.
         The default is 50 million transactions.  Although
         users can set this value anywhere from zero to one billion,
         <command>VACUUM</command> will silently limit the effective value to half
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 759ea5ac9..8d762bad2 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -478,15 +478,47 @@
    </note>
 
    <para>
-    <xref linkend="guc-vacuum-freeze-min-age"/>
-    controls how old an XID value has to be before rows bearing that XID will be
-    frozen.  Increasing this setting may avoid unnecessary work if the
-    rows that would otherwise be frozen will soon be modified again,
-    but decreasing this setting increases
-    the number of transactions that can elapse before the table must be
-    vacuumed again.
+    <xref linkend="guc-vacuum-freeze-strategy-threshold"/> controls
+    <command>VACUUM</command>'s freezing strategy.  The
+    <firstterm>eager freezing strategy</firstterm> freezes all tuples
+    on a page when they are considered visible to all current
+    transactions.  The goal is to freeze pages in batch earlier to
+    spread out the overhead of freezing over time, improving system
+    level performance stability.  The <firstterm>lazy freezing
+     strategy</firstterm> determines whether each page is to be frozen
+    largely on the basis of the age of the oldest extant XID on the
+    page.  The goal is to avoid wholly unnecessary freezing.
+    Increasing <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+    may avoid unnecessary work if the pages that would otherwise be
+    frozen will soon be modified again, but decreasing this setting
+    increases the risk of an eventual <command>VACUUM</command> that
+    must perform an excessive amount of <quote>catch up</quote>
+    freezing.
    </para>
 
+   <para>
+    <xref linkend="guc-vacuum-freeze-min-age"/> controls how old an
+    XID value has to be before pages with rows bearing that XID are
+    frozen.  This setting is an additional trigger criteria for
+    freezing a page's tuples, used by both freezing strategies.
+    Unlogged relations always use eager freezing strategy.  There is
+    also an optimization that makes <command>VACUUM</command> trigger
+    freezing pages whenever a full page image is logged (see <xref
+     linkend="wal-reliability"/>), which aims to avoid another full
+    page image for the same page later on.
+    </para>
+
+   <note>
+    <para>
+     In <productname>PostgreSQL</productname> versions before 16, all
+     freezing was triggered by
+     <varname>vacuum_freeze_min_age</varname>.  Newer versions trigger
+     freezing with the goal of finding the most opportune time to
+     freeze, spreading out the cost over multiple
+     <command>VACUUM</command> operations.
+    </para>
+   </note>
+
    <para>
     <command>VACUUM</command> uses the <link linkend="storage-vm">visibility map</link>
     to determine which pages of a table must be scanned.  Normally, it
@@ -837,8 +869,8 @@ vacuum insert threshold = vacuum base insert threshold + vacuum insert scale fac
     For tables which receive <command>INSERT</command> operations but no or
     almost no <command>UPDATE</command>/<command>DELETE</command> operations,
     it may be beneficial to lower the table's
-    <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
-    tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
+    <xref linkend="reloption-autovacuum-freeze-strategy-threshold"/>
+    to allow freezing to take place proactively.  The number of obsolete tuples and
     the number of inserted tuples are obtained from the cumulative statistics system;
     it is a semi-accurate count updated by each <command>UPDATE</command>,
     <command>DELETE</command> and <command>INSERT</command> operation.  (It is
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index a03dee4af..f97cc7084 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1682,6 +1682,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
     </listitem>
    </varlistentry>
 
+   <varlistentry id="reloption-autovacuum-freeze-strategy-threshold" xreflabel="autovacuum_freeze_strategy_threshold">
+    <term><literal>autovacuum_freeze_strategy_threshold</literal>, <literal>toast.autovacuum_freeze_strategy_threshold</literal> (<type>integer</type>)
+    <indexterm>
+     <primary><varname>autovacuum_freeze_strategy_threshold</varname> storage parameter</primary>
+    </indexterm>
+    </term>
+    <listitem>
+     <para>
+      Per-table value for <xref linkend="guc-vacuum-freeze-strategy-threshold"/>
+      parameter.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry id="reloption-autovacuum-freeze-min-age" xreflabel="autovacuum_freeze_min_age">
     <term><literal>autovacuum_freeze_min_age</literal>, <literal>toast.autovacuum_freeze_min_age</literal> (<type>integer</type>)
     <indexterm>
-- 
2.39.0

v17-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchapplication/x-patch; name=v17-0002-Add-eager-and-lazy-VM-strategies-to-VACUUM.patchDownload

From 13be272fb94d90d1bf0828e6513eb85bd138636d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 18 Jul 2022 14:35:44 -0700
Subject: [PATCH v17 2/3] Add eager and lazy VM strategies to VACUUM.

Acquire an in-memory immutable "snapshot" of the target rel's visibility
map at the start of each VACUUM.  This is a local copy of the visibility
map at the start of VACUUM, which can spill to a temp file as and when
required.  Tables that are small enough to only need a single visibility
map page don't need to use a temp file.  VACUUM now uses its VM snapshot
(not the authoritative VM) to determine which pages to scan.  VACUUM no
longer scans pages that were concurrently unset in the VM, since all of
the pages it will scan are known and fixed before scanning even begins.

VACUUM decides on its VM snapshot scanning strategy up-front, based on
information about costs taken from the snapshot, and relfrozenxid age.
Lazy scanning allows VACUUM to skip all-visible pages, whereas eager
scanning allows VACUUM to advance relfrozenxid.  This works in tandem
with VACUUM's freezing strategies.

This work often result in VACUUM advancing relfrozenxid at a cadence
that is driven by underlying physical costs, not table age (through
settings like autovacuum_freeze_max_age).  Antiwraparound autovacuums
will be far less common as a result.  Freezing now drives relfrozenxid,
rather than relfrozenxid driving freezing.  Even tables that always use
lazy freezing will have a decent chance of relfrozenxid advancement long
before table age nears autovacuum_freeze_max_age.

This also lays the groundwork for completely removing aggressive mode
VACUUMs in a later commit.  Scanning strategies now supersede the "early
aggressive VACUUM" concept implemented by vacuum_freeze_table_age, which
is now just a compatibility option (its new default of -1 is interpreted
as "just use autovacuum_freeze_max_age").  For now VACUUM will still
condition its cleanup lock wait behavior on being in aggressive mode.

Also add explicit I/O prefetching of heap pages, which is controlled by
maintenance_io_concurrency.  We prefetch at the point that the next
block in line is requested by VACUUM.  Prefetching is under the direct
control of the visibility map snapshot code, since VACUUM's vmsnap is
now an authoritative guide to which pages VACUUM will scan.

Prefetching should totally avoid the loss of performance that might
otherwise result from removing SKIP_PAGES_THRESHOLD in this commit.
SKIP_PAGES_THRESHOLD was intended to force OS readahead and encourage
relfrozenxid advancement.  See commit bf136cf6 from around the time the
visibility map first went in for full details.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: John Naylor <john.naylor@enterprisedb.com>
Reviewed-By: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com
---
 src/include/access/visibilitymap.h            |  17 +
 src/include/commands/vacuum.h                 |  20 +-
 src/backend/access/heap/heapam.c              |   1 +
 src/backend/access/heap/vacuumlazy.c          | 617 +++++++++---------
 src/backend/access/heap/visibilitymap.c       | 532 +++++++++++++++
 src/backend/commands/vacuum.c                 |  66 +-
 src/backend/utils/misc/guc_tables.c           |  12 +-
 src/backend/utils/misc/postgresql.conf.sample |   9 +-
 doc/src/sgml/config.sgml                      |  66 +-
 doc/src/sgml/maintenance.sgml                 |  78 +--
 doc/src/sgml/ref/vacuum.sgml                  |  10 +-
 src/test/regress/expected/reloptions.out      |   8 +-
 src/test/regress/sql/reloptions.sql           |   8 +-
 13 files changed, 1031 insertions(+), 413 deletions(-)

diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index daaa01a25..d8df744da 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -26,6 +26,17 @@
 #define VM_ALL_FROZEN(r, b, v) \
 	((visibilitymap_get_status((r), (b), (v)) & VISIBILITYMAP_ALL_FROZEN) != 0)
 
+/* Snapshot of visibility map at a point in time */
+typedef struct vmsnapshot vmsnapshot;
+
+/* VACUUM scanning strategy */
+typedef enum vmstrategy
+{
+	VMSNAP_SCAN_LAZY,			/* Skip all-visible and all-frozen pages */
+	VMSNAP_SCAN_EAGER,			/* Only skip all-frozen pages */
+	VMSNAP_SCAN_ALL				/* Don't skip any pages (scan them instead) */
+} vmstrategy;
+
 extern bool visibilitymap_clear(Relation rel, BlockNumber heapBlk,
 								Buffer vmbuf, uint8 flags);
 extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk,
@@ -35,6 +46,12 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 							  XLogRecPtr recptr, Buffer vmBuf, TransactionId cutoff_xid,
 							  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
+extern vmsnapshot *visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+											  BlockNumber *scanned_pages_lazy,
+											  BlockNumber *scanned_pages_eager);
+extern void visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat);
+extern BlockNumber visibilitymap_snap_next(vmsnapshot *vmsnap);
+extern void visibilitymap_snap_release(vmsnapshot *vmsnap);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
 extern BlockNumber visibilitymap_prepare_truncate(Relation rel,
 												  BlockNumber nheapblocks);
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 50cc6fce5..18a56efbd 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip using VM */
 #define VACOPT_SKIP_DATABASE_STATS 0x100	/* skip vac_update_datfrozenxid() */
 #define VACOPT_ONLY_DATABASE_STATS 0x200	/* only vac_update_datfrozenxid() */
 
@@ -285,6 +285,24 @@ struct VacuumCutoffs
 	 * Threshold that triggers VACUUM's eager freezing strategy
 	 */
 	BlockNumber freeze_strategy_threshold_pages;
+
+	/*
+	 * The tableagefrac value 1.0 represents the point that autovacuum.c
+	 * scheduling (and VACUUM itself) considers relfrozenxid/relminmxid
+	 * advancement strictly necessary.  Values near 0.0 mean that both
+	 * relfrozenxid and relminmxid are a recently allocated XID/MXID.
+	 *
+	 * We don't need separate relfrozenxid and relminmxid tableagefrac
+	 * variants.  We base tableagefrac on whichever pg_class field is closer
+	 * to the point of having autovacuum.c launch an autovacuum to advance the
+	 * field's value.
+	 *
+	 * Lower values provide useful context, and influence whether VACUUM will
+	 * opt to advance relfrozenxid before the point that it is strictly
+	 * necessary.  VACUUM can (and often does) opt to advance relfrozenxid
+	 * and/or relminmxid proactively.
+	 */
+	double		tableagefrac;
 };
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 152f6c2d6..602befa1d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7057,6 +7057,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold_pages = 0;
+	cutoffs.tableagefrac = 0;
 
 	pagefrz.freeze_required = true;
 	pagefrz.FreezePageRelfrozenXid = FreezeLimit;
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 03ea36624..b5a4094ba 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -11,8 +11,8 @@
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
  * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
+ * the number of pages we'll scan (this limit ensures we don't allocate a huge
+ * area for TIDs uselessly).  If the array threatens to overflow, we must call
  * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
  * This frees up the memory space dedicated to storing dead TIDs.
  *
@@ -110,10 +110,18 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Before we consider skipping a page that's marked as clean in
- * visibility map, we must've seen at least this many clean pages.
+ * tableagefrac-wise cutoffs influencing VACUUM's choice of scanning strategy
  */
-#define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
+#define TABLEAGEFRAC_MIDPOINT		0.5 /* half way to antiwraparound AV */
+#define TABLEAGEFRAC_HIGHPOINT		0.9 /* Eagerness now mandatory */
+
+/*
+ * Thresholds (expressed as a proportion of rel_pages) that determine the
+ * cutoff (in extra pages scanned) for eager vmsnap scanning behavior at
+ * particular tableagefrac-wise table ages
+ */
+#define MAX_PAGES_YOUNG_TABLEAGE	0.05	/* 5% of rel_pages */
+#define MAX_PAGES_OLD_TABLEAGE		0.70	/* 70% of rel_pages */
 
 /*
  * Size of the prefetch window for lazy vacuum backwards truncation scan.
@@ -151,8 +159,6 @@ typedef struct LVRelState
 
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
-	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
-	bool		skipwithvm;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -171,7 +177,9 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
-	bool		skippedallvis;
+	/* Immutable snapshot of visibility map (as of time that VACUUM began) */
+	vmsnapshot *vmsnap;
+	vmstrategy	vmstrat;
 
 	/* Error reporting state */
 	char	   *dbname;
@@ -223,6 +231,7 @@ typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
 	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+	bool		pd_allvis_corrupt;	/* PD_ALL_VISIBLE bit spuriously set? */
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
@@ -245,11 +254,8 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel);
-static void lazy_scan_strategy(LVRelState *vacrel);
-static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
-								  BlockNumber next_block,
-								  bool *next_unskippable_allvis,
-								  bool *skipping_current_range);
+static BlockNumber lazy_scan_strategy(LVRelState *vacrel,
+									  bool force_scan_all);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -279,7 +285,8 @@ static bool should_attempt_truncation(LVRelState *vacrel);
 static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
-static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_alloc(LVRelState *vacrel, int nworkers,
+							 BlockNumber scanned_pages);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -311,10 +318,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	LVRelState *vacrel;
 	bool		verbose,
 				instrument,
-				skipwithvm,
 				frozenxid_updated,
 				minmulti_updated;
 	BlockNumber orig_rel_pages,
+				scanned_pages,
 				new_rel_pages,
 				new_rel_allvisible;
 	PGRUsage	ru0;
@@ -461,37 +468,29 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/MXID */
 	vacrel->NewRelfrozenXid = vacrel->cutoffs.OldestXmin;
 	vacrel->NewRelminMxid = vacrel->cutoffs.OldestMxact;
-	vacrel->skippedallvis = false;
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		vacrel->aggressive = true;
-		skipwithvm = false;
-	}
-
-	vacrel->skipwithvm = skipwithvm;
 
 	/*
-	 * Now determine VACUUM's freezing strategy.
+	 * Now determine VACUUM's freezing and scanning strategies.
+	 *
+	 * This process is driven in part by information from VACUUM's visibility
+	 * map snapshot, which will be acquired in passing.  lazy_scan_heap will
+	 * use the same immutable VM snapshot to determine which pages to scan.
+	 * Using an immutable structure (instead of the live visibility map) makes
+	 * VACUUM avoid scanning concurrently modified pages.  These pages can
+	 * only have deleted tuples that OldestXmin will consider RECENTLY_DEAD.
 	 */
-	lazy_scan_strategy(vacrel);
+	scanned_pages = lazy_scan_strategy(vacrel,
+									   (params->options &
+										VACOPT_DISABLE_PAGE_SKIPPING) != 0);
 	if (verbose)
-	{
-		if (vacrel->aggressive)
-			ereport(INFO,
-					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							vacrel->dbname, vacrel->relnamespace,
-							vacrel->relname)));
-		else
-			ereport(INFO,
-					(errmsg("vacuuming \"%s.%s.%s\"",
-							vacrel->dbname, vacrel->relnamespace,
-							vacrel->relname)));
-	}
+		ereport(INFO,
+				(errmsg("vacuuming \"%s.%s.%s\"",
+						vacrel->dbname, vacrel->relnamespace,
+						vacrel->relname),
+				 errdetail("Table has %u pages in total, of which %u pages (%.2f%% of total) will be scanned.",
+						   orig_rel_pages, scanned_pages,
+						   orig_rel_pages == 0 ? 100.0 :
+						   100.0 * scanned_pages / orig_rel_pages)));
 
 	/*
 	 * Allocate dead_items array memory using dead_items_alloc.  This handles
@@ -501,13 +500,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * is already dangerously old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, params->nworkers, scanned_pages);
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
 	lazy_scan_heap(vacrel);
+	Assert(vacrel->scanned_pages == scanned_pages);
 
 	/*
 	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
@@ -554,12 +554,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
 									   vacrel->cutoffs.relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->skippedallvis)
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
 		/*
-		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * chose to skip an all-visible page range.  The state that tracks new
-		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
+		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
 		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
@@ -604,6 +603,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
+	/* Done with rel's visibility map snapshot */
+	visibilitymap_snap_release(vacrel->vmsnap);
+
 	if (instrument)
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
@@ -631,10 +633,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			initStringInfo(&buf);
 			if (verbose)
 			{
-				/*
-				 * Aggressiveness already reported earlier, in dedicated
-				 * VACUUM VERBOSE ereport
-				 */
 				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
 			}
@@ -829,13 +827,10 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_unskippable_block,
+				next_block_to_scan,
 				next_fsm_block_to_vacuum = 0;
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -849,46 +844,27 @@ lazy_scan_heap(LVRelState *vacrel)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
-	for (blkno = 0; blkno < rel_pages; blkno++)
+	next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap);
+	while (next_block_to_scan < rel_pages)
 	{
+		BlockNumber blkno = next_block_to_scan;
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		next_block_to_scan = visibilitymap_snap_next(vacrel->vmsnap);
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
-
-			if (skipping_current_range)
-				continue;
-
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/*
+		 * visibilitymap_snap_next must always force us to scan the last page
+		 * in rel (in the range of rel_pages) so that VACUUM can avoid useless
+		 * attempts at rel truncation (per should_attempt_truncation comments)
+		 */
+		Assert(next_block_to_scan > blkno);
+		Assert(next_block_to_scan < rel_pages || blkno == rel_pages - 1);
 
 		vacrel->scanned_pages++;
 
-		/* Report as block scanned, update error traceback information */
+		/* Report all blocks < blkno as initial-heap-pass processed */
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
@@ -1025,12 +1001,24 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
-
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
 			vacrel->nonempty_pages = blkno + 1;
 
+		/*
+		 * Clear PD_ALL_VISIBLE (and page's visibility map bits) in the event
+		 * of lazy_scan_prune detecting an inconsistency
+		 */
+		if (unlikely(prunestate.pd_allvis_corrupt))
+		{
+			elog(WARNING, "page containing dead tuples has PD_ALL_VISIBLE set in relation \"%s\" page %u",
+				 vacrel->relname, blkno);
+			PageClearAllVisible(page);
+			MarkBufferDirty(buf);
+			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
+								VISIBILITYMAP_VALID_BITS);
+		}
+
 		if (vacrel->nindexes == 0)
 		{
 			/*
@@ -1089,10 +1077,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
+		 * Set visibility map bits based on prunestate's instructions
 		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
+		if (prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
@@ -1102,34 +1089,36 @@ lazy_scan_heap(LVRelState *vacrel)
 				flags |= VISIBILITYMAP_ALL_FROZEN;
 			}
 
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
+			if (!PageIsAllVisible(page))
+			{
+				/*
+				 * We could avoid dirtying the page just to set PD_ALL_VISIBLE
+				 * when checksums are disabled.  It is very likely that the
+				 * heap page is already dirty anyway, so keep the rule simple:
+				 * always dirty a page when setting its PD_ALL_VISIBLE bit.
+				 */
+				PageSetAllVisible(page);
+				MarkBufferDirty(buf);
+			}
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, prunestate.visibility_cutoff_xid,
 							  flags);
 		}
 
 		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
+		 * When the page isn't eligible to become all-visible, we defensively
+		 * check that PD_ALL_VISIBLE agrees with the visibility map instead.
+		 * If there is disagreement then we clear both VM bits to repair.
+		 *
+		 * We don't expect (and deliberately avoid testing) mutual agreement;
+		 * it's okay for PD_ALL_VISIBLE to be set while both visibility map
+		 * bits remain unset (iff checksums are disabled).  It's even okay for
+		 * prunestate's all_visible flag to disagree with PD_ALL_VISIBLE here
+		 * (lazy_scan_prune's pd_allvis_corrupt comments explain why that is).
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
-				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
+		else if (!PageIsAllVisible(page) &&
+				 unlikely(visibilitymap_get_status(vacrel->rel, blkno,
+												   &vmbuffer) != 0))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1137,65 +1126,6 @@ lazy_scan_heap(LVRelState *vacrel)
 								VISIBILITYMAP_VALID_BITS);
 		}
 
-		/*
-		 * It's possible for the value returned by
-		 * GetOldestNonRemovableTransactionId() to move backwards, so it's not
-		 * wrong for us to see tuples that appear to not be visible to
-		 * everyone yet, while PD_ALL_VISIBLE is already set. The real safe
-		 * xmin value never moves backwards, but
-		 * GetOldestNonRemovableTransactionId() is conservative and sometimes
-		 * returns a value that's unnecessarily small, so if we see that
-		 * contradiction it just means that the tuples that we think are not
-		 * visible to everyone yet actually are, and the PD_ALL_VISIBLE flag
-		 * is correct.
-		 *
-		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
-		 * set, however.
-		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
-		{
-			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
-				 vacrel->relname, blkno);
-			PageClearAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
-								VISIBILITYMAP_VALID_BITS);
-		}
-
-		/*
-		 * If the all-visible page is all-frozen but not marked as such yet,
-		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both prunestate fields.
-		 */
-		else if (all_visible_according_to_vm && prunestate.all_visible &&
-				 prunestate.all_frozen &&
-				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-		{
-			/*
-			 * Avoid relying on all_visible_according_to_vm as a proxy for the
-			 * page-level PD_ALL_VISIBLE bit being set, since it might have
-			 * become stale -- even when all_visible is set in prunestate
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				PageSetAllVisible(page);
-				MarkBufferDirty(buf);
-			}
-
-			/*
-			 * Set the page all-frozen (and all-visible) in the VM.
-			 *
-			 * We can pass InvalidTransactionId as our visibility_cutoff_xid,
-			 * since a snapshotConflictHorizon sufficient to make everything
-			 * safe for REDO was logged when the page's tuples were frozen.
-			 */
-			Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
-			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
-							  VISIBILITYMAP_ALL_VISIBLE |
-							  VISIBILITYMAP_ALL_FROZEN);
-		}
-
 		/*
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
@@ -1232,12 +1162,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 	}
 
+	/* initial heap pass finished (final pass may still be required) */
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	/* report all blocks as initial-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, rel_pages);
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
@@ -1254,20 +1185,26 @@ lazy_scan_heap(LVRelState *vacrel)
 
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
-	 * related heap vacuuming
+	 * related heap vacuuming in final heap pass
 	 */
 	if (dead_items->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
-	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
-	 * not there were indexes, and whether or not we bypassed index vacuuming.
+	 * Now that both our initial heap pass and final heap pass (if any) have
+	 * ended, vacuum the Free Space Map. (Actually, similar FSM vacuuming will
+	 * have taken place earlier when VACUUM needed to call lazy_vacuum to deal
+	 * with running out of dead_items space.  Hopefully that will be rare.)
 	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+	if (rel_pages > 0)
+	{
+		Assert(vacrel->scanned_pages > 0);
+		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
+								rel_pages);
+	}
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+	/* report all blocks as final-heap-pass processed */
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, rel_pages);
 
 	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
@@ -1275,7 +1212,7 @@ lazy_scan_heap(LVRelState *vacrel)
 }
 
 /*
- *	lazy_scan_strategy() -- Determine freezing strategy.
+ *	lazy_scan_strategy() -- Determine freezing/vmsnap scanning strategies.
  *
  * Our lazy freezing strategy is useful when putting off the work of freezing
  * totally avoids freezing that turns out to have been wasted effort later on.
@@ -1283,11 +1220,42 @@ lazy_scan_heap(LVRelState *vacrel)
  * continual growth, where freezing pages proactively is needed just to avoid
  * falling behind on freezing (eagerness is also likely to be cheaper in the
  * short/medium term for such tables, but the long term picture matters most).
+ *
+ * Our lazy vmsnap scanning strategy is useful when we can save a significant
+ * amount of work in the short term by not advancing relfrozenxid/relminmxid.
+ * Our eager vmsnap scanning strategy is useful when there is hardly any work
+ * avoided by being lazy anyway, and/or when tableagefrac is nearing or has
+ * already surpassed 1.0, which is the point of antiwraparound autovacuuming.
+ *
+ * Freezing and scanning strategies are structured as two independent choices,
+ * but they are not independent in any practical sense (it's just mechanical).
+ * Eager and lazy behaviors go hand in hand, since the choice of each strategy
+ * is driven by similar considerations about the needs of the target table.
+ * Moreover, choosing eager scanning strategy can easily result in freezing
+ * many more pages (compared to an equivalent lazy scanning strategy VACUUM),
+ * since VACUUM can only freeze pages that it actually scans.  (All-visible
+ * pages may well have XIDs < FreezeLimit by now, but VACUUM has no way of
+ * noticing that it should freeze such pages besides just scanning them.)
+ *
+ * The single most important justification for the eager behaviors is system
+ * level performance stability.  It is often better to freeze all-visible
+ * pages before we're truly forced to (just to advance relfrozenxid) as a way
+ * of avoiding big spikes, where VACUUM has to freeze many pages all at once.
+ *
+ * Returns final scanned_pages for the VACUUM operation.  The exact number of
+ * pages that lazy_scan_heap scans depends in part on which vmsnap scanning
+ * strategy we choose (only eager scanning will scan rel's all-visible pages).
  */
-static void
-lazy_scan_strategy(LVRelState *vacrel)
+static BlockNumber
+lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 {
-	BlockNumber rel_pages = vacrel->rel_pages;
+	BlockNumber rel_pages = vacrel->rel_pages,
+				scanned_pages_lazy,
+				scanned_pages_eager,
+				nextra_scanned_eager,
+				nextra_young_threshold,
+				nextra_old_threshold,
+				nextra_toomany_threshold;
 
 	/*
 	 * Decide freezing strategy.
@@ -1295,125 +1263,159 @@ lazy_scan_strategy(LVRelState *vacrel)
 	 * The eager freezing strategy is used whenever rel_pages exceeds a
 	 * threshold controlled by the freeze_strategy_threshold GUC/reloption.
 	 *
+	 * Also freeze eagerly whenever table age is close to requiring (or is
+	 * actually undergoing) an antiwraparound autovacuum.  This may delay the
+	 * next antiwraparound autovacuum against the table.  We avoid relying on
+	 * them, if at all possible (mostly-static tables tend to rely on them).
+	 *
 	 * Also freeze eagerly with an unlogged or temp table, where the total
 	 * cost of freezing pages is mostly just the cycles needed to prepare a
 	 * set of freeze plans.  Executing the freeze plans adds very little cost.
 	 * Dirtying extra pages isn't a concern, either; VACUUM will definitely
 	 * set PD_ALL_VISIBLE on affected pages, regardless of freezing strategy.
+	 *
+	 * Once a table first becomes big enough for eager freezing, it's almost
+	 * inevitable that it will also naturally settle into a cadence where
+	 * relfrozenxid is advanced during every VACUUM (barring rel truncation).
+	 * This is a consequence of eager freezing strategy avoiding creating new
+	 * all-visible pages: if there never are any all-visible pages (if all
+	 * skippable pages are fully all-frozen), then there is no way that lazy
+	 * scanning strategy can ever look better than eager scanning strategy.
+	 * There are still ways that the occasional all-visible page could slip
+	 * into a table that we always freeze eagerly (at least when its tuples
+	 * tend to contain MultiXacts), but that should have negligible impact.
 	 */
 	vacrel->eager_freeze_strategy =
 		(rel_pages > vacrel->cutoffs.freeze_strategy_threshold_pages ||
+		 vacrel->cutoffs.tableagefrac > TABLEAGEFRAC_HIGHPOINT ||
 		 !RelationIsPermanent(vacrel->rel));
-}
-
-/*
- *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
- *
- * lazy_scan_heap() calls here every time it needs to set up a new range of
- * blocks to skip via the visibility map.  Caller passes the next block in
- * line.  We return a next_unskippable_block for this range.  When there are
- * no skippable blocks we just return caller's next_block.  The all-visible
- * status of the returned block is set in *next_unskippable_allvis for caller,
- * too.  Block usually won't be all-visible (since it's unskippable), but it
- * can be during aggressive VACUUMs (as well as in certain edge cases).
- *
- * Sets *skipping_current_range to indicate if caller should skip this range.
- * Costs and benefits drive our decision.  Very small ranges won't be skipped.
- *
- * Note: our opinion of which blocks can be skipped can go stale immediately.
- * It's okay if caller "misses" a page whose all-visible or all-frozen marking
- * was concurrently cleared, though.  All that matters is that caller scan all
- * pages whose tuples might contain XIDs < OldestXmin, or MXIDs < OldestMxact.
- * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
- * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
- * choice to skip such a range is actually made, making everything safe.)
- */
-static BlockNumber
-lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
-			   bool *next_unskippable_allvis, bool *skipping_current_range)
-{
-	BlockNumber rel_pages = vacrel->rel_pages,
-				next_unskippable_block = next_block,
-				nskippable_blocks = 0;
-	bool		skipsallvis = false;
-
-	*next_unskippable_allvis = true;
-	while (next_unskippable_block < rel_pages)
-	{
-		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
-													   next_unskippable_block,
-													   vmbuffer);
-
-		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
-		{
-			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Caller must scan the last page to determine whether it has tuples
-		 * (caller must have the opportunity to set vacrel->nonempty_pages).
-		 * This rule avoids having lazy_truncate_heap() take access-exclusive
-		 * lock on rel to attempt a truncation that fails anyway, just because
-		 * there are tuples on the last page (it is likely that there will be
-		 * tuples on other nearby pages as well, but those can be skipped).
-		 *
-		 * Implement this by always treating the last block as unsafe to skip.
-		 */
-		if (next_unskippable_block == rel_pages - 1)
-			break;
-
-		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
-		if (!vacrel->skipwithvm)
-		{
-			/* Caller shouldn't rely on all_visible_according_to_vm */
-			*next_unskippable_allvis = false;
-			break;
-		}
-
-		/*
-		 * Aggressive VACUUM caller can't skip pages just because they are
-		 * all-visible.  They may still skip all-frozen pages, which can't
-		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
-		 */
-		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
-		{
-			if (vacrel->aggressive)
-				break;
-
-			/*
-			 * All-visible block is safe to skip in non-aggressive case.  But
-			 * remember that the final range contains such a block for later.
-			 */
-			skipsallvis = true;
-		}
-
-		vacuum_delay_point();
-		next_unskippable_block++;
-		nskippable_blocks++;
-	}
 
 	/*
-	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
-	 * pages.  Since we're reading sequentially, the OS should be doing
-	 * readahead for us, so there's no gain in skipping a page now and then.
-	 * Skipping such a range might even discourage sequential detection.
+	 * Decide vmsnap scanning strategy.
 	 *
-	 * This test also enables more frequent relfrozenxid advancement during
-	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
-	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 * First acquire a visibility map snapshot, which determines the number of
+	 * pages that each vmsnap scanning strategy is required to scan for us in
+	 * passing.
+	 *
+	 * The number of "extra" scanned_pages added by choosing VMSNAP_SCAN_EAGER
+	 * over VMSNAP_SCAN_LAZY is a key input into the decision making process.
+	 * It is a good proxy for the added cost of applying our eager vmsnap
+	 * strategy during this particular VACUUM.  (We may or may not have to
+	 * dirty/freeze the extra pages when we scan them, which isn't something
+	 * that we try to model.  It shouldn't matter very much at this level.)
 	 */
-	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
-		*skipping_current_range = false;
+	vacrel->vmsnap = visibilitymap_snap_acquire(vacrel->rel, rel_pages,
+												&scanned_pages_lazy,
+												&scanned_pages_eager);
+	nextra_scanned_eager = scanned_pages_eager - scanned_pages_lazy;
+
+	/*
+	 * Next determine guideline "nextra_scanned_eager" thresholds, which are
+	 * applied based in part on tableagefrac (when nextra_toomany_threshold is
+	 * determined below).  These thresholds also represent minimum and maximum
+	 * sensible thresholds that can ever make sense for a table of this size
+	 * (when the table's age isn't old enough to make eagerness mandatory).
+	 *
+	 * For the most part we only care about relative (not absolute) costs.  We
+	 * want to advance relfrozenxid at an opportune time, during a VACUUM that
+	 * has to scan relatively many pages either way (whether due to the need
+	 * to remove dead tuples from many pages, or due to the table containing
+	 * lots of existing all-frozen pages, or due to a combination of both).
+	 * Even small tables (where lazy freezing is used) shouldn't have to do
+	 * dramatically more work than usual when advancing relfrozenxid, which
+	 * our policy of waiting for the right VACUUM largely avoids, in practice.
+	 */
+	nextra_young_threshold = (double) rel_pages * MAX_PAGES_YOUNG_TABLEAGE;
+	nextra_old_threshold = (double) rel_pages * MAX_PAGES_OLD_TABLEAGE;
+
+	/*
+	 * Next determine nextra_toomany_threshold, which represents how many
+	 * extra scanned_pages are deemed too high a cost to pay for eagerness,
+	 * given present conditions.  This is our model's break-even point.
+	 */
+	if (vacrel->cutoffs.tableagefrac < TABLEAGEFRAC_MIDPOINT)
+	{
+		/*
+		 * The table's age is still below table age mid point, so table age is
+		 * still of only minimal concern.  We're still willing to act eagerly
+		 * when it's _very_ cheap to do so: when use of VMSNAP_SCAN_EAGER will
+		 * force us to scan some extra pages not exceeding 5% of rel_pages.
+		 */
+		nextra_toomany_threshold = nextra_young_threshold;
+	}
+	else if (vacrel->cutoffs.tableagefrac < TABLEAGEFRAC_HIGHPOINT)
+	{
+		double		nextra_scale;
+
+		/*
+		 * The table's age is starting to become a concern, but not to the
+		 * extent that we'll force the use of VMSNAP_SCAN_EAGER strategy.
+		 * We'll need to interpolate to get an nextra_scanned_eager-based
+		 * threshold.
+		 *
+		 * If tableagefrac is only barely over the midway point, then we'll
+		 * choose an nextra_scanned_eager threshold of ~5% of rel_pages.  The
+		 * opposite extreme occurs when tableagefrac is very near to the high
+		 * point.  That will make our nextra_scanned_eager threshold very
+		 * aggressive: we'll go with VMSNAP_SCAN_EAGER when doing so requires
+		 * we scan a number of extra blocks as high as ~70% of rel_pages.
+		 *
+		 * Note that the threshold grows (on a percentage basis) by ~8.1% of
+		 * rel_pages for every additional 5%-of-tableagefrac increment added
+		 * (after tableagefrac has crossed the 50%-of-tableagefrac mid point,
+		 * until the 90%-of-tableagefrac high point is reached, when we switch
+		 * over to not caring about the added cost of eager freezing at all).
+		 */
+		nextra_scale =
+			1.0 - ((TABLEAGEFRAC_HIGHPOINT - vacrel->cutoffs.tableagefrac) /
+				   (TABLEAGEFRAC_HIGHPOINT - TABLEAGEFRAC_MIDPOINT));
+
+		nextra_toomany_threshold =
+			(nextra_young_threshold * (1.0 - nextra_scale)) +
+			(nextra_old_threshold * nextra_scale);
+	}
 	else
 	{
-		*skipping_current_range = true;
-		if (skipsallvis)
-			vacrel->skippedallvis = true;
+		/*
+		 * The table's age is approaching (or may even surpass) the point that
+		 * an antiwraparound autovacuum is required.  Force VMSNAP_SCAN_EAGER,
+		 * no matter how expensive it is compared to VMSNAP_SCAN_LAZY.
+		 *
+		 * Note that there is a discontinuity when tableagefrac crosses this
+		 * 90%-of-tableagefrac high point: the threshold set here jumps from
+		 * 70% of rel_pages to 100% of rel_pages (InvalidBlockNumber, really).
+		 * It's useful to only care about table age once it gets this high.
+		 * That way even extreme cases will have at least some chance of using
+		 * eager scanning before an antiwraparound autovacuum is launched.
+		 */
+		nextra_toomany_threshold = InvalidBlockNumber;
 	}
 
-	return next_unskippable_block;
+	/* Make final choice on scanning strategy using final threshold */
+	nextra_toomany_threshold = Max(nextra_toomany_threshold, 32);
+	vacrel->vmstrat = (nextra_scanned_eager >= nextra_toomany_threshold ?
+					   VMSNAP_SCAN_LAZY : VMSNAP_SCAN_EAGER);
+
+	/*
+	 * VACUUM's DISABLE_PAGE_SKIPPING option overrides our decision by forcing
+	 * VACUUM to scan every page (VACUUM effectively distrusts rel's VM)
+	 */
+	if (force_scan_all)
+		vacrel->vmstrat = VMSNAP_SCAN_ALL;
+
+	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
+
+	/* Inform vmsnap infrastructure of our chosen strategy */
+	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
+
+	/* Return appropriate scanned_pages for final strategy chosen */
+	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
+		return scanned_pages_lazy;
+	if (vacrel->vmstrat == VMSNAP_SCAN_EAGER)
+		return scanned_pages_eager;
+
+	/* DISABLE_PAGE_SKIPPING/VMSNAP_SCAN_ALL case */
+	return rel_pages;
 }
 
 /*
@@ -1633,6 +1635,7 @@ retry:
 	 */
 	prunestate->hastup = false;
 	prunestate->has_lpdead_items = false;
+	prunestate->pd_allvis_corrupt = false;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1966,12 +1969,26 @@ retry:
 		prunestate->all_visible = false;
 	}
 
-	/* Finally, add page-local counts to whole-VACUUM counts */
+	/* Add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
 	vacrel->lpdead_items += lpdead_items;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
+
+	/*
+	 * There should never be dead or deleted tuples when PD_ALL_VISIBLE is
+	 * already set.  Check that now, to help caller maintain the VM correctly.
+	 *
+	 * We deliberately avoid indicating corruption when a tuple was found to
+	 * be HEAPTUPLE_INSERT_IN_PROGRESS on a page that has PD_ALL_VISIBLE set.
+	 * That would lead to false positives, since OldestXmin is conservative.
+	 * (It's possible that this VACUUM has an earlier OldestXmin than a VACUUM
+	 * that ran against the same table at some point in the recent past.)
+	 */
+	if (PageIsAllVisible(page) &&
+		(lpdead_items > 0 || tuples_deleted > 0 || recently_dead_tuples > 0))
+		prunestate->pd_allvis_corrupt = true;
 }
 
 /*
@@ -2503,6 +2520,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		vacuumed_pages++;
 	}
 
+	/* final heap pass finished */
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
@@ -2846,6 +2864,14 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * Also don't attempt it if we are doing early pruning/vacuuming, because a
  * scan which cannot find a truncated heap page cannot determine that the
  * snapshot is too old to read that page.
+ *
+ * Note that we effectively rely on visibilitymap_snap_next() having forced
+ * VACUUM to scan the final page (rel_pages - 1) in all cases.  Without that,
+ * we'd tend to needlessly acquire an AccessExclusiveLock just to attempt rel
+ * truncation that is bound to fail.  VACUUM cannot set vacrel->nonempty_pages
+ * in pages that it skips using the VM, so we must avoid interpreting skipped
+ * pages as empty pages when it makes little sense.  Observing that the final
+ * page has tuples is a simple way of avoiding pathological locking behavior.
  */
 static bool
 should_attempt_truncation(LVRelState *vacrel)
@@ -3136,14 +3162,13 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 
 /*
  * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
+ * store, given the expected scanned_pages for this VACUUM operation,
+ * and given current maintenance_work_mem/autovacuum_work_mem setting.
  *
  * See the comments at the head of this file for rationale.
  */
 static int
-dead_items_max_items(LVRelState *vacrel)
+dead_items_max_items(LVRelState *vacrel, BlockNumber scanned_pages)
 {
 	int64		max_items;
 	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
@@ -3152,15 +3177,13 @@ dead_items_max_items(LVRelState *vacrel)
 
 	if (vacrel->nindexes > 0)
 	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
 		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
 		max_items = Min(max_items, INT_MAX);
 		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
 
 		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
+		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > scanned_pages)
+			max_items = scanned_pages * MaxHeapTuplesPerPage;
 
 		/* stay sane if small maintenance_work_mem */
 		max_items = Max(max_items, MaxHeapTuplesPerPage);
@@ -3182,12 +3205,12 @@ dead_items_max_items(LVRelState *vacrel)
  * DSM when required.
  */
 static void
-dead_items_alloc(LVRelState *vacrel, int nworkers)
+dead_items_alloc(LVRelState *vacrel, int nworkers, BlockNumber scanned_pages)
 {
 	VacDeadItems *dead_items;
 	int			max_items;
 
-	max_items = dead_items_max_items(vacrel);
+	max_items = dead_items_max_items(vacrel, scanned_pages);
 	Assert(max_items >= MaxHeapTuplesPerPage);
 
 	/*
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb1..7e39c6a70 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -16,6 +16,10 @@
  *		visibilitymap_pin_ok - check whether correct map page is already pinned
  *		visibilitymap_set	 - set a bit in a previously pinned page
  *		visibilitymap_get_status - get status of bits
+ *		visibilitymap_snap_acquire - acquire snapshot of visibility map
+ *		visibilitymap_snap_strategy - set VACUUM's scanning strategy
+ *		visibilitymap_snap_next - get next block to scan from vmsnap
+ *		visibilitymap_snap_release - release previously acquired snapshot
  *		visibilitymap_count  - count number of bits set in visibility map
  *		visibilitymap_prepare_truncate -
  *			prepare for truncation of the visibility map
@@ -52,6 +56,10 @@
  *
  * VACUUM will normally skip pages for which the visibility map bit is set;
  * such pages can't contain any dead tuples and therefore don't need vacuuming.
+ * VACUUM uses a snapshot of the visibility map to avoid scanning pages whose
+ * visibility map bit gets concurrently unset.  This also provides us with a
+ * convenient way of performing I/O prefetching on behalf of VACUUM, since the
+ * pages that VACUUM's first heap pass will scan are fully predetermined.
  *
  * LOCKING
  *
@@ -92,10 +100,12 @@
 #include "access/xlogutils.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 #include "utils/inval.h"
+#include "utils/spccache.h"
 
 
 /*#define TRACE_VISIBILITYMAP */
@@ -124,9 +134,81 @@
 #define FROZEN_MASK64	UINT64CONST(0xaaaaaaaaaaaaaaaa) /* The upper bit of each
 														 * bit pair */
 
+/*
+ * Prefetching of heap pages takes place as VACUUM requests the next block in
+ * line from its visibility map snapshot
+ *
+ * XXX MIN_PREFETCH_SIZE of 32 is a little on the high side, but matches
+ * hard-coded constant used by vacuumlazy.c when prefetching for rel
+ * truncation.  Might be better to increase the maintenance_io_concurrency
+ * default, or to do nothing like this at all.
+ */
+#define STAGED_BUFSIZE			(MAX_IO_CONCURRENCY * 2)
+#define MIN_PREFETCH_SIZE		((BlockNumber) 32)
+
+/*
+ * Snapshot of visibility map at the start of a VACUUM operation
+ */
+struct vmsnapshot
+{
+	/* Target heap rel */
+	Relation	rel;
+	/* Scanning strategy used by VACUUM operation */
+	vmstrategy	strat;
+	/* Per-strategy final scanned_pages */
+	BlockNumber rel_pages;
+	BlockNumber scanned_pages_lazy;
+	BlockNumber scanned_pages_eager;
+
+	/*
+	 * Materialized visibility map state.
+	 *
+	 * VM snapshots spill to a temp file when required.
+	 */
+	BlockNumber nvmpages;
+	BufFile    *file;
+
+	/*
+	 * Prefetch distance, used to perform I/O prefetching of heap pages
+	 */
+	int			prefetch_distance;
+
+	/* Current VM page cached */
+	BlockNumber curvmpage;
+	char	   *rawmap;
+	PGAlignedBlock vmpage;
+
+	/* Staging area for blocks returned to VACUUM */
+	BlockNumber staged[STAGED_BUFSIZE];
+	int			current_nblocks_staged;
+
+	/*
+	 * Next block from range of rel_pages to consider placing in staged block
+	 * array (it will be placed there if it's going to be scanned by VACUUM)
+	 */
+	BlockNumber next_block;
+
+	/*
+	 * Number of blocks that we still need to return, and number of blocks
+	 * that we still need to prefetch
+	 */
+	BlockNumber scanned_pages_to_return;
+	BlockNumber scanned_pages_to_prefetch;
+
+	/* offset of next block in line to return (from staged) */
+	int			next_return_idx;
+	/* offset of next block in line to prefetch (from staged) */
+	int			next_prefetch_idx;
+	/* offset of first garbage/invalid element (from staged) */
+	int			first_invalid_idx;
+};
+
+
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
 static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static void vm_snap_stage_blocks(vmsnapshot *vmsnap);
+static uint8 vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk);
 
 
 /*
@@ -376,6 +458,354 @@ visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf)
 	return result;
 }
 
+/*
+ *	visibilitymap_snap_acquire - get read-only snapshot of visibility map
+ *
+ * Initializes VACUUM caller's snapshot, allocating memory in current context.
+ * Used by VACUUM to determine which pages it must scan up front.
+ *
+ * Set scanned_pages_lazy and scanned_pages_eager to help VACUUM decide on its
+ * scanning strategy.  These are VACUUM's scanned_pages when it opts to skip
+ * all eligible pages and scanned_pages when it opts to just skip all-frozen
+ * pages, respectively.
+ *
+ * Caller finalizes scanning strategy by calling visibilitymap_snap_strategy.
+ * This determines the kind of blocks visibilitymap_snap_next should indicate
+ * need to be scanned by VACUUM.
+ */
+vmsnapshot *
+visibilitymap_snap_acquire(Relation rel, BlockNumber rel_pages,
+						   BlockNumber *scanned_pages_lazy,
+						   BlockNumber *scanned_pages_eager)
+{
+	BlockNumber nvmpages = 0,
+				mapBlockLast = 0,
+				all_visible = 0,
+				all_frozen = 0;
+	uint8		mapbits_last_page = 0;
+	vmsnapshot *vmsnap;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_acquire %s %u",
+		 RelationGetRelationName(rel), rel_pages);
+#endif
+
+	/*
+	 * Allocate space for VM pages up to and including those required to have
+	 * bits for the would-be heap block that is just beyond rel_pages
+	 */
+	if (rel_pages > 0)
+	{
+		mapBlockLast = HEAPBLK_TO_MAPBLOCK(rel_pages - 1);
+		nvmpages = mapBlockLast + 1;
+	}
+
+	/* Allocate and initialize VM snapshot state */
+	vmsnap = palloc0(sizeof(vmsnapshot));
+	vmsnap->rel = rel;
+	vmsnap->strat = VMSNAP_SCAN_ALL;	/* for now */
+	vmsnap->rel_pages = rel_pages;	/* scanned_pages for VMSNAP_SCAN_ALL */
+	vmsnap->scanned_pages_lazy = 0;
+	vmsnap->scanned_pages_eager = 0;
+
+	/*
+	 * vmsnap temp file state.
+	 *
+	 * Only relations large enough to need more than one visibility map page
+	 * use a temp file (cannot wholly rely on vmsnap's single page cache).
+	 */
+	vmsnap->nvmpages = nvmpages;
+	vmsnap->file = NULL;
+	if (nvmpages > 1)
+		vmsnap->file = BufFileCreateTemp(false);
+	vmsnap->prefetch_distance = 0;
+#ifdef USE_PREFETCH
+	vmsnap->prefetch_distance =
+		get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
+#endif
+	vmsnap->prefetch_distance = Max(vmsnap->prefetch_distance, MIN_PREFETCH_SIZE);
+
+	/* cache of VM pages read from temp file */
+	vmsnap->curvmpage = 0;
+	vmsnap->rawmap = NULL;
+
+	/* staged blocks array state */
+	vmsnap->current_nblocks_staged = 0;
+	vmsnap->next_block = 0;
+	vmsnap->scanned_pages_to_return = 0;
+	vmsnap->scanned_pages_to_prefetch = 0;
+	/* Offsets into staged blocks array */
+	vmsnap->next_return_idx = 0;
+	vmsnap->next_prefetch_idx = 0;
+	vmsnap->first_invalid_idx = 0;
+
+	for (BlockNumber mapBlock = 0; mapBlock <= mapBlockLast; mapBlock++)
+	{
+		Buffer		mapBuffer;
+		char	   *map;
+		uint64	   *umap;
+
+		mapBuffer = vm_readbuf(rel, mapBlock, false);
+		if (!BufferIsValid(mapBuffer))
+		{
+			/*
+			 * Not all VM pages available.  Remember that, so that we'll treat
+			 * relevant heap pages as not all-visible/all-frozen when asked.
+			 */
+			vmsnap->nvmpages = mapBlock;
+			break;
+		}
+
+		/* Cache page locally */
+		LockBuffer(mapBuffer, BUFFER_LOCK_SHARE);
+		memcpy(vmsnap->vmpage.data, BufferGetPage(mapBuffer), BLCKSZ);
+		UnlockReleaseBuffer(mapBuffer);
+
+		/* Finish off this VM page using snapshot's vmpage cache */
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = map = PageGetContents(vmsnap->vmpage.data);
+		umap = (uint64 *) map;
+
+		if (mapBlock == mapBlockLast)
+		{
+			uint32		mapByte;
+			uint8		mapOffset;
+
+			/*
+			 * The last VM page requires some extra steps.
+			 *
+			 * First get the status of the last heap page (page in the range
+			 * of rel_pages) in passing.
+			 */
+			Assert(mapBlock == HEAPBLK_TO_MAPBLOCK(rel_pages - 1));
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages - 1);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages - 1);
+			mapbits_last_page = ((map[mapByte] >> mapOffset) &
+								 VISIBILITYMAP_VALID_BITS);
+
+			/*
+			 * Also defensively "truncate" our local copy of the last page in
+			 * order to reliably exclude heap pages beyond the range of
+			 * rel_pages.  This is just paranoia.
+			 */
+			mapByte = HEAPBLK_TO_MAPBYTE(rel_pages);
+			mapOffset = HEAPBLK_TO_OFFSET(rel_pages);
+			if (mapByte != 0 || mapOffset != 0)
+			{
+				MemSet(&map[mapByte + 1], 0, MAPSIZE - (mapByte + 1));
+				map[mapByte] &= (1 << mapOffset) - 1;
+			}
+		}
+
+		/* Maintain count of all-frozen and all-visible pages */
+		for (int i = 0; i < MAPSIZE / sizeof(uint64); i++)
+		{
+			all_visible += pg_popcount64(umap[i] & VISIBLE_MASK64);
+			all_frozen += pg_popcount64(umap[i] & FROZEN_MASK64);
+		}
+
+		/* Finally, write out vmpage cache VM page to vmsnap's temp file */
+		if (vmsnap->file)
+			BufFileWrite(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+	}
+
+	/*
+	 * Should always have at least as many all_visible pages as all_frozen
+	 * pages.  Even still, we generally only interpret a page as all-frozen
+	 * when both the all-visible and all-frozen bits are set together.  Clamp
+	 * so that we'll avoid giving our caller an obviously bogus summary of the
+	 * visibility map when certain pages only have their all-frozen bit set.
+	 * More paranoia.
+	 */
+	Assert(all_frozen <= all_visible && all_visible <= rel_pages);
+	all_frozen = Min(all_frozen, all_visible);
+
+	/*
+	 * Done copying all VM pages from authoritative VM into a VM snapshot.
+	 *
+	 * Figure out the final scanned_pages for the two skipping policies that
+	 * we might use: skipallvis (skip both all-frozen and all-visible) and
+	 * skipallfrozen (just skip all-frozen).
+	 */
+	vmsnap->scanned_pages_lazy = rel_pages - all_visible;
+	vmsnap->scanned_pages_eager = rel_pages - all_frozen;
+
+	/*
+	 * When the last page is skippable in principle, it still won't be treated
+	 * as skippable by visibilitymap_snap_next, which recognizes the last page
+	 * as a special case.  Compensate by incrementing each scanning strategy's
+	 * scanned_pages as needed to avoid counting the last page as skippable.
+	 *
+	 * As usual we expect that the all-frozen bit can only be set alongside
+	 * the all-visible bit (for any given page), but only interpret a page as
+	 * truly all-frozen when both of its VM bits are set together.
+	 */
+	if (mapbits_last_page & VISIBILITYMAP_ALL_VISIBLE)
+	{
+		vmsnap->scanned_pages_lazy++;
+		if (mapbits_last_page & VISIBILITYMAP_ALL_FROZEN)
+			vmsnap->scanned_pages_eager++;
+	}
+
+	*scanned_pages_lazy = vmsnap->scanned_pages_lazy;
+	*scanned_pages_eager = vmsnap->scanned_pages_eager;
+
+	return vmsnap;
+}
+
+/*
+ *	visibilitymap_snap_strategy -- determine VACUUM's scanning strategy.
+ *
+ * VACUUM chooses a vmsnap strategy according to priorities around advancing
+ * relfrozenxid.  See visibilitymap_snap_acquire.
+ */
+void
+visibilitymap_snap_strategy(vmsnapshot *vmsnap, vmstrategy strat)
+{
+	int			nprefetch;
+
+	/* Remember final scanning strategy */
+	vmsnap->strat = strat;
+
+	if (vmsnap->strat == VMSNAP_SCAN_LAZY)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_lazy;
+	else if (vmsnap->strat == VMSNAP_SCAN_EAGER)
+		vmsnap->scanned_pages_to_return = vmsnap->scanned_pages_eager;
+	else
+		vmsnap->scanned_pages_to_return = vmsnap->rel_pages;
+
+	vmsnap->scanned_pages_to_prefetch = vmsnap->scanned_pages_to_return;
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_strategy %s %d %u",
+		 RelationGetRelationName(vmsnap->rel), (int) strat,
+		 vmsnap->scanned_pages_to_return);
+#endif
+
+	/*
+	 * Stage blocks (may have to read from temp file).
+	 *
+	 * We rely on the assumption that we'll always have a large enough staged
+	 * blocks array to accommodate any possible prefetch distance.
+	 */
+	vm_snap_stage_blocks(vmsnap);
+
+	nprefetch = Min(vmsnap->current_nblocks_staged, vmsnap->prefetch_distance);
+#ifdef USE_PREFETCH
+	for (int i = 0; i < nprefetch; i++)
+	{
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, vmsnap->staged[i]);
+	}
+#endif
+
+	vmsnap->scanned_pages_to_prefetch -= nprefetch;
+	vmsnap->next_prefetch_idx += nprefetch;
+}
+
+/*
+ *	visibilitymap_snap_next -- get next block to scan from vmsnap.
+ *
+ * Returns next block in line for VACUUM to scan according to vmsnap.  Caller
+ * skips any and all blocks preceding returned block.
+ *
+ * VACUUM always scans the last page to determine whether it has tuples.  This
+ * is useful as a way of avoiding certain pathological cases with heap rel
+ * truncation.  We always return the final block (rel_pages - 1) here last.
+ */
+BlockNumber
+visibilitymap_snap_next(vmsnapshot *vmsnap)
+{
+	BlockNumber next_block_to_scan;
+
+	if (vmsnap->scanned_pages_to_return == 0)
+		return InvalidBlockNumber;
+
+	/* Prepare to return this block */
+	next_block_to_scan = vmsnap->staged[vmsnap->next_return_idx++];
+	vmsnap->current_nblocks_staged--;
+	vmsnap->scanned_pages_to_return--;
+
+	/*
+	 * Did the staged blocks array just run out of blocks to return to caller,
+	 * or do we need to stage more blocks for I/O prefetching purposes?
+	 */
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+	if ((vmsnap->current_nblocks_staged == 0 &&
+		 vmsnap->scanned_pages_to_return > 0) ||
+		(vmsnap->next_prefetch_idx == vmsnap->first_invalid_idx &&
+		 vmsnap->scanned_pages_to_prefetch > 0))
+	{
+		if (vmsnap->current_nblocks_staged > 0)
+		{
+			/*
+			 * We've run out of prefetchable blocks, but still have some
+			 * non-returned blocks.  Shift existing blocks to the start of the
+			 * array.  The newly staged blocks go after these ones.
+			 */
+			memmove(&vmsnap->staged[0],
+					&vmsnap->staged[vmsnap->next_return_idx],
+					sizeof(BlockNumber) * vmsnap->current_nblocks_staged);
+		}
+
+		/*
+		 * Reset offsets in staged blocks array, while accounting for likely
+		 * presence of preexisting blocks that have already been prefetched
+		 * but have yet to be returned to VACUUM caller
+		 */
+		vmsnap->next_prefetch_idx -= vmsnap->next_return_idx;
+		vmsnap->first_invalid_idx -= vmsnap->next_return_idx;
+		vmsnap->next_return_idx = 0;
+
+		/* Stage more blocks (may have to read from temp file) */
+		vm_snap_stage_blocks(vmsnap);
+	}
+
+	/*
+	 * By here we're guaranteed to have at least one prefetchable block in the
+	 * staged blocks array (unless we've already prefetched all blocks that
+	 * will ever be returned to VACUUM caller)
+	 */
+	if (vmsnap->next_prefetch_idx < vmsnap->first_invalid_idx)
+	{
+#ifdef USE_PREFETCH
+		/* Still have remaining blocks to prefetch, so prefetch next one */
+		BlockNumber prefetch = vmsnap->staged[vmsnap->next_prefetch_idx++];
+
+		PrefetchBuffer(vmsnap->rel, MAIN_FORKNUM, prefetch);
+#else
+		vmsnap->next_prefetch_idx++;
+#endif
+		Assert(vmsnap->current_nblocks_staged > 1);
+		Assert(vmsnap->scanned_pages_to_prefetch > 0);
+		vmsnap->scanned_pages_to_prefetch--;
+	}
+	else
+	{
+		Assert(vmsnap->scanned_pages_to_prefetch == 0);
+	}
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "visibilitymap_snap_next %s %u",
+		 RelationGetRelationName(vmsnap->rel), next_block_to_scan);
+#endif
+
+	return next_block_to_scan;
+}
+
+/*
+ *	visibilitymap_snap_release - release previously acquired snapshot
+ *
+ * Frees resources allocated in visibilitymap_snap_acquire for VACUUM.
+ */
+void
+visibilitymap_snap_release(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->scanned_pages_to_return == 0);
+	if (vmsnap->file)
+		BufFileClose(vmsnap->file);
+	pfree(vmsnap);
+}
+
 /*
  *	visibilitymap_count  - count number of bits set in visibility map
  *
@@ -680,3 +1110,105 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 
 	UnlockRelationForExtension(rel, ExclusiveLock);
 }
+
+/*
+ * Stage some heap blocks from vmsnap to return to VACUUM caller.
+ *
+ * Called when we completely run out of staged blocks to return to VACUUM, or
+ * when vmsnap still has some pending staged blocks, but too few to be able to
+ * prefetch incrementally as the remaining blocks are returned to VACUUM.
+ */
+static void
+vm_snap_stage_blocks(vmsnapshot *vmsnap)
+{
+	Assert(vmsnap->current_nblocks_staged < STAGED_BUFSIZE);
+	Assert(vmsnap->first_invalid_idx < STAGED_BUFSIZE);
+	Assert(vmsnap->next_return_idx <= vmsnap->first_invalid_idx);
+	Assert(vmsnap->next_prefetch_idx <= vmsnap->first_invalid_idx);
+
+	while (vmsnap->next_block < vmsnap->rel_pages &&
+		   vmsnap->current_nblocks_staged < STAGED_BUFSIZE)
+	{
+		for (;;)
+		{
+			uint8		mapbits = vm_snap_get_status(vmsnap,
+													 vmsnap->next_block);
+
+			if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+			{
+				Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+				break;
+			}
+
+			/*
+			 * Stop staging blocks just before final page, which must always
+			 * be scanned by VACUUM
+			 */
+			if (vmsnap->next_block == vmsnap->rel_pages - 1)
+				break;
+
+			/* VMSNAP_SCAN_ALL forcing VACUUM to scan every page? */
+			if (vmsnap->strat == VMSNAP_SCAN_ALL)
+				break;
+
+			/*
+			 * Check if VACUUM must scan this page because it's not all-frozen
+			 * and VACUUM opted to use VMSNAP_SCAN_EAGER strategy
+			 */
+			if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
+				vmsnap->strat == VMSNAP_SCAN_EAGER)
+				break;
+
+			/* VACUUM will skip this block -- so don't stage it for later */
+			vmsnap->next_block++;
+		}
+
+		/* VACUUM will scan this block, so stage it for later */
+		vmsnap->staged[vmsnap->first_invalid_idx++] = vmsnap->next_block++;
+		vmsnap->current_nblocks_staged++;
+	}
+}
+
+/*
+ * Get status of bits from vm snapshot
+ */
+static uint8
+vm_snap_get_status(vmsnapshot *vmsnap, BlockNumber heapBlk)
+{
+	BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
+	uint32		mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
+	uint8		mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
+
+#ifdef TRACE_VISIBILITYMAP
+	elog(DEBUG1, "vm_snap_get_status %u", heapBlk);
+#endif
+
+	/*
+	 * If we didn't see the VM page when the snapshot was first acquired we
+	 * defensively assume heapBlk not all-visible or all-frozen
+	 */
+	Assert(heapBlk <= vmsnap->rel_pages);
+	if (unlikely(mapBlock >= vmsnap->nvmpages))
+		return 0;
+
+	/*
+	 * Read from temp file when required.
+	 *
+	 * Although this routine supports random access, sequential access is
+	 * expected.  We should only need to read each temp file page into cache
+	 * at most once per VACUUM.
+	 */
+	if (unlikely(mapBlock != vmsnap->curvmpage))
+	{
+		if (BufFileSeekBlock(vmsnap->file, mapBlock) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not seek to block %u of vmsnap temporary file",
+							mapBlock)));
+		BufFileReadExact(vmsnap->file, vmsnap->vmpage.data, BLCKSZ);
+		vmsnap->curvmpage = mapBlock;
+		vmsnap->rawmap = PageGetContents(vmsnap->vmpage.data);
+	}
+
+	return ((vmsnap->rawmap[mapByte] >> mapOffset) & VISIBILITYMAP_VALID_BITS);
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 62bb87846..4d1e13d51 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -969,11 +969,11 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				freeze_strategy_threshold;
 	uint64		threshold_strategy_pages;
 	TransactionId nextXID,
-				safeOldestXmin,
-				aggressiveXIDCutoff;
+				safeOldestXmin;
 	MultiXactId nextMXID,
-				safeOldestMxact,
-				aggressiveMXIDCutoff;
+				safeOldestMxact;
+	double		XIDFrac,
+				MXIDFrac;
 
 	/* Use mutable copies of freeze age parameters */
 	freeze_min_age = params->freeze_min_age;
@@ -1113,48 +1113,50 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	cutoffs->freeze_strategy_threshold_pages = threshold_strategy_pages;
 
 	/*
-	 * Finally, figure out if caller needs to do an aggressive VACUUM or not.
-	 *
 	 * Determine the table freeze age to use: as specified by the caller, or
-	 * the value of the vacuum_freeze_table_age GUC, but in any case not more
-	 * than autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze XIDs before
-	 * anti-wraparound autovacuum is launched.
+	 * the value of the vacuum_freeze_table_age GUC.  The GUC's default value
+	 * of -1 is interpreted as "just use autovacuum_freeze_max_age value".
+	 * Also clamp using autovacuum_freeze_max_age.
 	 */
 	if (freeze_table_age < 0)
 		freeze_table_age = vacuum_freeze_table_age;
-	freeze_table_age = Min(freeze_table_age, autovacuum_freeze_max_age * 0.95);
+	if (freeze_table_age < 0 || freeze_table_age > autovacuum_freeze_max_age)
+		freeze_table_age = autovacuum_freeze_max_age;
 	Assert(freeze_table_age >= 0);
-	aggressiveXIDCutoff = nextXID - freeze_table_age;
-	if (!TransactionIdIsNormal(aggressiveXIDCutoff))
-		aggressiveXIDCutoff = FirstNormalTransactionId;
-	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-									  aggressiveXIDCutoff))
-		return true;
 
 	/*
 	 * Similar to the above, determine the table freeze age to use for
 	 * multixacts: as specified by the caller, or the value of the
-	 * vacuum_multixact_freeze_table_age GUC, but in any case not more than
-	 * effective_multixact_freeze_max_age * 0.95, so that if you have e.g.
-	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
-	 * multixacts before anti-wraparound autovacuum is launched.
+	 * vacuum_multixact_freeze_table_age GUC.  The GUC's default value of -1
+	 * is interpreted as "just use effective_multixact_freeze_max_age value".
+	 * Also clamp using effective_multixact_freeze_max_age.
 	 */
 	if (multixact_freeze_table_age < 0)
 		multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
-	multixact_freeze_table_age =
-		Min(multixact_freeze_table_age,
-			effective_multixact_freeze_max_age * 0.95);
+	if (multixact_freeze_table_age < 0 ||
+		multixact_freeze_table_age > effective_multixact_freeze_max_age)
+		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 	Assert(multixact_freeze_table_age >= 0);
-	aggressiveMXIDCutoff = nextMXID - multixact_freeze_table_age;
-	if (aggressiveMXIDCutoff < FirstMultiXactId)
-		aggressiveMXIDCutoff = FirstMultiXactId;
-	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-									aggressiveMXIDCutoff))
-		return true;
 
-	/* Non-aggressive VACUUM */
-	return false;
+	/*
+	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
+	 * XMID table age (whichever is greater currently).
+	 */
+	XIDFrac = (double) (nextXID - cutoffs->relfrozenxid) /
+		((double) freeze_table_age + 0.5);
+	MXIDFrac = (double) (nextMXID - cutoffs->relminmxid) /
+		((double) multixact_freeze_table_age + 0.5);
+	cutoffs->tableagefrac = Max(XIDFrac, MXIDFrac);
+
+	/*
+	 * Make sure that antiwraparound autovacuums reliably advance relfrozenxid
+	 * to the satisfaction of autovacuum.c, even when the reloption version of
+	 * autovacuum_freeze_max_age happens to be in use
+	 */
+	if (params->is_wraparound)
+		cutoffs->tableagefrac = 1.0;
+
+	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 7a78d98d3..22c762c79 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2507,11 +2507,11 @@ struct config_int ConfigureNamesInt[] =
 
 	{
 		{"vacuum_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
-			gettext_noop("Age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("Age at which VACUUM must scan whole table to freeze tuples."),
+			gettext_noop("-1 to use autovacuum_freeze_max_age value.")
 		},
 		&vacuum_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
@@ -2527,11 +2527,11 @@ struct config_int ConfigureNamesInt[] =
 
 	{
 		{"vacuum_multixact_freeze_table_age", PGC_USERSET, CLIENT_CONN_STATEMENT,
-			gettext_noop("Multixact age at which VACUUM should scan whole table to freeze tuples."),
-			NULL
+			gettext_noop("Multixact age at which VACUUM must scan whole table to freeze tuples."),
+			gettext_noop("-1 to use autovacuum_multixact_freeze_max_age value.")
 		},
 		&vacuum_multixact_freeze_table_age,
-		150000000, 0, 2000000000,
+		-1, -1, 2000000000,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index fda695e75..a0c4fc213 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -661,6 +661,13 @@
 					# autovacuum, -1 means use
 					# vacuum_cost_limit
 
+# - AUTOVACUUM compatibility options (legacy) -
+
+#vacuum_freeze_table_age = -1	# target maximum XID age, or -1 to
+					# use autovacuum_freeze_max_age
+#vacuum_multixact_freeze_table_age = -1	# target maximum MXID age, or -1 to
+					# use autovacuum_multixact_freeze_max_age
+
 
 #------------------------------------------------------------------------------
 # CLIENT CONNECTION DEFAULTS
@@ -694,10 +701,8 @@
 #lock_timeout = 0			# in milliseconds, 0 is disabled
 #idle_in_transaction_session_timeout = 0	# in milliseconds, 0 is disabled
 #idle_session_timeout = 0		# in milliseconds, 0 is disabled
-#vacuum_freeze_table_age = 150000000
 #vacuum_freeze_min_age = 50000000
 #vacuum_failsafe_age = 1600000000
-#vacuum_multixact_freeze_table_age = 150000000
 #vacuum_multixact_freeze_min_age = 5000000
 #vacuum_multixact_failsafe_age = 1600000000
 #vacuum_freeze_strategy_threshold = 4GB
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 39480c653..898659a10 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9248,20 +9248,28 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million transactions.  Although users can
-        set this value anywhere from zero to two billion, <command>VACUUM</command>
-        will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound autovacuum is launched for the table. For more
-        information see
-        <xref linkend="vacuum-for-wraparound"/>.
+        <command>VACUUM</command> reliably advances
+        <structfield>relfrozenxid</structfield> to a recent value if
+        the table's
+        <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+        field has reached the age specified by this setting.
+        The default is -1.  If -1 is specified, the value
+        of <xref linkend="guc-autovacuum-freeze-max-age"/> is used.
+        Although users can set this value anywhere from zero to two
+        billion, <command>VACUUM</command> will silently limit the
+        effective value to <xref
+         linkend="guc-autovacuum-freeze-max-age"/>. For more
+        information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+         now take place more proactively, based on criteria that considers both
+         costs and benefits.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
@@ -9330,19 +9338,27 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        <command>VACUUM</command> performs an aggressive scan if the table's
-        <structname>pg_class</structname>.<structfield>relminmxid</structfield> field has reached
-        the age specified by this setting.  An aggressive scan differs from
-        a regular <command>VACUUM</command> in that it visits every page that might
-        contain unfrozen XIDs or MXIDs, not just those that might contain dead
-        tuples.  The default is 150 million multixacts.
-        Although users can set this value anywhere from zero to two billion,
-        <command>VACUUM</command> will silently limit the effective value to 95% of
-        <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>, so that a
-        periodic manual <command>VACUUM</command> has a chance to run before an
-        anti-wraparound is launched for the table.
-        For more information see <xref linkend="vacuum-for-multixact-wraparound"/>.
+       <command>VACUUM</command> reliably advances
+       <structfield>relminmxid</structfield> to a recent value if the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field has reached the age specified by this setting.
+       The default is -1.  If -1 is specified, the value of <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/> is used.
+       Although users can set this value anywhere from zero to two
+       billion, <command>VACUUM</command> will silently limit the
+       effective value to <xref
+        linkend="guc-autovacuum-multixact-freeze-max-age"/>. For more
+       information see <xref linkend="vacuum-for-wraparound"/>.
        </para>
+       <note>
+        <para>
+         The meaning of this parameter, and its default value, changed
+         in <productname>PostgreSQL</productname> 16.  Freezing and advancing
+         <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+         now take place more proactively, based on criteria that considers both
+         costs and benefits.
+        </para>
+       </note>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 8d762bad2..63bd10f2b 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -529,13 +529,6 @@
     <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
     XID and MXID values, including those from all-visible but not all-frozen pages.
     In practice most tables require periodic aggressive vacuuming.
-    <xref linkend="guc-vacuum-freeze-table-age"/>
-    controls when <command>VACUUM</command> does that: all-visible but not all-frozen
-    pages are scanned if the number of transactions that have passed since the
-    last such scan is greater than <varname>vacuum_freeze_table_age</varname> minus
-    <varname>vacuum_freeze_min_age</varname>. Setting
-    <varname>vacuum_freeze_table_age</varname> to 0 forces <command>VACUUM</command> to
-    always use its aggressive strategy.
    </para>
 
    <para>
@@ -565,27 +558,9 @@
     <varname>vacuum_freeze_min_age</varname>.
    </para>
 
-   <para>
-    The effective maximum for <varname>vacuum_freeze_table_age</varname> is 0.95 *
-    <varname>autovacuum_freeze_max_age</varname>; a setting higher than that will be
-    capped to the maximum. A value higher than
-    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an
-    anti-wraparound autovacuum would be triggered at that point anyway, and
-    the 0.95 multiplier leaves some breathing room to run a manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
-    <command>vacuum_freeze_table_age</command> should be set to a value somewhat
-    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that
-    a regularly scheduled <command>VACUUM</command> or an autovacuum triggered by
-    normal delete and update activity is run in that window.  Setting it too
-    close could lead to anti-wraparound autovacuums, even though the table
-    was recently vacuumed to reclaim space, whereas lower values lead to more
-    frequent aggressive vacuuming.
-   </para>
-
    <para>
     The sole disadvantage of increasing <varname>autovacuum_freeze_max_age</varname>
-    (and <varname>vacuum_freeze_table_age</varname> along with it) is that
-    the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
+    is that the <filename>pg_xact</filename> and <filename>pg_commit_ts</filename>
     subdirectories of the database cluster will take more space, because it
     must store the commit status and (if <varname>track_commit_timestamp</varname> is
     enabled) timestamp of all transactions back to
@@ -662,7 +637,7 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     advanced when every page of the table
     that might contain unfrozen XIDs is scanned.  This happens when
     <structfield>relfrozenxid</structfield> is more than
-    <varname>vacuum_freeze_table_age</varname> transactions old, when
+    <varname>autovacuum_freeze_max_age</varname> transactions old, when
     <command>VACUUM</command>'s <literal>FREEZE</literal> option is used, or when all
     pages that are not already all-frozen happen to
     require vacuuming to remove dead row versions. When <command>VACUUM</command>
@@ -680,6 +655,29 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     be forced for the table.
    </para>
 
+   <tip>
+   <para>
+    <varname>vacuum_freeze_table_age</varname> can be used to override
+    <varname>autovacuum_freeze_max_age</varname> locally.
+    <command>VACUUM</command> will advance
+    <structfield>relfrozenxid</structfield> in the same way as it
+    would had <varname>autovacuum_freeze_max_age</varname> been set to
+    the same value, without any direct impact on autovacuum
+    scheduling.
+   </para>
+   <para>
+    Prior to <productname>PostgreSQL</productname> 16,
+    <command>VACUUM</command> did not apply a cost model to decide
+    when to advance <structfield>relfrozenxid</structfield>, which
+    made <varname>vacuum_freeze_table_age</varname> an important
+    tunable setting.  This is no longer the case.  The revised
+    <varname>vacuum_freeze_table_age</varname> default of
+    <literal>-1</literal> makes <command>VACUUM</command> use
+    <varname>autovacuum_freeze_max_age</varname> as an input to its
+    cost model, which should be adequate in most environments.
+   </para>
+   </tip>
+
    <para>
     If for some reason autovacuum fails to clear old XIDs from a table, the
     system will begin to emit warning messages like this when the database's
@@ -752,12 +750,6 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
      transaction ID, or a newer multixact ID.  For each table,
      <structname>pg_class</structname>.<structfield>relminmxid</structfield> stores the oldest
      possible multixact ID still appearing in any tuple of that table.
-     If this value is older than
-     <xref linkend="guc-vacuum-multixact-freeze-table-age"/>, an aggressive
-     vacuum is forced.  As discussed in the previous section, an aggressive
-     vacuum means that only those pages which are known to be all-frozen will
-     be skipped.  <function>mxid_age()</function> can be used on
-     <structname>pg_class</structname>.<structfield>relminmxid</structfield> to find its age.
     </para>
 
     <para>
@@ -876,10 +868,22 @@ vacuum insert threshold = vacuum base insert threshold + vacuum insert scale fac
     <command>DELETE</command> and <command>INSERT</command> operation.  (It is
     only semi-accurate because some information might be lost under heavy
     load.)  If the <structfield>relfrozenxid</structfield> value of the table
-    is more than <varname>vacuum_freeze_table_age</varname> transactions old,
-    an aggressive vacuum is performed to freeze old tuples and advance
-    <structfield>relfrozenxid</structfield>; otherwise, only pages that have been modified
-    since the last vacuum are scanned.
+    is more than <varname>autovacuum_freeze_max_age</varname> transactions old,
+    vacuum must freeze old tuples from existing all-visible pages to
+    be able to advance <structfield>relfrozenxid</structfield>;
+    otherwise, vacuum applies a cost model that advances
+    <structfield>relfrozenxid</structfield> whenever the added cost of
+    doing so during the ongoing operation is sufficiently low.
+    <varname>autovacuum_freeze_max_age</varname> is used to guide
+    <command>VACUUM</command> on how
+    <structfield>relfrozenxid</structfield> must be advanced in the
+    worst case, which is often only weakly predictive of the actual
+    rate.  Much depends on workload characteristics.  A cost model
+    dynamically determines whether or not to advance
+    <structfield>relfrozenxid</structfield> at the start of each
+    <command>VACUUM</command>.  The model finds the most opportune
+    time by weighing the added cost of advancement against the age
+    that <structfield>relfrozenxid</structfield> has already attained.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 545b23b54..6ba4385a0 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -158,11 +158,11 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
      <para>
       Normally, <command>VACUUM</command> will skip pages based on the <link
       linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
+      all tuples are known to be frozen can always be skipped.  Pages
+      where all tuples are known to be visible to all transactions are
+      skipped whenever <command>VACUUM</command> determined that
+      advancing <structfield>relfrozenxid</structfield> and
+      <structfield>relminmxid</structfield> was unnecessary.
       This option disables all page-skipping behavior, and is intended to
       be used only when the contents of the visibility map are
       suspect, which should happen only if there is a hardware or software
diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index b6aef6f65..0e569d300 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,8 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -128,8 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 4252b0202..b2bed8ed8 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,8 +61,8 @@ CREATE TEMP TABLE reloptions_test(i INT NOT NULL, j text)
 	autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -72,8 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
--- Do an aggressive vacuum to prevent page-skipping.
-VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) reloptions_test;
+-- Do a VACUUM FREEZE to prevent skipping any pruning.
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options
-- 
2.39.0

v17-0003-Finish-removing-aggressive-mode-VACUUM.patchapplication/x-patch; name=v17-0003-Finish-removing-aggressive-mode-VACUUM.patchDownload

From f6e0daeeae16af641e30e703969bfa6d67376f79 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 5 Sep 2022 17:46:34 -0700
Subject: [PATCH v17 3/3] Finish removing aggressive mode VACUUM.

The concept of aggressive/scan_all VACUUM dates back to the introduction
of the visibility map in Postgres 8.4.  Although pre-visibility-map
VACUUM was far less efficient (especially after 9.6's commit fd31cd26),
its naive approach had one notable advantage: users only had to think
about a single kind of lazy vacuum (the only kind that existed).

Break the final remaining dependency on aggressive mode: replace the
rules governing when VACUUM will wait for a cleanup lock with a new set
of rules more attuned to the needs of the table.  With that last
dependency gone, there is no need for aggressive mode, so get rid of it.
Users once again only have to think about one kind of lazy vacuum.

In general, all of the behaviors associated with aggressive mode prior
to Postgres 16 have been retained; they just get applied selectively, on
a more dynamic timeline.  For example, the aforementioned change to
VACUUM's cleanup lock behavior retains the general idea of sometimes
waiting for a cleanup lock to make sure that older XIDs get frozen, so
that relfrozenxid can be advanced by a sufficient amount.  All that
really changes is the information driving VACUUM's decision on waiting.

We use new, dedicated cutoffs, rather than applying the FreezeLimit and
MultiXactCutoff used when deciding whether we should trigger freezing on
the basis of XID/MXID age.  These minimum fallback cutoffs (which are
called MinXid and MinMulti) are typically far older than the standard
FreezeLimit/MultiXactCutoff cutoffs.  VACUUM doesn't need an aggressive
mode to decide on whether to wait for a cleanup lock anymore; it can
decide everything at the level of individual heap pages.

It is okay to aggressively punt on waiting for a cleanup lock like this
because VACUUM now directly understands the importance of never falling
too far behind on the work of freezing physical heap pages at the level
of the whole table, following recent work to add VM scanning strategies.
It is generally safer for VACUUM to press on with freezing other heap
pages from the table instead.  Even if relfrozenxid can only be advanced
by relatively few XIDs as a consequence, VACUUM should have more than
ample opportunity to catch up next time, since there is bound to be no
more than a small number of problematic unfrozen pages left behind.
VACUUM now tends to consistently advance relfrozenxid (at least by some
small amount) all the time in larger tables, so all that has to happen
for relfrozenxid to fully catch up is for a few remaining unfrozen pages
to get frozen.  Since relfrozenxid is now considered to be no more than
a lagging indicator of freezing, and since relfrozenxid isn't used to
trigger freezing in the way that it once was, time is on our side.

Also teach VACUUM to wait for a short while for cleanup locks when doing
so has a decent chance of preserving its ability to advance relfrozenxid
up to FreezeLimit (and/or to advance relminmxid up to MultiXactCutoff).
As a result, VACUUM typically manages to advance relfrozenxid by just as
much as it would have had it promised to advance it up to FreezeLimit
(i.e. had it made the traditional aggressive VACUUM guarantee), even
when vacuuming a table that happens to have relatively many cleanup lock
conflicts affecting pages with older XIDs/MXIDs.  VACUUM thereby avoids
missing out on advancing relfrozenxid up to the traditional target
amount when it really can be avoided fairly easily, without promising to
do so (VACUUM only promises to advance up to MinXid/MinMulti).

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkU42GzrsHhL2BiC1QMhaVGmVdb5HR0_qczz0Gu2aSn=A@mail.gmail.com
---
 src/include/commands/vacuum.h                 |   9 +-
 src/backend/access/heap/heapam.c              |   2 +
 src/backend/access/heap/vacuumlazy.c          | 230 +++++++++++-------
 src/backend/commands/vacuum.c                 |  42 +++-
 src/backend/utils/activity/pgstat_relation.c  |   4 +-
 doc/src/sgml/maintenance.sgml                 |  37 +--
 .../expected/vacuum-no-cleanup-lock.out       |  24 +-
 .../specs/vacuum-no-cleanup-lock.spec         |  33 +--
 8 files changed, 230 insertions(+), 151 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 18a56efbd..12f0704f9 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -281,6 +281,13 @@ struct VacuumCutoffs
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
+	/*
+	 * Earliest permissible NewRelfrozenXid/NewRelminMxid values that can be
+	 * set in pg_class at the end of VACUUM.
+	 */
+	TransactionId MinXid;
+	MultiXactId MinMulti;
+
 	/*
 	 * Threshold that triggers VACUUM's eager freezing strategy
 	 */
@@ -357,7 +364,7 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern bool vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
+extern void vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 							   struct VacuumCutoffs *cutoffs);
 extern bool vacuum_xid_failsafe_check(const struct VacuumCutoffs *cutoffs);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 602befa1d..22a2e3028 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7056,6 +7056,8 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	cutoffs.OldestMxact = MultiXactCutoff;
 	cutoffs.FreezeLimit = FreezeLimit;
 	cutoffs.MultiXactCutoff = MultiXactCutoff;
+	cutoffs.MinXid = FreezeLimit;
+	cutoffs.MinMulti = MultiXactCutoff;
 	cutoffs.freeze_strategy_threshold_pages = 0;
 	cutoffs.tableagefrac = 0;
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b5a4094ba..9e5a7be23 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -157,8 +157,6 @@ typedef struct LVRelState
 	BufferAccessStrategy bstrategy;
 	ParallelVacuumState *pvs;
 
-	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
-	bool		aggressive;
 	/* Eagerly freeze all tuples on pages about to be set all-visible? */
 	bool		eager_freeze_strategy;
 	/* Wraparound failsafe has been triggered? */
@@ -264,7 +262,8 @@ static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							LVPagePruneState *prunestate);
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 							  BlockNumber blkno, Page page,
-							  bool *hastup, bool *recordfreespace);
+							  bool *hastup, bool *recordfreespace,
+							  Size *freespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
@@ -462,7 +461,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * future we might want to teach lazy_scan_prune to recompute vistest from
 	 * time to time, to increase the number of dead tuples it can prune away.)
 	 */
-	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
+	vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* Initialize state used to track oldest extant XID/MXID */
@@ -542,17 +541,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
-	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
-	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 * VACUUM can only advance relfrozenxid to a value >= MinXid, and
+	 * relminmxid to a value >= MinMulti.
 	 */
 	Assert(vacrel->NewRelfrozenXid == vacrel->cutoffs.OldestXmin ||
-		   TransactionIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.FreezeLimit :
-										 vacrel->cutoffs.relfrozenxid,
+		   TransactionIdPrecedesOrEquals(vacrel->cutoffs.MinXid,
 										 vacrel->NewRelfrozenXid));
 	Assert(vacrel->NewRelminMxid == vacrel->cutoffs.OldestMxact ||
-		   MultiXactIdPrecedesOrEquals(vacrel->aggressive ? vacrel->cutoffs.MultiXactCutoff :
-									   vacrel->cutoffs.relminmxid,
+		   MultiXactIdPrecedesOrEquals(vacrel->cutoffs.MinMulti,
 									   vacrel->NewRelminMxid));
 	if (vacrel->vmstrat == VMSNAP_SCAN_LAZY)
 	{
@@ -560,7 +556,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		 * Must keep original relfrozenxid/relminmxid when lazy_scan_strategy
 		 * decided to skip all-visible pages containing unfrozen XIDs/MXIDs
 		 */
-		Assert(!vacrel->aggressive);
 		vacrel->NewRelfrozenXid = InvalidTransactionId;
 		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
@@ -629,33 +624,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
 			memset(&walusage, 0, sizeof(WalUsage));
 			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
 			initStringInfo(&buf);
+
 			if (verbose)
-			{
-				Assert(!params->is_wraparound);
 				msgfmt = _("finished vacuuming \"%s.%s.%s\": index scans: %d\n");
-			}
 			else if (params->is_wraparound)
-			{
-				/*
-				 * While it's possible for a VACUUM to be both is_wraparound
-				 * and !aggressive, that's just a corner-case -- is_wraparound
-				 * implies aggressive.  Produce distinct output for the corner
-				 * case all the same, just in case.
-				 */
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			else
-			{
-				if (vacrel->aggressive)
-					msgfmt = _("automatic aggressive vacuum of table \"%s.%s.%s\": index scans: %d\n");
-				else
-					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
-			}
+				msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			appendStringInfo(&buf, msgfmt,
 							 vacrel->dbname,
 							 vacrel->relnamespace,
@@ -941,6 +917,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			bool		hastup,
 						recordfreespace;
+			Size		freespace;
 
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
@@ -954,10 +931,8 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Collect LP_DEAD items in dead_items array, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
-								  &recordfreespace))
+								  &recordfreespace, &freespace))
 			{
-				Size		freespace = 0;
-
 				/*
 				 * Processed page successfully (without cleanup lock) -- just
 				 * need to perform rel truncation and FSM steps, much like the
@@ -966,21 +941,14 @@ lazy_scan_heap(LVRelState *vacrel)
 				 */
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
-				if (recordfreespace)
-					freespace = PageGetHeapFreeSpace(page);
-				UnlockReleaseBuffer(buf);
 				if (recordfreespace)
 					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+
+				/* lock and pin released by lazy_scan_noprune */
 				continue;
 			}
 
-			/*
-			 * lazy_scan_noprune could not do all required processing.  Wait
-			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
-			 */
-			Assert(vacrel->aggressive);
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockBufferForCleanup(buf);
+			/* cleanup lock acquired by lazy_scan_noprune */
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
@@ -1403,8 +1371,6 @@ lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
 	if (force_scan_all)
 		vacrel->vmstrat = VMSNAP_SCAN_ALL;
 
-	Assert(!vacrel->aggressive || vacrel->vmstrat != VMSNAP_SCAN_LAZY);
-
 	/* Inform vmsnap infrastructure of our chosen strategy */
 	visibilitymap_snap_strategy(vacrel->vmsnap, vacrel->vmstrat);
 
@@ -1998,17 +1964,32 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items array for removal from indexes (assuming caller's page
+ * can be processed successfully here).
  *
- * For aggressive VACUUM callers, we may return false to indicate that a full
- * cleanup lock is required for processing by lazy_scan_prune.  This is only
- * necessary when the aggressive VACUUM needs to freeze some tuple XIDs from
- * one or more tuples on the page.  We always return true for non-aggressive
- * callers.
+ * We return true to indicate that processing succeeded, in which case we'll
+ * have dropped the lock and pin on buf/page.  Else returns false, indicating
+ * that page must be processed by lazy_scan_prune in the usual way after all.
+ * Acquires a cleanup lock on buf/page for caller before returning false.
+ *
+ * We go to considerable trouble to get a cleanup lock on any page that has
+ * XIDs/MXIDs that need to be frozen in order for VACUUM to be able to set
+ * relfrozenxid/relminmxid to values >= FreezeLimit/MultiXactCutoff cutoffs.
+ * But we don't strictly guarantee it; we only guarantee that final values
+ * will be >= MinXid/MinMulti cutoffs in the worst case.
+ *
+ * We prefer to "under promise and over deliver" like this because a strong
+ * guarantee has the potential to make a bad situation even worse.  VACUUM
+ * should avoid waiting for a cleanup lock for an indefinitely long time until
+ * it has already exhausted every available alternative.  It's quite possible
+ * (and perhaps even likely) that the problem will go away on its own.  But
+ * even when it doesn't, our approach at least makes it likely that the first
+ * VACUUM that encounters the issue will catch up on whatever freezing may
+ * still be required for every other page in the target rel.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.
  * recordfreespace flag instructs caller on whether or not it should do
- * generic FSM processing for page.
+ * generic FSM processing for page, using *freespace value set here.
  */
 static bool
 lazy_scan_noprune(LVRelState *vacrel,
@@ -2016,7 +1997,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				  BlockNumber blkno,
 				  Page page,
 				  bool *hastup,
-				  bool *recordfreespace)
+				  bool *recordfreespace,
+				  Size *freespace)
 {
 	OffsetNumber offnum,
 				maxoff;
@@ -2024,6 +2006,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 				live_tuples,
 				recently_dead_tuples,
 				missed_dead_tuples;
+	bool		should_freeze = false;
 	HeapTupleHeader tupleheader;
 	TransactionId NoFreezePageRelfrozenXid = vacrel->NewRelfrozenXid;
 	MultiXactId NoFreezePageRelminMxid = vacrel->NewRelminMxid;
@@ -2033,6 +2016,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	*hastup = false;			/* for now */
 	*recordfreespace = false;	/* for now */
+	*freespace = PageGetHeapFreeSpace(page);
 
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -2074,34 +2058,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		if (heap_tuple_should_freeze(tupleheader, &vacrel->cutoffs,
 									 &NoFreezePageRelfrozenXid,
 									 &NoFreezePageRelminMxid))
-		{
-			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
-			if (vacrel->aggressive)
-			{
-				/*
-				 * Aggressive VACUUMs must always be able to advance rel's
-				 * relfrozenxid to a value >= FreezeLimit (and be able to
-				 * advance rel's relminmxid to a value >= MultiXactCutoff).
-				 * The ongoing aggressive VACUUM won't be able to do that
-				 * unless it can freeze an XID (or MXID) from this tuple now.
-				 *
-				 * The only safe option is to have caller perform processing
-				 * of this page using lazy_scan_prune.  Caller might have to
-				 * wait a while for a cleanup lock, but it can't be helped.
-				 */
-				vacrel->offnum = InvalidOffsetNumber;
-				return false;
-			}
-
-			/*
-			 * Non-aggressive VACUUMs are under no obligation to advance
-			 * relfrozenxid (even by one XID).  We can be much laxer here.
-			 *
-			 * Currently we always just accept an older final relfrozenxid
-			 * and/or relminmxid value.  We never make caller wait or work a
-			 * little harder, even when it likely makes sense to do so.
-			 */
-		}
+			should_freeze = true;
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
 		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
@@ -2150,10 +2107,107 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * By here we know for sure that caller can put off freezing and pruning
-	 * this particular page until the next VACUUM.  Remember its details now.
-	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
+	 * Release lock (but not pin) on page now.  Then consider if we should
+	 * back out of accepting reduced processing for this page.
+	 *
+	 * Our caller's initial inability to get a cleanup lock will often turn
+	 * out to have been nothing more than a momentary blip, and it would be a
+	 * shame if relfrozenxid/relminmxid values < FreezeLimit/MultiXactCutoff
+	 * were used without good reason.  For example, the checkpointer might
+	 * have been writing out this page a moment ago, in which case its buffer
+	 * pin might have already been released by now.
+	 *
+	 * It's also possible that the conflicting buffer pin will continue to
+	 * block cleanup lock acquisition on the buffer for an extended period.
+	 * For example, it isn't uncommon for heap_lock_tuple to sleep while
+	 * holding a buffer pin, in which case a conflicting pin could easily be
+	 * held for much longer than VACUUM can reasonably be expected to wait.
+	 * There are also truly pathological cases to worry about.  For example,
+	 * the case where buggy application code holds open a cursor forever.
 	 */
+	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	if (should_freeze)
+	{
+		TransactionId RelfrozenXid;
+		MultiXactId RelminMxid;
+
+		/*
+		 * If page has tuple with a dangerously old XID/MXID (an XID < MinXid,
+		 * or an MXID < MinMulti), then we wait for however long it takes to
+		 * get a cleanup lock.
+		 *
+		 * Check for that first (get it out of the way).
+		 */
+		if (TransactionIdPrecedes(NoFreezePageRelfrozenXid,
+								  vacrel->cutoffs.MinXid) ||
+			MultiXactIdPrecedes(NoFreezePageRelminMxid,
+								vacrel->cutoffs.MinMulti))
+		{
+			/*
+			 * MinXid/MinMulti are considered to be only barely adequate final
+			 * values, so we only expect to end up here when previous VACUUMs
+			 * put off processing by lazy_scan_prune in the hope that it would
+			 * never come to this.  That hasn't worked out, so we must wait.
+			 */
+			LockBufferForCleanup(buf);
+			return false;
+		}
+
+		/*
+		 * Page has tuple with XID < FreezeLimit, or MXID < MultiXactCutoff,
+		 * but they're not so old that we're _strictly_ obligated to freeze.
+		 *
+		 * We are willing to go to the trouble of waiting for a cleanup lock
+		 * for a short while for such a page -- just not indefinitely long.
+		 * This avoids squandering opportunities to advance relfrozenxid or
+		 * relminmxid by the target amount during any one VACUUM, which is
+		 * particularly important with larger tables that only get vacuumed
+		 * when autovacuum.c is concerned about table age.  It would not be
+		 * okay if the number of autovacuums such a table ended up requiring
+		 * noticeably exceeded the expected autovacuum_freeze_max_age cadence.
+		 */
+		RelfrozenXid = NoFreezePageRelfrozenXid;
+		if (TransactionIdPrecedes(vacrel->cutoffs.FreezeLimit, RelfrozenXid))
+			RelfrozenXid = vacrel->cutoffs.FreezeLimit;
+		RelminMxid = NoFreezePageRelminMxid;
+		if (MultiXactIdPrecedes(vacrel->cutoffs.MultiXactCutoff, RelminMxid))
+			RelminMxid = vacrel->cutoffs.MultiXactCutoff;
+
+		/*
+		 * We are willing to wait and try again a total of 3 times.  If that
+		 * doesn't work then we just give up.  We only wait here when it is
+		 * actually expected to preserve current NewRelfrozenXid/NewRelminMxid
+		 * tracker values, and when trackers will actually be used to update
+		 * pg_class later on.  This also tends to limit the impact of waiting
+		 * for VACUUMs that experience relatively many cleanup lock conflicts.
+		 */
+		if (vacrel->vmstrat != VMSNAP_SCAN_LAZY &&
+			(TransactionIdPrecedes(RelfrozenXid, vacrel->NewRelfrozenXid) ||
+			 MultiXactIdPrecedes(RelminMxid, vacrel->NewRelminMxid)))
+		{
+			/* wait 10ms, then 20ms, then 30ms, then give up */
+			for (int i = 1; i <= 3; i++)
+			{
+				CHECK_FOR_INTERRUPTS();
+
+				pg_usleep(1000L * 10L * i);
+				if (ConditionalLockBufferForCleanup(buf))
+				{
+					/* Go process page in lazy_scan_prune after all */
+					return false;
+				}
+			}
+
+			/* Give up, accepting reduced processing for this page */
+		}
+	}
+
+	/*
+	 * By here we know for sure that caller will put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * Also drop the buffer pin that we held onto during cleanup lock steps.
+	 */
+	ReleaseBuffer(buf);
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 4d1e13d51..6f3bfa1e1 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -951,13 +951,8 @@ get_all_vacuum_rels(int options)
  * The target relation and VACUUM parameters are our inputs.
  *
  * Output parameters are the cutoffs that VACUUM caller should use.
- *
- * Return value indicates if vacuumlazy.c caller should make its VACUUM
- * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit (at a minimum), and relminmxid up to MultiXactCutoff (at a
- * minimum).
  */
-bool
+void
 vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 				   struct VacuumCutoffs *cutoffs)
 {
@@ -1138,6 +1133,39 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 		multixact_freeze_table_age = effective_multixact_freeze_max_age;
 	Assert(multixact_freeze_table_age >= 0);
 
+	/*
+	 * Determine the cutoffs used by VACUUM to decide on whether to wait for a
+	 * cleanup lock on a page (that it can't cleanup lock right away).  These
+	 * are the MinXid and MinMulti cutoffs.  They are related to the cutoffs
+	 * for freezing (FreezeLimit and MultiXactCutoff), and are only applied on
+	 * pages that we cannot freeze right away.  See vacuumlazy.c for details.
+	 *
+	 * VACUUM can ratchet back NewRelfrozenXid and/or NewRelminMxid instead of
+	 * waiting indefinitely for a cleanup lock in almost all cases.  The high
+	 * level goal is to create as many opportunities as possible to freeze
+	 * (across many successive VACUUM operations), while avoiding waiting for
+	 * a cleanup lock whenever possible.  Any time spent waiting is time spent
+	 * not freezing other eligible pages, which is typically a bad trade-off.
+	 *
+	 * As a consequence of all this, MinXid and MinMulti also act as limits on
+	 * the oldest acceptable values that can ever be set in pg_class by VACUUM
+	 * (though this is only relevant when they have already attained XID/XMID
+	 * ages that approach freeze_table_age and/or multixact_freeze_table_age).
+	 */
+	cutoffs->MinXid = nextXID - (freeze_table_age * 0.95);
+	if (!TransactionIdIsNormal(cutoffs->MinXid))
+		cutoffs->MinXid = FirstNormalTransactionId;
+	/* MinXid must always be <= FreezeLimit */
+	if (TransactionIdPrecedes(cutoffs->FreezeLimit, cutoffs->MinXid))
+		cutoffs->MinXid = cutoffs->FreezeLimit;
+
+	cutoffs->MinMulti = nextMXID - (multixact_freeze_table_age * 0.95);
+	if (cutoffs->MinMulti < FirstMultiXactId)
+		cutoffs->MinMulti = FirstMultiXactId;
+	/* MinMulti must always be <= MultiXactCutoff */
+	if (MultiXactIdPrecedes(cutoffs->MultiXactCutoff, cutoffs->MinMulti))
+		cutoffs->MinMulti = cutoffs->MultiXactCutoff;
+
 	/*
 	 * Finally, set tableagefrac for VACUUM.  This can come from either XID or
 	 * XMID table age (whichever is greater currently).
@@ -1155,8 +1183,6 @@ vacuum_get_cutoffs(Relation rel, const VacuumParams *params,
 	 */
 	if (params->is_wraparound)
 		cutoffs->tableagefrac = 1.0;
-
-	return (cutoffs->tableagefrac >= 1.0);
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 2e20b93c2..056ef0178 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -235,8 +235,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
 	tabentry->dead_tuples = deadtuples;
 
 	/*
-	 * It is quite possible that a non-aggressive VACUUM ended up skipping
-	 * various pages, however, we'll zero the insert counter here regardless.
+	 * It is quite possible that VACUUM will skip all-visible pages for a
+	 * smaller table, however, we'll zero the insert counter here regardless.
 	 * It's currently used only to track when we need to perform an "insert"
 	 * autovacuum, which are mainly intended to freeze newly inserted tuples.
 	 * Zeroing this may just mean we'll not try to vacuum the table again
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 63bd10f2b..7899c27ea 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -525,19 +525,13 @@
     will skip pages that don't have any dead row versions even if those pages
     might still have row versions with old XID values.  Therefore, normal
     <command>VACUUM</command>s won't always freeze every old row version in the table.
-    When that happens, <command>VACUUM</command> will eventually need to perform an
-    <firstterm>aggressive vacuum</firstterm>, which will freeze all eligible unfrozen
-    XID and MXID values, including those from all-visible but not all-frozen pages.
-    In practice most tables require periodic aggressive vacuuming.
    </para>
 
    <para>
     The maximum time that a table can go unvacuumed is two billion
     transactions minus the <varname>vacuum_freeze_min_age</varname> value at
-    the time of the last aggressive vacuum. If it were to go
-    unvacuumed for longer than
-    that, data loss could result.  To ensure that this does not happen,
-    autovacuum is invoked on any table that might contain unfrozen rows with
+    the time of the last vacuum that advanced <structfield>relfrozenxid</structfield>.
+    Autovacuum is invoked on any table that might contain unfrozen rows with
     XIDs older than the age specified by the configuration parameter <xref
     linkend="guc-autovacuum-freeze-max-age"/>.  (This will happen even if
     autovacuum is disabled.)
@@ -595,8 +589,7 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest remaining unfrozen
     XID at the end of the most recent <command>VACUUM</command> that successfully
-    advanced <structfield>relfrozenxid</structfield> (typically the most recent
-    aggressive VACUUM).  Similarly, the
+    advanced <structfield>relfrozenxid</structfield>.  Similarly, the
     <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
@@ -753,22 +746,14 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command>s, regardless of what causes
-     them, are <emphasis>guaranteed</emphasis> to be able to advance
-     the table's <structfield>relminmxid</structfield>.
-     Eventually, as all tables in all databases are scanned and their
-     oldest multixact values are advanced, on-disk storage for older
-     multixacts can be removed.
-    </para>
-
-    <para>
-     As a safety device, an aggressive vacuum scan will
-     occur for any table whose multixact-age is greater than <xref
-     linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Also, if the
-     storage occupied by multixacts members exceeds 2GB, aggressive vacuum
-     scans will occur more often for all tables, starting with those that
-     have the oldest multixact-age.  Both of these kinds of aggressive
-     scans will occur even if autovacuum is nominally disabled.
+     As a safety device, a vacuum to advance
+     <structfield>relminmxid</structfield> will occur for any table
+     whose multixact-age is greater than <xref
+      linkend="guc-autovacuum-multixact-freeze-max-age"/>.
+     Also, if the storage occupied by multixacts members exceeds 2GB,
+     vacuum scans will occur more often for all tables, starting with those that
+     have the oldest multixact-age.  This will occur even if
+     autovacuum is nominally disabled.
     </para>
    </sect3>
   </sect2>
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
index f7bc93e8f..076fe07ab 100644
--- a/src/test/isolation/expected/vacuum-no-cleanup-lock.out
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -1,6 +1,6 @@
 Parsed test spec with 4 sessions
 
-starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -12,7 +12,7 @@ relpages|reltuples
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -24,7 +24,7 @@ relpages|reltuples
 (1 row)
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -46,7 +46,7 @@ dummy
     1
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -61,7 +61,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -89,7 +89,7 @@ step dml_delete:
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -104,7 +104,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_vacuum_noprune vacuumer_pg_class_stats pinholder_commit
 step vacuumer_pg_class_stats: 
   SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
 
@@ -132,7 +132,7 @@ dummy
 step dml_insert: 
   INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step vacuumer_pg_class_stats: 
@@ -147,7 +147,7 @@ step pinholder_commit:
   COMMIT;
 
 
-starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_vacuum_noprune pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_vacuum_noprune pinholder_commit vacuumer_vacuum_noprune
 step dml_begin: BEGIN;
 step dml_other_begin: BEGIN;
 step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
@@ -162,7 +162,7 @@ id
  3
 (1 row)
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_cursor: 
@@ -178,12 +178,12 @@ dummy
 step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
 step dml_commit: COMMIT;
 step dml_other_commit: COMMIT;
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
 step pinholder_commit: 
   COMMIT;
 
-step vacuumer_nonaggressive_vacuum: 
+step vacuumer_vacuum_noprune: 
   VACUUM smalltbl;
 
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
index 05fd280f6..f9e4194cd 100644
--- a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -55,15 +55,21 @@ step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
 step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
 step dml_other_commit     { COMMIT; }
 
-# This session runs non-aggressive VACUUM, but with maximally aggressive
-# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+# This session runs VACUUM with maximally aggressive cutoffs for tuple
+# freezing (e.g., FreezeLimit == OldestXmin), while still using default
+# settings for vacuum_freeze_table_age/autovacuum_freeze_max_age.
+#
+# This makes VACUUM freeze tuples just as aggressively as it would if the
+# VACUUM command's FREEZE option was specified with almost all heap pages.
+# However, VACUUM is still unwilling to wait indefinitely for a cleanup lock,
+# just to freeze a few XIDs/MXIDs that still aren't very old.
 session vacuumer
 setup
 {
   SET vacuum_freeze_min_age = 0;
   SET vacuum_multixact_freeze_min_age = 0;
 }
-step vacuumer_nonaggressive_vacuum
+step vacuumer_vacuum_noprune
 {
   VACUUM smalltbl;
 }
@@ -75,15 +81,14 @@ step vacuumer_pg_class_stats
 # Test VACUUM's reltuples counting mechanism.
 #
 # Final pg_class.reltuples should never be affected by VACUUM's inability to
-# get a cleanup lock on any page, except to the extent that any cleanup lock
-# contention changes the number of tuples that remain ("missed dead" tuples
-# are counted in reltuples, much like "recently dead" tuples).
+# get a cleanup lock on any page.  Note that "missed dead" tuples are counted
+# in reltuples, much like "recently dead" tuples.
 
 # Easy case:
 permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
 
 # Harder case -- count 21 tuples at the end (like last time), but with cleanup
@@ -92,7 +97,7 @@ permutation
     vacuumer_pg_class_stats  # Start with 20 tuples
     dml_insert
     pinholder_cursor
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     vacuumer_pg_class_stats  # End with 21 tuples
     pinholder_commit  # order doesn't matter
 
@@ -103,7 +108,7 @@ permutation
     dml_insert
     dml_delete
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "recently dead" tuple won't be included in
     # count here:
     vacuumer_pg_class_stats
@@ -116,7 +121,7 @@ permutation
     dml_delete
     pinholder_cursor
     dml_insert
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
     # concurrent activity held back VACUUM's OldestXmin) won't be included in
     # count here:
@@ -128,7 +133,7 @@ permutation
 # This provides test coverage for code paths that are only hit when we need to
 # freeze, but inability to acquire a cleanup lock on a heap page makes
 # freezing some XIDs/MXIDs < FreezeLimit/MultiXactCutoff impossible (without
-# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+# waiting for a cleanup lock, which won't ever happen here).
 permutation
     dml_begin
     dml_other_begin
@@ -136,15 +141,15 @@ permutation
     dml_other_key_share
     # Will get cleanup lock, can't advance relminmxid yet:
     # (though will usually advance relfrozenxid by ~2 XIDs)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_cursor
     dml_other_update
     dml_commit
     dml_other_commit
     # Can't cleanup lock, so still can't advance relminmxid here:
     # (relfrozenxid held back by XIDs in MultiXact too)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
     pinholder_commit
     # Pin was dropped, so will advance relminmxid, at long last:
     # (ditto for relfrozenxid advancement)
-    vacuumer_nonaggressive_vacuum
+    vacuumer_vacuum_noprune
-- 
2.39.0

#89

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#88)

Re: New strategies for freezing, advancing relfrozenxid early

On Tue, 24 Jan 2023 at 23:50, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

0001 (the freezing strategies patch) is now committable IMV. Or at
least will be once I polish the docs a bit more. I plan on committing
0001 some time next week, barring any objections.

I plan on committing 0001 (the freezing strategies commit) tomorrow
morning, US Pacific time.

Attached is v17. There are no significant differences compared to v17.
I decided to post a new version now, ahead of commit, to show how I've
cleaned up the docs in 0001 -- docs describing the new GUC, freeze
strategies, and so on.

LGTM, +1 on 0001

Some more comments on 0002:

+lazy_scan_strategy(LVRelState *vacrel, bool force_scan_all)
scanned_pages_lazy & scanned_pages_eager

We have not yet scanned the pages, so I suggest plan/scan_pages_eager
and *_lazy as variable names instead, to minimize confusion about the
naming.

I'll await the next iteration of 0002 in which you've completed more
TODOs before I'll dig deeper into that patch.

Kind regards,

Matthias van de Meent

#90

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#88)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-24 14:49:38 -0800, Peter Geoghegan wrote:

On Mon, Jan 16, 2023 at 5:55 PM Peter Geoghegan <pg@bowt.ie> wrote:

0001 (the freezing strategies patch) is now committable IMV. Or at
least will be once I polish the docs a bit more. I plan on committing
0001 some time next week, barring any objections.

I plan on committing 0001 (the freezing strategies commit) tomorrow
morning, US Pacific time.

I unfortunately haven't been able to keep up with the thread and saw this just
now. But I've expressed the concern below several times before, so it
shouldn't come as a surprise.

I think, as committed, this will cause serious issues for some reasonably
common workloads, due to substantially increased WAL traffic.

The most common problematic scenario I see are tables full of rows with
limited lifetime. E.g. because rows get aggregated up after a while. Before
those rows practically never got frozen - but now we'll freeze them all the
time.

I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop
deletes older rows.

Workload fits in s_b:

Autovacuum on average generates between 1.5x-7x as much WAL as before,
depending on how things interact with checkpoints. And not just that, each
autovac cycle also takes substantially longer than before - the average time
for an autovacuum roughly doubled. Which of course increases the amount of
bloat.

When workload doesn't fit in s_b:

Time for vacuuming goes up to ~5x. WAL volume to ~9x. Autovacuum can't keep up
with bloat, every vacuum takes longer than the prior one:
65s->78s->139s->176s
And that's with autovac cost limits removed! Relation size nearly doubles due
to bloat.

After I disabled the new strategy autovac started to catch up again:
124s->101s->103->46s->20s->28s->24s

This is significantly worse than I predicted. This was my first attempt at
coming up with a problematic workload. There'll likely be way worse in
production.

I think as-is this logic will cause massive issues.

Andres

#91

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#88)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-24 14:49:38 -0800, Peter Geoghegan wrote:

From e41d3f45fcd6f639b768c22139006ad11422575f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Nov 2022 18:20:36 -0800
Subject: [PATCH v17 1/3] Add eager and lazy freezing strategies to VACUUM.

Eager freezing strategy avoids large build-ups of all-visible pages. It
makes VACUUM trigger page-level freezing whenever doing so will enable
the page to become all-frozen in the visibility map. This is useful for
tables that experience continual growth, particularly strict append-only
tables such as pgbench's history table. Eager freezing significantly
improves performance stability by spreading out the cost of freezing
over time, rather than doing most freezing during aggressive VACUUMs.
It complements the insert autovacuum mechanism added by commit b07642db.

However, it significantly increases the overall work when rows have a somewhat
limited lifetime. The documented reason why vacuum_freeze_min_age exist -
although I think it doesn't really achieve its documented goal anymore, after
the recent changes page-level freezing changes.

VACUUM determines its freezing strategy based on the value of the new
vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
tables that exceed the size threshold use the eager freezing strategy.

I think that's not a sufficient guard at all. The size of a table doesn't say
much about how a table is used.

Unlogged tables and temp tables will always use eager freezing strategy,
since there is essentially no downside.

I somewhat doubt that that is true, but certainly the cost is lower.

Eager freezing is strictly more aggressive than lazy freezing. Settings
like vacuum_freeze_min_age still get applied in just the same way in
every VACUUM, independent of the strategy in use. The only mechanical
difference between eager and lazy freezing strategies is that only the
former applies its own additional criteria to trigger freezing pages.

That's only true because vacuum_freeze_min_age being has been fairly radically
redefined recently.

Greetings,

Andres Freund

#92

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#90)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 4:43 PM Andres Freund <andres@anarazel.de> wrote:

I unfortunately haven't been able to keep up with the thread and saw this just
now. But I've expressed the concern below several times before, so it
shouldn't come as a surprise.

You missed the announcement 9 days ago, and the similar clear
signalling of a commit from yesterday. I guess I'll need to start
personally reaching out to you any time I commit anything in this area
in the future. I almost considered doing that here, in fact.

The most common problematic scenario I see are tables full of rows with
limited lifetime. E.g. because rows get aggregated up after a while. Before
those rows practically never got frozen - but now we'll freeze them all the
time.

Fundamentally, the choice to freeze or not freeze is driven by
speculation about the needs of the table, with some guidance from the
user. That isn't new. It seems to me that it will always be possible
for you to come up with an adversarial case that makes any given
approach look bad, no matter how good it is. Of course that doesn't
mean that this particular complaint has no validity; but it does mean
that you need to be willing to draw the line somewhere.

In particular, it would be very useful to know what the parameters of
the discussion are. Obviously I cannot come up with an algorithm that
can literally predict the future. But I may be able to handle specific
cases of concern better, or to better help users cope in whatever way.

I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop
deletes older rows.

Can you post the script? And what setting did you use?

Workload fits in s_b:

Autovacuum on average generates between 1.5x-7x as much WAL as before,
depending on how things interact with checkpoints. And not just that, each
autovac cycle also takes substantially longer than before - the average time
for an autovacuum roughly doubled. Which of course increases the amount of
bloat.

Anything that causes an autovacuum to take longer will effectively
make autovacuum think that it has removed more bloat than it really
has, which will then make autovacuum less aggressive when it really
should be more aggressive. That's a preexisting issue, that needs to
be accounted for in the context of this discussion.

This is significantly worse than I predicted. This was my first attempt at
coming up with a problematic workload. There'll likely be way worse in
production.

As I said in the commit message, the current default for
vacuum_freeze_strategy_threshold is considered low, and was always
intended to be provisional. Something that I explicitly noted would be
reviewed after the beta period is over, once we gained more experience
with the setting.

I think that a far higher setting could be almost as effective. 32GB,
or even 64GB could work quite well, since you'll still have the FPI
optimization.

--
Peter Geoghegan

#93

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#90)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 16:43:47 -0800, Andres Freund wrote:

I think, as committed, this will cause serious issues for some reasonably
common workloads, due to substantially increased WAL traffic.

The most common problematic scenario I see are tables full of rows with
limited lifetime. E.g. because rows get aggregated up after a while. Before
those rows practically never got frozen - but now we'll freeze them all the
time.

Another bad scenario: Some longrunning / hung transaction caused us to get
close to the xid wraparound. Problem was resolved, autovacuum runs. Previously
we wouldn't have frozen the portion of the table that was actively changing,
now we will. Consequence: We get closer to the "no write" limit / the outage
lasts longer.

I don't see an alternative to reverting this for now.

Greetings,

Andres Freund

#94

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#91)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 5:15 PM Andres Freund <andres@anarazel.de> wrote:

However, it significantly increases the overall work when rows have a somewhat
limited lifetime. The documented reason why vacuum_freeze_min_age exist -
although I think it doesn't really achieve its documented goal anymore, after
the recent changes page-level freezing changes.

Huh? vacuum_freeze_min_age hasn't done that, at all. At least not
since the visibility map went in back in 8.4:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Today.2C_on_Postgres_HEAD_2

That's why we literally do ~100% of all freezing in aggressive mode
VACUUM with append-only or append-mostly tables.

VACUUM determines its freezing strategy based on the value of the new
vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
tables that exceed the size threshold use the eager freezing strategy.

I think that's not a sufficient guard at all. The size of a table doesn't say
much about how a table is used.

Sufficient for what purpose?

Eager freezing is strictly more aggressive than lazy freezing. Settings
like vacuum_freeze_min_age still get applied in just the same way in
every VACUUM, independent of the strategy in use. The only mechanical
difference between eager and lazy freezing strategies is that only the
former applies its own additional criteria to trigger freezing pages.

That's only true because vacuum_freeze_min_age being has been fairly radically
redefined recently.

So? This part of the commit message is a simple statement of fact.

--
Peter Geoghegan

#95

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#93)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 5:26 PM Andres Freund <andres@anarazel.de> wrote:

Another bad scenario: Some longrunning / hung transaction caused us to get
close to the xid wraparound. Problem was resolved, autovacuum runs. Previously
we wouldn't have frozen the portion of the table that was actively changing,
now we will. Consequence: We get closer to the "no write" limit / the outage
lasts longer.

Obviously it isn't difficult to just invent a new rule that gets
applied by lazy_scan_strategy. For example, it would take me less than
5 minutes to write a patch that disables eager freezing when the
failsafe is in effect.

I don't see an alternative to reverting this for now.

I want to see your test case before acting.

--
Peter Geoghegan

#96

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#92)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 17:22:32 -0800, Peter Geoghegan wrote:

On Wed, Jan 25, 2023 at 4:43 PM Andres Freund <andres@anarazel.de> wrote:

I unfortunately haven't been able to keep up with the thread and saw this just
now. But I've expressed the concern below several times before, so it
shouldn't come as a surprise.

You missed the announcement 9 days ago, and the similar clear
signalling of a commit from yesterday. I guess I'll need to start
personally reaching out to you any time I commit anything in this area
in the future. I almost considered doing that here, in fact.

There's just too much email on -hackers to keep up with, if I ever want to do
any development of my own. I raised this concern before though, so it's not
like it's a surprise.

The most common problematic scenario I see are tables full of rows with
limited lifetime. E.g. because rows get aggregated up after a while. Before
those rows practically never got frozen - but now we'll freeze them all the
time.

Fundamentally, the choice to freeze or not freeze is driven by
speculation about the needs of the table, with some guidance from the
user. That isn't new. It seems to me that it will always be possible
for you to come up with an adversarial case that makes any given
approach look bad, no matter how good it is. Of course that doesn't
mean that this particular complaint has no validity; but it does mean
that you need to be willing to draw the line somewhere.

Sure. But significantly regressing plausible if not common workloads is
different than knowing that there'll be some edge case where we'll do
something worse.

I whipped up a quick test: 15 pgbench threads insert rows, 1 psql \while loop
deletes older rows.

Can you post the script? And what setting did you use?

prep:
CREATE TABLE pgbench_time_data(client_id int8 NOT NULL, ts timestamptz NOT NULL, filla int8 NOT NULL, fillb int8 not null, fillc int8 not null);
CREATE INDEX ON pgbench_time_data(ts);
ALTER SYSTEM SET autovacuum_naptime = '10s';
ALTER SYSTEM SET autovacuum_vacuum_cost_delay TO -1;
ALTER SYSTEM SET synchronous_commit = off; -- otherwise more clients are needed

pgbench script, with 15 clients:
INSERT INTO pgbench_time_data(client_id, ts, filla, fillb, fillc) VALUES (:client_id, now(), 0, 0, 0);

psql session deleting old data:
EXPLAIN ANALYZE DELETE FROM pgbench_time_data WHERE ts < now() - '120s'::interval \watch 1

Realistically the time should be longer, but I didn't want to wait that long
for the deletions to actually start.

I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for
impatient me.

I switched between vacuum_freeze_strategy_threshold=0 and
vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
warmup to set up something with smaller tables.

shared_buffers=32GB for fits in s_b, 1GB otherwise.

max_wal_size=150GB, log_autovacuum_min_duration=0, and a bunch of logging
settings.

Workload fits in s_b:

Autovacuum on average generates between 1.5x-7x as much WAL as before,
depending on how things interact with checkpoints. And not just that, each
autovac cycle also takes substantially longer than before - the average time
for an autovacuum roughly doubled. Which of course increases the amount of
bloat.

Anything that causes an autovacuum to take longer will effectively
make autovacuum think that it has removed more bloat than it really
has, which will then make autovacuum less aggressive when it really
should be more aggressive. That's a preexisting issue, that needs to
be accounted for in the context of this discussion.

That's not the problem here - on my system autovac starts again very
quickly. The problem is that we accumulate bloat while autovacuum is
running. Wasting time/WAL volume on freezing pages that don't need to be
frozen is an issue.

In particular, it would be very useful to know what the parameters of
the discussion are. Obviously I cannot come up with an algorithm that
can literally predict the future. But I may be able to handle specific
cases of concern better, or to better help users cope in whatever way.

This is significantly worse than I predicted. This was my first attempt at
coming up with a problematic workload. There'll likely be way worse in
production.

As I said in the commit message, the current default for
vacuum_freeze_strategy_threshold is considered low, and was always
intended to be provisional. Something that I explicitly noted would be
reviewed after the beta period is over, once we gained more experience
with the setting.

I think that a far higher setting could be almost as effective. 32GB,
or even 64GB could work quite well, since you'll still have the FPI
optimization.

The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
Table size simply isn't a usable proxy for whether eager freezing is a good
idea or not.

You can have a 1TB table full of transient data, or you can have a 1TB table
where part of the data is transient and only settles after a time. In neither
case eager freezing is ok.

Or you can have an append-only table. In which case eager freezing is great.

Greetings,

Andres Freund

#97

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#95)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 17:37:17 -0800, Peter Geoghegan wrote:

On Wed, Jan 25, 2023 at 5:26 PM Andres Freund <andres@anarazel.de> wrote:

Another bad scenario: Some longrunning / hung transaction caused us to get
close to the xid wraparound. Problem was resolved, autovacuum runs. Previously
we wouldn't have frozen the portion of the table that was actively changing,
now we will. Consequence: We get closer to the "no write" limit / the outage
lasts longer.

Obviously it isn't difficult to just invent a new rule that gets
applied by lazy_scan_strategy. For example, it would take me less than
5 minutes to write a patch that disables eager freezing when the
failsafe is in effect.

Sure. I'm not saying that these issues cannot be addressed. Of course no patch
of a meaningful size is perfect and we all can't predict the future. But this
is a very significant behavioural change to vacuum, and there are pretty
simple scenarios in which it causes significant regressions. And at least some
of the issues have been pointed out before.

Greetings,

Andres Freund

#98

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#96)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 5:49 PM Andres Freund <andres@anarazel.de> wrote:

Sure. But significantly regressing plausible if not common workloads is
different than knowing that there'll be some edge case where we'll do
something worse.

That's very vague. Significant to whom, for what purpose?

prep:
CREATE TABLE pgbench_time_data(client_id int8 NOT NULL, ts timestamptz NOT NULL, filla int8 NOT NULL, fillb int8 not null, fillc int8 not null);
CREATE INDEX ON pgbench_time_data(ts);
ALTER SYSTEM SET autovacuum_naptime = '10s';
ALTER SYSTEM SET autovacuum_vacuum_cost_delay TO -1;
ALTER SYSTEM SET synchronous_commit = off; -- otherwise more clients are needed

pgbench script, with 15 clients:
INSERT INTO pgbench_time_data(client_id, ts, filla, fillb, fillc) VALUES (:client_id, now(), 0, 0, 0);

psql session deleting old data:
EXPLAIN ANALYZE DELETE FROM pgbench_time_data WHERE ts < now() - '120s'::interval \watch 1

Realistically the time should be longer, but I didn't want to wait that long
for the deletions to actually start.

I'll review this tomorrow.

I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for
impatient me.

You said "Autovacuum on average generates between 1.5x-7x as much WAL
as before". Why stop there, though? There's a *big* multiplicative
effect in play here from FPIs, obviously, so the sky's the limit. Why
not set checkpoint_timeout to 30s?

I switched between vacuum_freeze_strategy_threshold=0 and
vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
warmup to set up something with smaller tables.

This makes no sense to me, at all.

The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
Table size simply isn't a usable proxy for whether eager freezing is a good
idea or not.

It's not supposed to be - you have it backwards. It's intended to work
as a proxy for whether lazy freezing is a bad idea, particularly in
the worst case.

There is also an effect that likely would have been protective with
your test case had you used a larger table with the same test case
(and had you not lowered vacuum_freeze_strategy_threshold from its
already low default). In general there'd be a much better chance of
concurrent reuse of space by new inserts discouraging page-level
freezing, since VACUUM would take much longer relative to everything
else, as compared to a small table.

You can have a 1TB table full of transient data, or you can have a 1TB table
where part of the data is transient and only settles after a time. In neither
case eager freezing is ok.

It sounds like you're not willing to accept any kind of trade-off.
How, in general, can we detect what kind of 1TB table it will be, in
the absence of user input? And in the absence of user input, why would
we prefer to default to a behavior that is highly destabilizing when
we get it wrong?

--
Peter Geoghegan

#99

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#94)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 17:28:48 -0800, Peter Geoghegan wrote:

On Wed, Jan 25, 2023 at 5:15 PM Andres Freund <andres@anarazel.de> wrote:

However, it significantly increases the overall work when rows have a somewhat
limited lifetime. The documented reason why vacuum_freeze_min_age exist -
although I think it doesn't really achieve its documented goal anymore, after
the recent changes page-level freezing changes.

Huh? vacuum_freeze_min_age hasn't done that, at all. At least not
since the visibility map went in back in 8.4:

My point was the other way round. That vacuum_freeze_min_age *prevented* us
from freezing rows "too soon" - obviously a very blunt instrument.

Since page level freezing, it only partially does that, because we'll freeze
even newer rows, if pruning triggered an FPI (I don't think that's quite the
right check, but that's a separate discussion).

As far as I can tell, with the eager strategy, the only thing
vacuum_freeze_min_age really influences is whether we'll block waiting for a
cleanup lock. IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is
now a slightly less-blocking version of VACUUM FREEZE.

The paragraph I was referencing:
<para>
One disadvantage of decreasing <varname>vacuum_freeze_min_age</varname> is that
it might cause <command>VACUUM</command> to do useless work: freezing a row
version is a waste of time if the row is modified
soon thereafter (causing it to acquire a new XID). So the setting should
be large enough that rows are not frozen until they are unlikely to change
any more.
</para>

But now vacuum_freeze_min_age doesn't reliably influence whether we'll freeze
row anymore.

Am I missing something here?

VACUUM determines its freezing strategy based on the value of the new
vacuum_freeze_strategy_threshold GUC (or reloption) with logged tables;
tables that exceed the size threshold use the eager freezing strategy.

I think that's not a sufficient guard at all. The size of a table doesn't say
much about how a table is used.

Sufficient for what purpose?

Not not regress a substantial portion of our userbase.

Greetings,

Andres Freund

#100

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#99)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 6:33 PM Andres Freund <andres@anarazel.de> wrote:

My point was the other way round. That vacuum_freeze_min_age *prevented* us
from freezing rows "too soon" - obviously a very blunt instrument.

Yes, not freezing at all until aggressive vacuum is definitely good
when you don't really need to freeze at all.

Since page level freezing, it only partially does that, because we'll freeze
even newer rows, if pruning triggered an FPI (I don't think that's quite the
right check, but that's a separate discussion).

But the added cost is very low, and it might well make all the difference.

As far as I can tell, with the eager strategy, the only thing
vacuum_freeze_min_age really influences is whether we'll block waiting for a
cleanup lock. IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is
now a slightly less-blocking version of VACUUM FREEZE.

That's simply not true, at all. I'm very surprised that you think
that. The commit message very clearly addresses this. You know, the
part that you specifically quoted to complain about today!

Once again I'll refer you to my Wiki page on this:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2

The difference between this and VACUUM FREEZE is described here:

"Note how we freeze most pages, but still leave a significant number
unfrozen each time, despite using an eager approach to freezing
(2981204 scanned - 2355230 frozen = 625974 pages scanned but left
unfrozen). Again, this is because we don't freeze pages unless they're
already eligible to be set all-visible. We saw the same effect with
the first pgbench_history example, but it was hardly noticeable at all
there. Whereas here we see that even eager freezing opts to hold off
on freezing relatively many individual heap pages, due to the observed
conditions on those particular heap pages."

If it was true that eager freezing strategy behaved just the same as
VACUUM FREEZE (at least as far as freezing is concerned) then
scenarios like this one would show that VACUUM froze practically all
of the pages it scanned -- maybe fully 100% of all scanned pages would
be frozen. This effect is absent from small tables, and I suspect that
it's absent from your test case in part because you used a table that
was too small.

Obviously the way that eager freezing strategy avoids freezing
concurrently modified pages isn't perfect. It's one approach to
limiting the downside from eager freezing, in tables (or even
individual pages) where it's inappropriate. Of course that isn't
perfect, but it's a significant factor.

--
Peter Geoghegan

#101

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#98)

Re: New strategies for freezing, advancing relfrozenxid early

Hk,

On 2023-01-25 18:31:16 -0800, Peter Geoghegan wrote:

On Wed, Jan 25, 2023 at 5:49 PM Andres Freund <andres@anarazel.de> wrote:

Sure. But significantly regressing plausible if not common workloads is
different than knowing that there'll be some edge case where we'll do
something worse.

That's very vague. Significant to whom, for what purpose?

Sure it's vague. But you can't tell me that it's uncommon to use postgres to
store rows that isn't retained for > 50million xids.

I reproduced both with checkpoint_timeout=5min and 1min. 1min is easier for
impatient me.

You said "Autovacuum on average generates between 1.5x-7x as much WAL
as before". Why stop there, though? There's a *big* multiplicative
effect in play here from FPIs, obviously, so the sky's the limit. Why
not set checkpoint_timeout to 30s?

The amount of WAL increases substantially even with 5min, the degree of the
increase varies more though. But that largely vanishes if you increase the
time after which rows are deleted a bit. I just am not patient enough to wait
for that.

I switched between vacuum_freeze_strategy_threshold=0 and
vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
warmup to set up something with smaller tables.

This makes no sense to me, at all.

It's quicker to run the workload with a table that initially is below 4GB, but
still be able to test the eager strategy. It wouldn't change anything
fundamental to just make the rows a bit wider, or to have a static portion of
the table.

And changing between vacuum_freeze_strategy_threshold=0/very-large (or I
assume -1, didn't check) while the workload is running having to wait until
the 120s to start deleting have passed..

The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
Table size simply isn't a usable proxy for whether eager freezing is a good
idea or not.

It's not supposed to be - you have it backwards. It's intended to work
as a proxy for whether lazy freezing is a bad idea, particularly in
the worst case.

That's a distinction without a difference.

There is also an effect that likely would have been protective with
your test case had you used a larger table with the same test case
(and had you not lowered vacuum_freeze_strategy_threshold from its
already low default).

Again, you just need a less heavily changing portion of the the table or a
slightly larger "deletion delay" and you end up with a table well over
4GB. Even as stated I end up with > 4GB after a bit of running.

It's just a shortcut to make testing this easier.

You can have a 1TB table full of transient data, or you can have a 1TB table
where part of the data is transient and only settles after a time. In neither
case eager freezing is ok.

It sounds like you're not willing to accept any kind of trade-off.

I am. Just not every tradeoff. I just don't see any useful tradeoffs purely
based on the relation size.

How, in general, can we detect what kind of 1TB table it will be, in the
absence of user input?

I suspect we'll need some form of heuristics to differentiate between tables
that are more append heavy and tables that are changing more heavily. I think
it might be preferrable to not have a hard cliff but a gradual changeover -
hard cliffs tend to lead to issue one can't see coming.

I think several of the heuristics below become easier once we introduce "xid
age" vacuums.

One idea is to start tracking the number of all-frozen pages in pg_class. If
there's a significant percentage of all-visible but not all-frozen pages,
vacuum should be more eager. If only a small portion of the table is not
frozen, there's no need to be eager. If only a small portion of the table is
all-visible, there similarly is no need to freeze eagerly.

I IIRC previously was handwaving at keeping track of the average age of tuples
on all-visible pages. That could extend the prior heuristic. A heavily
changing table will have a relatively young average, a more append only table
will have an increasing average age.

It might also make sense to look at the age of relfrozenxid - there's really
no point in being overly eager if the relation is quite young. And a very
heavily changing table will tend to be younger. But likely the approach of
tracking the age of all-visible pages will be more accurate.

The heuristics don't have to be perfect. If we get progressively more eager,
an occasional somewhat eager vacuum isn't a huge issue, as long as it then
leads to the next few vacuums to become less eager.

And in the absence of user input, why would we prefer to default to a
behavior that is highly destabilizing when we get it wrong?

Users know the current behaviour. Introducing significant issues that didn't
previously exist will cause new issues and new frustrations.

Greetings,

Andres Freund

#102

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Andres Freund (#96)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 8:49 PM Andres Freund <andres@anarazel.de> wrote:

The concrete setting of vacuum_freeze_strategy_threshold doesn't matter.
Table size simply isn't a usable proxy for whether eager freezing is a good
idea or not.

I strongly agree. I can't imagine how a size-based threshold can make
any sense at all.

Both Andres and I have repeatedly expressed concern about how much is
being changed in the behavior of vacuum, and how quickly, and IMHO on
the basis of very limited evidence that the changes are improvements.
The fact that Andres was very quickly able to find cases where the
patch produces large regression is just more evidence of that. It's
also hard to even understand what has been changed, because the
descriptions are so theoretical.

I think we're on a very dangerous path here. I want VACUUM to be
better as the next person, but I really don't believe that's the
direction we're headed. I think if we release like this, we're going
to experience more VACUUM pain, not less. And worse still, I don't
think anyone other than Peter and Andres is going to understand why
it's happening.

--
Robert Haas
EDB: http://www.enterprisedb.com

#103

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#101)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 7:11 PM Andres Freund <andres@anarazel.de> wrote:

I switched between vacuum_freeze_strategy_threshold=0 and
vacuum_freeze_strategy_threshold=too-high, because it's quicker/takes less
warmup to set up something with smaller tables.

This makes no sense to me, at all.

It's quicker to run the workload with a table that initially is below 4GB, but
still be able to test the eager strategy. It wouldn't change anything
fundamental to just make the rows a bit wider, or to have a static portion of
the table.

What does that actually mean? Wouldn't change anything fundamental?

What it would do is significantly reduce the write amplification
effect that you encountered. You came up with numbers of up to 7x, a
number that you used without any mention of checkpoint_timeout being
lowered to only 1 minutes (I had to push to get that information). Had
you done things differently (larger table, larger setting) then that
would have made the regression far smaller. So yeah, "nothing
fundamental".

How, in general, can we detect what kind of 1TB table it will be, in the
absence of user input?

I suspect we'll need some form of heuristics to differentiate between tables
that are more append heavy and tables that are changing more heavily.

The TPC-C tables are actually a perfect adversarial cases for this,
because it's both, together. What then?

I think
it might be preferrable to not have a hard cliff but a gradual changeover -
hard cliffs tend to lead to issue one can't see coming.

As soon as you change your behavior you have to account for the fact
that you behaved lazily up until all prior VACUUMs. I think that
you're better off just being eager with new pages and modified pages,
while not specifically going

I IIRC previously was handwaving at keeping track of the average age of tuples
on all-visible pages. That could extend the prior heuristic. A heavily
changing table will have a relatively young average, a more append only table
will have an increasing average age.

It might also make sense to look at the age of relfrozenxid - there's really
no point in being overly eager if the relation is quite young.

I don't think that's true. What about bulk loading? It's a totally
valid and common requirement.

--
Peter Geoghegan

#104

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#100)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-25 18:43:10 -0800, Peter Geoghegan wrote:

On Wed, Jan 25, 2023 at 6:33 PM Andres Freund <andres@anarazel.de> wrote:

As far as I can tell, with the eager strategy, the only thing
vacuum_freeze_min_age really influences is whether we'll block waiting for a
cleanup lock. IOW, VACUUM on a table > vacuum_freeze_strategy_threshold is
now a slightly less-blocking version of VACUUM FREEZE.

That's simply not true, at all. I'm very surprised that you think
that. The commit message very clearly addresses this.

It says something like that, but it's not really true:

Looking at the results of
DROP TABLE IF EXISTS frak;
-- autovac disabled so we see just the result of the vacuum below
CREATE TABLE frak WITH (autovacuum_enabled=0) AS SELECT generate_series(1, 10000000);
VACUUM frak;
SELECT pg_relation_size('frak') / 8192 AS relsize_pages, SUM(all_visible::int) all_vis_pages, SUM(all_frozen::int) all_frozen_pages FROM pg_visibility('frak');

across releases.

In < 16 you'll get:
┌───────────────┬───────────────┬──────────────────┐
│ relsize_pages │ all_vis_pages │ all_frozen_pages │
├───────────────┼───────────────┼──────────────────┤
│ 44248 │ 44248 │ 0 │
└───────────────┴───────────────┴──────────────────┘

You simply can't freeze these rows, because they're not vacuum_freeze_min_age
xids old.

With 16 and the default vacuum_freeze_strategy_threshold you'll get the same
(even though we wouldn't actually trigger an FPW).

With 16 and vacuum_freeze_strategy_threshold=0, you'll get:
┌───────────────┬───────────────┬──────────────────┐
│ relsize_pages │ all_vis_pages │ all_frozen_pages │
├───────────────┼───────────────┼──────────────────┤
│ 44248 │ 44248 │ 44248 │
└───────────────┴───────────────┴──────────────────┘

IOW, basically what you get with VACUUM FREEZE.

That's actually what I was complaining about. The commit message in a way is
right that
Settings
like vacuum_freeze_min_age still get applied in just the same way in
every VACUUM, independent of the strategy in use. The only mechanical
difference between eager and lazy freezing strategies is that only the
former applies its own additional criteria to trigger freezing pages.

but that's only true because page level freezing neutered
vacuum_freeze_min_age. Compared to <16, it's a *huge* change.

Yes, it's true that VACUUM still is less agressive than VACUUM FREEZE, even
disregarding cleanup locks, because it won't freeze if there's non-removable
rows on the page. But more often than not that's a pretty small difference.

Once again I'll refer you to my Wiki page on this:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2

The difference between this and VACUUM FREEZE is described here:

"Note how we freeze most pages, but still leave a significant number
unfrozen each time, despite using an eager approach to freezing
(2981204 scanned - 2355230 frozen = 625974 pages scanned but left
unfrozen). Again, this is because we don't freeze pages unless they're
already eligible to be set all-visible.

The only reason there is a substantial difference is because of pgbench's
uniform access pattern. Most real-world applications don't have that.

Greetings,

Andres Freund

#105

John Naylor

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Andres Freund (#101)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 10:11 AM Andres Freund <andres@anarazel.de> wrote:

I am. Just not every tradeoff. I just don't see any useful tradeoffs

purely

based on the relation size.

I expressed reservations about relation size six weeks ago:

On Wed, Dec 14, 2022 at 12:16 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Dec 13, 2022 at 12:29 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

If the number of unfrozen heap pages is the thing we care about,

perhaps that, and not the total size of the table, should be the parameter
that drives freezing strategy?

That's not the only thing we care about, though.

That was followed by several paragraphs that never got around to explaining
why table size should drive freezing strategy. Review is a feedback
mechanism alerting the patch author to possible problems. Listening to
feedback is like vacuum, in a way: If it hurts, you're not doing it enough.

--
John Naylor
EDB: http://www.enterprisedb.com

#106

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Robert Haas (#102)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 7:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

Both Andres and I have repeatedly expressed concern about how much is
being changed in the behavior of vacuum, and how quickly, and IMHO on
the basis of very limited evidence that the changes are improvements.
The fact that Andres was very quickly able to find cases where the
patch produces large regression is just more evidence of that. It's
also hard to even understand what has been changed, because the
descriptions are so theoretical.

Did you actually read the motivating examples Wiki page?

I think we're on a very dangerous path here. I want VACUUM to be
better as the next person, but I really don't believe that's the
direction we're headed. I think if we release like this, we're going
to experience more VACUUM pain, not less. And worse still, I don't
think anyone other than Peter and Andres is going to understand why
it's happening.

I think that the only sensible course of action at this point is for
me to revert the page-level freezing commit from today, and abandon
all outstanding work on VACUUM. I will still stand by the basic
page-level freezing work, at least to the extent that I am able to.
Honestly, just typing that makes me feel a big sense of relief.

I am a proud, stubborn man. While the experience of working on the
earlier related stuff for Postgres 15 was itself enough to make me
seriously reassess my choice to work on VACUUM in general, I still
wanted to finish off what I'd started. I don't see how that'll be
possible now -- I'm just not in a position to be in the center of
another controversy, and I just don't seem to be able to avoid them
here, as a practical matter. I will resolve to be a less stubborn
person. I don't have the constitution for it anymore.

--
Peter Geoghegan

#107

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: John Naylor (#105)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 8:12 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That was followed by several paragraphs that never got around to explaining why table size should drive freezing strategy.

You were talking about the system level view of freeze debt, and how
the table view might not be a sufficient proxy for that. What does
that have to do with anything that we've discussed on this thread
recently?

Review is a feedback mechanism alerting the patch author to possible problems. Listening to feedback is like vacuum, in a way: If it hurts, you're not doing it enough.

An elegant analogy.

--
Peter Geoghegan

#108

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Peter Geoghegan (#106)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 8:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think we're on a very dangerous path here. I want VACUUM to be
better as the next person, but I really don't believe that's the
direction we're headed. I think if we release like this, we're going
to experience more VACUUM pain, not less. And worse still, I don't
think anyone other than Peter and Andres is going to understand why
it's happening.

I think that the only sensible course of action at this point is for
me to revert the page-level freezing commit from today, and abandon
all outstanding work on VACUUM. I will still stand by the basic
page-level freezing work, at least to the extent that I am able to.

I have now reverted today's commit. I have also withdrawn all
remaining work from the patch series as a whole, which is reflected in
the CF app. Separately, I have withdrawn 2 other VACUUM related
patches of mine via the CF app: the antiwraparound autovacuum patch
series, plus a patch that did some further work on freezing
MultiXacts.

I have no intention of picking any of these patches back up again. I
also intend to completely avoid new work on both VACUUM and
autovacuum, not including ambulkdelete() code run by index access
methods. I will continue to do maintenance and bugfix work when it
happens to involve VACUUM, though.

For the record, in case it matters: I certainly have no objection to
anybody else picking up any of this unfinished work for themselves, in
part or in full.

--
Peter Geoghegan

#109

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#106)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jan 25, 2023 at 7:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

Both Andres and I have repeatedly expressed concern about how much is
being changed in the behavior of vacuum, and how quickly, and IMHO on
the basis of very limited evidence that the changes are improvements.
The fact that Andres was very quickly able to find cases where the
patch produces large regression is just more evidence of that. It's
also hard to even understand what has been changed, because the
descriptions are so theoretical.

Did you actually read the motivating examples Wiki page?

I don't know. I've read a lot of stuff that you've written on this
topic, which has taken a significant amount of time, and I still don't
understand a lot of what you're changing, and I don't agree with all
of the things that I do understand. I can't state with confidence that
the motivating examples wiki page was or was not among the things that
I read. But, you know, when people start running PostgreSQL 16, and
have some problem, they're not going to read the motivating examples
wiki page. They're going to read the documentation. If they can't find
the answer there, they (or some hacker that they contact) will
probably read the code comments and the relevant commit messages.
Those either clearly explain what was changed in a way that somebody
can understand, or they don't. If they don't, *the commits are not
good enough*, regardless of what other information may exist in any
other place.

--
Robert Haas
EDB: http://www.enterprisedb.com

#110

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Andres Freund (#104)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 10:56 PM Andres Freund <andres@anarazel.de> wrote:

but that's only true because page level freezing neutered
vacuum_freeze_min_age. Compared to <16, it's a *huge* change.

Do you think that page-level freezing
(1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently
committed?

I have always been a bit skeptical of vacuum_freeze_min_age as a
mechanism. It's certainly true that it is a waste of energy to freeze
tuples that will soon be removed anyway, but on the other hand,
repeatedly dirtying the same page for various different freezing and
visibility related reasons *really sucks*, and even repeatedly reading
the page because we kept deciding not to do anything yet isn't great.
It seems possible that the page-level freezing mechanism could help
with that quite a bit, and I think that the heuristic that patch
proposes is basically reasonable: if there's at least one tuple on the
page that is old enough to justify freezing, it doesn't seem like a
bad bet to freeze all the others that can be frozen at the same time,
at least if it means that we can mark the page all-visible or
all-frozen. If it doesn't, then I'm not so sure; maybe we're best off
deferring as much work as possible to a time when we *can* mark the
page all-visible or all-frozen.

In short, I think that neutering vacuum_freeze_min_age at least to
some degree might be a good thing, but that's not to say that I'm
altogether confident in that patch, either.

--
Robert Haas
EDB: http://www.enterprisedb.com

#111

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#104)

Re: New strategies for freezing, advancing relfrozenxid early

On Wed, Jan 25, 2023 at 7:56 PM Andres Freund <andres@anarazel.de> wrote:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples#Patch_2

The difference between this and VACUUM FREEZE is described here:

"Note how we freeze most pages, but still leave a significant number
unfrozen each time, despite using an eager approach to freezing
(2981204 scanned - 2355230 frozen = 625974 pages scanned but left
unfrozen). Again, this is because we don't freeze pages unless they're
already eligible to be set all-visible.

The only reason there is a substantial difference is because of pgbench's
uniform access pattern. Most real-world applications don't have that.

It's not pgbench! It's TPC-C. It's actually an adversarial case for
the patch series.

--
Peter Geoghegan

#112

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Robert Haas (#109)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 5:41 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 25, 2023 at 11:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jan 25, 2023 at 7:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

Both Andres and I have repeatedly expressed concern about how much is
being changed in the behavior of vacuum, and how quickly, and IMHO on
the basis of very limited evidence that the changes are improvements.
The fact that Andres was very quickly able to find cases where the
patch produces large regression is just more evidence of that. It's
also hard to even understand what has been changed, because the
descriptions are so theoretical.

Did you actually read the motivating examples Wiki page?

I don't know. I've read a lot of stuff that you've written on this
topic, which has taken a significant amount of time, and I still don't
understand a lot of what you're changing, and I don't agree with all
of the things that I do understand.

You complained about the descriptions being theoretical. But there's
nothing theoretical about the fact that we more or less do *all*
freezing in an eventual aggressive VACUUM in many important cases,
including very simple cases like pgbench_history -- the simplest
possible append-only table case. We'll merrily rewrite the entire
table, all at once, for no good reason at all. Consistently, reliably.
It's so incredibly obvious that this makes zero sense! And yet I don't
think you've ever engaged with such basic points as that one.

--
Peter Geoghegan

#113

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Robert Haas (#110)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 09:20:57 -0500, Robert Haas wrote:

On Wed, Jan 25, 2023 at 10:56 PM Andres Freund <andres@anarazel.de> wrote:

but that's only true because page level freezing neutered
vacuum_freeze_min_age. Compared to <16, it's a *huge* change.

Do you think that page-level freezing
(1de58df4fec7325d91f5a8345757314be7ac05da) was improvidently
committed?

I think it's probably ok, but perhaps deserves a bit more thought about when
to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
now.

With "opportunistic freezing" I mean freezing the page, even though we don't
*have* to freeze any of the tuples.

The overall condition gating freezing is:
if (pagefrz.freeze_required || tuples_frozen == 0 ||
(prunestate->all_visible && prunestate->all_frozen &&
fpi_before != pgWalUsage.wal_fpi))

fpi_before is set before the heap_page_prune() call.

To me the
fpi_before != pgWalUsage.wal_fpi"
part doesn't make a whole lot of sense. For one, it won't at all work if
full_page_writes=off. But more importantly, it also means we'll not freeze
when VACUUMing a recently modified page, even if pruning already emitted a WAL
record and we'd not emit an FPI if we freezed the page now.

To me a condition that checked if the buffer is already dirty and if another
XLogInsert() would be likely to generate an FPI would make more sense. The
rare race case of a checkpoint starting concurrently doesn't matter IMO.

A minor complaint I have about the code is that the "tuples_frozen == 0" path
imo is confusing. We go into the "freeze" path, which then inside has another
if for the tuples_frozen == 0 part. I get that this deduplicates the
NewRelFrozenXid handling, but it still looks odd.

I have always been a bit skeptical of vacuum_freeze_min_age as a
mechanism. It's certainly true that it is a waste of energy to freeze
tuples that will soon be removed anyway, but on the other hand,
repeatedly dirtying the same page for various different freezing and
visibility related reasons *really sucks*, and even repeatedly reading
the page because we kept deciding not to do anything yet isn't great.
It seems possible that the page-level freezing mechanism could help
with that quite a bit, and I think that the heuristic that patch
proposes is basically reasonable: if there's at least one tuple on the
page that is old enough to justify freezing, it doesn't seem like a
bad bet to freeze all the others that can be frozen at the same time,
at least if it means that we can mark the page all-visible or
all-frozen. If it doesn't, then I'm not so sure; maybe we're best off
deferring as much work as possible to a time when we *can* mark the
page all-visible or all-frozen.

Agreed. Freezing everything if we need to freeze some things seems quite safe
to me.

In short, I think that neutering vacuum_freeze_min_age at least to
some degree might be a good thing, but that's not to say that I'm
altogether confident in that patch, either.

I am not too woried about the neutering in the page level freezing patch.

The combination of the page level work with the eager strategy is where the
sensibly-more-aggressive freeze_min_age got turbocharged to an imo dangerous
degree.

Greetings,

Andres Freund

#114

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#113)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 8:35 AM Andres Freund <andres@anarazel.de> wrote:

I think it's probably ok, but perhaps deserves a bit more thought about when
to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
now.

With "opportunistic freezing" I mean freezing the page, even though we don't
*have* to freeze any of the tuples.

The overall condition gating freezing is:
if (pagefrz.freeze_required || tuples_frozen == 0 ||
(prunestate->all_visible && prunestate->all_frozen &&
fpi_before != pgWalUsage.wal_fpi))

fpi_before is set before the heap_page_prune() call.

Have you considered page-level checksums, and how the impact on hint
bits needs to be accounted for here?

All RDS customers use page-level checksums. And I've noticed that it's
very common for the number of FPIs to only be very slightly less than
the number of pages dirtied. Much of which is just hint bits. The
"fpi_before != pgWalUsage.wal_fpi" test catches that.

To me a condition that checked if the buffer is already dirty and if another
XLogInsert() would be likely to generate an FPI would make more sense. The
rare race case of a checkpoint starting concurrently doesn't matter IMO.

That's going to be very significantly more aggressive. For example
it'll impact small tables very differently.

--
Peter Geoghegan

#115

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#114)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 08:54:55 -0800, Peter Geoghegan wrote:

On Thu, Jan 26, 2023 at 8:35 AM Andres Freund <andres@anarazel.de> wrote:

I think it's probably ok, but perhaps deserves a bit more thought about when
to "opportunistically" freeze. Perhaps to make it *more* aggressive than it's
now.

With "opportunistic freezing" I mean freezing the page, even though we don't
*have* to freeze any of the tuples.

The overall condition gating freezing is:
if (pagefrz.freeze_required || tuples_frozen == 0 ||
(prunestate->all_visible && prunestate->all_frozen &&
fpi_before != pgWalUsage.wal_fpi))

fpi_before is set before the heap_page_prune() call.

Have you considered page-level checksums, and how the impact on hint
bits needs to be accounted for here?

All RDS customers use page-level checksums. And I've noticed that it's
very common for the number of FPIs to only be very slightly less than
the number of pages dirtied. Much of which is just hint bits. The
"fpi_before != pgWalUsage.wal_fpi" test catches that.

I assume the case you're thinking of is that pruning did *not* do any changes,
but in the process of figuring out that nothing needed to be pruned, we did a
MarkBufferDirtyHint(), and as part of that emitted an FPI?

To me a condition that checked if the buffer is already dirty and if another
XLogInsert() would be likely to generate an FPI would make more sense. The
rare race case of a checkpoint starting concurrently doesn't matter IMO.

That's going to be very significantly more aggressive. For example
it'll impact small tables very differently.

Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
relatively small, with one important exception below, if we are 99.99% sure
that it's not going to require an FPI and isn't going to dirty the page.

The exception is that a newer LSN on the page can cause the ringbuffer
replacement to trigger more more aggressive WAL flushing. No meaningful
difference if we modified the page during pruning, or if the page was already
in s_b (since it likely won't be written out via the ringbuffer in that case),
but if checksums are off and we just hint-dirtied the page, it could be a
significant issue.

Thus a modification of the above logic could be to opportunistically freeze if
a ) it won't cause an FPI and either
b1) the page was already dirty before pruning, as we'll not do a ringbuffer
replacement in that case
or
b2) We wrote a WAL record during pruning, as the difference in flush position
is marginal

An even more aggressive version would be to replace b1) with logic that'd
allow newly dirtying the page if it wasn't read through the ringbuffer. But
newly dirtying the page feels like it'd be more dangerous.

A less aggressive version would be to check if any WAL records were emitted
during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
modified the page again. Similar to what we do now, except not requiring an
FPI to have been emitted.

But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
wal_log_hint bits is on, than without them. Which I think is how using either
of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Greetings,

Andres Freund

#116

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#115)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 9:53 AM Andres Freund <andres@anarazel.de> wrote:

I assume the case you're thinking of is that pruning did *not* do any changes,
but in the process of figuring out that nothing needed to be pruned, we did a
MarkBufferDirtyHint(), and as part of that emitted an FPI?

Yes.

That's going to be very significantly more aggressive. For example
it'll impact small tables very differently.

Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
relatively small, with one important exception below, if we are 99.99% sure
that it's not going to require an FPI and isn't going to dirty the page.

The exception is that a newer LSN on the page can cause the ringbuffer
replacement to trigger more more aggressive WAL flushing. No meaningful
difference if we modified the page during pruning, or if the page was already
in s_b (since it likely won't be written out via the ringbuffer in that case),
but if checksums are off and we just hint-dirtied the page, it could be a
significant issue.

Most of the overhead of FREEZE WAL records (with freeze plan
deduplication and page-level freezing in) is generic WAL record header
overhead. Your recent adversarial test case is going to choke on that,
too. At least if you set checkpoint_timeout to 1 minute again.

Thus a modification of the above logic could be to opportunistically freeze if
a ) it won't cause an FPI and either
b1) the page was already dirty before pruning, as we'll not do a ringbuffer
replacement in that case
or
b2) We wrote a WAL record during pruning, as the difference in flush position
is marginal

An even more aggressive version would be to replace b1) with logic that'd
allow newly dirtying the page if it wasn't read through the ringbuffer. But
newly dirtying the page feels like it'd be more dangerous.

In many cases we'll have to dirty the page anyway, just to set
PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
triggered by an FPI or triggered by my now-reverted GUC) on being able
to set the whole page all-frozen in the VM.

A less aggressive version would be to check if any WAL records were emitted
during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
modified the page again. Similar to what we do now, except not requiring an
FPI to have been emitted.

Also way more aggressive. Not nearly enough on its own.

But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
wal_log_hint bits is on, than without them. Which I think is how using either
of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Which part is the odd part? Is it odd that page-level freezing works
that way, or is it odd that page-level checksums work that way?

In any case this seems like an odd thing for you to say, having
eviscerated a patch that really just made the same behavior trigger
independently of FPIs in some tables, controlled via a GUC.

--
Peter Geoghegan

#117

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#116)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, 26 Jan 2023 at 19:45, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jan 26, 2023 at 9:53 AM Andres Freund <andres@anarazel.de> wrote:

I assume the case you're thinking of is that pruning did *not* do any changes,
but in the process of figuring out that nothing needed to be pruned, we did a
MarkBufferDirtyHint(), and as part of that emitted an FPI?

Yes.

That's going to be very significantly more aggressive. For example
it'll impact small tables very differently.

Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
relatively small, with one important exception below, if we are 99.99% sure
that it's not going to require an FPI and isn't going to dirty the page.

The exception is that a newer LSN on the page can cause the ringbuffer
replacement to trigger more more aggressive WAL flushing. No meaningful
difference if we modified the page during pruning, or if the page was already
in s_b (since it likely won't be written out via the ringbuffer in that case),
but if checksums are off and we just hint-dirtied the page, it could be a
significant issue.

Most of the overhead of FREEZE WAL records (with freeze plan
deduplication and page-level freezing in) is generic WAL record header
overhead. Your recent adversarial test case is going to choke on that,
too. At least if you set checkpoint_timeout to 1 minute again.

Could someone explain to me why we don't currently (optionally)
include the functionality of page freezing in the PRUNE records? I
think they're quite closely related (in that they both execute in
VACUUM and are required for long-term system stability), and are even
more related now that we have opportunistic page-level freezing. I
think adding a "freeze this page as well"-flag in PRUNE records would
go a long way to reducing the WAL overhead of aggressive and more
opportunistic freezing.

-Matthias

#118

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#112)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 11:35 AM Peter Geoghegan <pg@bowt.ie> wrote:

You complained about the descriptions being theoretical. But there's
nothing theoretical about the fact that we more or less do *all*
freezing in an eventual aggressive VACUUM in many important cases,
including very simple cases like pgbench_history -- the simplest
possible append-only table case. We'll merrily rewrite the entire
table, all at once, for no good reason at all. Consistently, reliably.
It's so incredibly obvious that this makes zero sense! And yet I don't
think you've ever engaged with such basic points as that one.

I'm aware that that's a problem, and I agree that it sucks. I think
that what this patch does is make vacuum more aggressively, and I
expect that would help this problem. I haven't said much about that
because I don't think it's controversial. However, the patch also has
a cost, and that's what I think is controversial.

I think it's pretty much impossible to freeze more aggressively
without losing in some scenario or other. If waiting longer to freeze
would have resulted in the data getting updated again or deleted
before we froze it, then waiting longer reduces the total amount of
freezing work that ever has to be done. Freezing more aggressively
inevitably gives up some amount of that potential benefit in order to
try to secure some other benefit. It's a trade-off.

I think that the goal of a patch that makes vacuum more (or less)
aggressive should be to make the cases where we lose as obscure as
possible, and the cases where we win as broad as possible. I think
that, in order to be a good patch, it needs to be relatively difficult
to find cases where we incur a big loss. If it's easy to find a big
loss, then I think it's better to stick with the current behavior,
even if it's also easy to find a big gain. There's nothing wonderful
about the current behavior, but (to paraphrase what I think Andres has
already said several times) it's better to keep shipping code with the
same bad behavior than to put out a new major release with behaviors
that are just as bad, but different.

I feel like your emails sometimes seem to suppose that I think that
you're a bad person, or a bad developer, or that you have no good
ideas, or that you have no good ideas about this topic, or that this
topic is not important, or that we don't need to do better than we are
currently doing. I think none of those things. However, I'm also not
prepared to go all the way to the other end of the spectrum and say
that all of your ideas and everything in this patch are great. I don't
think either of those things, either.

I certainly think that freezing more aggressively in some scenarios
could be a great idea, but it seems like the patch's theory is to be
very nearly maximally aggressive in every vacuum run if the table size
is greater than some threshold, and I don't think that's right at all.
I'm not exactly sure what information we should use to decide how
aggressive to be, but I am pretty sure that the size of the table is
not it. It's true that, for a small table, the cost of having to
eventually vacuum the whole table at once isn't going to be very high,
whereas for a large table, it will be. That line of reasoning makes a
size threshold sound reasonable. However, the amount of extra work
that we can potentially do by vacuuming more aggressively *also*
increases with the table size, which to me means using that a
criterion actually isn't sensible at all.

One idea that I've had about how to solve this problem is to try to
make vacuum try to aggressively freeze some portion of the table on
each pass, and to behave less aggressively on the rest of the table so
that, hopefully, no single vacuum does too much work. Unfortunately, I
don't really know how to do that effectively. If we knew that the
table was going to see 10 vacuums before we hit
autovacuum_freeze_max_age, we could try to have each one do 10% of the
amount of freezing that was going to need to be done rather than
letting any single vacuum do all of it, but we don't have that sort of
information. Also, even if we did have that sort of information, the
idea only works if the pages that we freeze sooner are ones that we're
not about to update or delete again, and we don't have any idea what
is likely there. In theory we could have some system that tracks how
recently each page range in a table has been modified, and direct our
freezing activity toward the ones less-recently modified on the theory
that they're not so likely to be modified again in the near future,
but in reality we have no such system. So I don't really feel like I
know what the right answer is here, yet.

--
Robert Haas
EDB: http://www.enterprisedb.com

#119

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Robert Haas (#118)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 11:28 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think it's pretty much impossible to freeze more aggressively
without losing in some scenario or other. If waiting longer to freeze
would have resulted in the data getting updated again or deleted
before we froze it, then waiting longer reduces the total amount of
freezing work that ever has to be done. Freezing more aggressively
inevitably gives up some amount of that potential benefit in order to
try to secure some other benefit. It's a trade-off.

There is no question about that.

I think that the goal of a patch that makes vacuum more (or less)
aggressive should be to make the cases where we lose as obscure as
possible, and the cases where we win as broad as possible. I think
that, in order to be a good patch, it needs to be relatively difficult
to find cases where we incur a big loss. If it's easy to find a big
loss, then I think it's better to stick with the current behavior,
even if it's also easy to find a big gain.

Again, this seems totally uncontroversial. It's just incredibly vague,
and not at all actionable.

Relatively difficult for Andres, or for somebody else? What are the
real parameters here? Obviously there are no clear answers available.

However, I'm also not
prepared to go all the way to the other end of the spectrum and say
that all of your ideas and everything in this patch are great. I don't
think either of those things, either.

It doesn't matter. I'm done with it. This is not a negotiation about
what gets in and what doesn't get in.

All that I aim to do now is to draw some kind of line under the basic
page-level freezing work, since of course I'm still responsible for
that. And perhaps to defend my personal reputation.

I certainly think that freezing more aggressively in some scenarios
could be a great idea, but it seems like the patch's theory is to be
very nearly maximally aggressive in every vacuum run if the table size
is greater than some threshold, and I don't think that's right at all.

We'll systematically avoid accumulating debt past a certain point --
that's its purpose. That is, we'll avoid accumulating all-visible
pages that eventually need to be frozen.

I'm not exactly sure what information we should use to decide how
aggressive to be, but I am pretty sure that the size of the table is
not it. It's true that, for a small table, the cost of having to
eventually vacuum the whole table at once isn't going to be very high,
whereas for a large table, it will be. That line of reasoning makes a
size threshold sound reasonable. However, the amount of extra work
that we can potentially do by vacuuming more aggressively *also*
increases with the table size, which to me means using that a
criterion actually isn't sensible at all.

The overwhelming cost is usually FPIs in any case. If you're not
mostly focussing on that, you're focussing on the wrong thing. At
least with larger tables. You just have to focus on the picture over
time, across multiple VACUUM operations.

One idea that I've had about how to solve this problem is to try to
make vacuum try to aggressively freeze some portion of the table on
each pass, and to behave less aggressively on the rest of the table so
that, hopefully, no single vacuum does too much work. Unfortunately, I
don't really know how to do that effectively.

That has been proposed a couple of times in the context of this
thread. It won't work, because the way autovacuum works in general
(and likely always will work) doesn't allow it. With an append-only
table, each VACUUM will naturally have to scan significantly more
pages than the last one, forever (barring antiwraparound vacuums). Why
wouldn't it continue that way? I mean it might not (the table might
stop growing altogether), but then it doesn't matter much what we do.

If you're not behaving very proactively at the level of each VACUUM
operation, then the picture over time is that you're *already* falling
behind. At least with an append-only table. You have to think of the
sequence of operations, not just one.

In theory we could have some system that tracks how
recently each page range in a table has been modified, and direct our
freezing activity toward the ones less-recently modified on the theory
that they're not so likely to be modified again in the near future,
but in reality we have no such system. So I don't really feel like I
know what the right answer is here, yet.

So we need to come up with a way of getting reliable information from
the future, about an application that we have no particular
understanding of. As opposed to just eating the cost to some degree,
and making it configurable.

--
Peter Geoghegan

#120

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Matthias van de Meent (#117)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 11:26 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Could someone explain to me why we don't currently (optionally)
include the functionality of page freezing in the PRUNE records? I
think they're quite closely related (in that they both execute in
VACUUM and are required for long-term system stability), and are even
more related now that we have opportunistic page-level freezing. I
think adding a "freeze this page as well"-flag in PRUNE records would
go a long way to reducing the WAL overhead of aggressive and more
opportunistic freezing.

Yeah, we've talked about doing that in the past year. It's quite
possible. It would make quite a lot of sense, because the actual
overhead of the WAL record for freezing tends to come from the generic
WAL record header stuff itself. If there was only one record for both,
then you'd only need to include the relfilenode and block number (and
so on) once.

It would be tricky to handle Multis, so what you'd probably do is just
freezing xmin, and possibly aborted and locker XIDs in xmax. So you
wouldn't completely get rid of the main freeze record, but would be
able to avoid it in many important cases.

--
Peter Geoghegan

#121

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#116)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 10:44:45 -0800, Peter Geoghegan wrote:

On Thu, Jan 26, 2023 at 9:53 AM Andres Freund <andres@anarazel.de> wrote:

That's going to be very significantly more aggressive. For example
it'll impact small tables very differently.

Maybe it would be too aggressive, not sure. The cost of a freeze WAL record is
relatively small, with one important exception below, if we are 99.99% sure
that it's not going to require an FPI and isn't going to dirty the page.

The exception is that a newer LSN on the page can cause the ringbuffer
replacement to trigger more more aggressive WAL flushing. No meaningful
difference if we modified the page during pruning, or if the page was already
in s_b (since it likely won't be written out via the ringbuffer in that case),
but if checksums are off and we just hint-dirtied the page, it could be a
significant issue.

Most of the overhead of FREEZE WAL records (with freeze plan
deduplication and page-level freezing in) is generic WAL record header
overhead. Your recent adversarial test case is going to choke on that,
too. At least if you set checkpoint_timeout to 1 minute again.

I don't quite follow. What do you mean with "record header overhead"? Unless
that includes FPIs, I don't think that's that commonly true?

The problematic case I am talking about is when we do *not* emit a WAL record
during pruning (because there's nothing to prune), but want to freeze the
table. If you don't log an FPI, the remaining big overhead is that increasing
the LSN on the page will often cause an XLogFlush() when writing out the
buffer.

I don't see what your reference to checkpoint timeout is about here?

Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
= 1min. It just makes it cheaper to reproduce.

Thus a modification of the above logic could be to opportunistically freeze if
a ) it won't cause an FPI and either
b1) the page was already dirty before pruning, as we'll not do a ringbuffer
replacement in that case
or
b2) We wrote a WAL record during pruning, as the difference in flush position
is marginal

An even more aggressive version would be to replace b1) with logic that'd
allow newly dirtying the page if it wasn't read through the ringbuffer. But
newly dirtying the page feels like it'd be more dangerous.

In many cases we'll have to dirty the page anyway, just to set
PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
triggered by an FPI or triggered by my now-reverted GUC) on being able
to set the whole page all-frozen in the VM.

IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
bits. But freezing does trigger one even without wal_log_hint_bits.

You're right, it makes sense to consider whether we'll emit a
XLOG_HEAP2_VISIBLE anyway.

A less aggressive version would be to check if any WAL records were emitted
during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
modified the page again. Similar to what we do now, except not requiring an
FPI to have been emitted.

Also way more aggressive. Not nearly enough on its own.

In which cases will it be problematically more aggressive?

If we emitted a WAL record during pruning we've already set the LSN of the
page to a very recent LSN. We know the page is dirty. Thus we'll already
trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.

But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
wal_log_hint bits is on, than without them. Which I think is how using either
of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Which part is the odd part? Is it odd that page-level freezing works
that way, or is it odd that page-level checksums work that way?

That page-level freezing works that way.

In any case this seems like an odd thing for you to say, having
eviscerated a patch that really just made the same behavior trigger
independently of FPIs in some tables, controlled via a GUC.

jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj

That behaviour I critizied was causing a torrent of FPIs and additional
dirtying of pages. My proposed replacement for the current FPI check doesn't,
because a) it only triggers when we wrote a WAL record b) It doesn't trigger
if we would write an FPI.

Greetings,

Andres Freund

#122

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#119)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 2:57 PM Peter Geoghegan <pg@bowt.ie> wrote:

Relatively difficult for Andres, or for somebody else? What are the
real parameters here? Obviously there are no clear answers available.

Andres is certainly smarter than the average guy, but practically any
scenario that someone can create in a few lines of SQL is something to
which code will be exposed to on some real-world system. If Andres
came along and said, hey, well I found a way to make this patch suck,
and proceeded to describe a scenario that involved a complex set of
tables and multiple workloads running simultaneously and using a
debugger to trigger some race condition and whatever, I'd be like "OK,
but is that really going to happen?". The actual scenario he came up
with is three lines of SQL, and it's nothing remotely obscure. That
kind of thing is going to happen *all the time*.

The overwhelming cost is usually FPIs in any case. If you're not
mostly focussing on that, you're focussing on the wrong thing. At
least with larger tables. You just have to focus on the picture over
time, across multiple VACUUM operations.

I think that's all mostly true, but the cases where being more
aggressive can cause *extra* FPIs are worthy of just as much attention
as the cases where we can reduce them.

--
Robert Haas
EDB: http://www.enterprisedb.com

#123

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Matthias van de Meent (#117)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 20:26:00 +0100, Matthias van de Meent wrote:

Could someone explain to me why we don't currently (optionally)
include the functionality of page freezing in the PRUNE records?

I think we definitely should (and have argued for it a couple times). It's not
just about reducing WAL overhead, it's also about reducing redundant
visibility checks - which are where a very significant portion of the CPU time
for VACUUMing goes to.

Besides performance considerations, it's also just plain weird that
lazy_scan_prune() can end up with a different visibility than
heap_page_prune() (mostly due to concurrent aborts).

The number of WAL records we often end up emitting for a processing a single
page in vacuum is just plain absurd:
- PRUNE
- FREEZE_PAGE
- VISIBLE

There's afaict no justification whatsoever for these to be separate records.

I think they're quite closely related (in that they both execute in VACUUM
and are required for long-term system stability), and are even more related
now that we have opportunistic page-level freezing. I think adding a "freeze
this page as well"-flag in PRUNE records would go a long way to reducing the
WAL overhead of aggressive and more opportunistic freezing.

Yep.

I think we should also seriously consider setting all-visible during on-access
pruning, and freezing rows during on-access pruning.

Greetings,

Andres Freund

#124

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Robert Haas (#122)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

The overwhelming cost is usually FPIs in any case. If you're not
mostly focussing on that, you're focussing on the wrong thing. At
least with larger tables. You just have to focus on the picture over
time, across multiple VACUUM operations.

I think that's all mostly true, but the cases where being more
aggressive can cause *extra* FPIs are worthy of just as much attention
as the cases where we can reduce them.

It's a question of our exposure to real problems, in no small part.
What can we afford to be wrong about? What problem can be fixed by the
user more or less as it emerges, and what problem doesn't have that
quality?

There is very good reason to believe that the large majority of all
data that people store in a system like Postgres is extremely cold
data:

https://www.microsoft.com/en-us/research/video/cost-performance-in-modern-data-stores-how-data-cashing-systems-succeed/
https://brandur.org/fragments/events

Having a separate aggressive step that rewrites an entire large table,
apparently at random, is just a huge burden to users. You've said that
you agree that it sucks, but somehow I still can't shake the feeling
that you don't fully understand just how much it sucks.

--
Peter Geoghegan

#125

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#124)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

There is very good reason to believe that the large majority of all
data that people store in a system like Postgres is extremely cold
data:

The systems where I end up troubleshooting problems seem to be, most
typically, busy OLTP systems. I'm not in a position to say whether
that's more or less common than systems with extremely cold data, but
I am in a position to say that my employer will have a lot fewer happy
customers if we regress that use case. Naturally I'm keen to avoid
that.

Having a separate aggressive step that rewrites an entire large table,
apparently at random, is just a huge burden to users. You've said that
you agree that it sucks, but somehow I still can't shake the feeling
that you don't fully understand just how much it sucks.

Ha!

Well, that's possible. But maybe you don't understand how much your
patch makes other things suck.

I don't think we can really get anywhere here by postulating that the
problem is the other person's lack of understanding, even if such a
postulate should happen to be correct.

--
Robert Haas
EDB: http://www.enterprisedb.com

#126

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Robert Haas (#125)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 1:22 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 26, 2023 at 4:06 PM Peter Geoghegan <pg@bowt.ie> wrote:

There is very good reason to believe that the large majority of all
data that people store in a system like Postgres is extremely cold
data:

The systems where I end up troubleshooting problems seem to be, most
typically, busy OLTP systems. I'm not in a position to say whether
that's more or less common than systems with extremely cold data, but
I am in a position to say that my employer will have a lot fewer happy
customers if we regress that use case. Naturally I'm keen to avoid
that.

This is the kind of remark that makes me think that you don't get it.

The most influential OLTP benchmark of all time is TPC-C, which has
exactly this problem. In spades -- it's enormously disruptive. Which
is one reason why I used it as a showcase for a lot of this work. Plus
practical experience (like the Heroku database in the blog post I
linked to) fully agrees with that benchmark, as far as this stuff goes
-- that was also a busy OLTP database.

Online transaction involves transactions. Right? There is presumably
some kind of ledger, some kind of orders table. Naturally these have
entries that age out fairly predictably. After a while, almost all the
data is cold data. It is usually about that simple.

One of the key strengths of systems like Postgres is the ability to
inexpensively store a relatively large amount of data that has just
about zero chance of being read, let alone modified. While at the same
time having decent OLTP performance for the hot data. Not nearly as
good as an in-memory system, mind you -- and yet in-memory systems
remain largely a niche thing.

--
Peter Geoghegan

#127

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#121)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 12:45 PM Andres Freund <andres@anarazel.de> wrote:

Most of the overhead of FREEZE WAL records (with freeze plan
deduplication and page-level freezing in) is generic WAL record header
overhead. Your recent adversarial test case is going to choke on that,
too. At least if you set checkpoint_timeout to 1 minute again.

I don't quite follow. What do you mean with "record header overhead"? Unless
that includes FPIs, I don't think that's that commonly true?

Even if there are no directly observable FPIs, there is still extra
WAL, which can cause FPIs indirectly, just by making checkpoints more
frequent. I feel ridiculous even having to explain this to you.

The problematic case I am talking about is when we do *not* emit a WAL record
during pruning (because there's nothing to prune), but want to freeze the
table. If you don't log an FPI, the remaining big overhead is that increasing
the LSN on the page will often cause an XLogFlush() when writing out the
buffer.

I don't see what your reference to checkpoint timeout is about here?

Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
= 1min. It just makes it cheaper to reproduce.

That's flagrantly intellectually dishonest. Sure, it made it easier to
reproduce. But that's not all it did!

You had *lots* of specific numbers and technical details in your first
email, such as "Time for vacuuming goes up to ~5x. WAL volume to
~9x.". But you did not feel that it was worth bothering with details
like having set checkpoint_timeout to 1 minute, which is a setting
that nobody uses, and obviously had a multiplicative effect. That
detail was unimportant. I had to drag it out of you!

You basically found a way to add WAL overhead to a system/workload
that is already in a write amplification vicious cycle, with latent
tipping point type behavior.

There is a practical point here, that is equally obvious, and yet
somehow still needs to be said: benchmarks like that one are basically
completely free of useful information. If we can't agree on how to
assess such things in general, then what can we agree on when it comes
to what should be done about it, what trade-off to make, when it comes
to any similar question?

In many cases we'll have to dirty the page anyway, just to set
PD_ALL_VISIBLE. The whole way the logic works is conditioned (whether
triggered by an FPI or triggered by my now-reverted GUC) on being able
to set the whole page all-frozen in the VM.

IIRC setting PD_ALL_VISIBLE doesn't trigger an FPI unless we need to log hint
bits. But freezing does trigger one even without wal_log_hint_bits.

That is correct.

You're right, it makes sense to consider whether we'll emit a
XLOG_HEAP2_VISIBLE anyway.

As written the page-level freezing FPI mechanism probably doesn't
really stand to benefit much from doing that. Either checksums are
disabled and it's just a hint, or they're enabled and there is a very
high chance that we'll get an FPI inside lazy_scan_prune rather than
right after it is called, when PD_ALL_VISIBLE is set.

That's not perfect, of course, but it doesn't have to be. Perhaps it
should still be improved, just on general principle.

A less aggressive version would be to check if any WAL records were emitted
during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
modified the page again. Similar to what we do now, except not requiring an
FPI to have been emitted.

Also way more aggressive. Not nearly enough on its own.

In which cases will it be problematically more aggressive?

If we emitted a WAL record during pruning we've already set the LSN of the
page to a very recent LSN. We know the page is dirty. Thus we'll already
trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.

You seem to be talking about this as if the only thing that could
matter is the immediate FPI -- the first order effects -- and not any
second order effects. You certainly didn't get to 9x extra WAL
overhead by controlling for that before. Should I take it that you've
decided to assess these things more sensibly now? Out of curiosity:
why the change of heart?

But to me it seems a bit odd that VACUUM now is more aggressive if checksums /
wal_log_hint bits is on, than without them. Which I think is how using either
of pgWalUsage.wal_fpi, pgWalUsage.wal_records ends up working?

Which part is the odd part? Is it odd that page-level freezing works
that way, or is it odd that page-level checksums work that way?

That page-level freezing works that way.

I think that it will probably cause a little confusion, and should be
specifically documented. But other than that, it seems reasonable
enough to me. I mean, should I not do something that's going to be of
significant help to users with checksums, just because it'll be
somewhat confusing to a small minority of them?

In any case this seems like an odd thing for you to say, having
eviscerated a patch that really just made the same behavior trigger
independently of FPIs in some tables, controlled via a GUC.

jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj

That behaviour I critizied was causing a torrent of FPIs and additional
dirtying of pages. My proposed replacement for the current FPI check doesn't,
because a) it only triggers when we wrote a WAL record b) It doesn't trigger
if we would write an FPI.

It increases the WAL written in many important cases that
vacuum_freeze_strategy_threshold avoided. Sure, it did have some
problems, but the general idea of adding some high level
context/strategies seems essential to me.

You also seem to be suggesting that your proposed change to how basic
page-level freezing works will make freezing of pages on databases
with page-level checksums similar to an equivalent case without
checksums enabled. Even assuming that that's an important goal, you
won't be much closer to achieving it under your scheme, since hint
bits being set during VACUUM and requiring an FPI still make a huge
difference. Tables like pgbench_history have pages that generally
aren't pruned, that don't need to log an FPI just to set
PD_ALL_VISIBLE once checksums are disabled.

That's the difference that users are going to notice between checksums
enabled vs disabled, if they notice any -- it's the most important one
by far.

--
Peter Geoghegan

#128

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Robert Haas (#118)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 14:27:53 -0500, Robert Haas wrote:

One idea that I've had about how to solve this problem is to try to
make vacuum try to aggressively freeze some portion of the table on
each pass, and to behave less aggressively on the rest of the table so
that, hopefully, no single vacuum does too much work.

I agree that this rough direction is worthwhile to purse.

Unfortunately, I don't really know how to do that effectively. If we knew
that the table was going to see 10 vacuums before we hit
autovacuum_freeze_max_age, we could try to have each one do 10% of the
amount of freezing that was going to need to be done rather than letting any
single vacuum do all of it, but we don't have that sort of information.

I think, quite fundamentally, it's not possible to bound the amount of work an
anti-wraparound vacuum has to do if we don't have an age based autovacuum
trigger kicking in before autovacuum_freeze_max_age. After all, there might be
no autovacuum before that's autovacuum_freeze_max_age is reached.

But there's just no reason to not have a trigger below
autovacuum_freeze_max_age. That's why I think Peter's patch to split age and
anti-"auto-cancel" autovacuums is an strictly necessary change if we want to
make autovacuum fundamentally suck less. There's a few boring details to
figure out how to set/compute those limits, but I don't think there's anything
fundamentally hard.

I think we also need the number of all-frozen pages in pg_class if we want to
make better scheduling decision. As we already compute the number of
all-visible pages at the end of vacuuming, we can compute the number of
all-frozen pages as well. The space for another integer in pg_class doesn't
bother me one bit.

Let's say we had a autovacuum_vacuum_age trigger of 100m, and
autovacuum_freeze_max_age=500m. We know that we're roughly going to be
vacuuming 5 times before reaching autovacuum_freeze_max_age (very slow
autovacuums are an issue, but if one autovacuum takes 100m+ xids long, there's
not much we can do).

With that we could determine the eager percentage along the lines of:
frozen_target = Min(age(relfrozenxid), autovacuum_freeze_max_age)/autovacuum_freeze_max_age
eager_percentage = Min(0, frozen_target * relpages - pg_class.relallfrozen * relpages)

One thing I don't know fully how to handle is how to ensure that we try to
freeze a different part of the table each vacuum. I guess we could store a
page number in pgstats?

This would help address the "cliff" issue of reaching
autovacuum_freeze_max_age. What it would *not*, on its own, would is the
number of times we rewrite pages.

I can guess at a few ways to heuristically identify when tables are "append
mostly" from vacuum's view (a table can be update heavy, but very localized to
recent rows, and still be append mostly from vacuum's view). There's obvious
cases, e.g. when there are way more inserts than dead rows. But other cases
are harder.

Also, even if we did have that sort of information, the idea only works if
the pages that we freeze sooner are ones that we're not about to update or
delete again, and we don't have any idea what is likely there.

Perhaps we could use something like
(age(relfrozenxid) - age(newest_xid_on_page)) / age(relfrozenxid)
as a heuristic?

I have a gut feeling that we should somehow collect/use statistics about the
number of frozen pages, marked as such by the last (or recent?) vacuum, that
had to be "defrosted" by backends. But I don't quite know how to yet. I think
we could collect statistics about that by storing the LSN of the last vacuum
in the shared stats, and incrementing that counter when defrosting.

A lot of things like that would work a whole lot better if we had statistics
that take older data into account, but weigh it less than more recent
data. But that's hard/expensive to collect.

Greetings,

Andres Freund

#129

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#127)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 15:36:52 -0800, Peter Geoghegan wrote:

On Thu, Jan 26, 2023 at 12:45 PM Andres Freund <andres@anarazel.de> wrote:

Most of the overhead of FREEZE WAL records (with freeze plan
deduplication and page-level freezing in) is generic WAL record header
overhead. Your recent adversarial test case is going to choke on that,
too. At least if you set checkpoint_timeout to 1 minute again.

I don't quite follow. What do you mean with "record header overhead"? Unless
that includes FPIs, I don't think that's that commonly true?

Even if there are no directly observable FPIs, there is still extra
WAL, which can cause FPIs indirectly, just by making checkpoints more
frequent. I feel ridiculous even having to explain this to you.

What does that have to do with "generic WAL record overhead"?

I also don't really see how that is responsive to anything else in my
email. That's just as true for the current gating condition (the issuance of
an FPI during heap_page_prune() / HTSV()).

What I was wondering about is whether we should replace the
fpi_before != pgWalUsage.wal_fpi
with
records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)

The problematic case I am talking about is when we do *not* emit a WAL record
during pruning (because there's nothing to prune), but want to freeze the
table. If you don't log an FPI, the remaining big overhead is that increasing
the LSN on the page will often cause an XLogFlush() when writing out the
buffer.

I don't see what your reference to checkpoint timeout is about here?

Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
= 1min. It just makes it cheaper to reproduce.

That's flagrantly intellectually dishonest. Sure, it made it easier to
reproduce. But that's not all it did!

You had *lots* of specific numbers and technical details in your first
email, such as "Time for vacuuming goes up to ~5x. WAL volume to
~9x.". But you did not feel that it was worth bothering with details
like having set checkpoint_timeout to 1 minute, which is a setting
that nobody uses, and obviously had a multiplicative effect. That
detail was unimportant. I had to drag it out of you!

The multiples were for checkpoint_timeout=5min, with
'250s' instead of WHERE ts < now() - '120s'

I started out with checkpoint_timeout=1min, as I wanted to quickly test my
theory. Then I increased checkpoint_timeout back to 5min, adjusted the query
to some randomly guessed value. Happened to get nearly the same results.

I then experimented more with '1min', because it's less annoying to have to
wait for 120s until deletions start, than to wait for 250s. Because it's
quicker to run I thought I'd share the less resource intensive version. A
mistake as I now realize.

This wasn't intended as a carefully designed benchmark, or anything. It was a
quick proof for a problem that I found obvious. And it's not something worth
testing carefully - e.g. the constants in the test are actually quite hardware
specific, because the insert/seconds rate is very machine specific, and it's
completely unnecessarily hardware intensive due to the use of single-row
inserts, instead of batched operations. It's just a POC.

You basically found a way to add WAL overhead to a system/workload
that is already in a write amplification vicious cycle, with latent
tipping point type behavior.

There is a practical point here, that is equally obvious, and yet
somehow still needs to be said: benchmarks like that one are basically
completely free of useful information. If we can't agree on how to
assess such things in general, then what can we agree on when it comes
to what should be done about it, what trade-off to make, when it comes
to any similar question?

It's not at all free of useful information. It reproduces a problem I
predicted repeatedly, that others in the discussion also wondered about, that
you refused to acknowledge or address.

It's not a good benchmark - I completely agree with that much. It was not
designed to carefully benchmark different settings or such. It was designed to
show a problem. And it does that.

You're right, it makes sense to consider whether we'll emit a
XLOG_HEAP2_VISIBLE anyway.

As written the page-level freezing FPI mechanism probably doesn't
really stand to benefit much from doing that. Either checksums are
disabled and it's just a hint, or they're enabled and there is a very
high chance that we'll get an FPI inside lazy_scan_prune rather than
right after it is called, when PD_ALL_VISIBLE is set.

I think it might be useful with logged hint bits, consider cases where all the
tuples on the page were already fully hinted. That's not uncommon, I think?

A less aggressive version would be to check if any WAL records were emitted
during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI if we
modified the page again. Similar to what we do now, except not requiring an
FPI to have been emitted.

Also way more aggressive. Not nearly enough on its own.

In which cases will it be problematically more aggressive?

If we emitted a WAL record during pruning we've already set the LSN of the
page to a very recent LSN. We know the page is dirty. Thus we'll already
trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.

You seem to be talking about this as if the only thing that could
matter is the immediate FPI -- the first order effects -- and not any
second order effects.

* Freeze the page when heap_prepare_freeze_tuple indicates that at least
* one XID/MXID from before FreezeLimit/MultiXactCutoff is present. Also
* freeze when pruning generated an FPI, if doing so means that we set the
* page all-frozen afterwards (might not happen until final heap pass).
*/
if (pagefrz.freeze_required || tuples_frozen == 0 ||
(prunestate->all_visible && prunestate->all_frozen &&
fpi_before != pgWalUsage.wal_fpi))

That's just as true for this.

What I'd like to know is why the second order effects of the above are lesser
than for
if (pagefrz.freeze_required || tuples_frozen == 0 ||
(prunestate->all_visible && prunestate->all_frozen &&
records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)))

You certainly didn't get to 9x extra WAL
overhead by controlling for that before. Should I take it that you've
decided to assess these things more sensibly now? Out of curiosity:
why the change of heart?

Dude.

What would the point have been to invest a lot of time in a repro for a
predicted problem? It's a problem repro, not a carefully designed benchmark.

In any case this seems like an odd thing for you to say, having
eviscerated a patch that really just made the same behavior trigger
independently of FPIs in some tables, controlled via a GUC.

jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj

That behaviour I critizied was causing a torrent of FPIs and additional
dirtying of pages. My proposed replacement for the current FPI check doesn't,
because a) it only triggers when we wrote a WAL record b) It doesn't trigger
if we would write an FPI.

It increases the WAL written in many important cases that
vacuum_freeze_strategy_threshold avoided. Sure, it did have some
problems, but the general idea of adding some high level
context/strategies seems essential to me.

I was discussing changing the conditions for the "oppportunistic pruning"
logic, not about a replacement for the eager freezing strategy.

Greetings,

Andres Freund

#130

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#129)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 6:37 PM Andres Freund <andres@anarazel.de> wrote:

I also don't really see how that is responsive to anything else in my
email. That's just as true for the current gating condition (the issuance of
an FPI during heap_page_prune() / HTSV()).

What I was wondering about is whether we should replace the
fpi_before != pgWalUsage.wal_fpi
with
records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)

I understand that. What I'm saying is that that's going to create a
huge problem of its own, unless you separately account for that
problem.

The simplest and obvious example is something like a pgbench_tellers
table. VACUUM will generally run fast enough relative to the workload
that it will set some of those pages all-visible. Now it's going to
freeze them, too. Arguably it shouldn't even be setting the pages
all-visible, but now you make that existing problem much worse.

The important point is that there doesn't seem to be any good way
around thinking about the table as a whole if you're going to freeze
speculatively. This is not the same dynamic as we see with the FPI
thing IMV -- that's not nearly so speculative as what you're talking
about, since it is speculative in roughly the same sense that eager
freezing was speculative (hence the suggestion that something like
vacuum_freeze_strategy_threshold could have a roll to play).

The FPI thing is mostly about the cost now versus the cost later on.
You're gambling that you won't get another FPI later on if you freeze
now. But the cost of a second FPI later on is so much higher than the
added cost of freezing now that that's a very favorable bet, that we
can afford to "lose" many times while still coming out ahead overall.
And even when we lose, you generally still won't have been completely
wrong -- even then there generally will indeed be a second FPI later
on for the same page, to go with everything else. This makes the
wasted freezing even less significant, on a comparative basis!

It's also likely true that an FPI in lazy_scan_prune is a much
stronger signal, but I think that the important dynamic is that we're
reasoning about "costs now vs costs later on". The asymmetry is really
important.

--
Peter Geoghegan

#131

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#130)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 19:01:03 -0800, Peter Geoghegan wrote:

On Thu, Jan 26, 2023 at 6:37 PM Andres Freund <andres@anarazel.de> wrote:

I also don't really see how that is responsive to anything else in my
email. That's just as true for the current gating condition (the issuance of
an FPI during heap_page_prune() / HTSV()).

What I was wondering about is whether we should replace the
fpi_before != pgWalUsage.wal_fpi
with
records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)

I understand that. What I'm saying is that that's going to create a
huge problem of its own, unless you separately account for that
problem.

The simplest and obvious example is something like a pgbench_tellers
table. VACUUM will generally run fast enough relative to the workload
that it will set some of those pages all-visible. Now it's going to
freeze them, too. Arguably it shouldn't even be setting the pages
all-visible, but now you make that existing problem much worse.

So the benefit of the FPI condition is that it indicates that the page hasn't
been updated all that recently, because, after all, a checkpoint has happened
since? If that's the intention, it needs a huge honking comment - at least I
can't read that out of:

Also freeze when pruning generated an FPI, if doing so means that we set the
page all-frozen afterwards (might not happen until final heap pass).

It doesn't seem like a great proxy to me. ISTM that this means that how
aggressive vacuum is about opportunistically freezing pages depends on config
variables like checkpoint_timeout & max_wal_size (less common opportunistic
freezing), full_page_writes & use of unlogged tables (no opportunistic
freezing), and the largely random scheduling of autovac workers.

I can see it making a difference for pgbench_tellers, but it's a pretty small
difference in overall WAL volume. I can think of more adverse workloads though
- but even there the difference seems not huge, and not predictably
reached. Due to the freeze plan stuff you added, the amount of WAL for
freezing a page is pretty darn small compared to the amount of WAL if compared
to the amount of WAL needed to fill a page with non-frozen tuples.

That's not to say we shouldn't reduce the risk - I agree that both the "any
fpi" and the "any record" condition can have adverse effects!

However, an already dirty page getting frozen is also the one case where
freezing won't have meaningful write amplication effect. So I think it's worth
trying spending effort figuring out how we can make freezing in that situation
have unlikely and small downsides.

The cases with downsides are tables that are very heavily updated througout,
where the page is going to be defrosted again almost immediately. As you say,
the all-visible marking has a similar problem.

Essentially the "any fpi" logic is a very coarse grained way of using the page
LSN as a measurement. As I said, I don't think "has a checkpoint occurred
since the last write" is a good metric to avoid unnecessary freezing - it's
too coarse. But I think using the LSN is the right thought. What about
something like

lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
FreezeMe();

I probably got some details wrong, what I am going for with lsn_threshold is
that we'd freeze an already dirty page if it's not been updated within 10% of
the LSN distance to the last VACUUM.

The important point is that there doesn't seem to be any good way
around thinking about the table as a whole if you're going to freeze
speculatively. This is not the same dynamic as we see with the FPI
thing IMV -- that's not nearly so speculative as what you're talking
about, since it is speculative in roughly the same sense that eager
freezing was speculative (hence the suggestion that something like
vacuum_freeze_strategy_threshold could have a roll to play).

I don't think the speculation is that fundamentally different - a heavily
updated table with a bit of a historic, non-changing portion, makes
vacuum_freeze_strategy_threshold freeze way more aggressively than either "any
record" or "any fpi".

The FPI thing is mostly about the cost now versus the cost later on.
You're gambling that you won't get another FPI later on if you freeze
now. But the cost of a second FPI later on is so much higher than the
added cost of freezing now that that's a very favorable bet, that we
can afford to "lose" many times while still coming out ahead overall.

Agreed. And not just avoiding FPIs, avoiding another dirtying of the page! The
latter part is especially huge IMO. Depending on s_b size it can also avoid
another *read* of the page...

And even when we lose, you generally still won't have been completely
wrong -- even then there generally will indeed be a second FPI later
on for the same page, to go with everything else. This makes the
wasted freezing even less significant, on a comparative basis!

This is precisely why I think that we can afford to be quite aggressive about
freezing already dirty pages...

Greetings,

Andres Freund

#132

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#131)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 9:58 PM Andres Freund <andres@anarazel.de> wrote:

It doesn't seem like a great proxy to me. ISTM that this means that how
aggressive vacuum is about opportunistically freezing pages depends on config
variables like checkpoint_timeout & max_wal_size (less common opportunistic
freezing), full_page_writes & use of unlogged tables (no opportunistic
freezing), and the largely random scheduling of autovac workers.

The FPI thing was originally supposed to complement the freezing
strategies stuff, and possibly other rules that live in
lazy_scan_prune. Obviously you can freeze a page by following any rule
that you care to invent -- you can decide by calling random(). Two
rules can coexist during the same VACUUM (actually, they do already).

Essentially the "any fpi" logic is a very coarse grained way of using the page
LSN as a measurement. As I said, I don't think "has a checkpoint occurred
since the last write" is a good metric to avoid unnecessary freezing - it's
too coarse. But I think using the LSN is the right thought. What about
something like

lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
FreezeMe();

I probably got some details wrong, what I am going for with lsn_threshold is
that we'd freeze an already dirty page if it's not been updated within 10% of
the LSN distance to the last VACUUM.

It seems to me that you're reinventing something akin to eager
freezing strategy here. At least that's how I define it, since now
you're bringing the high level context into it; what happens with the
table, with VACUUM operations, and so on. Obviously this requires
tracking the metadata that you suppose will be available in some way
or other, in particular things like lsn_of_last_vacuum.

What about unlogged/temporary tables? The obvious thing to do there is
what I did in the patch that was reverted (freeze whenever the page
will thereby become all-frozen), and forget about LSNs. But you have
already objected to that part, specifically.

BTW, you still haven't changed the fact that you get rather different
behavior with checksums/wal_log_hints. I think that that's good, but
you didn't seem to.

I don't think the speculation is that fundamentally different - a heavily
updated table with a bit of a historic, non-changing portion, makes
vacuum_freeze_strategy_threshold freeze way more aggressively than either "any
record" or "any fpi".

That's true. The point I was making is that both this proposal and
eager freezing are based on some kind of high level picture of the
needs of the table, based on high level metadata. To me that's the
defining characteristic.

And even when we lose, you generally still won't have been completely
wrong -- even then there generally will indeed be a second FPI later
on for the same page, to go with everything else. This makes the
wasted freezing even less significant, on a comparative basis!

This is precisely why I think that we can afford to be quite aggressive about
freezing already dirty pages...

I'm beginning to warm to this idea, now that I understand it a little better.

--
Peter Geoghegan

#133

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Peter Geoghegan (#132)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-26 23:11:41 -0800, Peter Geoghegan wrote:

Essentially the "any fpi" logic is a very coarse grained way of using the page
LSN as a measurement. As I said, I don't think "has a checkpoint occurred
since the last write" is a good metric to avoid unnecessary freezing - it's
too coarse. But I think using the LSN is the right thought. What about
something like

lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
FreezeMe();

I probably got some details wrong, what I am going for with lsn_threshold is
that we'd freeze an already dirty page if it's not been updated within 10% of
the LSN distance to the last VACUUM.

It seems to me that you're reinventing something akin to eager
freezing strategy here. At least that's how I define it, since now
you're bringing the high level context into it; what happens with the
table, with VACUUM operations, and so on. Obviously this requires
tracking the metadata that you suppose will be available in some way
or other, in particular things like lsn_of_last_vacuum.

I agree with bringing high-level context into the decision about whether to
freeze agressively - my problem with the eager freezing strategy patch isn't
that it did that too much, it's that it didn't do it enough.

But I also don't think what I describe above is really comparable to "table
level" eager freezing though - the potential worst case overhead is a small
fraction of the WAL volume, and there's zero increase in data write volume. I
suspect the absolute worst case of "always freeze dirty pages" is when a
single tuple on the page gets updated immediately after every time we freeze
the page - a single tuple is where the freeze record is the least space
efficient. The smallest update is about the same size as the smallest freeze
record. For that to amount to a large WAL increase you'd a crazy rate of such
updates interspersed with vacuums. In slightly more realistic cases (i.e. not
column less tuples that constantly get updated and freezing happening all the
time) you end up with a reasonably small WAL rate overhead.

That worst case of "freeze dirty" is bad enough to spend some brain and
compute cycles to prevent. But if we don't always get it right in some
workload, it's not *awful*.

The worst case of the "eager freeze strategy" is a lot larger - it's probably
something like updating one narrow tuple every page, once per checkpoint, so
that each freeze generates an FPI. I think that results in a max overhead of
2x for data writes, and about 150x for WAL volume (ratio of one update record
with an FPI). Obviously that's a pointless workload, but I do think that
analyzing the "outer boundaries" of the regression something can cause, can be
helpful.

I think one way forward with the eager strategy approach would be to have a
very narrow gating condition for now, and then incrementally expand it in
later releases.

One use-case where the eager strategy is particularly useful is
[nearly-]append-only tables - and it's also the one workload that's reasonably
easy to detect using stats. Maybe something like
(dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05
or so.

That'll definitely leave out loads of workloads where eager freezing would be
useful - but are there semi-reasonable workloads where it'll hurt badly? I
don't *think* so.

What about unlogged/temporary tables? The obvious thing to do there is
what I did in the patch that was reverted (freeze whenever the page
will thereby become all-frozen), and forget about LSNs. But you have
already objected to that part, specifically.

My main concern about that is the data write amplification it could cause when
page is clean when we start freezing. But I can't see a large potential
downside to always freezing unlogged/temp tables when the page is already
dirty.

BTW, you still haven't changed the fact that you get rather different
behavior with checksums/wal_log_hints. I think that that's good, but
you didn't seem to.

I think that, if we had something like the recency test I was talking about,
we could afford to alway freeze when the page is already dirty and not very
recently modified. I.e. not even insist on a WAL record having been generated
during pruning/HTSV. But I need to think through the dangers of that more.

Greetings,

Andres Freund

#134

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#133)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-27 00:51:59 -0800, Andres Freund wrote:

One use-case where the eager strategy is particularly useful is
[nearly-]append-only tables - and it's also the one workload that's reasonably
easy to detect using stats. Maybe something like
(dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05
or so.

That'll definitely leave out loads of workloads where eager freezing would be
useful - but are there semi-reasonable workloads where it'll hurt badly? I
don't *think* so.

That 0.05 could be a GUC + relopt combo, which'd allow users to opt in tables
with known usage pattern into always using eager freezing.

Greetings,

Andres Freund

#135

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#126)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 4:51 PM Peter Geoghegan <pg@bowt.ie> wrote:

This is the kind of remark that makes me think that you don't get it.

The most influential OLTP benchmark of all time is TPC-C, which has
exactly this problem. In spades -- it's enormously disruptive. Which
is one reason why I used it as a showcase for a lot of this work. Plus
practical experience (like the Heroku database in the blog post I
linked to) fully agrees with that benchmark, as far as this stuff goes
-- that was also a busy OLTP database.

Online transaction involves transactions. Right? There is presumably
some kind of ledger, some kind of orders table. Naturally these have
entries that age out fairly predictably. After a while, almost all the
data is cold data. It is usually about that simple.

One of the key strengths of systems like Postgres is the ability to
inexpensively store a relatively large amount of data that has just
about zero chance of being read, let alone modified. While at the same
time having decent OLTP performance for the hot data. Not nearly as
good as an in-memory system, mind you -- and yet in-memory systems
remain largely a niche thing.

I think it's interesting that TPC-C suffers from the kind of problem
that your patch was intended to address. I hadn't considered that. But
I do not think it detracts from the basic point I was making, which is
that you need to think about the downsides of your patch, not just the
upsides.

If you want to argue that there is *no* OLTP workload that will be
harmed by freezing as aggressively as possible, then that would be a
good argument in favor of your patch, because it would be arguing that
the downside simply doesn't exist, at least for OLTP workloads. The
fact that you can think of *one particular* OLTP workload that can
benefit from the patch is just doubling down on the "my patch has an
upside" argument, which literally no one is disputing.

I don't think you can make such an argument stick, though. OLTP
workloads come in all shapes and sizes. It's pretty common to have
tables where the application inserts a bunch of data, updates it over
and over again like, truncates the table, and starts over. In such a
case, aggressive freezing has to be a loss, because no freezing is
ever needed. It's also surprisingly common to have tables where a
bunch of data is inserted and then, after a bit of processing, a bunch
of rows are updated exactly once, after which the data is not modified
any further. In those kinds of cases, aggressive freezing is a great
idea if it happens after that round of updates but a poor idea if it
happens before that round of updates.

It's also pretty common to have cases where portions of the table
become very hot, get a lot of updates for a while, and then that part
of the table becomes cool and some other part of the table becomes
very hot for a while. I think it's possible that aggressive freezing
might do OK in such environments, actually. It will be a negative if
we aggressively freeze the part of the table that's currently hot, but
I think typically tables that have this access pattern are quite big,
so VACUUM isn't going to sweep through the table all that often. It
will probably freeze a lot more data-that-was-hot-a-bit-ago than it
will freeze data-that-is-hot-this-very-minute. Then again, maybe that
would happen without the patch, too. Maybe this kind of case is a wash
for your patch? I don't know.

Whatever you think of these examples, I don't see how it can be right
to suppose that *in general* freezing very aggressively has no
downsides. If that were true, then we probably wouldn't have
vacuum_freeze_min_age at all. We would always just freeze everything
ASAP. I mean, you could theorize that whoever invented that GUC is an
idiot and that they had absolutely no good reason for introducing it,
but that seems pretty ridiculous. Someone put guards against
overly-aggressive freezing into the system *for a reason* and if you
just go rip them all out, you're going to reintroduce the problems
against which they were intended to guard.

--
Robert Haas
EDB: http://www.enterprisedb.com

#136

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Peter Geoghegan (#127)

Re: New strategies for freezing, advancing relfrozenxid early

On Thu, Jan 26, 2023 at 6:37 PM Peter Geoghegan <pg@bowt.ie> wrote:

I don't see what your reference to checkpoint timeout is about here?

Also, as I mentioned before, the problem isn't specific to checkpoint_timeout
= 1min. It just makes it cheaper to reproduce.

That's flagrantly intellectually dishonest.

This kind of ad hominum attack has no place on this mailing list, or
anywhere in the PostgreSQL community.

If you think there's a problem with Andres's test case, or his
analysis of it, you can talk about those problems without accusing him
of intellectual dishonesty.

I don't see anything to indicate that he was being intentionally
dishonest, either. At most he was mistaken. More than likely, not even
that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#137

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Robert Haas (#135)

Re: New strategies for freezing, advancing relfrozenxid early

On Fri, Jan 27, 2023 at 6:48 AM Robert Haas <robertmhaas@gmail.com> wrote:

One of the key strengths of systems like Postgres is the ability to
inexpensively store a relatively large amount of data that has just
about zero chance of being read, let alone modified. While at the same
time having decent OLTP performance for the hot data. Not nearly as
good as an in-memory system, mind you -- and yet in-memory systems
remain largely a niche thing.

I think it's interesting that TPC-C suffers from the kind of problem
that your patch was intended to address. I hadn't considered that. But
I do not think it detracts from the basic point I was making, which is
that you need to think about the downsides of your patch, not just the
upsides.

If you want to argue that there is *no* OLTP workload that will be
harmed by freezing as aggressively as possible, then that would be a
good argument in favor of your patch, because it would be arguing that
the downside simply doesn't exist, at least for OLTP workloads. The
fact that you can think of *one particular* OLTP workload that can
benefit from the patch is just doubling down on the "my patch has an
upside" argument, which literally no one is disputing.

You've treated me to another multi paragraph talking down, as if I was
still clinging to my original position, which is of course not the
case. I've literally said I'm done with VACUUM for good, and that I
just want to put a line under this. Yet you still persist in doing
this sort of thing. I'm not fighting you, I'm not fighting Andres.

I was making a point about the need to do something in this area in
general. That's all.

--
Peter Geoghegan

#138

Robert Haas

robertmhaas@gmail.com

almost 3 years ago

In reply to: Andres Freund (#131)

Re: New strategies for freezing, advancing relfrozenxid early

On Fri, Jan 27, 2023 at 12:58 AM Andres Freund <andres@anarazel.de> wrote:

Essentially the "any fpi" logic is a very coarse grained way of using the page
LSN as a measurement. As I said, I don't think "has a checkpoint occurred
since the last write" is a good metric to avoid unnecessary freezing - it's
too coarse. But I think using the LSN is the right thought. What about
something like

lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
FreezeMe();

I probably got some details wrong, what I am going for with lsn_threshold is
that we'd freeze an already dirty page if it's not been updated within 10% of
the LSN distance to the last VACUUM.

I think this might not be quite the right idea for a couple of reasons.

First, suppose that the table is being processed just by autovacuum
(no manual VACUUM operations) and that the rate of WAL generation is
pretty even, so that LSN age is a good proxy for time. If autovacuum
processes the table once per hour, this will freeze if it hasn't been
updated in the last six minutes. That sounds good. But if autovacuum
processes the table once per day, then this will freeze if it hasn't
been updated in 2.4 hours. That might be OK, but it sounds a little on
the long side. If autovacuum processes the table once per week, then
this will freeze if it hasn't been updated in 16.8 hours. That sounds
too conservative. Conversely, if autovacuum processes the table every
3 minutes, then this will freeze the data if it hasn't been updated in
the last 18 seconds, which sounds awfully aggressive. Maybe I'm wrong
here, but I feel like the absolute amount of wall-clock time we're
talking about here probably matters to some degree. I'm not sure
whether a strict time-based threshold like, say, 10 minutes would be a
good idea, leaving aside the difficulties of implementation. It might
be right to think that if the table is being vacuumed a lot, freezing
more aggressively is smart, and if it's being vacuumed infrequently,
freezing less aggressively is smart, because if the table has enough
activity that it's being vacuumed frequently, that might also be a
sign that we need to freeze more aggressively in order to avoid having
things go sideways. However, I'm not completely sure about that, and I
think it's possible that we need some guardrails to avoid going too
far in either direction.

Second, and more seriously, I think this would, in some circumstances,
lead to tremendously unstable behavior. Suppose somebody does a bunch
of work on a table and then they're like "oh, we should clean up,
VACUUM" and it completes quickly because it's been a while since the
last vacuum and so it doesn't freeze much. Then, for whatever reason,
they decide to run it one more time, and it goes bananas and starts
freezing all kinds of stuff because the LSN distance since the last
vacuum is basically zero. Or equally, you run a manual VACUUM, and you
get completely different behavior depending on how long it's been
since the last autovacuum ran.

In some ways, I think this proposal has many of the same problems as
vacuum_freeze_min_age. In both cases, the instinct is that we should
use something on the page to let us know how long it's been since the
page was modified, and proceed on the theory that if the page has not
been modified recently, it probably isn't about to be modified again.
That's a reasonable instinct, but the rate of XID advancement and the
rate of LSN advancement are both highly variable, even on a system
that's always under some load.

--
Robert Haas
EDB: http://www.enterprisedb.com

#139

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Robert Haas (#138)

Re: New strategies for freezing, advancing relfrozenxid early

Hi,

On 2023-01-27 12:53:58 -0500, Robert Haas wrote:

On Fri, Jan 27, 2023 at 12:58 AM Andres Freund <andres@anarazel.de> wrote:

Essentially the "any fpi" logic is a very coarse grained way of using the page
LSN as a measurement. As I said, I don't think "has a checkpoint occurred
since the last write" is a good metric to avoid unnecessary freezing - it's
too coarse. But I think using the LSN is the right thought. What about
something like

lsn_threshold = insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1
if (/* other conds */ && PageGetLSN(page) <= lsn_threshold)
FreezeMe();

I probably got some details wrong, what I am going for with lsn_threshold is
that we'd freeze an already dirty page if it's not been updated within 10% of
the LSN distance to the last VACUUM.

I think this might not be quite the right idea for a couple of reasons.

It's definitely not perfect. If we had an approximate LSN->time map as
general infrastructure, we could do a lot better. I think it'd be reasonably
easy to maintain that in the autovacuum launcher, for example.

One thing worth calling out here, because it's not obvious from the code
quoted above in isolation, is that what I was trying to refine here was the
decision when to perform opportunistic freezing *of already dirty pages that
do not require an FPI*.

So all that we need to prevent here is freezing very hotly updated data, where
the WAL overhead of the freeze records would be noticable, because we
constantly VACUUM, due to the high turnover.

First, suppose that the table is being processed just by autovacuum
(no manual VACUUM operations) and that the rate of WAL generation is
pretty even, so that LSN age is a good proxy for time. If autovacuum
processes the table once per hour, this will freeze if it hasn't been
updated in the last six minutes. That sounds good. But if autovacuum
processes the table once per day, then this will freeze if it hasn't
been updated in 2.4 hours. That might be OK, but it sounds a little on
the long side.

You're right. I was thinking of the "lsn_since_last_vacuum" because I was
posulating it being useful elsewhere in the thread (but for eager strategy
logic) - but here that's really not very relevant.

Given that we're dealing with already dirty pages not requiring an FPI, I
think a much better "reference LSN" would be the LSN of the last checkpoint
(LSN of the last checkpoint record, not the current REDO pointer).

Second, and more seriously, I think this would, in some circumstances,
lead to tremendously unstable behavior. Suppose somebody does a bunch
of work on a table and then they're like "oh, we should clean up,
VACUUM" and it completes quickly because it's been a while since the
last vacuum and so it doesn't freeze much. Then, for whatever reason,
they decide to run it one more time, and it goes bananas and starts
freezing all kinds of stuff because the LSN distance since the last
vacuum is basically zero. Or equally, you run a manual VACUUM, and you
get completely different behavior depending on how long it's been
since the last autovacuum ran.

I don't think this quite applies to the scenario at hand, because it's
restricted to already dirty pages. And the max increased overhead is also
small due to that - so occasionally getting it wrong is that impactful.

Greetings,

Andres Freund

#140

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#133)

Re: New strategies for freezing, advancing relfrozenxid early

On Fri, Jan 27, 2023 at 12:52 AM Andres Freund <andres@anarazel.de> wrote:

I agree with bringing high-level context into the decision about whether to
freeze agressively - my problem with the eager freezing strategy patch isn't
that it did that too much, it's that it didn't do it enough.

But I also don't think what I describe above is really comparable to "table
level" eager freezing though - the potential worst case overhead is a small
fraction of the WAL volume, and there's zero increase in data write volume.

All I meant was that I initially thought that you were trying to
replace the FPI thing with something at the same level of ambition,
that could work in a low context way. But I now see that you're
actually talking about something quite a bit more ambitious for
Postgres 16, which is structurally similar to a freezing strategy,
from a code point of view -- it relies on high-level context for the
VACUUM/table as a whole. I wasn't equating it with the eager freezing
strategy in any other way.

It might also be true that this other thing happens to render the FPI
mechanism redundant. I'm actually not completely sure that it will
just yet. Let me verify my understanding of your proposal:

You mean that we'd take the page LSN before doing anything with the
page, right at the top of lazy_scan_prune, at the same point that
"fpi_before" is initialized currently. Then, if we subsequently
dirtied the page (as determined by its LSN, so as to focus on "dirtied
via WAL logged operation") during pruning, *and* if the "lsn_before"
of the page was from before our cutoff (derived via " lsn_threshold =
insert_lsn - (insert_lsn - lsn_of_last_vacuum) * 0.1" or similar),
*and* if the page is eligible to become all-frozen, then we'd freeze
the page.

That's it, right? It's about pages that *we* (VACUUM) dirtied, and
wrote records and/or FPIs for already?

I suspect the absolute worst case of "always freeze dirty pages" is when a
single tuple on the page gets updated immediately after every time we freeze
the page - a single tuple is where the freeze record is the least space
efficient. The smallest update is about the same size as the smallest freeze
record. For that to amount to a large WAL increase you'd a crazy rate of such
updates interspersed with vacuums. In slightly more realistic cases (i.e. not
column less tuples that constantly get updated and freezing happening all the
time) you end up with a reasonably small WAL rate overhead.

Other thing is that we'd be doing this in situations where we already
know that a VISIBLE record is required, which is comparable in size to
a FREEZE_PAGE record with one tuple/plan (around 64 bytes). The
smallest WAL records are mostly just generic WAL record header
overhead.

Obviously that's a pointless workload, but I do think that
analyzing the "outer boundaries" of the regression something can cause, can be
helpful.

I agree about the "outer boundaries" being a useful guide.

I think one way forward with the eager strategy approach would be to have a
very narrow gating condition for now, and then incrementally expand it in
later releases.

One use-case where the eager strategy is particularly useful is
[nearly-]append-only tables - and it's also the one workload that's reasonably
easy to detect using stats. Maybe something like
(dead_tuples_since_last_vacuum / inserts_since_last_vacuum) < 0.05
or so.

That'll definitely leave out loads of workloads where eager freezing would be
useful - but are there semi-reasonable workloads where it'll hurt badly? I
don't *think* so.

I have no further plans to work on eager freezing strategy, or
anything of the sort, in light of recent developments. My goal at this
point is very unambitious: to get the basic page-level freezing work
into a form that makes sense as a standalone thing for Postgres 16. To
put things on a good footing, so that I can permanently bow out of all
work on VACUUM having left everything in good order. That's all.

Now, that might still mean that I'd facilitate future work of this
sort, by getting the right basic structure in place. But my
involvement in any work on freezing or anything of the sort ends here,
both as a patch author and a committer of anybody else's work. I'm
proud of the work I've done on VACUUM, but I'm keen to move on from
it.

What about unlogged/temporary tables? The obvious thing to do there is
what I did in the patch that was reverted (freeze whenever the page
will thereby become all-frozen), and forget about LSNs. But you have
already objected to that part, specifically.

My main concern about that is the data write amplification it could cause when
page is clean when we start freezing. But I can't see a large potential
downside to always freezing unlogged/temp tables when the page is already
dirty.

But we have to dirty the page anyway, just to set PD_ALL_VISIBLE. That
was always a gating condition. Actually, that may have depended on not
having SKIP_PAGES_THRESHOLD, which the vm snapshot infrastructure
would have removed. That's not happening now, so I may need to
reassess. But even with SKIP_PAGES_THRESHOLD, it should be fine.

BTW, you still haven't changed the fact that you get rather different
behavior with checksums/wal_log_hints. I think that that's good, but
you didn't seem to.

I think that, if we had something like the recency test I was talking about,
we could afford to alway freeze when the page is already dirty and not very
recently modified. I.e. not even insist on a WAL record having been generated
during pruning/HTSV. But I need to think through the dangers of that more.

Now I'm confused. I thought that the recency test you talked about was
purely to be used to do something a bit like the FPI thing, but using
some high level context. Now I don't know what to think.

--
Peter Geoghegan