Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Started by Peter Geogheganabout 4 years ago139 messages

pg@bowt.ie

about 4 years ago

2 attachment(s)

Attached WIP patch series significantly simplifies the definition of
scanned_pages inside vacuumlazy.c. Apart from making several very
tricky things a lot simpler, and moving more complex code outside of
the big "blkno" loop inside lazy_scan_heap (building on the Postgres
14 work), this refactoring directly facilitates 2 new optimizations
(also in the patch):

1. We now collect LP_DEAD items into the dead_tuples array for all
scanned pages -- even when we cannot get a cleanup lock.

2. We now don't give up on advancing relfrozenxid during a
non-aggressive VACUUM when we happen to be unable to get a cleanup
lock on a heap page.

Both optimizations are much more natural with the refactoring in
place. Especially #2, which can be thought of as making aggressive and
non-aggressive VACUUM behave similarly. Sure, we shouldn't wait for a
cleanup lock in a non-aggressive VACUUM (by definition) -- and we
still don't in the patch (obviously). But why wouldn't we at least
*check* if the page has tuples that need to be frozen in order for us
to advance relfrozenxid? Why give up on advancing relfrozenxid in a
non-aggressive VACUUM when there's no good reason to?

See the draft commit messages from the patch series for many more
details on the simplifications I am proposing.

I'm not sure how much value the second optimization has on its own.
But I am sure that the general idea of teaching non-aggressive VACUUM
to be conscious of the value of advancing relfrozenxid is a good one
-- and so #2 is a good start on that work, at least. I've discussed
this idea with Andres (CC'd) a few times before now. Maybe we'll need
another patch that makes VACUUM avoid setting heap pages to
all-visible without also setting them to all-frozen (and freezing as
necessary) in order to really get a benefit. Since, of course, a
non-aggressive VACUUM still won't be able to advance relfrozenxid when
it skipped over all-visible pages that are not also known to be
all-frozen.

Masahiko (CC'd) has expressed interest in working on opportunistic
freezing. This refactoring patch seems related to that general area,
too. At a high level, to me, this seems like the tuple freezing
equivalent of the Postgres 14 work on bypassing index vacuuming when
there are very few LP_DEAD items (interpret that as 0 LP_DEAD items,
which is close to the truth anyway). There are probably quite a few
interesting opportunities to make VACUUM better by not having such a
sharp distinction between aggressive and non-aggressive VACUUM. Why
should they be so different? A good medium term goal might be to
completely eliminate aggressive VACUUMs.

I have heard many stories about anti-wraparound/aggressive VACUUMs
where the cure (which suddenly made autovacuum workers
non-cancellable) was worse than the disease (not actually much danger
of wraparound failure). For example:

https://www.joyent.com/blog/manta-postmortem-7-27-2015

Yes, this problem report is from 2015, which is before we even had the
freeze map stuff. I still think that the point about aggressive
VACUUMs blocking DDL (leading to chaos) remains valid.

There is another interesting area of future optimization within
VACUUM, that also seems relevant to this patch: the general idea of
*avoiding* pruning during VACUUM, when it just doesn't make sense to
do so -- better to avoid dirtying the page for now. Needlessly pruning
inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
with heap fill factor reduced to 95) will have autovacuums that
*constantly* do it (granted, it may not matter so much there because
VACUUM is unlikely to re-dirty the page anyway). This patch seems
relevant to that area because it recognizes that pruning during VACUUM
is not necessarily special -- a new function called lazy_scan_noprune
may be used instead of lazy_scan_prune (though only when a cleanup
lock cannot be acquired). These pages are nevertheless considered
fully processed by VACUUM (this is perhaps 99% true, so it seems
reasonable to round up to 100% true).

I find it easy to imagine generalizing the same basic idea --
recognizing more ways in which pruning by VACUUM isn't necessarily
better than opportunistic pruning, at the level of each heap page. Of
course we *need* to prune sometimes (e.g., might be necessary to do so
to set the page all-visible in the visibility map), but why bother
when we don't, and when there is no reason to think that it'll help
anyway? Something to think about, at least.

--
Peter Geoghegan

Attachments:

v1-0002-Improve-log_autovacuum_min_duration-output.patchapplication/octet-stream; name=v1-0002-Improve-log_autovacuum_min_duration-output.patchDownload

From 9bd19c1e324c0a796091dce988831b1165f815e8 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 21 Nov 2021 14:47:11 -0800
Subject: [PATCH v1 2/2] Improve log_autovacuum_min_duration output.

Report on visibility map pages skipped by VACUUM, without regard for
whether the pages were all-frozen or just all-visible.

Also report when and how relfrozenxid is advanced by VACUUM, including
non-aggressive VACUUM.  Apart from being useful on its own, this might
enable future work that teaches non-aggressive VACUUM to be more
concerned about advancing relfrozenxid sooner rather than later.
---
 src/include/commands/vacuum.h        |  2 ++
 src/backend/access/heap/vacuumlazy.c | 41 ++++++++++++++++++++++------
 src/backend/commands/analyze.c       |  3 ++
 src/backend/commands/vacuum.c        |  9 ++++++
 4 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4cfd52eaf..bc625463e 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -263,6 +263,8 @@ extern void vac_update_relstats(Relation relation,
 								bool hasindex,
 								TransactionId frozenxid,
 								MultiXactId minmulti,
+								bool *frozenxid_updated,
+								bool *minmulti_updated,
 								bool in_outer_xact);
 extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 809f59c73..2d23f35a4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -498,6 +498,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	double		read_rate,
 				write_rate;
 	bool		aggressive;
+	bool		frozenxid_updated,
+				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
@@ -703,9 +705,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
+		frozenxid_updated = minmulti_updated = false;
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId, false);
+							InvalidTransactionId, InvalidMultiXactId,
+							NULL, NULL, false);
 	}
 	else
 	{
@@ -714,7 +718,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			   orig_rel_pages);
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff, false);
+							FreezeLimit, MultiXactCutoff,
+							&frozenxid_updated, &minmulti_updated, false);
 	}
 
 	/*
@@ -744,6 +749,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
+			int32		   diff;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -790,16 +796,35 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped using visibility map (%.2f%% of total)\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->frozenskipped_pages);
+							 orig_rel_pages - vacrel->scanned_pages,
+							 orig_rel_pages > 0 ?
+							 100.0 * (orig_rel_pages - vacrel->scanned_pages) / orig_rel_pages : 0);
 			appendStringInfo(&buf,
-							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
+							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->new_dead_tuples,
-							 OldestXmin);
+							 (long long) vacrel->new_dead_tuples);
+			diff = (int32) (ReadNextTransactionId() - OldestXmin);
+			appendStringInfo(&buf,
+							 _("removal cutoff: oldest xmin was %u, which is now %d xact IDs behind\n"),
+							 OldestXmin, diff);
+			if (frozenxid_updated)
+			{
+				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				appendStringInfo(&buf,
+								 _("relfrozenxid: advanced by %d xact IDs, new value: %u\n"),
+								 diff, FreezeLimit);
+			}
+			if (minmulti_updated)
+			{
+				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				appendStringInfo(&buf,
+								 _("relminmxid: advanced by %d multixact IDs, new value: %u\n"),
+								 diff, MultiXactCutoff);
+			}
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -4011,7 +4036,7 @@ update_index_statistics(LVRelState *vacrel)
 							false,
 							InvalidTransactionId,
 							InvalidMultiXactId,
-							false);
+							NULL, NULL, false);
 	}
 }
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4928702ae..719bf556a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -650,6 +650,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 							hasindex,
 							InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 
 		/* Same for indexes */
@@ -666,6 +667,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								false,
 								InvalidTransactionId,
 								InvalidMultiXactId,
+								NULL, NULL,
 								in_outer_xact);
 		}
 	}
@@ -678,6 +680,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		vac_update_relstats(onerel, -1, totalrows,
 							0, hasindex, InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 	}
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5c4bc15b4..8bd4bd12c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1308,6 +1308,7 @@ vac_update_relstats(Relation relation,
 					BlockNumber num_all_visible_pages,
 					bool hasindex, TransactionId frozenxid,
 					MultiXactId minmulti,
+					bool *frozenxid_updated, bool *minmulti_updated,
 					bool in_outer_xact)
 {
 	Oid			relid = RelationGetRelid(relation);
@@ -1383,22 +1384,30 @@ vac_update_relstats(Relation relation,
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
+	if (frozenxid_updated)
+		*frozenxid_updated = false;
 	if (TransactionIdIsNormal(frozenxid) &&
 		pgcform->relfrozenxid != frozenxid &&
 		(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
 		 TransactionIdPrecedes(ReadNextTransactionId(),
 							   pgcform->relfrozenxid)))
 	{
+		if (frozenxid_updated)
+			*frozenxid_updated = true;
 		pgcform->relfrozenxid = frozenxid;
 		dirty = true;
 	}
 
 	/* Similarly for relminmxid */
+	if (minmulti_updated)
+		*minmulti_updated = false;
 	if (MultiXactIdIsValid(minmulti) &&
 		pgcform->relminmxid != minmulti &&
 		(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
 		 MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
 	{
+		if (minmulti_updated)
+			*minmulti_updated = true;
 		pgcform->relminmxid = minmulti;
 		dirty = true;
 	}
-- 
2.30.2

v1-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/octet-stream; name=v1-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload

From c5698dc01952e09bd922c120ec691e63f9b890c9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 17 Nov 2021 21:27:06 -0800
Subject: [PATCH v1 1/2] Simplify lazy_scan_heap's handling of scanned pages.

Redefine a scanned page as any heap page that actually gets pinned by
VACUUM's first pass over the heap.  Pages counted by scanned_pages are
now the complement of the pages that are skipped over using the
visibility map.  This new definition significantly simplifies quite a
few things.

Now heap relation truncation, visibility map bit setting, tuple counting
(e.g., for pg_class.reltuples), and tuple freezing all share a common
definition of scanned_pages.  That makes it possible to remove certain
special cases, that never made much sense.  We no longer need to track
tupcount_pages separately (see bugfix commit 1914c5ea for details),
since we now always count tuples from pages that are scanned_pages.  We
also don't need to needlessly distinguish between aggressive and
non-aggressive VACUUM operations when we cannot immediately acquire a
cleanup lock.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.  But we now
make the non-aggressive case notice that a failure to acquire a cleanup
lock on one particular heap page does not in itself make it unsafe to
advance relfrozenxid for the whole relation (which is what we usually
see in the aggressive case already).

This new relfrozenxid optimization might not be all that valuable on its
own, but it may still facilitate future work that makes non-aggressive
VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
rather than later.  In general it would be useful for non-aggressive
VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
a cleanup lock once or twice if needed).  It would also be generally
useful if aggressive VACUUMs were "less aggressive" opportunistically
(e.g. by being responsive to query cancellations when the risk of
wraparound failure is still very low).

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer.  We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.
Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),
so we should not squander any opportunity to do that.  Commit 8523492d4e
taught VACUUM to set LP_DEAD line pointers to LP_UNUSED while only
holding an exclusive lock (not a cleanup lock), so we can expect to set
existing LP_DEAD items to LP_UNUSED reliably, even when we cannot
acquire our own cleanup lock at either pass over the heap (unless we opt
to skip index vacuuming, which implies that there is no second pass over
the heap).

Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
since there is no barely any real practical sense in which we actually
miss doing useful work for these pages.  Besides, this information
always seemed to have little practical value, even to Postgres hackers.
---
 src/backend/access/heap/vacuumlazy.c          | 792 +++++++++++-------
 .../isolation/expected/vacuum-reltuples.out   |   2 +-
 .../isolation/specs/vacuum-reltuples.spec     |   7 +-
 3 files changed, 500 insertions(+), 301 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 88b9d1f41..809f59c73 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -309,6 +309,8 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	bool		aggressive;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -333,6 +335,8 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Are FreezeLimit/MultiXactCutoff still valid? */
+	bool		freeze_cutoffs_valid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -347,10 +351,8 @@ typedef struct LVRelState
 	 */
 	LVDeadTuples *dead_tuples;	/* items to vacuum from indexes */
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -363,6 +365,7 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
+	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
 	int64		lpdead_items;	/* # deleted from indexes */
 	int64		new_dead_tuples;	/* new estimated total # of dead items in
@@ -402,19 +405,22 @@ static int	elevel = -1;
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
-						   bool aggressive);
+static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params);
+static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
+								   BlockNumber blkno, Page page,
+								   bool sharelock, Buffer vmbuffer);
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
 							LVPagePruneState *prunestate);
+static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
+							  BlockNumber blkno, Page page,
+							  bool *hastup);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, int tupindex, Buffer *vmbuffer);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel);
@@ -491,16 +497,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	int			usecs;
 	double		read_rate,
 				write_rate;
-	bool		aggressive;		/* should we scan all unfrozen pages? */
-	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
+	bool		aggressive;
+	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
-	TransactionId new_frozen_xid;
-	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
@@ -555,6 +559,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
+	vacrel->aggressive = aggressive;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 
@@ -599,6 +604,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* Track if cutoffs became invalid (possible in !aggressive case only) */
+	vacrel->freeze_cutoffs_valid = true;
 
 	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	vacrel->relname = pstrdup(RelationGetRelationName(rel));
@@ -632,30 +639,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	error_context_stack = &errcallback;
 
 	/* Do the vacuuming */
-	lazy_scan_heap(vacrel, params, aggressive);
+	lazy_scan_heap(vacrel, params);
 
 	/* Done with indexes */
 	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the all unfrozen pages. If we did,
-	 * we can adjust relfrozenxid and relminmxid.
-	 *
-	 * NB: We need to check this before truncating the relation, because that
-	 * will change ->rel_pages.
-	 */
-	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
-		< vacrel->rel_pages)
-	{
-		Assert(!aggressive);
-		scanned_all_unfrozen = false;
-	}
-	else
-		scanned_all_unfrozen = true;
-
-	/*
-	 * Optionally truncate the relation.
+	 * Optionally truncate the relation.  But remember the relation size used
+	 * by lazy_scan_prune for later first.
 	 */
+	orig_rel_pages = vacrel->rel_pages;
 	if (should_attempt_truncation(vacrel))
 	{
 		/*
@@ -686,28 +679,43 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 *
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * relpages to.
-	 *
-	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
-	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrel->rel_pages;
+	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
 	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
-
-	vac_update_relstats(rel,
-						new_rel_pages,
-						new_live_tuples,
-						new_rel_allvisible,
-						vacrel->nindexes > 0,
-						new_frozen_xid,
-						new_min_multi,
-						false);
+	/*
+	 * Aggressive VACUUM (which is the same thing as anti-wraparound
+	 * autovacuum for most practical purposes) exists so that we'll reliably
+	 * advance relfrozenxid and relminmxid sooner or later.  But we can often
+	 * opportunistically advance them even in a non-aggressive VACUUM.
+	 * Consider if that's possible now.
+	 *
+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_prune, from before a possible relation
+	 * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
+	 */
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}
 
 	/*
 	 * Report results to the stats collector, too.
@@ -736,7 +744,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
-			BlockNumber orig_rel_pages;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -783,10 +790,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->pinskipped_pages,
 							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
@@ -794,7 +800,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) vacrel->new_rel_tuples,
 							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
-			orig_rel_pages = vacrel->rel_pages + vacrel->pages_removed;
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -891,9 +896,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		reference them have been killed.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, VacuumParams *params)
 {
 	LVDeadTuples *dead_tuples;
+	bool		aggressive;
 	BlockNumber nblocks,
 				blkno,
 				next_unskippable_block,
@@ -913,26 +919,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	pg_rusage_init(&ru0);
 
-	if (aggressive)
-		ereport(elevel,
-				(errmsg("aggressively vacuuming \"%s.%s\"",
-						vacrel->relnamespace,
-						vacrel->relname)));
-	else
-		ereport(elevel,
-				(errmsg("vacuuming \"%s.%s\"",
-						vacrel->relnamespace,
-						vacrel->relname)));
-
+	aggressive = vacrel->aggressive;
 	nblocks = RelationGetNumberOfBlocks(vacrel->rel);
 	next_unskippable_block = 0;
 	next_failsafe_block = 0;
 	next_fsm_block_to_vacuum = 0;
 	vacrel->rel_pages = nblocks;
 	vacrel->scanned_pages = 0;
-	vacrel->pinskipped_pages = 0;
 	vacrel->frozenskipped_pages = 0;
-	vacrel->tupcount_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -950,6 +944,17 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
+	if (aggressive)
+		ereport(elevel,
+				(errmsg("aggressively vacuuming \"%s.%s\"",
+						vacrel->relnamespace,
+						vacrel->relname)));
+	else
+		ereport(elevel,
+				(errmsg("vacuuming \"%s.%s\"",
+						vacrel->relnamespace,
+						vacrel->relname)));
+
 	/*
 	 * Before beginning scan, check if it's already necessary to apply
 	 * failsafe
@@ -1004,15 +1009,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * just added to that page are necessarily newer than the GlobalXmin we
 	 * computed, so they'll have no effect on the value to which we can safely
 	 * set relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 *
-	 * We will scan the table's last page, at least to the extent of
-	 * determining whether it has tuples or not, even if it should be skipped
-	 * according to the above rules; except when we've already determined that
-	 * it's not worth trying to truncate the table.  This avoids having
-	 * lazy_truncate_heap() take access-exclusive lock on the table to attempt
-	 * a truncation that just fails immediately because there are tuples in
-	 * the last page.  This is worth avoiding mainly because such a lock must
-	 * be replayed on any hot standby, where it can be disruptive.
 	 */
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
@@ -1050,18 +1046,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		bool		all_visible_according_to_vm = false;
 		LVPagePruneState prunestate;
 
-		/*
-		 * Consider need to skip blocks.  See note above about forcing
-		 * scanning of last page.
-		 */
-#define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
 
+		/*
+		 * Consider need to skip blocks
+		 */
 		if (blkno == next_unskippable_block)
 		{
 			/* Time to advance next_unskippable_block */
@@ -1110,13 +1102,19 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		else
 		{
 			/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current block can be skipped if we've seen a long enough
+			 * run of skippable blocks to justify skipping it.
+			 *
+			 * There is an exception: we will scan the table's last page to
+			 * determine whether it has tuples or not, even if it would
+			 * otherwise be skipped (unless it's clearly not worth trying to
+			 * truncate the table).  This avoids having lazy_truncate_heap()
+			 * take access-exclusive lock on the table to attempt a truncation
+			 * that just fails immediately because there are tuples in the
+			 * last page.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks &&
+				!(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
@@ -1126,12 +1124,22 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * case, or else we'll think we can't update relfrozenxid and
 				 * relminmxid.  If it's not an aggressive vacuum, we don't
 				 * know whether it was all-frozen, so we have to recheck; but
-				 * in this case an approximate answer is OK.
+				 * in this case an approximate answer is still correct.
+				 *
+				 * (We really don't want to miss out on the opportunity to
+				 * advance relfrozenxid in a non-aggressive vacuum, but this
+				 * edge case shouldn't make that appreciably less likely in
+				 * practice.)
 				 */
 				if (aggressive || VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
+
+			/*
+			 * Otherwise, the page must be at least all-visible if not
+			 * all-frozen, so we can set all_visible_according_to_vm = true
+			 */
 			all_visible_according_to_vm = true;
 		}
 
@@ -1156,7 +1164,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 * Consider if we definitely have enough space to process TIDs on page
 		 * already.  If we are close to overrunning the available space for
 		 * dead-tuple TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
+		 * this page.  Must do this before calling lazy_scan_prune (or before
+		 * calling lazy_scan_noprune).
 		 */
 		if ((dead_tuples->max_tuples - dead_tuples->num_tuples) < MaxHeapTuplesPerPage &&
 			dead_tuples->num_tuples > 0)
@@ -1191,7 +1200,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
-		 * Set up visibility map page as needed.
+		 * Set up visibility map page as needed, and pin the heap page that
+		 * we're going to scan.
 		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
+		vacrel->scanned_pages++;
 
 		/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing in lazy_scan_noprune.
 		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
 			bool		hastup;
 
-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
 			{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
+				/* Lock and pin released for us */
+				continue;
+			}
+
+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
+			{
+				/* No need to wait for cleanup lock for this page */
+				UnlockReleaseBuffer(buf);
+				if (hastup)
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 
 			/*
-			 * Read the page with share lock to see if any xids on it need to
-			 * be frozen.  If not we just skip the page, after updating our
-			 * scan statistics.  If there are some, we wait for cleanup lock.
-			 *
-			 * We could defer the lock request further by remembering the page
-			 * and coming back to it later, or we could even register
-			 * ourselves for multiple buffers and then service whichever one
-			 * is received first.  For now, this seems good enough.
-			 *
-			 * If we get here with aggressive false, then we're just forcibly
-			 * checking the page, and so we don't want to insist on getting
-			 * the lock; we only need to know if the page contains tuples, so
-			 * that we can update nonempty_pages correctly.  It's convenient
-			 * to use lazy_check_needs_freeze() for both situations, though.
+			 * lazy_scan_noprune could not do all required processing without
+			 * a cleanup lock.  Wait for a cleanup lock, and then proceed to
+			 * lazy_scan_prune to perform ordinary pruning and freezing.
 			 */
-			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
-			{
-				UnlockReleaseBuffer(buf);
-				vacrel->scanned_pages++;
-				vacrel->pinskipped_pages++;
-				if (hastup)
-					vacrel->nonempty_pages = blkno + 1;
-				continue;
-			}
-			if (!aggressive)
-			{
-				/*
-				 * Here, we must not advance scanned_pages; that would amount
-				 * to claiming that the page contains no freezable tuples.
-				 */
-				UnlockReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
-				if (hastup)
-					vacrel->nonempty_pages = blkno + 1;
-				continue;
-			}
+			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
-			/* drop through to normal processing */
 		}
 
-		/*
-		 * By here we definitely have enough dead_tuples space for whatever
-		 * LP_DEAD tids are on this page, we have the visibility map page set
-		 * up in case we need to set this page's all_visible/all_frozen bit,
-		 * and we have a super-exclusive lock.  Any tuples on this page are
-		 * now sure to be "counted" by this VACUUM.
-		 *
-		 * One last piece of preamble needs to take place before we can prune:
-		 * we need to consider new and empty pages.
-		 */
-		vacrel->scanned_pages++;
-		vacrel->tupcount_pages++;
-
-		page = BufferGetPage(buf);
-
-		if (PageIsNew(page))
+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
-			if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
-			{
-				Size		freespace = BLCKSZ - SizeOfPageHeaderData;
-
-				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-			}
-			continue;
-		}
-
-		if (PageIsEmpty(page))
-		{
-			Size		freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+			/* Lock and pin released for us */
 			continue;
 		}
 
@@ -1566,7 +1472,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
-													 vacrel->tupcount_pages,
+													 vacrel->scanned_pages,
 													 vacrel->live_tuples);
 
 	/*
@@ -1640,14 +1546,10 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	appendStringInfo(&buf,
 					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
 					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
-	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
-									"Skipped %u pages due to buffer pins, ",
-									vacrel->pinskipped_pages),
-					 vacrel->pinskipped_pages);
-	appendStringInfo(&buf, ngettext("%u frozen page.\n",
-									"%u frozen pages.\n",
-									vacrel->frozenskipped_pages),
-					 vacrel->frozenskipped_pages);
+	appendStringInfo(&buf, ngettext("%u page skipped using visibility map.\n",
+									"%u pages skipped using visibility map.\n",
+									vacrel->rel_pages - vacrel->scanned_pages),
+					 vacrel->rel_pages - vacrel->scanned_pages);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
@@ -1661,6 +1563,132 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pfree(buf.data);
 }
 
+/*
+ *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
+ *
+ * Must call here to handle both new and empty pages before calling
+ * lazy_scan_prune or lazy_scan_noprune, since they're not prepared to deal
+ * with new or empty pages.
+ *
+ * It's necessary to consider new pages as a special case, since the rules for
+ * maintaining the visibility map and FSM with empty pages are a little
+ * different (though new pages can be truncated based on the usual rules).
+ *
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a rare corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).
+ *
+ * Caller can either hold a buffer cleanup lock on the buffer, or a simple
+ * shared lock.
+ */
+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)
+{
+	Size		freespace;
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * All-zeroes pages can be left over if either a backend extends the
+		 * relation by a single page, but crashes before the newly initialized
+		 * page has been written out, or when bulk-extending the relation
+		 * (which creates a number of empty pages at the tail end of the
+		 * relation), and then enters them into the FSM.
+		 *
+		 * Note we do not enter the page into the visibilitymap. That has the
+		 * downside that we repeatedly visit this page in subsequent vacuums,
+		 * but otherwise we'll never not discover the space on a promoted
+		 * standby. The harm of repeated checking ought to normally not be too
+		 * bad - the space usually should be used at some point, otherwise
+		 * there wouldn't be any regular vacuums.
+		 *
+		 * Make sure these pages are in the FSM, to ensure they can be reused.
+		 * Do that by testing if there's any space recorded for the page. If
+		 * not, enter it. We do so after releasing the lock on the heap page,
+		 * the FSM is approximate, after all.
+		 */
+		UnlockReleaseBuffer(buf);
+
+		if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
+		{
+			freespace = BLCKSZ - SizeOfPageHeaderData;
+
+			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		}
+
+		return true;
+	}
+
+	if (PageIsEmpty(page))
+	{
+		/*
+		 * It seems likely that caller will always be able to get a cleanup
+		 * lock on an empty page.  But don't take any chances -- escalate to
+		 * an exclusive lock (still don't need a cleanup lock, though).
+		 */
+		if (sharelock)
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (!PageIsEmpty(page))
+			{
+				/* page isn't new or empty -- keep lock and pin for now */
+				return false;
+			}
+		}
+		else
+		{
+			/* Already have a full cleanup lock (which is more than enough) */
+		}
+
+		freespace = PageGetHeapFreeSpace(page);
+
+		/*
+		 * Unlike new pages, empty pages are always set all-visible and
+		 * all-frozen.
+		 */
+		if (!PageIsAllVisible(page))
+		{
+			START_CRIT_SECTION();
+
+			/* mark buffer dirty before writing a WAL record */
+			MarkBufferDirty(buf);
+
+			/*
+			 * It's possible that another backend has extended the heap,
+			 * initialized the page, and then failed to WAL-log the page due
+			 * to an ERROR.  Since heap extension is not WAL-logged, recovery
+			 * might try to replay our record setting the page all-visible and
+			 * find that the page isn't initialized, which will cause a PANIC.
+			 * To prevent that, check whether the page has been previously
+			 * WAL-logged, and if not, do that now.
+			 */
+			if (RelationNeedsWAL(vacrel->rel) &&
+				PageGetLSN(page) == InvalidXLogRecPtr)
+				log_newpage_buffer(buf, true);
+
+			PageSetAllVisible(page);
+			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
+							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		return true;
+	}
+
+	/* page isn't new or empty -- keep lock and pin */
+	return false;
+}
+
 /*
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -1767,10 +1795,9 @@ retry:
 		 * LP_DEAD items are processed outside of the loop.
 		 *
 		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how lazy_check_needs_freeze() or
-		 * count_nondeletable_pages() do it -- they only consider pages empty
-		 * when they only have LP_UNUSED items, which is important for
-		 * correctness.
+		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
+		 * it -- it only considers pages empty/truncatable when they have no
+		 * items at all (except LP_UNUSED items).
 		 *
 		 * Our assumption is that any LP_DEAD items we encounter here will
 		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
@@ -2057,6 +2084,236 @@ retry:
 	vacrel->live_tuples += live_tuples;
 }
 
+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.
+ *
+ * While pruning isn't performed here, we can at least collect existing
+ * LP_DEAD items into the dead_tuples array for removal from indexes.  It's
+ * quite possible that earlier opportunistic pruning left LP_DEAD items
+ * behind, and we shouldn't miss out on an opportunity to make them reusable
+ * (VACUUM alone is capable of cleaning up line pointer bloat like this).
+ * Note that we'll only require an exclusive lock (not a cleanup lock) later
+ * on when we set these LP_DEAD items to LP_UNUSED in lazy_vacuum_heap_page.
+ *
+ * Freezing isn't performed here either.  For aggressive VACUUM callers, we
+ * may return false to indicate that a full cleanup lock is required.  This is
+ * necessary because pruning requires a cleanup lock, and because VACUUM
+ * cannot freeze a page's tuples until after pruning takes place (freezing
+ * tuples effectively requires a cleanup lock, though we don't need a cleanup
+ * lock in lazy_vacuum_heap_page or in lazy_scan_new_or_empty to set a heap
+ * page all-frozen in the visibility map).
+ *
+ * We'll always return true for a non-aggressive VACUUM, even when we know
+ * that this will cause them to miss out on freezing tuples from before
+ * vacrel->FreezeLimit cutoff -- they should never have to wait for a cleanup
+ * lock.  This does mean that they definitely won't be able to advance
+ * relfrozenxid opportunistically (same applies to vacrel->MultiXactCutoff and
+ * relminmxid).
+ *
+ * See lazy_scan_prune for an explanation of hastup return flag.
+ */
+static bool
+lazy_scan_noprune(LVRelState *vacrel,
+				  Buffer buf,
+				  BlockNumber blkno,
+				  Page page,
+				  bool *hastup)
+{
+	OffsetNumber offnum,
+				maxoff;
+	bool		has_tuple_needs_freeze = false;
+	int			lpdead_items,
+				num_tuples,
+				live_tuples,
+				new_dead_tuples;
+	HeapTupleHeader tupleheader;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+
+	*hastup = false;			/* for now */
+
+	lpdead_items = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+	new_dead_tuples = 0;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		if (!ItemIdIsUsed(itemid))
+			continue;
+
+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;		/* page won't be truncatable */
+			continue;
+		}
+
+		if (ItemIdIsDead(itemid))
+		{
+			/*
+			 * Deliberately don't set hastup=true here.  See same point in
+			 * lazy_scan_prune for an explanation.
+			 */
+			deadoffsets[lpdead_items++] = offnum;
+			continue;
+		}
+
+		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+		if (!has_tuple_needs_freeze &&
+			heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
+		{
+			if (vacrel->aggressive)
+			{
+				/* Going to have to get cleanup lock for lazy_scan_prune */
+				vacrel->offnum = InvalidOffsetNumber;
+				return false;
+			}
+
+			has_tuple_needs_freeze = true;
+		}
+
+		num_tuples++;
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
+
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count both cases as live, just like lazy_scan_prune
+				 */
+				live_tuples++;
+
+				break;
+			case HEAPTUPLE_DEAD:
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * We count DEAD and RECENTLY_DEAD tuples in new_dead_tuples.
+				 *
+				 * lazy_scan_prune only does this for RECENTLY_DEAD tuples,
+				 * and never has to deal with DEAD tuples directly (they
+				 * reliably become LP_DEAD items through pruning).  Our
+				 * approach to DEAD tuples is a bit arbitrary, but it seems
+				 * better than totally ignoring them.
+				 */
+				new_dead_tuples++;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Do not count these rows as live, just like lazy_scan_prune
+				 */
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+	}
+
+	vacrel->offnum = InvalidOffsetNumber;
+
+	if (has_tuple_needs_freeze)
+	{
+		/*
+		 * Current non-aggressive VACUUM operation definitely won't be able to
+		 * advance relfrozenxid or relminmxid
+		 */
+		Assert(!vacrel->aggressive);
+		vacrel->freeze_cutoffs_valid = false;
+	}
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in the dead_tuples
+	 * array iff VACUUM uses two-pass strategy case
+	 */
+	if (vacrel->nindexes == 0)
+	{
+		/*
+		 * We are not prepared to handle the corner case where a single pass
+		 * strategy VACUUM cannot get a cleanup lock, and we then find LP_DEAD
+		 * items.  Repeat the same trick that we use for DEAD tuples: pretend
+		 * that they're RECENTLY_DEAD tuples.
+		 *
+		 * There is no fundamental reason why we must take the easy way out
+		 * like this.  Finding a way to make these LP_DEAD items get set to
+		 * LP_UNUSED would be less valuable and more complicated than it is in
+		 * the two-pass strategy case, since it would necessitate that we
+		 * repeat our lazy_scan_heap caller's page-at-a-time/one-pass-strategy
+		 * heap vacuuming steps.  Whereas in the two-pass strategy case,
+		 * lazy_vacuum_heap_rel will set the LP_DEAD items to LP_UNUSED. It
+		 * must always deal with things like remaining DEAD tuples with
+		 * storage, new LP_DEAD items that we didn't see earlier on, etc.
+		 */
+		if (lpdead_items > 0)
+			*hastup = true;		/* page won't be truncatable */
+		num_tuples += lpdead_items;
+		new_dead_tuples += lpdead_items;
+	}
+	else if (lpdead_items > 0)
+	{
+		LVDeadTuples *dead_tuples = vacrel->dead_tuples;
+		ItemPointerData tmp;
+
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_tuples->itemptrs[dead_tuples->num_tuples++] = tmp;
+		}
+
+		Assert(dead_tuples->num_tuples <= dead_tuples->max_tuples);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_tuples->num_tuples);
+
+		vacrel->lpdead_items += lpdead_items;
+	}
+	else
+	{
+		/*
+		 * We opt to skip FSM processing for the page on the grounds that it
+		 * is probably being modified by concurrent DML operations.  Seems
+		 * best to assume that the space is best left behind for future
+		 * updates of existing tuples.  This matches what opportunistic
+		 * pruning does.
+		 *
+		 * It's theoretically possible for us to set VM bits here too, but we
+		 * don't try that either.  It is highly unlikely to be possible, much
+		 * less useful.
+		 */
+	}
+
+	/*
+	 * Finally, add relevant page-local counts to whole-VACUUM counts
+	 */
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+
+	/* Caller won't need to call lazy_scan_prune with same page */
+	return true;
+}
+
 /*
  * Remove the collected garbage tuples from the table and its indexes.
  *
@@ -2504,67 +2761,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	return tupindex;
 }
 
-/*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
- *
- * Returns true if the page needs to be vacuumed using cleanup lock.
- * Also returns a flag indicating whether page contains any tuples at all.
- */
-static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
-{
-	Page		page = BufferGetPage(buf);
-	OffsetNumber offnum,
-				maxoff;
-	HeapTupleHeader tupleheader;
-
-	*hastup = false;
-
-	/*
-	 * New and empty pages, obviously, don't contain tuples. We could make
-	 * sure that the page is registered in the FSM, but it doesn't seem worth
-	 * waiting for a cleanup lock just for that, especially because it's
-	 * likely that the pin holder will do so.
-	 */
-	if (PageIsNew(page) || PageIsEmpty(page))
-		return false;
-
-	maxoff = PageGetMaxOffsetNumber(page);
-	for (offnum = FirstOffsetNumber;
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		ItemId		itemid;
-
-		/*
-		 * Set the offset number so that we can display it along with any
-		 * error that occurred while processing this tuple.
-		 */
-		vacrel->offnum = offnum;
-		itemid = PageGetItemId(page, offnum);
-
-		/* this should match hastup test in count_nondeletable_pages() */
-		if (ItemIdIsUsed(itemid))
-			*hastup = true;
-
-		/* dead and redirect items never need freezing */
-		if (!ItemIdIsNormal(itemid))
-			continue;
-
-		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-
-		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
-			break;
-	}							/* scan along page */
-
-	/* Clear the offset information once we have processed the given page. */
-	vacrel->offnum = InvalidOffsetNumber;
-
-	return (offnum <= maxoff);
-}
-
 /*
  * Trigger the failsafe to avoid wraparound failure when vacrel table has a
  * relfrozenxid and/or relminmxid that is dangerously far in the past.
@@ -2659,7 +2855,7 @@ do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel)
 	 */
 	vacrel->lps->lvshared->reltuples = vacrel->new_rel_tuples;
 	vacrel->lps->lvshared->estimated_count =
-		(vacrel->tupcount_pages < vacrel->rel_pages);
+		(vacrel->scanned_pages < vacrel->rel_pages);
 
 	/* Determine the number of parallel workers to launch */
 	if (vacrel->lps->lvshared->first_time)
@@ -2976,7 +3172,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		double		reltuples = vacrel->new_rel_tuples;
 		bool		estimated_count =
-		vacrel->tupcount_pages < vacrel->rel_pages;
+		vacrel->scanned_pages < vacrel->rel_pages;
 
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -3124,7 +3320,9 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * should_attempt_truncation - should we attempt to truncate the heap?
  *
  * Don't even think about it unless we have a shot at releasing a goodly
- * number of pages.  Otherwise, the time taken isn't worth it.
+ * number of pages.  Otherwise, the time taken isn't worth it, mainly because
+ * an AccessExclusive lock must be replayed on any hot standby, where it can
+ * be particularly disruptive.
  *
  * Also don't attempt it if wraparound failsafe is in effect.  It's hard to
  * predict how long lazy_truncate_heap will take.  Don't take any chances.
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
index cdbe7f3a6..ce55376e7 100644
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ b/src/test/isolation/expected/vacuum-reltuples.out
@@ -45,7 +45,7 @@ step stats:
 
 relpages|reltuples
 --------+---------
-       1|       20
+       1|       21
 (1 row)
 
 
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
index ae2f79b8f..a2a461f2f 100644
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ b/src/test/isolation/specs/vacuum-reltuples.spec
@@ -2,9 +2,10 @@
 # to page pins. We absolutely need to avoid setting reltuples=0 in
 # such cases, since that interferes badly with planning.
 #
-# Expected result in second permutation is 20 tuples rather than 21 as
-# for the others, because vacuum should leave the previous result
-# (from before the insert) in place.
+# Expected result for all three permutation is 21 tuples, including
+# the second permutation.  VACUUM is able to count the concurrently
+# inserted tuple in its final reltuples, even when a cleanup lock
+# cannot be acquired on the affected heap page.
 
 setup {
     create table smalltbl
-- 
2.30.2

Andres Freund

andres@anarazel.de

about 4 years ago

In reply to: Peter Geoghegan (#1)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2021-11-21 18:13:51 -0800, Peter Geoghegan wrote:

I have heard many stories about anti-wraparound/aggressive VACUUMs
where the cure (which suddenly made autovacuum workers
non-cancellable) was worse than the disease (not actually much danger
of wraparound failure). For example:

https://www.joyent.com/blog/manta-postmortem-7-27-2015

Yes, this problem report is from 2015, which is before we even had the
freeze map stuff. I still think that the point about aggressive
VACUUMs blocking DDL (leading to chaos) remains valid.

As I noted below, I think this is a bit of a separate issue than what your
changes address in this patch.

There is another interesting area of future optimization within
VACUUM, that also seems relevant to this patch: the general idea of
*avoiding* pruning during VACUUM, when it just doesn't make sense to
do so -- better to avoid dirtying the page for now. Needlessly pruning
inside lazy_scan_prune is hardly rare -- standard pgbench (maybe only
with heap fill factor reduced to 95) will have autovacuums that
*constantly* do it (granted, it may not matter so much there because
VACUUM is unlikely to re-dirty the page anyway).

Hm. I'm a bit doubtful that there's all that many cases where it's worth not
pruning during vacuum. However, it seems much more common for opportunistic
pruning during non-write accesses.

Perhaps checking whether we'd log an FPW would be a better criteria for
deciding whether to prune or not compared to whether we're dirtying the page?
IME the WAL volume impact of FPWs is a considerably bigger deal than
unnecessarily dirtying a page that has previously been dirtied in the same
checkpoint "cycle".

This patch seems relevant to that area because it recognizes that pruning
during VACUUM is not necessarily special -- a new function called
lazy_scan_noprune may be used instead of lazy_scan_prune (though only when a
cleanup lock cannot be acquired). These pages are nevertheless considered
fully processed by VACUUM (this is perhaps 99% true, so it seems reasonable
to round up to 100% true).

IDK, the potential of not having usable space on an overfly fragmented page
doesn't seem that low. We can't just mark such pages as all-visible because
then we'll potentially never reclaim that space.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.

IMO the big difference between aggressive / non-aggressive isn't whether we
wait for a cleanup lock, but that we don't skip all-visible pages...

But we now make the non-aggressive case notice that a failure to acquire a
cleanup lock on one particular heap page does not in itself make it unsafe
to advance relfrozenxid for the whole relation (which is what we usually see
in the aggressive case already).

This new relfrozenxid optimization might not be all that valuable on its
own, but it may still facilitate future work that makes non-aggressive
VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
rather than later. In general it would be useful for non-aggressive
VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
a cleanup lock once or twice if needed).

What do you mean by "waiting once or twice"? A single wait may simply never
end on a busy page that's constantly pinned by a lot of backends...

It would also be generally useful if aggressive VACUUMs were "less
aggressive" opportunistically (e.g. by being responsive to query
cancellations when the risk of wraparound failure is still very low).

Being canceleable is already a different concept than anti-wraparound
vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but
anti-wrap only at autovacuum_freeze_max_age. The problem is that the
autovacuum scheduling is way too naive for that to be a significant benefit -
nothing tries to schedule autovacuums so that they have a chance to complete
before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

This is one of the most embarassing issues around the whole anti-wrap
topic. We kind of define it as an emergency that there's an anti-wraparound
vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer. We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.

This has become *much* more important with the changes around deciding when to
index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
items, it's that a previous vacuum is quite likely to have left them there,
because the previous vacuum decided not to perform index cleanup.

Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),

One thing we could do around this, btw, would be to aggressively replace
LP_REDIRECT items with their target item. We can't do that in all situations
(somebody might be following a ctid chain), but I think we have all the
information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
state or something like that.

I think that'd be quite a win - we right now often "migrate" to other pages
for modifications not because we're out of space on a page, but because we run
out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the
number of line pointers, not just the number of actual tuples). Effectively
doubling the number of available line item in common cases in a number of
realistic / common scenarios would be quite the win.

Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
since there is no barely any real practical sense in which we actually
miss doing useful work for these pages. Besides, this information
always seemed to have little practical value, even to Postgres hackers.

-0.5. I think it provides some value, and I don't see why the removal of the
information should be tied to this change. It's hard to diagnose why some dead
tuples aren't cleaned up - a common cause for that on smaller tables is that
nearly all pages are pinned nearly all the time.

I wonder if we could have a more restrained version of heap_page_prune() that
doesn't require a cleanup lock? Obviously we couldn't defragment the page, but
it's not immediately obvious that we need it if we constrain ourselves to only
modify tuple versions that cannot be visible to anybody.

Random note: I really dislike that we talk about cleanup locks in some parts
of the code, and super-exclusive locks in others :(.

+	/*
+	 * Aggressive VACUUM (which is the same thing as anti-wraparound
+	 * autovacuum for most practical purposes) exists so that we'll reliably
+	 * advance relfrozenxid and relminmxid sooner or later.  But we can often
+	 * opportunistically advance them even in a non-aggressive VACUUM.
+	 * Consider if that's possible now.

I don't agree with the "most practical purposes" bit. There's a huge
difference because manual VACUUMs end up aggressive but not anti-wrap once
older than vacuum_freeze_table_age.

+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_prune, from before a possible relation
+	 * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
+	 */

I think it should be doable to add an isolation test for this path. There have
been quite a few bugs around the wider topic...

+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}

I wonder if this whole logic wouldn't become easier and less fragile if we
just went for maintaining the "actually observed" horizon while scanning the
relation. If we skip a page via VM set the horizon to invalid. Otherwise we
can keep track of the accurate horizon and use that. No need to count pages
and stuff.

@@ -1050,18 +1046,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
bool all_visible_according_to_vm = false;
LVPagePruneState prunestate;

- /*
- * Consider need to skip blocks. See note above about forcing
- * scanning of last page.
- */
-#define FORCE_CHECK_PAGE() \
- (blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);

update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
blkno, InvalidOffsetNumber);

+		/*
+		 * Consider need to skip blocks
+		 */
if (blkno == next_unskippable_block)
{
/* Time to advance next_unskippable_block */
@@ -1110,13 +1102,19 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
else
{
/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current block can be skipped if we've seen a long enough
+			 * run of skippable blocks to justify skipping it.
+			 *
+			 * There is an exception: we will scan the table's last page to
+			 * determine whether it has tuples or not, even if it would
+			 * otherwise be skipped (unless it's clearly not worth trying to
+			 * truncate the table).  This avoids having lazy_truncate_heap()
+			 * take access-exclusive lock on the table to attempt a truncation
+			 * that just fails immediately because there are tuples in the
+			 * last page.
*/
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks &&
+				!(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
{
/*
* Tricky, tricky.  If this is in aggressive vacuum, the page

I find the FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like
mixing such changes within a larger change doing many other things.

@@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
+		vacrel->scanned_pages++;

I don't particularly like doing BufferGetPage() before holding a lock on the
page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
be good to have a crosscheck that BufferGetPage() is only allowed when holding
a page level lock.

/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing in lazy_scan_noprune.
*/

s/in lazy_scan_noprune/via lazy_scan_noprune/?

if (!ConditionalLockBufferForCleanup(buf))
{
bool hastup;

-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
+				/* Lock and pin released for us */
+				continue;
+			}

Why isn't this done in lazy_scan_noprune()?

+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
+			{
+				/* No need to wait for cleanup lock for this page */
+				UnlockReleaseBuffer(buf);
+				if (hastup)
+					vacrel->nonempty_pages = blkno + 1;
continue;
}

Do we really need all of buf, blkno, page for both of these functions? Quite
possible that yes, if so, could we add an assertion that
BufferGetBockNumber(buf) == blkno?

+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
{

Maybe worth a note mentioning that we need to redo this even in the aggressive
case, because we didn't continually hold a lock on the page?

+/*
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a rare corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).

I don't think it's actually that rare - the window for this is huge. You just
need to crash / immediate shutdown at any time between the relation having
been extended and the new page contents being written out (checkpoint or
buffer replacement / ring writeout). That's often many minutes.

I don't really see that as a realistic thing to ever reliably avoid, FWIW. I
think the overhead would be prohibitive. We'd need to do synchronous WAL
logging while holding the extension lock I think. Um, not fun.

+ * Caller can either hold a buffer cleanup lock on the buffer, or a simple
+ * shared lock.
+ */

Kinda sounds like it'd be incorrect to call this with an exclusive lock, which
made me wonder why that could be true. Perhaps just say that it needs to be
called with at least a shared lock?

+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)

It'd be good to document the return value - for me it's not a case where it's
so obvious that it's not worth it.

+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.

I'd add somethign like "returns whether a cleanup lock is required". Having to
read multiple paragraphs to understand the basic meaning of the return value
isn't great.

+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;		/* page won't be truncatable */
+			continue;
+		}

It's not really new, but this comment is now a bit confusing, because it can
be understood to be about PageTruncateLinePointerArray().

+			case HEAPTUPLE_DEAD:
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * We count DEAD and RECENTLY_DEAD tuples in new_dead_tuples.
+				 *
+				 * lazy_scan_prune only does this for RECENTLY_DEAD tuples,
+				 * and never has to deal with DEAD tuples directly (they
+				 * reliably become LP_DEAD items through pruning).  Our
+				 * approach to DEAD tuples is a bit arbitrary, but it seems
+				 * better than totally ignoring them.
+				 */
+				new_dead_tuples++;
+				break;

Why does it make sense to track DEAD tuples this way? Isn't that going to lead
to counting them over-and-over again? I think it's quite misleading to include
them in "dead bot not yet removable".

+	/*
+	 * Now save details of the LP_DEAD items from the page in the dead_tuples
+	 * array iff VACUUM uses two-pass strategy case
+	 */

Do we really need to have separate code for this in lazy_scan_prune() and
lazy_scan_noprune()?

+	}
+	else
+	{
+		/*
+		 * We opt to skip FSM processing for the page on the grounds that it
+		 * is probably being modified by concurrent DML operations.  Seems
+		 * best to assume that the space is best left behind for future
+		 * updates of existing tuples.  This matches what opportunistic
+		 * pruning does.

Why can we assume that there concurrent DML rather than concurrent read-only
operations? IME it's much more common for read-only operations to block
cleanup locks than read-write ones (partially because the frequency makes it
easier, partially because cursors allow long-held pins, partially because the
EXCLUSIVE lock of a r/w operation wouldn't let us get here)

I think this is a change mostly in the right direction. But as formulated this
commit does *WAY* too much at once.

Greetings,

Andres Freund

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Andres Freund (#2)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Nov 22, 2021 at 11:29 AM Andres Freund <andres@anarazel.de> wrote:

Hm. I'm a bit doubtful that there's all that many cases where it's worth not
pruning during vacuum. However, it seems much more common for opportunistic
pruning during non-write accesses.

Fair enough. I just wanted to suggest an exploratory conversation
about pruning (among several other things). I'm mostly saying: hey,
pruning during VACUUM isn't actually that special, at least not with
this refactoring patch in place. So maybe it makes sense to go
further, in light of that general observation about pruning in VACUUM.

Maybe it wasn't useful to even mention this aspect now. I would rather
focus on freezing optimizations for now -- that's much more promising.

Perhaps checking whether we'd log an FPW would be a better criteria for
deciding whether to prune or not compared to whether we're dirtying the page?
IME the WAL volume impact of FPWs is a considerably bigger deal than
unnecessarily dirtying a page that has previously been dirtied in the same
checkpoint "cycle".

Agreed. (I tend to say the former when I really mean the latter, which
I should try to avoid.)

IDK, the potential of not having usable space on an overfly fragmented page
doesn't seem that low. We can't just mark such pages as all-visible because
then we'll potentially never reclaim that space.

Don't get me started on this - because I'll never stop.

It makes zero sense that we don't think about free space holistically,
using the whole context of what changed in the recent past. As I think
you know already, a higher level concept (like open and closed pages)
seems like the right direction to me -- because it isn't sensible to
treat X bytes of free space in one heap page as essentially
interchangeable with any other space on any other heap page. That
misses an enormous amount of things that matter. The all-visible
status of a page is just one such thing.

IMO the big difference between aggressive / non-aggressive isn't whether we
wait for a cleanup lock, but that we don't skip all-visible pages...

I know what you mean by that, of course. But FWIW that definition
seems too focused on what actually happens today, rather than what is
essential given the invariants we have for VACUUM. And so I personally
prefer to define it as "a VACUUM that *reliably* advances
relfrozenxid". This looser definition will probably "age" well (ahem).

This new relfrozenxid optimization might not be all that valuable on its
own, but it may still facilitate future work that makes non-aggressive
VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
rather than later. In general it would be useful for non-aggressive
VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
a cleanup lock once or twice if needed).

What do you mean by "waiting once or twice"? A single wait may simply never
end on a busy page that's constantly pinned by a lot of backends...

I was speculating about future work again. I think that you've taken
my words too literally. This is just a draft commit message, just a
way of framing what I'm really trying to do.

Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a
non-aggressive VACUUM -- so "at least waiting for one or two pins
during non-aggressive VACUUM" might not have been the best way of
expressing the idea that I wanted to express. The important point is
that _we can make a choice_ about stuff like this dynamically, based
on the observed characteristics of the table, and some general ideas
about the costs and benefits (of waiting or not waiting, or of how
long we want to wait in total, whatever might be important). This
probably just means adding some heuristics that are pretty sensitive
to any reason to not do more work in a non-aggressive VACUUM, without
*completely* balking at doing even a tiny bit more work.

For example, we can definitely afford to wait a few more milliseconds
to get a cleanup lock just once, especially if we're already pretty
sure that that's all the extra work that it would take to ultimately
be able to advance relfrozenxid in the ongoing (non-aggressive) VACUUM
-- it's easy to make that case. Once you agree that it makes sense
under these favorable circumstances, you've already made
"aggressiveness" a continuous thing conceptually, at a high level.

The current binary definition of "aggressive" is needlessly
restrictive -- that much seems clear to me. I'm much less sure of what
specific alternative should replace it.

I've already prototyped advancing relfrozenxid using a dynamically
determined value, so that our final relfrozenxid is just about the
most recent safe value (not the original FreezeLimit). That's been
interesting. Consider this log output from an autovacuum with the
prototype patch (also uses my new instrumentation), based on standard
pgbench (just tuned heap fill factor a bit):

LOG: automatic vacuum of table "regression.public.pgbench_accounts":
index scans: 0
pages: 0 removed, 909091 remain, 33559 skipped using visibility map
(3.69% of total)
tuples: 297113 removed, 50090880 remain, 90880 are dead but not yet removable
removal cutoff: oldest xmin was 29296744, which is now 203341 xact IDs behind
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
I/O timings: read: 55.574 ms, write: 0.000 ms
avg read rate: 17.805 MB/s, avg write rate: 4.389 MB/s
buffer usage: 1728273 hits, 23150 misses, 5706 dirtied
WAL usage: 594211 records, 0 full page images, 35065032 bytes
system usage: CPU: user: 6.85 s, system: 0.08 s, elapsed: 10.15 s

All of the autovacuums against the accounts table look similar to this
one -- you don't see anything about relfrozenxid being advanced
(because it isn't). Whereas for the smaller pgbench tables, every
single VACUUM successfully advances relfrozenxid to a fairly recent
XID (without there ever being an aggressive VACUUM) -- just because
VACUUM needs to visit every page for the smaller tables. While the
accounts table doesn't generally need to have 100% of all pages
touched by VACUUM -- it's more like 95% there. Does that really make
sense, though?

I'm pretty sure that less aggressive VACUUMing (e.g. higher
scale_factor setting) would lead to more aggressive setting of
relfrozenxid here. I'm always suspicious when I see insignificant
differences that lead to significant behavioral differences. Am I
worried over nothing here? Perhaps -- we don't really need to advance
relfrozenxid early with this table/workload anyway. But I'm not so
sure.

Again, my point is that there is a good chance that redefining
aggressiveness in some way will be helpful. A more creative, flexible
definition might be just what we need. The details are very much up in
the air, though.

It would also be generally useful if aggressive VACUUMs were "less
aggressive" opportunistically (e.g. by being responsive to query
cancellations when the risk of wraparound failure is still very low).

Being canceleable is already a different concept than anti-wraparound
vacuums. We start aggressive autovacuums at vacuum_freeze_table_age, but
anti-wrap only at autovacuum_freeze_max_age.

You know what I meant. Also, did *you* mean "being canceleable is
already a different concept to *aggressive* vacuums"? :-)

The problem is that the
autovacuum scheduling is way too naive for that to be a significant benefit -
nothing tries to schedule autovacuums so that they have a chance to complete
before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

Not sure what you mean about scheduling, since vacuum_freeze_table_age
is only in place to make overnight (off hours low activity scripted
VACUUMs) freeze tuples before any autovacuum worker gets the chance
(since the latter may run at a much less convenient time). Sure,
vacuum_freeze_table_age might also force a regular autovacuum worker
to do an aggressive VACUUM -- but I think it's mostly intended for a
manual overnight VACUUM. Not usually very helpful, but also not
harmful.

Oh, wait. I think that you're talking about how autovacuum workers in
particular tend to be affected by this. We launch an av worker that
wants to clean up bloat, but it ends up being aggressive (and maybe
taking way longer), perhaps quite randomly, only due to
vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is
that it?

This is one of the most embarassing issues around the whole anti-wrap
topic. We kind of define it as an emergency that there's an anti-wraparound
vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

What do you mean? Only an autovacuum worker can do an anti-wraparound
VACUUM (which is not quite the same thing as an aggressive VACUUM).

I agree that anti-wraparound autovacuum is way too unfriendly, though.

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer. We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.

This has become *much* more important with the changes around deciding when to
index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
items, it's that a previous vacuum is quite likely to have left them there,
because the previous vacuum decided not to perform index cleanup.

I haven't seen any evidence of that myself (with the optimization
added to Postgres 14 by commit 5100010ee4). I still don't understand
why you doubted that work so much. I'm not saying that you're wrong
to; I'm saying that I don't think that I understand your perspective
on it.

What I have seen in my own tests (particularly with BenchmarkSQL) is
that most individual tables either never apply the optimization even
once (because the table reliably has heap pages with many more LP_DEAD
items than the 2%-of-relpages threshold), or will never need to
(because there are precisely zero LP_DEAD items anyway). Remaining
tables that *might* use the optimization tend to not go very long
without actually getting a round of index vacuuming. It's just too
easy for updates (and even aborted xact inserts) to introduce new
LP_DEAD items for us to go long without doing index vacuuming.

If you can be more concrete about a problem you've seen, then I might
be able to help. It's not like there are no options in this already. I
already thought about introducing a small degree of randomness into
the process of deciding to skip or to not skip (in the
consider_bypass_optimization path of lazy_vacuum() on Postgres 14).
The optimization is mostly valuable because it allows us to do more
useful work in VACUUM -- not because it allows us to do less useless
work in VACUUM. In particular, it allows to tune
autovacuum_vacuum_insert_scale_factor very aggressively with an
append-only table, without useless index vacuuming making it all but
impossible for autovacuum to get to the useful work.

Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),

One thing we could do around this, btw, would be to aggressively replace
LP_REDIRECT items with their target item. We can't do that in all situations
(somebody might be following a ctid chain), but I think we have all the
information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
state or something like that.

Another idea is to truncate the line pointer during pruning (including
opportunistic pruning). Matthias van de Meent has a patch for that.

I am not aware of a specific workload where the patch helps, but that
doesn't mean that there isn't one, or that it doesn't matter. It's
subtle enough that I might have just missed something. I *expect* the
true damage over time to be very hard to model or understand -- I
imagine the potential for weird feedback loops is there.

I think that'd be quite a win - we right now often "migrate" to other pages
for modifications not because we're out of space on a page, but because we run
out of itemids (for debatable reasons MaxHeapTuplesPerPage constraints the
number of line pointers, not just the number of actual tuples). Effectively
doubling the number of available line item in common cases in a number of
realistic / common scenarios would be quite the win.

I believe Masahiko is working on this in the current cycle. It would
be easier if we had a better sense of how increasing
MaxHeapTuplesPerPage will affect tidbitmap.c. But the idea of
increasing that seems sound to me.

Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
since there is no barely any real practical sense in which we actually
miss doing useful work for these pages. Besides, this information
always seemed to have little practical value, even to Postgres hackers.

-0.5. I think it provides some value, and I don't see why the removal of the
information should be tied to this change. It's hard to diagnose why some dead
tuples aren't cleaned up - a common cause for that on smaller tables is that
nearly all pages are pinned nearly all the time.

Is that still true, though? If it turns out that we need to leave it
in, then I can do that. But I'd prefer to wait until we have more
information before making a final decision. Remember, the high level
idea of this whole patch is that we do as much work as possible for
any scanned_pages, which now includes pages that we never successfully
acquired a cleanup lock on. And so we're justified in assuming that
they're exactly equivalent to pages that we did get a cleanup on --
that's now the working assumption. I know that that's not literally
true, but that doesn't mean it's not a useful fiction -- it should be
very close to the truth.

Also, I would like to put more information (much more useful
information) in the same log output. Perhaps that will be less
controversial if I take something useless away first.

I wonder if we could have a more restrained version of heap_page_prune() that
doesn't require a cleanup lock? Obviously we couldn't defragment the page, but
it's not immediately obvious that we need it if we constrain ourselves to only
modify tuple versions that cannot be visible to anybody.

Random note: I really dislike that we talk about cleanup locks in some parts
of the code, and super-exclusive locks in others :(.

Somebody should normalize that.

+     /*
+      * Aggressive VACUUM (which is the same thing as anti-wraparound
+      * autovacuum for most practical purposes) exists so that we'll reliably
+      * advance relfrozenxid and relminmxid sooner or later.  But we can often
+      * opportunistically advance them even in a non-aggressive VACUUM.
+      * Consider if that's possible now.
I don't agree with the "most practical purposes" bit. There's a huge
difference because manual VACUUMs end up aggressive but not anti-wrap once
older than vacuum_freeze_table_age.

Okay.

+      * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+      * the rel_pages used by lazy_scan_prune, from before a possible relation
+      * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
+      */
I think it should be doable to add an isolation test for this path. There have
been quite a few bugs around the wider topic...

I would argue that we already have one -- vacuum-reltuples.spec. I had
to update its expected output in the patch. I would argue that the
behavioral change (count tuples on a pinned-by-cursor heap page) that
necessitated updating the expected output for the test is an
improvement overall.

+     {
+             /* Can safely advance relfrozen and relminmxid, too */
+             Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+                        orig_rel_pages);
+             vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+                                                     new_rel_allvisible, vacrel->nindexes > 0,
+                                                     FreezeLimit, MultiXactCutoff, false);
+     }
I wonder if this whole logic wouldn't become easier and less fragile if we
just went for maintaining the "actually observed" horizon while scanning the
relation. If we skip a page via VM set the horizon to invalid. Otherwise we
can keep track of the accurate horizon and use that. No need to count pages
and stuff.

There is no question that that makes sense as an optimization -- my
prototype convinced me of that already. But I don't think that it can
simplify anything (not even the call to vac_update_relstats itself, to
actually update relfrozenxid at the end). Fundamentally, this will
only work if we decide to only skip all-frozen pages, which (by
definition) only happens within aggressive VACUUMs. Isn't it that
simple?

You recently said (on the heap-pruning-14-bug thread) that you don't
think it would be practical to always set a page all-frozen when we
see that we're going to set it all-visible -- apparently you feel that
we could never opportunistically freeze early such that all-visible
but not all-frozen pages practically cease to exist. I'm still not
sure why you believe that (though you may be right, or I might have
misunderstood, since it's complicated). It would certainly benefit
this dynamic relfrozenxid business if it was possible, though. If we
could somehow make that work, then almost every VACUUM would be able
to advance relfrozenxid, independently of aggressive-ness -- because
we wouldn't have any all-visible-but-not-all-frozen pages to skip
(that important detail wouldn't be left to chance).

-                     if (skipping_blocks && !FORCE_CHECK_PAGE())
+                     if (skipping_blocks &&
+                             !(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
{
/*
* Tricky, tricky.  If this is in aggressive vacuum, the page
I find the FORCE_CHECK_PAGE macro decidedly unhelpful. But I don't like
mixing such changes within a larger change doing many other things.

I got rid of FORCE_CHECK_PAGE() itself in this patch (not a later
patch) because the patch also removes the only other
FORCE_CHECK_PAGE() call -- and the latter change is very much in scope
for the big patch (can't be broken down into smaller changes, I
think). And so this felt natural to me. But if you prefer, I can break
it out into a separate commit.

@@ -1204,156 +1214,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
RBM_NORMAL, vacrel->bstrategy);
+             page = BufferGetPage(buf);
+             vacrel->scanned_pages++;
I don't particularly like doing BufferGetPage() before holding a lock on the
page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
be good to have a crosscheck that BufferGetPage() is only allowed when holding
a page level lock.

I have occasionally wondered if the whole idea of reading heap pages
with only a pin (and having cleanup locks in VACUUM) is really worth
it -- alternative designs seem possible. Obviously that's a BIG
discussion, and not one to have right now. But it seems kind of
relevant.

Since it is often legit to read a heap page without a buffer lock
(only a pin), I can't see why BufferGetPage() without a buffer lock
shouldn't also be okay -- if anything it seems safer. I think that I
would agree with you if it wasn't for that inconsistency (which is
rather a big "if", to be sure -- even for me).

+                     /* Check for new or empty pages before lazy_scan_noprune call */
+                     if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+                                                                        vmbuffer))
{
-                             ReleaseBuffer(buf);
-                             vacrel->pinskipped_pages++;
+                             /* Lock and pin released for us */
+                             continue;
+                     }

Why isn't this done in lazy_scan_noprune()?

No reason, really -- could be done that way (we'd then also give
lazy_scan_prune the same treatment). I thought that it made a certain
amount of sense to keep some of this in the main loop, but I can
change it if you want.

+                     if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
+                     {
+                             /* No need to wait for cleanup lock for this page */
+                             UnlockReleaseBuffer(buf);
+                             if (hastup)
+                                     vacrel->nonempty_pages = blkno + 1;
continue;
}
Do we really need all of buf, blkno, page for both of these functions? Quite
possible that yes, if so, could we add an assertion that
BufferGetBockNumber(buf) == blkno?

This just matches the existing lazy_scan_prune function (which doesn't
mean all that much, since it was only added in Postgres 14). Will add
the assertion to both.

+             /* Check for new or empty pages before lazy_scan_prune call */
+             if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
{
Maybe worth a note mentioning that we need to redo this even in the aggressive
case, because we didn't continually hold a lock on the page?

Isn't that obvious? Either way it isn't the kind of thing that I'd try
to optimize away. It's such a narrow issue.

+/*
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a rare corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).

I don't think it's actually that rare - the window for this is huge.

I can just remove the comment, though it still makes sense to me.

I don't really see that as a realistic thing to ever reliably avoid, FWIW. I
think the overhead would be prohibitive. We'd need to do synchronous WAL
logging while holding the extension lock I think. Um, not fun.

My long term goal for the FSM (the lease based design I talked about
earlier this year) includes soft ownership of free space from
preallocated pages by individual xacts -- the smgr layer itself
becomes transactional and crash safe (at least to a limited degree).
This includes bulk extension of relations, to make up for the new
overhead implied by crash safe rel extension. I don't think that we
should require VACUUM (or anything else) to be cool with random
uninitialized pages -- to me that just seems backwards.

We can't do true bulk extension right now (just an inferior version
that doesn't give specific pages to specific backends) because the
risk of losing a bunch of empty pages for way too long is not
acceptable. But that doesn't seem fundamental to me -- that's one of
the things we'd be fixing at the same time (through what I call soft
ownership semantics). I think we'd come out ahead on performance, and
*also* have a more robust approach to relation extension.

+ * Caller can either hold a buffer cleanup lock on the buffer, or a simple
+ * shared lock.
+ */
Kinda sounds like it'd be incorrect to call this with an exclusive lock, which
made me wonder why that could be true. Perhaps just say that it needs to be
called with at least a shared lock?

Okay.

+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+                                        Page page, bool sharelock, Buffer vmbuffer)
It'd be good to document the return value - for me it's not a case where it's
so obvious that it's not worth it.

Okay.

+/*
+ *   lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.
I'd add somethign like "returns whether a cleanup lock is required". Having to
read multiple paragraphs to understand the basic meaning of the return value
isn't great.

Will fix.

+             if (ItemIdIsRedirected(itemid))
+             {
+                     *hastup = true;         /* page won't be truncatable */
+                     continue;
+             }
It's not really new, but this comment is now a bit confusing, because it can
be understood to be about PageTruncateLinePointerArray().

I didn't think of that. Will address it in the next version.

Why does it make sense to track DEAD tuples this way? Isn't that going to lead
to counting them over-and-over again? I think it's quite misleading to include
them in "dead bot not yet removable".

Compared to what? Do we really want to invent a new kind of DEAD tuple
(e.g., to report on), just to handle this rare case?

I accept that this code is lying about the tuples being RECENTLY_DEAD,
kind of. But isn't it still strictly closer to the truth, compared to
HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not
counting it at all.

Note that we don't remember LP_DEAD items here, either (not here, in
lazy_scan_noprune, and not in lazy_scan_prune on HEAD). Because we
pretty much interpret LP_DEAD items as "future LP_UNUSED items"
instead -- we make a soft assumption that we're going to go on to mark
the same items LP_UNUSED during a second pass over the heap. My point
is that there is no natural way to count "fully DEAD tuple that
autovacuum didn't deal with" -- and so I picked RECENTLY_DEAD.

+     /*
+      * Now save details of the LP_DEAD items from the page in the dead_tuples
+      * array iff VACUUM uses two-pass strategy case
+      */
Do we really need to have separate code for this in lazy_scan_prune() and
lazy_scan_noprune()?

There is hardly any repetition, though.

+     }
+     else
+     {
+             /*
+              * We opt to skip FSM processing for the page on the grounds that it
+              * is probably being modified by concurrent DML operations.  Seems
+              * best to assume that the space is best left behind for future
+              * updates of existing tuples.  This matches what opportunistic
+              * pruning does.
Why can we assume that there concurrent DML rather than concurrent read-only
operations? IME it's much more common for read-only operations to block
cleanup locks than read-write ones (partially because the frequency makes it
easier, partially because cursors allow long-held pins, partially because the
EXCLUSIVE lock of a r/w operation wouldn't let us get here)

I actually agree. It still probably isn't worth dealing with the FSM
here, though. It's just too much mechanism for too little benefit in a
very rare case. What do you think?

--
Peter Geoghegan

Andres Freund

andres@anarazel.de

about 4 years ago

In reply to: Peter Geoghegan (#3)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2021-11-22 17:07:46 -0800, Peter Geoghegan wrote:

Sure, it wouldn't be okay to wait *indefinitely* for any one pin in a
non-aggressive VACUUM -- so "at least waiting for one or two pins
during non-aggressive VACUUM" might not have been the best way of
expressing the idea that I wanted to express. The important point is
that _we can make a choice_ about stuff like this dynamically, based
on the observed characteristics of the table, and some general ideas
about the costs and benefits (of waiting or not waiting, or of how
long we want to wait in total, whatever might be important). This
probably just means adding some heuristics that are pretty sensitive
to any reason to not do more work in a non-aggressive VACUUM, without
*completely* balking at doing even a tiny bit more work.

For example, we can definitely afford to wait a few more milliseconds
to get a cleanup lock just once

We currently have no infrastructure to wait for an lwlock or pincount for a
limited time. And at least for the former it'd not be easy to add. It may be
worth adding that at some point, but I'm doubtful this is sufficient reason
for nontrivial new infrastructure in very performance sensitive areas.

All of the autovacuums against the accounts table look similar to this
one -- you don't see anything about relfrozenxid being advanced
(because it isn't). Whereas for the smaller pgbench tables, every
single VACUUM successfully advances relfrozenxid to a fairly recent
XID (without there ever being an aggressive VACUUM) -- just because
VACUUM needs to visit every page for the smaller tables. While the
accounts table doesn't generally need to have 100% of all pages
touched by VACUUM -- it's more like 95% there. Does that really make
sense, though?

Does what make really sense?

I'm pretty sure that less aggressive VACUUMing (e.g. higher
scale_factor setting) would lead to more aggressive setting of
relfrozenxid here. I'm always suspicious when I see insignificant
differences that lead to significant behavioral differences. Am I
worried over nothing here? Perhaps -- we don't really need to advance
relfrozenxid early with this table/workload anyway. But I'm not so
sure.

I think pgbench_accounts is just a really poor showcase. Most importantly
there's no even slightly longer running transactions that hold down the xid
horizon. But in real workloads thats incredibly common IME. It's also quite
uncommon in real workloads to huge tables in which all records are
updated. It's more common to have value ranges that are nearly static, and a
more heavily changing range.

I think the most interesting cases where using the "measured" horizon will be
advantageous is anti-wrap vacuums. Those obviously have to happen for rarely
modified tables, including completely static ones, too. Using the "measured"
horizon will allow us to reduce the frequency of anti-wrap autovacuums on old
tables, because we'll be able to set a much more recent relfrozenxid.

This is becoming more common with the increased use of partitioning.

The problem is that the
autovacuum scheduling is way too naive for that to be a significant benefit -
nothing tries to schedule autovacuums so that they have a chance to complete
before anti-wrap autovacuums kick in. All that vacuum_freeze_table_age does is
to promote an otherwise-scheduled (auto-)vacuum to an aggressive vacuum.

Not sure what you mean about scheduling, since vacuum_freeze_table_age
is only in place to make overnight (off hours low activity scripted
VACUUMs) freeze tuples before any autovacuum worker gets the chance
(since the latter may run at a much less convenient time). Sure,
vacuum_freeze_table_age might also force a regular autovacuum worker
to do an aggressive VACUUM -- but I think it's mostly intended for a
manual overnight VACUUM. Not usually very helpful, but also not
harmful.

Oh, wait. I think that you're talking about how autovacuum workers in
particular tend to be affected by this. We launch an av worker that
wants to clean up bloat, but it ends up being aggressive (and maybe
taking way longer), perhaps quite randomly, only due to
vacuum_freeze_table_age (not due to autovacuum_freeze_max_age). Is
that it?

No, not quite. We treat anti-wraparound vacuums as an emergency (including
logging messages, not cancelling). But the only mechanism we have against
anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's
not really a "real" mechanism, because it requires an "independent" reason to
vacuum a table.

I've seen cases where anti-wraparound vacuums weren't a problem / never
happend for important tables for a long time, because there always was an
"independent" reason for autovacuum to start doing its thing before the table
got to be autovacuum_freeze_max_age old. But at some point the important
tables started to be big enough that autovacuum didn't schedule vacuums that
got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap
vacuums. Then things started to burn, because of the unpaced anti-wrap vacuums
clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite
remember the details.

Behaviour that lead to a "sudden" falling over, rather than getting gradually
worse are bad - they somehow tend to happen on Friday evenings :).

This is one of the most embarassing issues around the whole anti-wrap
topic. We kind of define it as an emergency that there's an anti-wraparound
vacuum. But we have *absolutely no mechanism* to prevent them from occurring.

What do you mean? Only an autovacuum worker can do an anti-wraparound
VACUUM (which is not quite the same thing as an aggressive VACUUM).

Just that autovacuum should have a mechanism to trigger aggressive vacuums
(i.e. ones that are guaranteed to be able to increase relfrozenxid unless
cancelled) before getting to the "emergency"-ish anti-wraparound state.

Or alternatively that we should have a separate threshold for the "harsher"
anti-wraparound measures.

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer. We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.

This has become *much* more important with the changes around deciding when to
index vacuum. It's not just that opportunistic pruning could have left LP_DEAD
items, it's that a previous vacuum is quite likely to have left them there,
because the previous vacuum decided not to perform index cleanup.

I haven't seen any evidence of that myself (with the optimization
added to Postgres 14 by commit 5100010ee4). I still don't understand
why you doubted that work so much. I'm not saying that you're wrong
to; I'm saying that I don't think that I understand your perspective
on it.

I didn't (nor do) doubt that it can be useful - to the contrary, I think the
unconditional index pass was a huge practial issue. I do however think that
there are cases where it can cause trouble. The comment above wasn't meant as
a criticism - just that it seems worth pointing out that one reason we might
encounter a lot of LP_DEAD items is previous vacuums that didn't perform index
cleanup.

What I have seen in my own tests (particularly with BenchmarkSQL) is
that most individual tables either never apply the optimization even
once (because the table reliably has heap pages with many more LP_DEAD
items than the 2%-of-relpages threshold), or will never need to
(because there are precisely zero LP_DEAD items anyway). Remaining
tables that *might* use the optimization tend to not go very long
without actually getting a round of index vacuuming. It's just too
easy for updates (and even aborted xact inserts) to introduce new
LP_DEAD items for us to go long without doing index vacuuming.

I think workloads are a bit more worried than a realistic set of benchmarksk
that one person can run yourself.

I gave you examples of cases that I see as likely being bitten by this,
e.g. when the skipped index cleanup prevents IOS scans. When both the
likely-to-be-modified and likely-to-be-queried value ranges are a small subset
of the entire data, the 2% threshold can prevent vacuum from cleaning up
LP_DEAD entries for a long time. Or when all index scans are bitmap index
scans, and nothing ends up cleaning up the dead index entries in certain
ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively
small rollback / non-HOT update rate can start to be really painful.

Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),

One thing we could do around this, btw, would be to aggressively replace
LP_REDIRECT items with their target item. We can't do that in all situations
(somebody might be following a ctid chain), but I think we have all the
information needed to do so. Probably would require a new HTSV RECENTLY_LIVE
state or something like that.

Another idea is to truncate the line pointer during pruning (including
opportunistic pruning). Matthias van de Meent has a patch for that.

I'm a bit doubtful that's as important (which is not to say that it's not
worth doing). For a heavily updated table the max space usage of the line
pointer array just isn't as big a factor as ending up with only half the
usable line pointers.

Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
since there is no barely any real practical sense in which we actually
miss doing useful work for these pages. Besides, this information
always seemed to have little practical value, even to Postgres hackers.

-0.5. I think it provides some value, and I don't see why the removal of the
information should be tied to this change. It's hard to diagnose why some dead
tuples aren't cleaned up - a common cause for that on smaller tables is that
nearly all pages are pinned nearly all the time.

Is that still true, though? If it turns out that we need to leave it
in, then I can do that. But I'd prefer to wait until we have more
information before making a final decision. Remember, the high level
idea of this whole patch is that we do as much work as possible for
any scanned_pages, which now includes pages that we never successfully
acquired a cleanup lock on. And so we're justified in assuming that
they're exactly equivalent to pages that we did get a cleanup on --
that's now the working assumption. I know that that's not literally
true, but that doesn't mean it's not a useful fiction -- it should be
very close to the truth.

IDK, it seems misleading to me. Small tables with a lot of churn - quite
common - are highly reliant on LP_DEAD entries getting removed or the tiny
table suddenly isn't so tiny anymore. And it's harder to diagnose why the
cleanup isn't happening without knowledge that pages needing cleanup couldn't
be cleaned up due to pins.

If you want to improve the logic so that we only count pages that would have
something to clean up, I'd be happy as well. It doesn't have to mean exactly
what it means today.

+      * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+      * the rel_pages used by lazy_scan_prune, from before a possible relation
+      * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
+      */
I think it should be doable to add an isolation test for this path. There have
been quite a few bugs around the wider topic...
I would argue that we already have one -- vacuum-reltuples.spec. I had
to update its expected output in the patch. I would argue that the
behavioral change (count tuples on a pinned-by-cursor heap page) that
necessitated updating the expected output for the test is an
improvement overall.

I was thinking of truncations, which I don't think vacuum-reltuples.spec
tests.

+     {
+             /* Can safely advance relfrozen and relminmxid, too */
+             Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+                        orig_rel_pages);
+             vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+                                                     new_rel_allvisible, vacrel->nindexes > 0,
+                                                     FreezeLimit, MultiXactCutoff, false);
+     }
I wonder if this whole logic wouldn't become easier and less fragile if we
just went for maintaining the "actually observed" horizon while scanning the
relation. If we skip a page via VM set the horizon to invalid. Otherwise we
can keep track of the accurate horizon and use that. No need to count pages
and stuff.
There is no question that that makes sense as an optimization -- my
prototype convinced me of that already. But I don't think that it can
simplify anything (not even the call to vac_update_relstats itself, to
actually update relfrozenxid at the end).

Maybe. But we've had quite a few bugs because we ended up changing some detail
of what is excluded in one of the counters, leading to wrong determination
about whether we scanned everything or not.

Fundamentally, this will only work if we decide to only skip all-frozen
pages, which (by definition) only happens within aggressive VACUUMs.

Hm? Or if there's just no runs of all-visible pages of sufficient length, so
we don't end up skipping at all.

You recently said (on the heap-pruning-14-bug thread) that you don't
think it would be practical to always set a page all-frozen when we
see that we're going to set it all-visible -- apparently you feel that
we could never opportunistically freeze early such that all-visible
but not all-frozen pages practically cease to exist. I'm still not
sure why you believe that (though you may be right, or I might have
misunderstood, since it's complicated).

Yes, I think it may not work out to do that. But it's not a very strongly held
opinion.

On reason for my doubt is the following:

We can set all-visible on a page without a FPW image (well, as long as hint
bits aren't logged). There's a significant difference between needing to WAL
log FPIs for every heap page or not, and it's not that rare for data to live
shorter than autovacuum_freeze_max_age or that limit never being reached.

On a table with 40 million individually inserted rows, fully hintbitted via
reads, I see a first VACUUM taking 1.6s and generating 11MB of WAL. A
subsequent VACUUM FREEZE takes 5s and generates 500MB of WAL. That's a quite
large multiplier...

If we ever managed to not have a per-page all-visible flag this'd get even
more extreme, because we'd then not even need to dirty the page for
insert-only pages. But if we want to freeze, we'd need to (unless we just got
rid of freezing).

It would certainly benefit this dynamic relfrozenxid business if it was
possible, though. If we could somehow make that work, then almost every
VACUUM would be able to advance relfrozenxid, independently of
aggressive-ness -- because we wouldn't have any
all-visible-but-not-all-frozen pages to skip (that important detail wouldn't
be left to chance).

Perhaps we can have most of the benefit even without that. If we were to
freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip
all-visible but not !all-frozen pages if they were less than x% of the
to-be-scanned data, we should be able to to still increase relfrozenxid in a
lot of cases?

I don't particularly like doing BufferGetPage() before holding a lock on the
page. Perhaps I'm too influenced by rust etc, but ISTM that at some point it'd
be good to have a crosscheck that BufferGetPage() is only allowed when holding
a page level lock.

I have occasionally wondered if the whole idea of reading heap pages
with only a pin (and having cleanup locks in VACUUM) is really worth
it -- alternative designs seem possible. Obviously that's a BIG
discussion, and not one to have right now. But it seems kind of
relevant.

With 'reading' do you mean reads-from-os, or just references to buffer
contents?

Since it is often legit to read a heap page without a buffer lock
(only a pin), I can't see why BufferGetPage() without a buffer lock
shouldn't also be okay -- if anything it seems safer. I think that I
would agree with you if it wasn't for that inconsistency (which is
rather a big "if", to be sure -- even for me).

At least for heap it's rarely legit to read buffer contents via
BufferGetPage() without a lock. It's legit to read data at already-determined
offsets, but you can't look at much other than the tuple contents.

Why does it make sense to track DEAD tuples this way? Isn't that going to lead
to counting them over-and-over again? I think it's quite misleading to include
them in "dead bot not yet removable".

Compared to what? Do we really want to invent a new kind of DEAD tuple
(e.g., to report on), just to handle this rare case?

When looking at logs I use the
"tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"
line to see whether the user is likely to have issues around an old
transaction / slot / prepared xact preventing cleanup. If new_dead_tuples
doesn't identify those cases anymore that's not reliable anymore.

I accept that this code is lying about the tuples being RECENTLY_DEAD,
kind of. But isn't it still strictly closer to the truth, compared to
HEAD? Counting it as RECENTLY_DEAD is far closer to the truth than not
counting it at all.

I don't see how it's closer at all. There's imo a significant difference
between not being able to remove tuples because of the xmin horizon, and not
being able to remove it because we couldn't get a cleanup lock.

Greetings,

Andres Freund

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Andres Freund (#4)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Nov 22, 2021 at 9:49 PM Andres Freund <andres@anarazel.de> wrote:

For example, we can definitely afford to wait a few more milliseconds
to get a cleanup lock just once

We currently have no infrastructure to wait for an lwlock or pincount for a
limited time. And at least for the former it'd not be easy to add. It may be
worth adding that at some point, but I'm doubtful this is sufficient reason
for nontrivial new infrastructure in very performance sensitive areas.

It was a hypothetical example. To be more practical about it: it seems
likely that we won't really benefit from waiting some amount of time
(not forever) for a cleanup lock in non-aggressive VACUUM, once we
have some of the relfrozenxid stuff we've talked about in place. In a
world where we're smarter about advancing relfrozenxid in
non-aggressive VACUUMs, the choice between waiting for a cleanup lock,
and not waiting (but also not advancing relfrozenxid at all) matters
less -- it's no longer a binary choice.

It's no longer a binary choice because we will have done away with the
current rigid way in which our new relfrozenxid for the relation is
either FreezeLimit, or nothing at all. So far we've only talked about
the case where we can update relfrozenxid with a value that happens to
be much newer than FreezeLimit. If we can do that, that's great. But
what about setting relfrozenxid to an *older* value than FreezeLimit
instead (in a non-aggressive VACUUM)? That's also pretty good! There
is still a decent chance that the final "suboptimal" relfrozenxid that
we determine can be safely set in pg_class at the end of our VACUUM
will still be far more recent than the preexisting relfrozenxid.
Especially with larger tables.

Advancing relfrozenxid should be thought of as a totally independent
thing to freezing tuples, at least in vacuumlazy.c itself. That's
kinda the case today, even, but *explicitly* decoupling advancing
relfrozenxid from actually freezing tuples seems like a good high
level goal for this project.

Remember, FreezeLimit is derived from vacuum_freeze_min_age in the
obvious way: OldestXmin for the VACUUM, minus vacuum_freeze_min_age
GUC/reloption setting. I'm pretty sure that this means that making
autovacuum freeze tuples more aggressively (by reducing
vacuum_freeze_min_age) could have the perverse effect of making
non-aggressive VACUUMs less likely to advance relfrozenxid -- which is
exactly backwards. This effect could easily be missed, even by expert
users, since there is no convenient instrumentation that shows how and
when relfrozenxid is advanced.

All of the autovacuums against the accounts table look similar to this
one -- you don't see anything about relfrozenxid being advanced
(because it isn't).

Does that really make
sense, though?

Does what make really sense?

Well, my accounts table example wasn't a particularly good one (it was
a conveniently available example). I am now sure that you got the
point I was trying to make here already, based on what you go on to
say about non-aggressive VACUUMs optionally *not* skipping
all-visible-not-all-frozen heap pages in the hopes of advancing
relfrozenxid earlier (more on that idea below, in my response).

On reflection, the simplest way of expressing the same idea is what I
just said about decoupling (decoupling advancing relfrozenxid from
freezing).

I think pgbench_accounts is just a really poor showcase. Most importantly
there's no even slightly longer running transactions that hold down the xid
horizon. But in real workloads thats incredibly common IME. It's also quite
uncommon in real workloads to huge tables in which all records are
updated. It's more common to have value ranges that are nearly static, and a
more heavily changing range.

I agree.

I think the most interesting cases where using the "measured" horizon will be
advantageous is anti-wrap vacuums. Those obviously have to happen for rarely
modified tables, including completely static ones, too. Using the "measured"
horizon will allow us to reduce the frequency of anti-wrap autovacuums on old
tables, because we'll be able to set a much more recent relfrozenxid.

That's probably true in practice -- but who knows these days, with the
autovacuum_vacuum_insert_scale_factor stuff? Either way I see no
reason to emphasize that case in the design itself. The "decoupling"
concept now seems like the key design-level concept -- everything else
follows naturally from that.

This is becoming more common with the increased use of partitioning.

Also with bulk loading. There could easily be a tiny number of
distinct XIDs that are close together in time, for many many rows --
practically one XID, or even exactly one XID.

No, not quite. We treat anti-wraparound vacuums as an emergency (including
logging messages, not cancelling). But the only mechanism we have against
anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's
not really a "real" mechanism, because it requires an "independent" reason to
vacuum a table.

Got it.

I've seen cases where anti-wraparound vacuums weren't a problem / never
happend for important tables for a long time, because there always was an
"independent" reason for autovacuum to start doing its thing before the table
got to be autovacuum_freeze_max_age old. But at some point the important
tables started to be big enough that autovacuum didn't schedule vacuums that
got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap
vacuums.

Right. Not just because they were big; also because autovacuum runs at
geometric intervals -- the final reltuples from last time is used to
determine the point at which av runs this time. This might make sense,
or it might not make any sense -- it all depends (mostly on index
stuff).

Then things started to burn, because of the unpaced anti-wrap vacuums
clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite
remember the details.

Non-cancelling anti-wraparound VACUUMs that (all of a sudden) cause
chaos because they interact badly with automated DDL is one I've seen
several times -- I'm sure you have too. That was what the Manta/Joyent
blogpost I referenced upthread went into.

Behaviour that lead to a "sudden" falling over, rather than getting gradually
worse are bad - they somehow tend to happen on Friday evenings :).

These are among our most important challenges IMV.

Just that autovacuum should have a mechanism to trigger aggressive vacuums
(i.e. ones that are guaranteed to be able to increase relfrozenxid unless
cancelled) before getting to the "emergency"-ish anti-wraparound state.

Maybe, but that runs into the problem of needing another GUC that
nobody will ever be able to remember the name of. I consider the idea
of adding a variety of measures that make non-aggressive VACUUM much
more likely to advance relfrozenxid in practice to be far more
promising.

Or alternatively that we should have a separate threshold for the "harsher"
anti-wraparound measures.

Or maybe just raise the default of autovacuum_freeze_max_age, which
many people don't change? That might be a lot safer than it once was.
Or will be, once we manage to teach VACUUM to advance relfrozenxid
more often in non-aggressive VACUUMs on Postgres 15. Imagine a world
in which we have that stuff in place, as well as related enhancements
added in earlier releases: autovacuum_vacuum_insert_scale_factor, the
freezemap, and the wraparound failsafe.

These add up to a lot; with all of that in place, the risk we'd be
introducing by increasing the default value of
autovacuum_freeze_max_age would be *far* lower than the risk of making
the same change back in 2006. I bring up 2006 because it was the year
that commit 48188e1621 added autovacuum_freeze_max_age -- the default
hasn't changed since that time.

I think workloads are a bit more worried than a realistic set of benchmarksk
that one person can run yourself.

No question. I absolutely accept that I only have to miss one
important detail with something like this -- that just goes with the
territory. Just saying that I have yet to see any evidence that the
bypass-indexes behavior really hurt anything. I do take the idea that
I might have missed something very seriously, despite all this.

I gave you examples of cases that I see as likely being bitten by this,
e.g. when the skipped index cleanup prevents IOS scans. When both the
likely-to-be-modified and likely-to-be-queried value ranges are a small subset
of the entire data, the 2% threshold can prevent vacuum from cleaning up
LP_DEAD entries for a long time. Or when all index scans are bitmap index
scans, and nothing ends up cleaning up the dead index entries in certain
ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively
small rollback / non-HOT update rate can start to be really painful.

That does seem possible. But I consider it very unlikely to appear as
a regression caused by the bypass mechanism itself -- not in any way
that was consistent over time. As far as I can tell, autovacuum
scheduling just doesn't operate at that level of precision, and never
has.

I have personally observed that ANALYZE does a very bad job at
noticing LP_DEAD items in tables/workloads where LP_DEAD items (not
DEAD tuples) tend to concentrate [1]/messages/by-id/CAH2-Wz=9R83wcwZcPUH4FVPeDM4znzbzMvp3rt21+XhQWMU8+g@mail.gmail.com -- Peter Geoghegan. The whole idea that ANALYZE
should count these items as if they were normal tuples seems pretty
bad to me.

Put it this way: imagine you run into trouble with the bypass thing,
and then you opt to disable it on that table (using the INDEX_CLEANUP
reloption). Why should this step solve the problem on its own? In
order for that to work, VACUUM would have to have to know to be very
aggressive about these LP_DEAD items. But there is good reason to
believe that it just won't ever notice them, as long as ANALYZE is
expected to provide reliable statistics that drive autovacuum --
they're just too concentrated for the block-based approach to truly
work.

I'm not minimizing the risk. Just telling you my thoughts on this.

I'm a bit doubtful that's as important (which is not to say that it's not
worth doing). For a heavily updated table the max space usage of the line
pointer array just isn't as big a factor as ending up with only half the
usable line pointers.

Agreed; by far the best chance we have of improving the line pointer
bloat situation is preventing it in the first place, by increasing
MaxHeapTuplesPerPage. Once we actually do that, our remaining options
are going to be much less helpful -- then it really is mostly just up
to VACUUM.

And it's harder to diagnose why the
cleanup isn't happening without knowledge that pages needing cleanup couldn't
be cleaned up due to pins.

If you want to improve the logic so that we only count pages that would have
something to clean up, I'd be happy as well. It doesn't have to mean exactly
what it means today.

It seems like what you really care about here are remaining cases
where our inability to acquire a cleanup lock has real consequences --
you want to hear about it when it happens, however unlikely it may be.
In other words, you want to keep something in log_autovacuum_* that
indicates that "less than the expected amount of work was completed"
due to an inability to acquire a cleanup lock. And so for you, this is
a question of keeping instrumentation that might still be useful, not
a question of how we define things fundamentally, at the design level.

Sound right?

If so, then this proposal might be acceptable to you:

* Remaining DEAD tuples with storage (though not LP_DEAD items from
previous opportunistic pruning) will get counted separately in the
lazy_scan_noprune (no cleanup lock) path. Also count the total number
of distinct pages that were found to contain one or more such DEAD
tuples.

* These two new counters will be reported on their own line in the log
output, though only in the cases where we actually have any such
tuples -- which will presumably be much rarer than simply failing to
get a cleanup lock (that's now no big deal at all, because we now
consistently do certain cleanup steps, and because FreezeLimit isn't
the only viable thing that we can set relfrozenxid to, at least in the
non-aggressive case).

* There is still a limited sense in which the same items get counted
as RECENTLY_DEAD -- though just those aspects that make the overall
design simpler. So the helpful aspects of this are still preserved.

We only need to tell pgstat_report_vacuum() that these items are
"deadtuples" (remaining dead tuples). That can work by having its
caller add a new int64 counter (same new tuple-based counter used for
the new log line) to vacrel->new_dead_tuples. We'd also add the same
new tuple counter in about the same way at the point where we
determine a final vacrel->new_rel_tuples.

So we wouldn't really be treating anything as RECENTLY_DEAD anymore --
pgstat_report_vacuum() and vacrel->new_dead_tuples don't specifically
expect anything about RECENTLY_DEAD-ness already.

I was thinking of truncations, which I don't think vacuum-reltuples.spec
tests.

Got it. I'll look into that for v2.

Maybe. But we've had quite a few bugs because we ended up changing some detail
of what is excluded in one of the counters, leading to wrong determination
about whether we scanned everything or not.

Right. But let me just point out that my whole approach is to make
that impossible, by not needing to count pages, except in
scanned_pages (and in frozenskipped_pages + rel_pages). The processing
performed for any page that we actually read during VACUUM should be
uniform (or practically uniform), by definition. With minimal fudging
in the cleanup lock case (because we mostly do the same work there
too).

There should be no reason for any more page counters now, except for
non-critical instrumentation. For example, if you want to get the
total number of pages skipped via the visibility map (not just
all-frozen pages), then you simply subtract scanned_pages from
rel_pages.

Fundamentally, this will only work if we decide to only skip all-frozen
pages, which (by definition) only happens within aggressive VACUUMs.

Hm? Or if there's just no runs of all-visible pages of sufficient length, so
we don't end up skipping at all.

Of course. But my point was: who knows when that'll happen?

On reason for my doubt is the following:

We can set all-visible on a page without a FPW image (well, as long as hint
bits aren't logged). There's a significant difference between needing to WAL
log FPIs for every heap page or not, and it's not that rare for data to live
shorter than autovacuum_freeze_max_age or that limit never being reached.

This sounds like an objection to one specific heuristic, and not an
objection to the general idea. The only essential part is
"opportunistic freezing during vacuum, when the cost is clearly very
low, and the benefit is probably high". And so it now seems you were
making a far more limited statement than I first believed.

Obviously many variations are possible -- there is a spectrum.
Example: a heuristic that makes VACUUM notice when it is going to
freeze at least one tuple on a page, iff the page will be marked
all-visible in any case -- we should instead freeze every tuple on the
page, and mark the page all-frozen, batching work (could account for
LP_DEAD items here too, not counting them on the assumption that
they'll become LP_UNUSED during the second heap pass later on).

If we see these conditions, then the likely explanation is that the
tuples on the heap page happen to have XIDs that are "split" by the
not-actually-important FreezeLimit cutoff, despite being essentially
similar in any way that matters.

If you want to make the same heuristic more conservative: only do this
when no existing tuples are frozen, since that could be taken as a
sign of the original heuristic not quite working on the same heap page
at an earlier stage.

I suspect that even very conservative versions of the same basic idea
would still help a lot.

Perhaps we can have most of the benefit even without that. If we were to
freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip
all-visible but not !all-frozen pages if they were less than x% of the
to-be-scanned data, we should be able to to still increase relfrozenxid in a
lot of cases?

I bet that's true. I like that idea.

If we had this policy, then the number of "extra"
visited-in-non-aggressive-vacuum pages (all-visible but not yet
all-frozen pages) could be managed over time through more
opportunistic freezing. This might make it work even better.

These all-visible (but not all-frozen) heap pages could be considered
"tenured", since they have survived at least one full VACUUM cycle
without being unset. So why not also freeze them based on the
assumption that they'll probably stay that way forever? There won't be
so many of the pages when we do this anyway, by definition -- since
we'd have a heuristic that limited the total number (say to no more
than 10% of the total relation size, something like that).

We're smoothing out the work that currently takes place all together
during an aggressive VACUUM this way.

Moreover, there is perhaps a good chance that the total number of
all-visible-not all-frozen heap pages will *stay* low over time, as a
result of this policy actually working -- there may be a virtuous
cycle that totally prevents us from getting an aggressive VACUUM even
once.

I have occasionally wondered if the whole idea of reading heap pages
with only a pin (and having cleanup locks in VACUUM) is really worth
it -- alternative designs seem possible. Obviously that's a BIG
discussion, and not one to have right now. But it seems kind of
relevant.

With 'reading' do you mean reads-from-os, or just references to buffer
contents?

The latter.

[1]: /messages/by-id/CAH2-Wz=9R83wcwZcPUH4FVPeDM4znzbzMvp3rt21+XhQWMU8+g@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Andres Freund

andres@anarazel.de

about 4 years ago

In reply to: Peter Geoghegan (#5)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2021-11-23 17:01:20 -0800, Peter Geoghegan wrote:

On reason for my doubt is the following:

We can set all-visible on a page without a FPW image (well, as long as hint
bits aren't logged). There's a significant difference between needing to WAL
log FPIs for every heap page or not, and it's not that rare for data to live
shorter than autovacuum_freeze_max_age or that limit never being reached.

This sounds like an objection to one specific heuristic, and not an
objection to the general idea.

I understood you to propose that we do not have separate frozen and
all-visible states. Which I think will be problematic, because of scenarios
like the above.

The only essential part is "opportunistic freezing during vacuum, when the
cost is clearly very low, and the benefit is probably high". And so it now
seems you were making a far more limited statement than I first believed.

I'm on board with freezing when we already dirty out the page, and when doing
so doesn't cause an additional FPI. And I don't think I've argued against that
in the past.

These all-visible (but not all-frozen) heap pages could be considered
"tenured", since they have survived at least one full VACUUM cycle
without being unset. So why not also freeze them based on the
assumption that they'll probably stay that way forever?

Because it's a potentially massive increase in write volume? E.g. if you have
a insert-only workload, and you discard old data by dropping old partitions,
this will often add yet another rewrite, despite your data likely never
getting old enough to need to be frozen.

Given that we often immediately need to start another vacuum just when one
finished, because the vacuum took long enough to reach thresholds of vacuuming
again, I don't think the (auto-)vacuum count is a good proxy.

Maybe you meant this as a more limited concept, i.e. only doing so when the
percentage of all-visible but not all-frozen pages is small?

We could perhaps do better with if we had information about the system-wide
rate of xid throughput and how often / how long past vacuums of a table took.

Greetings,

Andres Freund

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Peter Geoghegan (#5)

1 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Nov 23, 2021 at 5:01 PM Peter Geoghegan <pg@bowt.ie> wrote:

Behaviour that lead to a "sudden" falling over, rather than getting gradually
worse are bad - they somehow tend to happen on Friday evenings :).

These are among our most important challenges IMV.

I haven't had time to work through any of your feedback just yet --
though it's certainly a priority for. I won't get to it until I return
home from PGConf NYC next week.

Even still, here is a rebased v2, just to fix the bitrot. This is just
a courtesy to anybody interested in the patch.

--
Peter Geoghegan

Attachments:

v2-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/octet-stream; name=v2-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload

From 2cc761a55b6f727b44a32b03e8393ffd3f61fb2c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 17 Nov 2021 21:27:06 -0800
Subject: [PATCH v2] Simplify lazy_scan_heap's handling of scanned pages.

Redefine a scanned page as any heap page that actually gets pinned by
VACUUM's first pass over the heap.  Pages counted by scanned_pages are
now the complement of the pages that are skipped over using the
visibility map.  This new definition significantly simplifies quite a
few things.

Now heap relation truncation, visibility map bit setting, tuple counting
(e.g., for pg_class.reltuples), and tuple freezing all share a common
definition of scanned_pages.  That makes it possible to remove certain
special cases, that never made much sense.  We no longer need to track
tupcount_pages separately (see bugfix commit 1914c5ea for details),
since we now always count tuples from pages that are scanned_pages.  We
also don't need to needlessly distinguish between aggressive and
non-aggressive VACUUM operations when we cannot immediately acquire a
cleanup lock.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.  But we now
make the non-aggressive case notice that a failure to acquire a cleanup
lock on one particular heap page does not in itself make it unsafe to
advance relfrozenxid for the whole relation (which is what we usually
see in the aggressive case already).

This new relfrozenxid optimization might not be all that valuable on its
own, but it may still facilitate future work that makes non-aggressive
VACUUMs more conscious of the benefit of advancing relfrozenxid sooner
rather than later.  In general it would be useful for non-aggressive
VACUUMs to be "more aggressive" opportunistically (e.g., by waiting for
a cleanup lock once or twice if needed).  It would also be generally
useful if aggressive VACUUMs were "less aggressive" opportunistically
(e.g. by being responsive to query cancellations when the risk of
wraparound failure is still very low).

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer.  We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.
Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),
so we should not squander any opportunity to do that.  Commit 8523492d4e
taught VACUUM to set LP_DEAD line pointers to LP_UNUSED while only
holding an exclusive lock (not a cleanup lock), so we can expect to set
existing LP_DEAD items to LP_UNUSED reliably, even when we cannot
acquire our own cleanup lock at either pass over the heap (unless we opt
to skip index vacuuming, which implies that there is no second pass over
the heap).

Note that we no longer report on "pin skipped pages" in VACUUM VERBOSE,
since there is no barely any real practical sense in which we actually
miss doing useful work for these pages.  Besides, this information
always seemed to have little practical value, even to Postgres hackers.
---
 src/backend/access/heap/vacuumlazy.c          | 792 +++++++++++-------
 .../isolation/expected/vacuum-reltuples.out   |   2 +-
 .../isolation/specs/vacuum-reltuples.spec     |   7 +-
 3 files changed, 500 insertions(+), 301 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 282b44f87..39a7fb39e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -284,6 +284,8 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	bool		aggressive;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -308,6 +310,8 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Are FreezeLimit/MultiXactCutoff still valid? */
+	bool		freeze_cutoffs_valid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -322,10 +326,8 @@ typedef struct LVRelState
 	 */
 	LVDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -338,6 +340,7 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
+	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
 	int64		lpdead_items;	/* # deleted from indexes */
 	int64		new_dead_tuples;	/* new estimated total # of dead items in
@@ -377,19 +380,22 @@ static int	elevel = -1;
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
-						   bool aggressive);
+static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params);
+static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
+								   BlockNumber blkno, Page page,
+								   bool sharelock, Buffer vmbuffer);
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
 							LVPagePruneState *prunestate);
+static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
+							  BlockNumber blkno, Page page,
+							  bool *hastup);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, int index, Buffer *vmbuffer);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel);
@@ -465,16 +471,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	int			usecs;
 	double		read_rate,
 				write_rate;
-	bool		aggressive;		/* should we scan all unfrozen pages? */
-	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
+	bool		aggressive;
+	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
-	TransactionId new_frozen_xid;
-	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
@@ -529,6 +533,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
+	vacrel->aggressive = aggressive;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 
@@ -573,6 +578,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* Track if cutoffs became invalid (possible in !aggressive case only) */
+	vacrel->freeze_cutoffs_valid = true;
 
 	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	vacrel->relname = pstrdup(RelationGetRelationName(rel));
@@ -609,30 +616,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params, aggressive);
+	lazy_scan_heap(vacrel, params);
 
 	/* Done with indexes */
 	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the all unfrozen pages. If we did,
-	 * we can adjust relfrozenxid and relminmxid.
-	 *
-	 * NB: We need to check this before truncating the relation, because that
-	 * will change ->rel_pages.
-	 */
-	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
-		< vacrel->rel_pages)
-	{
-		Assert(!aggressive);
-		scanned_all_unfrozen = false;
-	}
-	else
-		scanned_all_unfrozen = true;
-
-	/*
-	 * Optionally truncate the relation.
+	 * Optionally truncate the relation.  But remember the relation size used
+	 * by lazy_scan_prune for later first.
 	 */
+	orig_rel_pages = vacrel->rel_pages;
 	if (should_attempt_truncation(vacrel))
 	{
 		/*
@@ -663,28 +656,43 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 *
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * relpages to.
-	 *
-	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
-	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrel->rel_pages;
+	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
 	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
-
-	vac_update_relstats(rel,
-						new_rel_pages,
-						new_live_tuples,
-						new_rel_allvisible,
-						vacrel->nindexes > 0,
-						new_frozen_xid,
-						new_min_multi,
-						false);
+	/*
+	 * Aggressive VACUUM (which is the same thing as anti-wraparound
+	 * autovacuum for most practical purposes) exists so that we'll reliably
+	 * advance relfrozenxid and relminmxid sooner or later.  But we can often
+	 * opportunistically advance them even in a non-aggressive VACUUM.
+	 * Consider if that's possible now.
+	 *
+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_prune, from before a possible relation
+	 * truncation took place. (vacrel->rel_pages is now new_rel_pages.)
+	 */
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}
 
 	/*
 	 * Report results to the stats collector, too.
@@ -713,7 +721,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
-			BlockNumber orig_rel_pages;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -760,10 +767,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->pinskipped_pages,
 							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
@@ -771,7 +777,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) vacrel->new_rel_tuples,
 							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
-			orig_rel_pages = vacrel->rel_pages + vacrel->pages_removed;
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -888,9 +893,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, VacuumParams *params)
 {
 	LVDeadItems *dead_items;
+	bool		aggressive;
 	BlockNumber nblocks,
 				blkno,
 				next_unskippable_block,
@@ -910,26 +916,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	pg_rusage_init(&ru0);
 
-	if (aggressive)
-		ereport(elevel,
-				(errmsg("aggressively vacuuming \"%s.%s\"",
-						vacrel->relnamespace,
-						vacrel->relname)));
-	else
-		ereport(elevel,
-				(errmsg("vacuuming \"%s.%s\"",
-						vacrel->relnamespace,
-						vacrel->relname)));
-
+	aggressive = vacrel->aggressive;
 	nblocks = RelationGetNumberOfBlocks(vacrel->rel);
 	next_unskippable_block = 0;
 	next_failsafe_block = 0;
 	next_fsm_block_to_vacuum = 0;
 	vacrel->rel_pages = nblocks;
 	vacrel->scanned_pages = 0;
-	vacrel->pinskipped_pages = 0;
 	vacrel->frozenskipped_pages = 0;
-	vacrel->tupcount_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -947,6 +941,17 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	vacrel->indstats = (IndexBulkDeleteResult **)
 		palloc0(vacrel->nindexes * sizeof(IndexBulkDeleteResult *));
 
+	if (aggressive)
+		ereport(elevel,
+				(errmsg("aggressively vacuuming \"%s.%s\"",
+						vacrel->relnamespace,
+						vacrel->relname)));
+	else
+		ereport(elevel,
+				(errmsg("vacuuming \"%s.%s\"",
+						vacrel->relnamespace,
+						vacrel->relname)));
+
 	/*
 	 * Do failsafe precheck before calling dead_items_alloc.  This ensures
 	 * that parallel VACUUM won't be attempted when relfrozenxid is already
@@ -1002,15 +1007,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * just added to that page are necessarily newer than the GlobalXmin we
 	 * computed, so they'll have no effect on the value to which we can safely
 	 * set relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 *
-	 * We will scan the table's last page, at least to the extent of
-	 * determining whether it has tuples or not, even if it should be skipped
-	 * according to the above rules; except when we've already determined that
-	 * it's not worth trying to truncate the table.  This avoids having
-	 * lazy_truncate_heap() take access-exclusive lock on the table to attempt
-	 * a truncation that just fails immediately because there are tuples in
-	 * the last page.  This is worth avoiding mainly because such a lock must
-	 * be replayed on any hot standby, where it can be disruptive.
 	 */
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
@@ -1048,18 +1044,14 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		bool		all_visible_according_to_vm = false;
 		LVPagePruneState prunestate;
 
-		/*
-		 * Consider need to skip blocks.  See note above about forcing
-		 * scanning of last page.
-		 */
-#define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
 
+		/*
+		 * Consider need to skip blocks
+		 */
 		if (blkno == next_unskippable_block)
 		{
 			/* Time to advance next_unskippable_block */
@@ -1108,13 +1100,19 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		else
 		{
 			/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current block can be skipped if we've seen a long enough
+			 * run of skippable blocks to justify skipping it.
+			 *
+			 * There is an exception: we will scan the table's last page to
+			 * determine whether it has tuples or not, even if it would
+			 * otherwise be skipped (unless it's clearly not worth trying to
+			 * truncate the table).  This avoids having lazy_truncate_heap()
+			 * take access-exclusive lock on the table to attempt a truncation
+			 * that just fails immediately because there are tuples in the
+			 * last page.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks &&
+				!(blkno == nblocks - 1 && should_attempt_truncation(vacrel)))
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
@@ -1124,12 +1122,22 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * case, or else we'll think we can't update relfrozenxid and
 				 * relminmxid.  If it's not an aggressive vacuum, we don't
 				 * know whether it was all-frozen, so we have to recheck; but
-				 * in this case an approximate answer is OK.
+				 * in this case an approximate answer is still correct.
+				 *
+				 * (We really don't want to miss out on the opportunity to
+				 * advance relfrozenxid in a non-aggressive vacuum, but this
+				 * edge case shouldn't make that appreciably less likely in
+				 * practice.)
 				 */
 				if (aggressive || VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
+
+			/*
+			 * Otherwise, the page must be at least all-visible if not
+			 * all-frozen, so we can set all_visible_according_to_vm = true
+			 */
 			all_visible_according_to_vm = true;
 		}
 
@@ -1154,7 +1162,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 * Consider if we definitely have enough space to process TIDs on page
 		 * already.  If we are close to overrunning the available space for
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
-		 * this page.
+		 * this page.  Must do this before calling lazy_scan_prune (or before
+		 * calling lazy_scan_noprune).
 		 */
 		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
 		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
@@ -1189,7 +1198,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
-		 * Set up visibility map page as needed.
+		 * Set up visibility map page as needed, and pin the heap page that
+		 * we're going to scan.
 		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
@@ -1202,156 +1212,52 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
+		vacrel->scanned_pages++;
 
 		/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing in lazy_scan_noprune.
 		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
 			bool		hastup;
 
-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
 			{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
+				/* Lock and pin released for us */
+				continue;
+			}
+
+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup))
+			{
+				/* No need to wait for cleanup lock for this page */
+				UnlockReleaseBuffer(buf);
+				if (hastup)
+					vacrel->nonempty_pages = blkno + 1;
 				continue;
 			}
 
 			/*
-			 * Read the page with share lock to see if any xids on it need to
-			 * be frozen.  If not we just skip the page, after updating our
-			 * scan statistics.  If there are some, we wait for cleanup lock.
-			 *
-			 * We could defer the lock request further by remembering the page
-			 * and coming back to it later, or we could even register
-			 * ourselves for multiple buffers and then service whichever one
-			 * is received first.  For now, this seems good enough.
-			 *
-			 * If we get here with aggressive false, then we're just forcibly
-			 * checking the page, and so we don't want to insist on getting
-			 * the lock; we only need to know if the page contains tuples, so
-			 * that we can update nonempty_pages correctly.  It's convenient
-			 * to use lazy_check_needs_freeze() for both situations, though.
+			 * lazy_scan_noprune could not do all required processing without
+			 * a cleanup lock.  Wait for a cleanup lock, and then proceed to
+			 * lazy_scan_prune to perform ordinary pruning and freezing.
 			 */
-			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
-			{
-				UnlockReleaseBuffer(buf);
-				vacrel->scanned_pages++;
-				vacrel->pinskipped_pages++;
-				if (hastup)
-					vacrel->nonempty_pages = blkno + 1;
-				continue;
-			}
-			if (!aggressive)
-			{
-				/*
-				 * Here, we must not advance scanned_pages; that would amount
-				 * to claiming that the page contains no freezable tuples.
-				 */
-				UnlockReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
-				if (hastup)
-					vacrel->nonempty_pages = blkno + 1;
-				continue;
-			}
+			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
-			/* drop through to normal processing */
 		}
 
-		/*
-		 * By here we definitely have enough dead_items space for whatever
-		 * LP_DEAD tids are on this page, we have the visibility map page set
-		 * up in case we need to set this page's all_visible/all_frozen bit,
-		 * and we have a cleanup lock.  Any tuples on this page are now sure
-		 * to be "counted" by this VACUUM.
-		 *
-		 * One last piece of preamble needs to take place before we can prune:
-		 * we need to consider new and empty pages.
-		 */
-		vacrel->scanned_pages++;
-		vacrel->tupcount_pages++;
-
-		page = BufferGetPage(buf);
-
-		if (PageIsNew(page))
+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
-			if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
-			{
-				Size		freespace = BLCKSZ - SizeOfPageHeaderData;
-
-				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-			}
-			continue;
-		}
-
-		if (PageIsEmpty(page))
-		{
-			Size		freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+			/* Lock and pin released for us */
 			continue;
 		}
 
@@ -1564,7 +1470,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
-													 vacrel->tupcount_pages,
+													 vacrel->scanned_pages,
 													 vacrel->live_tuples);
 
 	/*
@@ -1637,14 +1543,10 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	appendStringInfo(&buf,
 					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
 					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
-	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
-									"Skipped %u pages due to buffer pins, ",
-									vacrel->pinskipped_pages),
-					 vacrel->pinskipped_pages);
-	appendStringInfo(&buf, ngettext("%u frozen page.\n",
-									"%u frozen pages.\n",
-									vacrel->frozenskipped_pages),
-					 vacrel->frozenskipped_pages);
+	appendStringInfo(&buf, ngettext("%u page skipped using visibility map.\n",
+									"%u pages skipped using visibility map.\n",
+									vacrel->rel_pages - vacrel->scanned_pages),
+					 vacrel->rel_pages - vacrel->scanned_pages);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
@@ -1658,6 +1560,132 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pfree(buf.data);
 }
 
+/*
+ *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
+ *
+ * Must call here to handle both new and empty pages before calling
+ * lazy_scan_prune or lazy_scan_noprune, since they're not prepared to deal
+ * with new or empty pages.
+ *
+ * It's necessary to consider new pages as a special case, since the rules for
+ * maintaining the visibility map and FSM with empty pages are a little
+ * different (though new pages can be truncated based on the usual rules).
+ *
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a rare corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).
+ *
+ * Caller can either hold a buffer cleanup lock on the buffer, or a simple
+ * shared lock.
+ */
+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)
+{
+	Size		freespace;
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * All-zeroes pages can be left over if either a backend extends the
+		 * relation by a single page, but crashes before the newly initialized
+		 * page has been written out, or when bulk-extending the relation
+		 * (which creates a number of empty pages at the tail end of the
+		 * relation), and then enters them into the FSM.
+		 *
+		 * Note we do not enter the page into the visibilitymap. That has the
+		 * downside that we repeatedly visit this page in subsequent vacuums,
+		 * but otherwise we'll never not discover the space on a promoted
+		 * standby. The harm of repeated checking ought to normally not be too
+		 * bad - the space usually should be used at some point, otherwise
+		 * there wouldn't be any regular vacuums.
+		 *
+		 * Make sure these pages are in the FSM, to ensure they can be reused.
+		 * Do that by testing if there's any space recorded for the page. If
+		 * not, enter it. We do so after releasing the lock on the heap page,
+		 * the FSM is approximate, after all.
+		 */
+		UnlockReleaseBuffer(buf);
+
+		if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
+		{
+			freespace = BLCKSZ - SizeOfPageHeaderData;
+
+			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		}
+
+		return true;
+	}
+
+	if (PageIsEmpty(page))
+	{
+		/*
+		 * It seems likely that caller will always be able to get a cleanup
+		 * lock on an empty page.  But don't take any chances -- escalate to
+		 * an exclusive lock (still don't need a cleanup lock, though).
+		 */
+		if (sharelock)
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (!PageIsEmpty(page))
+			{
+				/* page isn't new or empty -- keep lock and pin for now */
+				return false;
+			}
+		}
+		else
+		{
+			/* Already have a full cleanup lock (which is more than enough) */
+		}
+
+		freespace = PageGetHeapFreeSpace(page);
+
+		/*
+		 * Unlike new pages, empty pages are always set all-visible and
+		 * all-frozen.
+		 */
+		if (!PageIsAllVisible(page))
+		{
+			START_CRIT_SECTION();
+
+			/* mark buffer dirty before writing a WAL record */
+			MarkBufferDirty(buf);
+
+			/*
+			 * It's possible that another backend has extended the heap,
+			 * initialized the page, and then failed to WAL-log the page due
+			 * to an ERROR.  Since heap extension is not WAL-logged, recovery
+			 * might try to replay our record setting the page all-visible and
+			 * find that the page isn't initialized, which will cause a PANIC.
+			 * To prevent that, check whether the page has been previously
+			 * WAL-logged, and if not, do that now.
+			 */
+			if (RelationNeedsWAL(vacrel->rel) &&
+				PageGetLSN(page) == InvalidXLogRecPtr)
+				log_newpage_buffer(buf, true);
+
+			PageSetAllVisible(page);
+			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
+							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		return true;
+	}
+
+	/* page isn't new or empty -- keep lock and pin */
+	return false;
+}
+
 /*
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -1764,10 +1792,9 @@ retry:
 		 * LP_DEAD items are processed outside of the loop.
 		 *
 		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how lazy_check_needs_freeze() or
-		 * count_nondeletable_pages() do it -- they only consider pages empty
-		 * when they only have LP_UNUSED items, which is important for
-		 * correctness.
+		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
+		 * it -- it only considers pages empty/truncatable when they have no
+		 * items at all (except LP_UNUSED items).
 		 *
 		 * Our assumption is that any LP_DEAD items we encounter here will
 		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
@@ -2054,6 +2081,236 @@ retry:
 	vacrel->live_tuples += live_tuples;
 }
 
+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.
+ *
+ * While pruning isn't performed here, we can at least collect existing
+ * LP_DEAD items into the dead_items array for removal from indexes.  It's
+ * quite possible that earlier opportunistic pruning left LP_DEAD items
+ * behind, and we shouldn't miss out on an opportunity to make them reusable
+ * (VACUUM alone is capable of cleaning up line pointer bloat like this).
+ * Note that we'll only require an exclusive lock (not a cleanup lock) later
+ * on when we set these LP_DEAD items to LP_UNUSED in lazy_vacuum_heap_page.
+ *
+ * Freezing isn't performed here either.  For aggressive VACUUM callers, we
+ * may return false to indicate that a full cleanup lock is required.  This is
+ * necessary because pruning requires a cleanup lock, and because VACUUM
+ * cannot freeze a page's tuples until after pruning takes place (freezing
+ * tuples effectively requires a cleanup lock, though we don't need a cleanup
+ * lock in lazy_vacuum_heap_page or in lazy_scan_new_or_empty to set a heap
+ * page all-frozen in the visibility map).
+ *
+ * We'll always return true for a non-aggressive VACUUM, even when we know
+ * that this will cause them to miss out on freezing tuples from before
+ * vacrel->FreezeLimit cutoff -- they should never have to wait for a cleanup
+ * lock.  This does mean that they definitely won't be able to advance
+ * relfrozenxid opportunistically (same applies to vacrel->MultiXactCutoff and
+ * relminmxid).
+ *
+ * See lazy_scan_prune for an explanation of hastup return flag.
+ */
+static bool
+lazy_scan_noprune(LVRelState *vacrel,
+				  Buffer buf,
+				  BlockNumber blkno,
+				  Page page,
+				  bool *hastup)
+{
+	OffsetNumber offnum,
+				maxoff;
+	bool		has_tuple_needs_freeze = false;
+	int			lpdead_items,
+				num_tuples,
+				live_tuples,
+				new_dead_tuples;
+	HeapTupleHeader tupleheader;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+
+	*hastup = false;			/* for now */
+
+	lpdead_items = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+	new_dead_tuples = 0;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		if (!ItemIdIsUsed(itemid))
+			continue;
+
+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;		/* page won't be truncatable */
+			continue;
+		}
+
+		if (ItemIdIsDead(itemid))
+		{
+			/*
+			 * Deliberately don't set hastup=true here.  See same point in
+			 * lazy_scan_prune for an explanation.
+			 */
+			deadoffsets[lpdead_items++] = offnum;
+			continue;
+		}
+
+		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+		if (!has_tuple_needs_freeze &&
+			heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
+		{
+			if (vacrel->aggressive)
+			{
+				/* Going to have to get cleanup lock for lazy_scan_prune */
+				vacrel->offnum = InvalidOffsetNumber;
+				return false;
+			}
+
+			has_tuple_needs_freeze = true;
+		}
+
+		num_tuples++;
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
+
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count both cases as live, just like lazy_scan_prune
+				 */
+				live_tuples++;
+
+				break;
+			case HEAPTUPLE_DEAD:
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * We count DEAD and RECENTLY_DEAD tuples in new_dead_tuples.
+				 *
+				 * lazy_scan_prune only does this for RECENTLY_DEAD tuples,
+				 * and never has to deal with DEAD tuples directly (they
+				 * reliably become LP_DEAD items through pruning).  Our
+				 * approach to DEAD tuples is a bit arbitrary, but it seems
+				 * better than totally ignoring them.
+				 */
+				new_dead_tuples++;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Do not count these rows as live, just like lazy_scan_prune
+				 */
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+	}
+
+	vacrel->offnum = InvalidOffsetNumber;
+
+	if (has_tuple_needs_freeze)
+	{
+		/*
+		 * Current non-aggressive VACUUM operation definitely won't be able to
+		 * advance relfrozenxid or relminmxid
+		 */
+		Assert(!vacrel->aggressive);
+		vacrel->freeze_cutoffs_valid = false;
+	}
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in the dead_items
+	 * array iff VACUUM uses two-pass strategy case
+	 */
+	if (vacrel->nindexes == 0)
+	{
+		/*
+		 * We are not prepared to handle the corner case where a single pass
+		 * strategy VACUUM cannot get a cleanup lock, and we then find LP_DEAD
+		 * items.  Repeat the same trick that we use for DEAD tuples: pretend
+		 * that they're RECENTLY_DEAD tuples.
+		 *
+		 * There is no fundamental reason why we must take the easy way out
+		 * like this.  Finding a way to make these LP_DEAD items get set to
+		 * LP_UNUSED would be less valuable and more complicated than it is in
+		 * the two-pass strategy case, since it would necessitate that we
+		 * repeat our lazy_scan_heap caller's page-at-a-time/one-pass-strategy
+		 * heap vacuuming steps.  Whereas in the two-pass strategy case,
+		 * lazy_vacuum_heap_rel will set the LP_DEAD items to LP_UNUSED. It
+		 * must always deal with things like remaining DEAD tuples with
+		 * storage, new LP_DEAD items that we didn't see earlier on, etc.
+		 */
+		if (lpdead_items > 0)
+			*hastup = true;		/* page won't be truncatable */
+		num_tuples += lpdead_items;
+		new_dead_tuples += lpdead_items;
+	}
+	else if (lpdead_items > 0)
+	{
+		LVDeadItems *dead_items = vacrel->dead_items;
+		ItemPointerData tmp;
+
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_items->items[dead_items->num_items++] = tmp;
+		}
+
+		Assert(dead_items->num_items <= dead_items->max_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_items->num_items);
+
+		vacrel->lpdead_items += lpdead_items;
+	}
+	else
+	{
+		/*
+		 * We opt to skip FSM processing for the page on the grounds that it
+		 * is probably being modified by concurrent DML operations.  Seems
+		 * best to assume that the space is best left behind for future
+		 * updates of existing tuples.  This matches what opportunistic
+		 * pruning does.
+		 *
+		 * It's theoretically possible for us to set VM bits here too, but we
+		 * don't try that either.  It is highly unlikely to be possible, much
+		 * less useful.
+		 */
+	}
+
+	/*
+	 * Finally, add relevant page-local counts to whole-VACUUM counts
+	 */
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+
+	/* Caller won't need to call lazy_scan_prune with same page */
+	return true;
+}
+
 /*
  * Remove the collected garbage tuples from the table and its indexes.
  *
@@ -2500,67 +2757,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	return index;
 }
 
-/*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
- *
- * Returns true if the page needs to be vacuumed using cleanup lock.
- * Also returns a flag indicating whether page contains any tuples at all.
- */
-static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
-{
-	Page		page = BufferGetPage(buf);
-	OffsetNumber offnum,
-				maxoff;
-	HeapTupleHeader tupleheader;
-
-	*hastup = false;
-
-	/*
-	 * New and empty pages, obviously, don't contain tuples. We could make
-	 * sure that the page is registered in the FSM, but it doesn't seem worth
-	 * waiting for a cleanup lock just for that, especially because it's
-	 * likely that the pin holder will do so.
-	 */
-	if (PageIsNew(page) || PageIsEmpty(page))
-		return false;
-
-	maxoff = PageGetMaxOffsetNumber(page);
-	for (offnum = FirstOffsetNumber;
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		ItemId		itemid;
-
-		/*
-		 * Set the offset number so that we can display it along with any
-		 * error that occurred while processing this tuple.
-		 */
-		vacrel->offnum = offnum;
-		itemid = PageGetItemId(page, offnum);
-
-		/* this should match hastup test in count_nondeletable_pages() */
-		if (ItemIdIsUsed(itemid))
-			*hastup = true;
-
-		/* dead and redirect items never need freezing */
-		if (!ItemIdIsNormal(itemid))
-			continue;
-
-		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-
-		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
-			break;
-	}							/* scan along page */
-
-	/* Clear the offset information once we have processed the given page. */
-	vacrel->offnum = InvalidOffsetNumber;
-
-	return (offnum <= maxoff);
-}
-
 /*
  * Trigger the failsafe to avoid wraparound failure when vacrel table has a
  * relfrozenxid and/or relminmxid that is dangerously far in the past.
@@ -2655,7 +2851,7 @@ do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel)
 	 */
 	vacrel->lps->lvshared->reltuples = vacrel->new_rel_tuples;
 	vacrel->lps->lvshared->estimated_count =
-		(vacrel->tupcount_pages < vacrel->rel_pages);
+		(vacrel->scanned_pages < vacrel->rel_pages);
 
 	/* Determine the number of parallel workers to launch */
 	if (vacrel->lps->lvshared->first_time)
@@ -2972,7 +3168,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		double		reltuples = vacrel->new_rel_tuples;
 		bool		estimated_count =
-		vacrel->tupcount_pages < vacrel->rel_pages;
+		vacrel->scanned_pages < vacrel->rel_pages;
 
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -3123,7 +3319,9 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * should_attempt_truncation - should we attempt to truncate the heap?
  *
  * Don't even think about it unless we have a shot at releasing a goodly
- * number of pages.  Otherwise, the time taken isn't worth it.
+ * number of pages.  Otherwise, the time taken isn't worth it, mainly because
+ * an AccessExclusive lock must be replayed on any hot standby, where it can
+ * be particularly disruptive.
  *
  * Also don't attempt it if wraparound failsafe is in effect.  It's hard to
  * predict how long lazy_truncate_heap will take.  Don't take any chances.
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
index cdbe7f3a6..ce55376e7 100644
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ b/src/test/isolation/expected/vacuum-reltuples.out
@@ -45,7 +45,7 @@ step stats:
 
 relpages|reltuples
 --------+---------
-       1|       20
+       1|       21
 (1 row)
 
 
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
index ae2f79b8f..a2a461f2f 100644
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ b/src/test/isolation/specs/vacuum-reltuples.spec
@@ -2,9 +2,10 @@
 # to page pins. We absolutely need to avoid setting reltuples=0 in
 # such cases, since that interferes badly with planning.
 #
-# Expected result in second permutation is 20 tuples rather than 21 as
-# for the others, because vacuum should leave the previous result
-# (from before the insert) in place.
+# Expected result for all three permutation is 21 tuples, including
+# the second permutation.  VACUUM is able to count the concurrently
+# inserted tuple in its final reltuples, even when a cleanup lock
+# cannot be acquired on the affected heap page.
 
 setup {
     create table smalltbl
-- 
2.30.2

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Peter Geoghegan (#7)

2 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Nov 30, 2021 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote:

I haven't had time to work through any of your feedback just yet --
though it's certainly a priority for. I won't get to it until I return
home from PGConf NYC next week.

Attached is v3, which works through most of your (Andres') feedback.

Changes in v3:

* While the first patch still gets rid of the "pinskipped_pages"
instrumentation, the second patch adds back a replacement that's
better targeted: it tracks and reports "missed_dead_tuples". This
means that log output will show the number of fully DEAD tuples with
storage that could not be pruned away due to the fact that that would
have required waiting for a cleanup lock. But we *don't* generally
report the number of pages that we couldn't get a cleanup lock on,
because that in itself doesn't mean that we skipped any useful work
(which is very much the point of all of the refactoring in the first
patch).

* We now have FSM processing in the lazy_scan_noprune case, which more
or less matches the standard lazy_scan_prune case.

* Many small tweaks, based on suggestions from Andres, and other
things that I noticed.

* Further simplification of the "consider skipping pages using
visibility map" logic -- now we always don't skip the last block in
the relation, without calling should_attempt_truncation() to make sure
we have a reason.

Note that this means that we'll always read the final page during
VACUUM, even when doing so is provably unhelpful. I'd prefer to keep
the code that deals with skipping pages using the visibility map as
simple as possible. There isn't much downside to always doing that
once my refactoring is in place: there is no risk that we'll wait for
a cleanup lock (on the final page in the rel) for no good reason.
We're only wasting one page access, at most.

(I'm not 100% sure that this is the right trade-off, actually, but
it's at least worth considering.)

Not included in v3:

* Still haven't added the isolation test for rel truncation, though
it's on my TODO list.

* I'm still working on the optimization that we discussed on this
thread: the optimization that allows the final relfrozenxid (that we
set in pg_class) to be determined dynamically, based on the actual
XIDs we observed in the table (we don't just naively use FreezeLimit).

I'm not ready to post that today, but it shouldn't take too much
longer to be good enough to review.

Thanks
--
Peter Geoghegan

Attachments:

v3-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/octet-stream; name=v3-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload

From e867662d06d8db72b7be4e26f509c57029d867f9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 17 Nov 2021 21:27:06 -0800
Subject: [PATCH v3 1/2] Simplify lazy_scan_heap's handling of scanned pages.

Redefine a scanned page as any heap page that actually gets pinned by
VACUUM's first pass over the heap.  Pages counted by scanned_pages are
now the complement of the pages that are skipped over using the
visibility map.  This new definition significantly simplifies quite a
few things.

Now heap relation truncation, visibility map bit setting, tuple counting
(e.g., for pg_class.reltuples), and tuple freezing all share a common
definition of scanned_pages.  That makes it possible to remove certain
special cases, that never made much sense.  We no longer need to track
tupcount_pages separately (see bugfix commit 1914c5ea for details),
since we now always count tuples from pages that are scanned_pages.  We
also don't need to needlessly distinguish between aggressive and
non-aggressive VACUUM operations when we cannot immediately acquire a
cleanup lock.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.  But we now
make the non-aggressive case notice that a failure to acquire a cleanup
lock on one particular heap page does not in itself make it unsafe to
advance relfrozenxid for the whole relation (which is what we usually
see in the aggressive case already).

We now also collect LP_DEAD items in the dead_tuples array in the case
where we cannot immediately get a cleanup lock on the buffer.  We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.
Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),
so we should not squander any opportunity to do that.  Commit 8523492d4e
taught VACUUM to set LP_DEAD line pointers to LP_UNUSED while only
holding an exclusive lock (not a cleanup lock), so we can expect to set
existing LP_DEAD items to LP_UNUSED reliably, even when we cannot
acquire our own cleanup lock at either pass over the heap (unless we opt
to skip index vacuuming, which implies that there is no second pass over
the heap).

We no longer report on "pin skipped pages" in log output.  A later patch
will add back an improved version of the same instrumentation.  We don't
want to show any information about any failures to acquire cleanup locks
unless we actually failed to do useful work as a consequence.  A page
that we could not acquire a cleanup lock on is now treated as equivalent
to any other scanned page in most cases.
---
 src/backend/access/heap/vacuumlazy.c          | 856 +++++++++++-------
 .../isolation/expected/vacuum-reltuples.out   |   2 +-
 .../isolation/specs/vacuum-reltuples.spec     |   7 +-
 3 files changed, 541 insertions(+), 324 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 282b44f87..1fb8735a2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -151,7 +151,7 @@ typedef enum
 /*
  * LVDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
  * Each TID points to an LP_DEAD line pointer from a heap page that has been
- * processed by lazy_scan_prune.
+ * processed by lazy_scan_prune (or by lazy_scan_noprune, perhaps).
  *
  * Also needed by lazy_vacuum_heap_rel, which marks the same LP_DEAD line
  * pointers as LP_UNUSED during second heap pass.
@@ -284,6 +284,8 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	bool		aggressive;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -308,6 +310,8 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Are FreezeLimit/MultiXactCutoff still valid? */
+	bool		freeze_cutoffs_valid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -322,10 +326,8 @@ typedef struct LVRelState
 	 */
 	LVDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -338,6 +340,7 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
+	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
 	int64		lpdead_items;	/* # deleted from indexes */
 	int64		new_dead_tuples;	/* new estimated total # of dead items in
@@ -377,19 +380,22 @@ static int	elevel = -1;
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
-						   bool aggressive);
+static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params);
+static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
+								   BlockNumber blkno, Page page,
+								   bool sharelock, Buffer vmbuffer);
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
 							LVPagePruneState *prunestate);
+static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
+							  BlockNumber blkno, Page page,
+							  bool *hastup, bool *hasfreespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, int index, Buffer *vmbuffer);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void do_parallel_lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel);
@@ -465,16 +471,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	int			usecs;
 	double		read_rate,
 				write_rate;
-	bool		aggressive;		/* should we scan all unfrozen pages? */
-	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
+	bool		aggressive;
+	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
-	TransactionId new_frozen_xid;
-	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
@@ -529,6 +533,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
+	vacrel->aggressive = aggressive;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 
@@ -573,6 +578,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* Track if cutoffs became invalid (possible in !aggressive case only) */
+	vacrel->freeze_cutoffs_valid = true;
 
 	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	vacrel->relname = pstrdup(RelationGetRelationName(rel));
@@ -609,30 +616,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params, aggressive);
+	lazy_scan_heap(vacrel, params);
 
 	/* Done with indexes */
 	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the all unfrozen pages. If we did,
-	 * we can adjust relfrozenxid and relminmxid.
-	 *
-	 * NB: We need to check this before truncating the relation, because that
-	 * will change ->rel_pages.
-	 */
-	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
-		< vacrel->rel_pages)
-	{
-		Assert(!aggressive);
-		scanned_all_unfrozen = false;
-	}
-	else
-		scanned_all_unfrozen = true;
-
-	/*
-	 * Optionally truncate the relation.
+	 * Optionally truncate the relation.  But remember the relation size used
+	 * by lazy_scan_prune for later first.
 	 */
+	orig_rel_pages = vacrel->rel_pages;
 	if (should_attempt_truncation(vacrel))
 	{
 		/*
@@ -663,28 +656,44 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 *
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * relpages to.
-	 *
-	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
-	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrel->rel_pages;
+	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
 	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
-
-	vac_update_relstats(rel,
-						new_rel_pages,
-						new_live_tuples,
-						new_rel_allvisible,
-						vacrel->nindexes > 0,
-						new_frozen_xid,
-						new_min_multi,
-						false);
+	/*
+	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
+	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
+	 * provided we didn't skip any all-visible (not all-frozen) pages using
+	 * the visibility map, and assuming that we didn't fail to get a cleanup
+	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
+	 * MultiXactCutoff) established for VACUUM operation.
+	 *
+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_heap, which won't match when we
+	 * happened to truncate the relation afterwards.
+	 */
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}
 
 	/*
 	 * Report results to the stats collector, too.
@@ -713,7 +722,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
-			BlockNumber orig_rel_pages;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -760,10 +768,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->pinskipped_pages,
 							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
@@ -771,7 +778,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) vacrel->new_rel_tuples,
 							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
-			orig_rel_pages = vacrel->rel_pages + vacrel->pages_removed;
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -888,7 +894,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, VacuumParams *params)
 {
 	LVDeadItems *dead_items;
 	BlockNumber nblocks,
@@ -910,7 +916,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	pg_rusage_init(&ru0);
 
-	if (aggressive)
+	if (vacrel->aggressive)
 		ereport(elevel,
 				(errmsg("aggressively vacuuming \"%s.%s\"",
 						vacrel->relnamespace,
@@ -922,14 +928,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 						vacrel->relname)));
 
 	nblocks = RelationGetNumberOfBlocks(vacrel->rel);
-	next_unskippable_block = 0;
 	next_failsafe_block = 0;
 	next_fsm_block_to_vacuum = 0;
 	vacrel->rel_pages = nblocks;
 	vacrel->scanned_pages = 0;
-	vacrel->pinskipped_pages = 0;
 	vacrel->frozenskipped_pages = 0;
-	vacrel->tupcount_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -969,7 +972,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/*
-	 * Except when aggressive is set, we want to skip pages that are
+	 * Set things up for skipping blocks using visibility map.
+	 *
+	 * Except when vacrel->aggressive is set, we want to skip pages that are
 	 * all-visible according to the visibility map, but only when we can skip
 	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
 	 * sequentially, the OS should be doing readahead for us, so there's no
@@ -978,8 +983,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * page means that we can't update relfrozenxid, so we only want to do it
 	 * if we can skip a goodly number of pages.
 	 *
-	 * When aggressive is set, we can't skip pages just because they are
-	 * all-visible, but we can still skip pages that are all-frozen, since
+	 * When vacrel->aggressive is set, we can't skip pages just because they
+	 * are all-visible, but we can still skip pages that are all-frozen, since
 	 * such pages do not need freezing and do not affect the value that we can
 	 * safely set for relfrozenxid or relminmxid.
 	 *
@@ -1002,18 +1007,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * just added to that page are necessarily newer than the GlobalXmin we
 	 * computed, so they'll have no effect on the value to which we can safely
 	 * set relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 *
-	 * We will scan the table's last page, at least to the extent of
-	 * determining whether it has tuples or not, even if it should be skipped
-	 * according to the above rules; except when we've already determined that
-	 * it's not worth trying to truncate the table.  This avoids having
-	 * lazy_truncate_heap() take access-exclusive lock on the table to attempt
-	 * a truncation that just fails immediately because there are tuples in
-	 * the last page.  This is worth avoiding mainly because such a lock must
-	 * be replayed on any hot standby, where it can be disruptive.
 	 */
+	skipping_blocks = false;
 	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
 	{
+		next_unskippable_block = 0;
 		while (next_unskippable_block < nblocks)
 		{
 			uint8		vmstatus;
@@ -1021,7 +1019,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			vmstatus = visibilitymap_get_status(vacrel->rel,
 												next_unskippable_block,
 												&vmbuffer);
-			if (aggressive)
+			if (vacrel->aggressive)
 			{
 				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
 					break;
@@ -1034,12 +1032,11 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			vacuum_delay_point();
 			next_unskippable_block++;
 		}
+		if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
+			skipping_blocks = true;
 	}
-
-	if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
-		skipping_blocks = true;
 	else
-		skipping_blocks = false;
+		next_unskippable_block = InvalidBlockNumber;
 
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
@@ -1048,44 +1045,38 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		bool		all_visible_according_to_vm = false;
 		LVPagePruneState prunestate;
 
-		/*
-		 * Consider need to skip blocks.  See note above about forcing
-		 * scanning of last page.
-		 */
-#define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
 								 blkno, InvalidOffsetNumber);
 
+		/*
+		 * Consider need to skip blocks using visibility map
+		 */
 		if (blkno == next_unskippable_block)
 		{
 			/* Time to advance next_unskippable_block */
+			Assert((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0);
 			next_unskippable_block++;
-			if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
+			while (next_unskippable_block < nblocks)
 			{
-				while (next_unskippable_block < nblocks)
-				{
-					uint8		vmskipflags;
+				uint8		vmskipflags;
 
-					vmskipflags = visibilitymap_get_status(vacrel->rel,
-														   next_unskippable_block,
-														   &vmbuffer);
-					if (aggressive)
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
-							break;
-					}
-					else
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
-							break;
-					}
-					vacuum_delay_point();
-					next_unskippable_block++;
+				vmskipflags = visibilitymap_get_status(vacrel->rel,
+													   next_unskippable_block,
+													   &vmbuffer);
+				if (vacrel->aggressive)
+				{
+					if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
+						break;
 				}
+				else
+				{
+					if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
+						break;
+				}
+				vacuum_delay_point();
+				next_unskippable_block++;
 			}
 
 			/*
@@ -1102,19 +1093,24 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * it's not all-visible.  But in an aggressive vacuum we know only
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
-			if (aggressive && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+			if (vacrel->aggressive &&
+				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 				all_visible_according_to_vm = true;
 		}
 		else
 		{
 			/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current page can be skipped if we've seen a long enough run
+			 * of skippable blocks to justify skipping it -- provided it's not
+			 * the last page in the relation (according to rel_pages/nblocks).
+			 *
+			 * We always scan the table's last page to determine whether it
+			 * has tuples or not, even if it would otherwise be skipped.  This
+			 * avoids having lazy_truncate_heap() take access-exclusive lock
+			 * on the table to attempt a truncation that just fails
+			 * immediately because there are tuples on the last page.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks && blkno < nblocks - 1)
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
@@ -1123,18 +1119,31 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * careful to count it as a skipped all-frozen page in that
 				 * case, or else we'll think we can't update relfrozenxid and
 				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was all-frozen, so we have to recheck; but
-				 * in this case an approximate answer is OK.
+				 * know whether it was initially all-frozen, so we have to
+				 * recheck.
 				 */
-				if (aggressive || VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
+				if (vacrel->aggressive ||
+					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
+
+			/*
+			 * Otherwise we scan the page.  It must be at least all-visible,
+			 * if not all-frozen.
+			 */
 			all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
+		/*
+		 * We're not skipping this page using the visibility map, and so it is
+		 * (by definition) a scanned page.  Any tuples from this page are now
+		 * guaranteed to be counted below, after some preparatory checks.
+		 */
+		vacrel->scanned_pages++;
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1189,174 +1198,78 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
-		 * Set up visibility map page as needed.
-		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
-		 * already have the correct page pinned anyway.  However, it's
-		 * possible that (a) next_unskippable_block is covered by a different
-		 * VM page than the current block or (b) we released our pin and did a
-		 * cycle of index vacuuming.
+		 * already have the correct page pinned anyway.
 		 */
 		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
 
+		/* Finished preparatory checks.  Actually scan the page. */
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
 
 		/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing using lazy_scan_noprune.
 		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
-			bool		hastup;
+			bool		hastup,
+						hasfreespace;
 
-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
-			{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
-				continue;
-			}
-
-			/*
-			 * Read the page with share lock to see if any xids on it need to
-			 * be frozen.  If not we just skip the page, after updating our
-			 * scan statistics.  If there are some, we wait for cleanup lock.
-			 *
-			 * We could defer the lock request further by remembering the page
-			 * and coming back to it later, or we could even register
-			 * ourselves for multiple buffers and then service whichever one
-			 * is received first.  For now, this seems good enough.
-			 *
-			 * If we get here with aggressive false, then we're just forcibly
-			 * checking the page, and so we don't want to insist on getting
-			 * the lock; we only need to know if the page contains tuples, so
-			 * that we can update nonempty_pages correctly.  It's convenient
-			 * to use lazy_check_needs_freeze() for both situations, though.
-			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
 			{
-				UnlockReleaseBuffer(buf);
-				vacrel->scanned_pages++;
-				vacrel->pinskipped_pages++;
-				if (hastup)
-					vacrel->nonempty_pages = blkno + 1;
+				/* Processed as new/empty page (lock and pin released) */
 				continue;
 			}
-			if (!aggressive)
+
+			/* Collect LP_DEAD items in dead_items array, count tuples */
+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
+								  &hasfreespace))
 			{
+				Size		freespace;
+
 				/*
-				 * Here, we must not advance scanned_pages; that would amount
-				 * to claiming that the page contains no freezable tuples.
+				 * Processed page successfully (without cleanup lock) -- just
+				 * need to perform rel truncation and FSM steps, much like the
+				 * lazy_scan_prune case.  Don't bother trying to match its
+				 * visibility map setting steps, though.
 				 */
-				UnlockReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
+				if (hasfreespace)
+					freespace = PageGetHeapFreeSpace(page);
+				UnlockReleaseBuffer(buf);
+				if (hasfreespace)
+					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 				continue;
 			}
+
+			/*
+			 * lazy_scan_noprune could not do all required processing.  Wait
+			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
+			 */
+			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
-			/* drop through to normal processing */
 		}
 
-		/*
-		 * By here we definitely have enough dead_items space for whatever
-		 * LP_DEAD tids are on this page, we have the visibility map page set
-		 * up in case we need to set this page's all_visible/all_frozen bit,
-		 * and we have a cleanup lock.  Any tuples on this page are now sure
-		 * to be "counted" by this VACUUM.
-		 *
-		 * One last piece of preamble needs to take place before we can prune:
-		 * we need to consider new and empty pages.
-		 */
-		vacrel->scanned_pages++;
-		vacrel->tupcount_pages++;
-
-		page = BufferGetPage(buf);
-
-		if (PageIsNew(page))
+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
-			if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
-			{
-				Size		freespace = BLCKSZ - SizeOfPageHeaderData;
-
-				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-			}
-			continue;
-		}
-
-		if (PageIsEmpty(page))
-		{
-			Size		freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+			/* Processed as new/empty page (lock and pin released) */
 			continue;
 		}
 
 		/*
-		 * Prune and freeze tuples.
+		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
 		 * dead_items array.  This includes LP_DEAD line pointers that we
@@ -1564,7 +1477,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
-													 vacrel->tupcount_pages,
+													 vacrel->scanned_pages,
 													 vacrel->live_tuples);
 
 	/*
@@ -1637,14 +1550,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	appendStringInfo(&buf,
 					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
 					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
-	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
-									"Skipped %u pages due to buffer pins, ",
-									vacrel->pinskipped_pages),
-					 vacrel->pinskipped_pages);
-	appendStringInfo(&buf, ngettext("%u frozen page.\n",
-									"%u frozen pages.\n",
-									vacrel->frozenskipped_pages),
-					 vacrel->frozenskipped_pages);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
@@ -1658,6 +1563,138 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pfree(buf.data);
 }
 
+/*
+ *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
+ *
+ * Must call here to handle both new and empty pages before calling
+ * lazy_scan_prune or lazy_scan_noprune, since they're not prepared to deal
+ * with new or empty pages.
+ *
+ * It's necessary to consider new pages as a special case, since the rules for
+ * maintaining the visibility map and FSM with empty pages are a little
+ * different (though new pages can be truncated based on the usual rules).
+ *
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).
+ *
+ * Caller must hold at least a shared lock.  We might need to escalate the
+ * lock in that case, so the type of lock caller holds needs to be specified
+ * using 'sharelock' argument.
+ *
+ * Returns false in common case where caller should go on to call
+ * lazy_scan_prune (or lazy_scan_noprune).  Otherwise returns true, indicating
+ * that lazy_scan_heap is done processing the page, releasing lock on caller's
+ * behalf.
+ */
+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)
+{
+	Size		freespace;
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * All-zeroes pages can be left over if either a backend extends the
+		 * relation by a single page, but crashes before the newly initialized
+		 * page has been written out, or when bulk-extending the relation
+		 * (which creates a number of empty pages at the tail end of the
+		 * relation), and then enters them into the FSM.
+		 *
+		 * Note we do not enter the page into the visibilitymap. That has the
+		 * downside that we repeatedly visit this page in subsequent vacuums,
+		 * but otherwise we'll never discover the space on a promoted standby.
+		 * The harm of repeated checking ought to normally not be too bad.
+		 * The space usually should be used at some point, otherwise there
+		 * wouldn't be any regular vacuums.
+		 *
+		 * Make sure these pages are in the FSM, to ensure they can be reused.
+		 * Do that by testing if there's any space recorded for the page. If
+		 * not, enter it. We do so after releasing the lock on the heap page,
+		 * the FSM is approximate, after all.
+		 */
+		UnlockReleaseBuffer(buf);
+
+		if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
+		{
+			freespace = BLCKSZ - SizeOfPageHeaderData;
+
+			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		}
+
+		return true;
+	}
+
+	if (PageIsEmpty(page))
+	{
+		/*
+		 * It seems likely that caller will always be able to get a cleanup
+		 * lock on an empty page.  But don't take any chances -- escalate to
+		 * an exclusive lock (still don't need a cleanup lock, though).
+		 */
+		if (sharelock)
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (!PageIsEmpty(page))
+			{
+				/* page isn't new or empty -- keep lock and pin for now */
+				return false;
+			}
+		}
+		else
+		{
+			/* Already have a full cleanup lock (which is more than enough) */
+		}
+
+		freespace = PageGetHeapFreeSpace(page);
+
+		/*
+		 * Unlike new pages, empty pages are always set all-visible and
+		 * all-frozen.
+		 */
+		if (!PageIsAllVisible(page))
+		{
+			START_CRIT_SECTION();
+
+			/* mark buffer dirty before writing a WAL record */
+			MarkBufferDirty(buf);
+
+			/*
+			 * It's possible that another backend has extended the heap,
+			 * initialized the page, and then failed to WAL-log the page due
+			 * to an ERROR.  Since heap extension is not WAL-logged, recovery
+			 * might try to replay our record setting the page all-visible and
+			 * find that the page isn't initialized, which will cause a PANIC.
+			 * To prevent that, check whether the page has been previously
+			 * WAL-logged, and if not, do that now.
+			 */
+			if (RelationNeedsWAL(vacrel->rel) &&
+				PageGetLSN(page) == InvalidXLogRecPtr)
+				log_newpage_buffer(buf, true);
+
+			PageSetAllVisible(page);
+			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
+							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		return true;
+	}
+
+	/* page isn't new or empty -- keep lock and pin */
+	return false;
+}
+
 /*
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -1702,6 +1739,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	maxoff = PageGetMaxOffsetNumber(page);
 
 retry:
@@ -1764,10 +1803,9 @@ retry:
 		 * LP_DEAD items are processed outside of the loop.
 		 *
 		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how lazy_check_needs_freeze() or
-		 * count_nondeletable_pages() do it -- they only consider pages empty
-		 * when they only have LP_UNUSED items, which is important for
-		 * correctness.
+		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
+		 * it -- it only considers pages empty/truncatable when they have no
+		 * items at all (except LP_UNUSED items).
 		 *
 		 * Our assumption is that any LP_DEAD items we encounter here will
 		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
@@ -2054,6 +2092,243 @@ retry:
 	vacrel->live_tuples += live_tuples;
 }
 
+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.
+ *
+ * While pruning isn't performed here, we can at least collect existing
+ * LP_DEAD items into the dead_items array for removal from indexes.  It's
+ * quite possible that earlier opportunistic pruning left LP_DEAD items
+ * behind, and we shouldn't miss out on an opportunity to make them reusable
+ * (VACUUM alone is capable of cleaning up line pointer bloat like this).
+ * Note that we'll only require an exclusive lock (not a cleanup lock) later
+ * on when we set these LP_DEAD items to LP_UNUSED in lazy_vacuum_heap_page.
+ *
+ * Freezing isn't performed here either.  For aggressive VACUUM callers, we
+ * may return false to indicate that a full cleanup lock is required.  This is
+ * necessary because pruning requires a cleanup lock, and because VACUUM
+ * cannot freeze a page's tuples until after pruning takes place (freezing
+ * tuples effectively requires a cleanup lock, though we don't need a cleanup
+ * lock in lazy_vacuum_heap_page or in lazy_scan_new_or_empty to set a heap
+ * page all-frozen in the visibility map).
+ *
+ * Returns true to indicate that all required processing has been performed.
+ * We'll always return true for a non-aggressive VACUUM, even when we know
+ * that this will cause them to miss out on freezing tuples from before
+ * vacrel->FreezeLimit cutoff -- they should never have to wait for a cleanup
+ * lock.  This does mean that they definitely won't be able to advance
+ * relfrozenxid opportunistically (same applies to vacrel->MultiXactCutoff and
+ * relminmxid).  Caller waits for full cleanup lock when we return false.
+ *
+ * See lazy_scan_prune for an explanation of hastup return flag.  The
+ * hasfreespace flag instructs caller on whether or not it should do generic
+ * FSM processing for page, which is determined based on almost the same
+ * criteria as the lazy_scan_prune case.
+ */
+static bool
+lazy_scan_noprune(LVRelState *vacrel,
+				  Buffer buf,
+				  BlockNumber blkno,
+				  Page page,
+				  bool *hastup,
+				  bool *hasfreespace)
+{
+	OffsetNumber offnum,
+				maxoff;
+	bool		has_tuple_needs_freeze = false;
+	int			lpdead_items,
+				num_tuples,
+				live_tuples,
+				new_dead_tuples;
+	HeapTupleHeader tupleheader;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
+	*hastup = false;			/* for now */
+	*hasfreespace = false;		/* for now */
+
+	lpdead_items = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+	new_dead_tuples = 0;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		if (!ItemIdIsUsed(itemid))
+			continue;
+
+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;		/* page prevents rel truncation */
+			continue;
+		}
+
+		if (ItemIdIsDead(itemid))
+		{
+			/*
+			 * Deliberately don't set hastup=true here.  See same point in
+			 * lazy_scan_prune for an explanation.
+			 */
+			deadoffsets[lpdead_items++] = offnum;
+			continue;
+		}
+
+		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+		if (!has_tuple_needs_freeze &&
+			heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
+		{
+			if (vacrel->aggressive)
+			{
+				/* Going to have to get cleanup lock for lazy_scan_prune */
+				vacrel->offnum = InvalidOffsetNumber;
+				return false;
+			}
+
+			has_tuple_needs_freeze = true;
+		}
+
+		num_tuples++;
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
+
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count both cases as live, just like lazy_scan_prune
+				 */
+				live_tuples++;
+
+				break;
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * There is some useful work for pruning to do, that won't be
+				 * done due to failure to get a cleanup lock.
+				 *
+				 * TODO Add dedicated instrumentation for this case
+				 */
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 */
+				new_dead_tuples++;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Do not count these rows as live, just like lazy_scan_prune
+				 */
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+	}
+
+	vacrel->offnum = InvalidOffsetNumber;
+
+	if (has_tuple_needs_freeze)
+	{
+		/*
+		 * Current non-aggressive VACUUM operation definitely won't be able to
+		 * advance relfrozenxid or relminmxid
+		 */
+		Assert(!vacrel->aggressive);
+		vacrel->freeze_cutoffs_valid = false;
+	}
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in the dead_items
+	 * array iff VACUUM uses two-pass strategy case
+	 */
+	if (vacrel->nindexes == 0)
+	{
+		/*
+		 * We are not prepared to handle the corner case where a single pass
+		 * strategy VACUUM cannot get a cleanup lock, and we then find LP_DEAD
+		 * items.  Count the LP_DEAD items as if they were DEAD tuples with
+		 * storage that we cannot prune away.  This is slightly inaccurate,
+		 * but it hardly seems worth having dedicated handling just for this
+		 * case.
+		 *
+		 * There is no fundamental reason why we must take the easy way out
+		 * like this.  Finding a way to make these LP_DEAD items get set to
+		 * LP_UNUSED would be less valuable and more complicated than it is in
+		 * the two-pass strategy case, since it would necessitate that we
+		 * repeat our lazy_scan_heap caller's page-at-a-time/one-pass-strategy
+		 * heap vacuuming steps.  Whereas in the two-pass strategy case,
+		 * lazy_vacuum_heap_rel will set the LP_DEAD items to LP_UNUSED. It
+		 * must always deal with things like remaining DEAD tuples with
+		 * storage, new LP_DEAD items that we didn't see earlier on, etc.
+		 */
+		if (lpdead_items > 0)
+			*hastup = true;
+		*hasfreespace = true;
+		num_tuples += lpdead_items;
+		/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+	}
+	else if (lpdead_items > 0)
+	{
+		LVDeadItems *dead_items = vacrel->dead_items;
+		ItemPointerData tmp;
+
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_items->items[dead_items->num_items++] = tmp;
+		}
+
+		Assert(dead_items->num_items <= dead_items->max_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_items->num_items);
+
+		vacrel->lpdead_items += lpdead_items;
+	}
+	else
+	{
+		/*
+		 * Caller won't be vacuuming this page later, so tell it to record
+		 * page's freespace in the FSM now
+		 */
+		*hasfreespace = true;
+	}
+
+	/*
+	 * Finally, add relevant page-local counts to whole-VACUUM counts
+	 */
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+
+	/* Caller won't need to call lazy_scan_prune with same page */
+	return true;
+}
+
 /*
  * Remove the collected garbage tuples from the table and its indexes.
  *
@@ -2500,67 +2775,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	return index;
 }
 
-/*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
- *
- * Returns true if the page needs to be vacuumed using cleanup lock.
- * Also returns a flag indicating whether page contains any tuples at all.
- */
-static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
-{
-	Page		page = BufferGetPage(buf);
-	OffsetNumber offnum,
-				maxoff;
-	HeapTupleHeader tupleheader;
-
-	*hastup = false;
-
-	/*
-	 * New and empty pages, obviously, don't contain tuples. We could make
-	 * sure that the page is registered in the FSM, but it doesn't seem worth
-	 * waiting for a cleanup lock just for that, especially because it's
-	 * likely that the pin holder will do so.
-	 */
-	if (PageIsNew(page) || PageIsEmpty(page))
-		return false;
-
-	maxoff = PageGetMaxOffsetNumber(page);
-	for (offnum = FirstOffsetNumber;
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		ItemId		itemid;
-
-		/*
-		 * Set the offset number so that we can display it along with any
-		 * error that occurred while processing this tuple.
-		 */
-		vacrel->offnum = offnum;
-		itemid = PageGetItemId(page, offnum);
-
-		/* this should match hastup test in count_nondeletable_pages() */
-		if (ItemIdIsUsed(itemid))
-			*hastup = true;
-
-		/* dead and redirect items never need freezing */
-		if (!ItemIdIsNormal(itemid))
-			continue;
-
-		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-
-		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
-			break;
-	}							/* scan along page */
-
-	/* Clear the offset information once we have processed the given page. */
-	vacrel->offnum = InvalidOffsetNumber;
-
-	return (offnum <= maxoff);
-}
-
 /*
  * Trigger the failsafe to avoid wraparound failure when vacrel table has a
  * relfrozenxid and/or relminmxid that is dangerously far in the past.
@@ -2655,7 +2869,7 @@ do_parallel_lazy_cleanup_all_indexes(LVRelState *vacrel)
 	 */
 	vacrel->lps->lvshared->reltuples = vacrel->new_rel_tuples;
 	vacrel->lps->lvshared->estimated_count =
-		(vacrel->tupcount_pages < vacrel->rel_pages);
+		(vacrel->scanned_pages < vacrel->rel_pages);
 
 	/* Determine the number of parallel workers to launch */
 	if (vacrel->lps->lvshared->first_time)
@@ -2972,7 +3186,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		double		reltuples = vacrel->new_rel_tuples;
 		bool		estimated_count =
-		vacrel->tupcount_pages < vacrel->rel_pages;
+		vacrel->scanned_pages < vacrel->rel_pages;
 
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -3123,7 +3337,9 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * should_attempt_truncation - should we attempt to truncate the heap?
  *
  * Don't even think about it unless we have a shot at releasing a goodly
- * number of pages.  Otherwise, the time taken isn't worth it.
+ * number of pages.  Otherwise, the time taken isn't worth it, mainly because
+ * an AccessExclusive lock must be replayed on any hot standby, where it can
+ * be particularly disruptive.
  *
  * Also don't attempt it if wraparound failsafe is in effect.  It's hard to
  * predict how long lazy_truncate_heap will take.  Don't take any chances.
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
index cdbe7f3a6..ce55376e7 100644
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ b/src/test/isolation/expected/vacuum-reltuples.out
@@ -45,7 +45,7 @@ step stats:
 
 relpages|reltuples
 --------+---------
-       1|       20
+       1|       21
 (1 row)
 
 
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
index ae2f79b8f..a2a461f2f 100644
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ b/src/test/isolation/specs/vacuum-reltuples.spec
@@ -2,9 +2,10 @@
 # to page pins. We absolutely need to avoid setting reltuples=0 in
 # such cases, since that interferes badly with planning.
 #
-# Expected result in second permutation is 20 tuples rather than 21 as
-# for the others, because vacuum should leave the previous result
-# (from before the insert) in place.
+# Expected result for all three permutation is 21 tuples, including
+# the second permutation.  VACUUM is able to count the concurrently
+# inserted tuple in its final reltuples, even when a cleanup lock
+# cannot be acquired on the affected heap page.
 
 setup {
     create table smalltbl
-- 
2.30.2

v3-0002-Improve-log_autovacuum_min_duration-output.patchapplication/octet-stream; name=v3-0002-Improve-log_autovacuum_min_duration-output.patchDownload

From 4affd72e80405f6593866c21391789a5afe471f0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 21 Nov 2021 14:47:11 -0800
Subject: [PATCH v3 2/2] Improve log_autovacuum_min_duration output.

Add instrumentation of "missed dead tuples", and the number of pages
that had at least one such tuple.  These are fully DEAD (not just
RECENTLY_DEAD) tuples with storage that could not be pruned due to an
inability to acquire a cleanup lock.  This is a replacement for the
"skipped due to pin" instrumentation removed by the previous commit.
Note that the new instrumentation doesn't say anything about pages that
we failed to acquire a cleanup lock on when we see that there were no
missed dead tuples on the page.

Also report on visibility map pages skipped by VACUUM, without regard
for whether the pages were all-frozen or just all-visible.

Also report when and how relfrozenxid is advanced by VACUUM, including
non-aggressive VACUUM.  Apart from being useful on its own, this might
enable future work that teaches non-aggressive VACUUM to be more
concerned about advancing relfrozenxid sooner rather than later.

Also enhance how we report OldestXmin cutoff by putting it in context:
show how far behind it is at the _end_ of the VACUUM operation.

Deliberately don't do anything with VACUUM VERBOSE in this commit, since
a pending patch will generalize the log_autovacuum_min_duration code to
produce VACUUM VERBOSE output as well [1].  That'll get committed first.

[1] https://commitfest.postgresql.org/36/3431/
---
 src/include/commands/vacuum.h        |   2 +
 src/backend/access/heap/vacuumlazy.c | 100 +++++++++++++++++++--------
 src/backend/commands/analyze.c       |   3 +
 src/backend/commands/vacuum.c        |   9 +++
 4 files changed, 84 insertions(+), 30 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4cfd52eaf..bc625463e 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -263,6 +263,8 @@ extern void vac_update_relstats(Relation relation,
 								bool hasindex,
 								TransactionId frozenxid,
 								MultiXactId minmulti,
+								bool *frozenxid_updated,
+								bool *minmulti_updated,
 								bool in_outer_xact);
 extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1fb8735a2..56da89f04 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -330,6 +330,7 @@ typedef struct LVRelState
 	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 
 	/* Statistics output by us, for table */
@@ -343,8 +344,8 @@ typedef struct LVRelState
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
 	int64		lpdead_items;	/* # deleted from indexes */
-	int64		new_dead_tuples;	/* new estimated total # of dead items in
-									 * table */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples;	/* # removable, but not removed */
 	int64		num_tuples;		/* total number of nonremovable tuples */
 	int64		live_tuples;	/* live tuples (reltuples estimate) */
 } LVRelState;
@@ -472,6 +473,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	double		read_rate,
 				write_rate;
 	bool		aggressive;
+	bool		frozenxid_updated,
+				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
@@ -681,9 +684,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
+		frozenxid_updated = minmulti_updated = false;
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId, false);
+							InvalidTransactionId, InvalidMultiXactId,
+							NULL, NULL, false);
 	}
 	else
 	{
@@ -692,7 +697,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			   orig_rel_pages);
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff, false);
+							FreezeLimit, MultiXactCutoff,
+							&frozenxid_updated, &minmulti_updated, false);
 	}
 
 	/*
@@ -708,7 +714,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(new_live_tuples, 0),
-						 vacrel->new_dead_tuples);
+						 vacrel->recently_dead_tuples +
+						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	/* and log the action if appropriate */
@@ -722,6 +729,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
+			int32		   diff;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -768,16 +776,40 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped using visibility map (%.2f%% of total)\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->frozenskipped_pages);
+							 orig_rel_pages - vacrel->scanned_pages,
+							 orig_rel_pages == 0 ? 0 :
+							 100.0 * (orig_rel_pages - vacrel->scanned_pages) / orig_rel_pages);
 			appendStringInfo(&buf,
-							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
+							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->new_dead_tuples,
-							 OldestXmin);
+							 (long long) vacrel->recently_dead_tuples);
+			if (vacrel->missed_dead_tuples > 0)
+				appendStringInfo(&buf,
+								 _("tuples missed: %lld dead from %u contended pages\n"),
+								 (long long) vacrel->missed_dead_tuples,
+								 vacrel->missed_dead_pages);
+			diff = (int32) (ReadNextTransactionId() - OldestXmin);
+			appendStringInfo(&buf,
+							 _("removal cutoff: oldest xmin was %u, which is now %d xact IDs behind\n"),
+							 OldestXmin, diff);
+			if (frozenxid_updated)
+			{
+				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				appendStringInfo(&buf,
+								 _("relfrozenxid: advanced by %d xact IDs, new value: %u\n"),
+								 diff, FreezeLimit);
+			}
+			if (minmulti_updated)
+			{
+				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				appendStringInfo(&buf,
+								 _("relminmxid: advanced by %d multixact IDs, new value: %u\n"),
+								 diff, MultiXactCutoff);
+			}
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -935,13 +967,15 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params)
 	vacrel->frozenskipped_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
+	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
 
 	/* Initialize instrumentation counters */
 	vacrel->num_index_scans = 0;
 	vacrel->tuples_deleted = 0;
 	vacrel->lpdead_items = 0;
-	vacrel->new_dead_tuples = 0;
+	vacrel->recently_dead_tuples = 0;
+	vacrel->missed_dead_tuples = 0;
 	vacrel->num_tuples = 0;
 	vacrel->live_tuples = 0;
 
@@ -1485,7 +1519,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params)
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->new_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
+		vacrel->missed_dead_tuples;
 
 	/*
 	 * Release any remaining pin on visibility map page.
@@ -1549,7 +1584,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params)
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
+					 (long long) vacrel->recently_dead_tuples,
+					 vacrel->OldestXmin);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
@@ -1731,7 +1767,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				lpdead_items,
-				new_dead_tuples,
+				recently_dead_tuples,
 				num_tuples,
 				live_tuples;
 	int			nnewlpdead;
@@ -1748,7 +1784,7 @@ retry:
 	/* Initialize (or reset) page-level counters */
 	tuples_deleted = 0;
 	lpdead_items = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
 	num_tuples = 0;
 	live_tuples = 0;
 
@@ -1907,11 +1943,11 @@ retry:
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * If tuple is recently deleted then we must not remove it
-				 * from relation.  (We only remove items that are LP_DEAD from
+				 * If tuple is recently dead then we must not remove it from
+				 * the relation.  (We only remove items that are LP_DEAD from
 				 * pruning.)
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				prunestate->all_visible = false;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -2087,7 +2123,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->lpdead_items += lpdead_items;
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
 }
@@ -2141,7 +2177,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	int			lpdead_items,
 				num_tuples,
 				live_tuples,
-				new_dead_tuples;
+				recently_dead_tuples,
+				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
@@ -2153,7 +2190,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	lpdead_items = 0;
 	num_tuples = 0;
 	live_tuples = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
+	missed_dead_tuples = 0;
 
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -2222,16 +2260,15 @@ lazy_scan_noprune(LVRelState *vacrel,
 				/*
 				 * There is some useful work for pruning to do, that won't be
 				 * done due to failure to get a cleanup lock.
-				 *
-				 * TODO Add dedicated instrumentation for this case
 				 */
+				missed_dead_tuples++;
 				break;
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 * Count in recently_dead_tuples, just like lazy_scan_prune
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
 
@@ -2286,7 +2323,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 			*hastup = true;
 		*hasfreespace = true;
 		num_tuples += lpdead_items;
-		/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+		missed_dead_tuples += lpdead_items;
 	}
 	else if (lpdead_items > 0)
 	{
@@ -2321,9 +2358,12 @@ lazy_scan_noprune(LVRelState *vacrel,
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
+	vacrel->missed_dead_tuples += missed_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
+	if (missed_dead_tuples > 0)
+		vacrel->missed_dead_pages++;
 
 	/* Caller won't need to call lazy_scan_prune with same page */
 	return true;
@@ -2397,8 +2437,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * dead_items space is not CPU cache resident.
 		 *
 		 * We don't take any special steps to remember the LP_DEAD items (such
-		 * as counting them in new_dead_tuples report to the stats collector)
-		 * when the optimization is applied.  Though the accounting used in
+		 * as counting them in our final report to the stats collector) when
+		 * the optimization is applied.  Though the accounting used in
 		 * analyze.c's acquire_sample_rows() will recognize the same LP_DEAD
 		 * items as dead rows in its own stats collector report, that's okay.
 		 * The discrepancy should be negligible.  If this optimization is ever
@@ -4058,7 +4098,7 @@ update_index_statistics(LVRelState *vacrel)
 							false,
 							InvalidTransactionId,
 							InvalidMultiXactId,
-							false);
+							NULL, NULL, false);
 	}
 }
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index cd77907fc..afd1cb8f5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -651,6 +651,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 							hasindex,
 							InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 
 		/* Same for indexes */
@@ -667,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								false,
 								InvalidTransactionId,
 								InvalidMultiXactId,
+								NULL, NULL,
 								in_outer_xact);
 		}
 	}
@@ -679,6 +681,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		vac_update_relstats(onerel, -1, totalrows,
 							0, hasindex, InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 	}
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5c4bc15b4..8bd4bd12c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1308,6 +1308,7 @@ vac_update_relstats(Relation relation,
 					BlockNumber num_all_visible_pages,
 					bool hasindex, TransactionId frozenxid,
 					MultiXactId minmulti,
+					bool *frozenxid_updated, bool *minmulti_updated,
 					bool in_outer_xact)
 {
 	Oid			relid = RelationGetRelid(relation);
@@ -1383,22 +1384,30 @@ vac_update_relstats(Relation relation,
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
+	if (frozenxid_updated)
+		*frozenxid_updated = false;
 	if (TransactionIdIsNormal(frozenxid) &&
 		pgcform->relfrozenxid != frozenxid &&
 		(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
 		 TransactionIdPrecedes(ReadNextTransactionId(),
 							   pgcform->relfrozenxid)))
 	{
+		if (frozenxid_updated)
+			*frozenxid_updated = true;
 		pgcform->relfrozenxid = frozenxid;
 		dirty = true;
 	}
 
 	/* Similarly for relminmxid */
+	if (minmulti_updated)
+		*minmulti_updated = false;
 	if (MultiXactIdIsValid(minmulti) &&
 		pgcform->relminmxid != minmulti &&
 		(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
 		 MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
 	{
+		if (minmulti_updated)
+			*minmulti_updated = true;
 		pgcform->relminmxid = minmulti;
 		dirty = true;
 	}
-- 
2.30.2

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Peter Geoghegan (#8)

5 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Dec 10, 2021 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:

* I'm still working on the optimization that we discussed on this
thread: the optimization that allows the final relfrozenxid (that we
set in pg_class) to be determined dynamically, based on the actual
XIDs we observed in the table (we don't just naively use FreezeLimit).

Attached is v4 of the patch series, which now includes this
optimization, broken out into its own patch. In addition, it includes
a prototype of opportunistic freezing.

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

There is a kind of virtuous circle here, where successive
non-aggressive autovacuums never fall behind on freezing, and so never
fail to advance relfrozenxid (there are never any
all_visible-but-not-all_frozen pages, and we can cope with not
acquiring a cleanup lock quite well). When VACUUM chooses to freeze a
tuple opportunistically, the frozen XIDs naturally cannot hold back
the final safe relfrozenxid for the relation. Opportunistic freezing
avoids setting all_visible (without setting all_frozen) in the
visibility map. It's impossible for VACUUM to just set a page to
all_visible now, which seems like an essential part of making a decent
amount of relfrozenxid advancement take place in almost every VACUUM
operation.

Here is an example of what I'm calling a virtuous circle -- all
pgbench_history autovacuums look like this with the patch applied:

LOG: automatic vacuum of table "regression.public.pgbench_history":
index scans: 0
pages: 0 removed, 35503 remain, 31930 skipped using visibility map
(89.94% of total)
tuples: 0 removed, 5568687 remain (547976 newly frozen), 0 are
dead but not yet removable
removal cutoff: oldest xmin was 5570281, which is now 1177 xact IDs behind
relfrozenxid: advanced by 546618 xact IDs, new value: 5565226
index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
I/O timings: read: 0.003 ms, write: 0.000 ms
avg read rate: 0.068 MB/s, avg write rate: 0.068 MB/s
buffer usage: 7169 hits, 1 misses, 1 dirtied
WAL usage: 7043 records, 1 full page images, 6974928 bytes
system usage: CPU: user: 0.10 s, system: 0.00 s, elapsed: 0.11 s

Note that relfrozenxid is almost the same as oldest xmin here. Note also
that the log output shows the number of tuples newly frozen. I see the
same general trends with *every* pgbench_history autovacuum. Actually,
with every autovacuum. The history table tends to have ultra-recent
relfrozenxid values, which isn't always what we see, but that
difference may not matter. As far as I can tell, we can expect
practically every table to have a relfrozenxid that would (at least
traditionally) be considered very safe/recent. Barring weird
application issues that make it totally impossible to advance
relfrozenxid (e.g., idle cursors that hold onto a buffer pin forever),
it seems as if relfrozenxid will now steadily march forward. Sure,
relfrozenxid advancement might be held by the occasional inability to
acquire a cleanup lock, but the effect isn't noticeable over time;
what are the chances that a cleanup lock won't be available on the
same page (with the same old XID) more than once or twice? The odds of
that happening become astronomically tiny, long before there is any
real danger (barring pathological cases).

In the past, we've always talked about opportunistic freezing as a way
of avoiding re-dirtying heap pages during successive VACUUM operations
-- especially as a way of lowering the total volume of WAL. While I
agree that that's important, I have deliberately ignored it for now,
preferring to focus on the relfrozenxid stuff, and smoothing out the
cost of freezing (avoiding big shocks from aggressive/anti-wraparound
autovacuums). I care more about stable performance than absolute
throughput, but even still I believe that the approach I've taken to
opportunistic freezing is probably too aggressive. But it's dead
simple, which will make it easier to understand and discuss the issue
of central importance. It may be possible to optimize the WAL-logging
used during freezing, getting the cost down to the point where
freezing early just isn't a concern. The current prototype adds extra
WAL overhead, to be sure, but even that's not wildly unreasonable (you
make some of it back on FPIs, depending on the workload -- especially
with tables like pgbench_history, where delaying freezing is a total loss).

--
Peter Geoghegan

Attachments:

v4-0002-Improve-log_autovacuum_min_duration-output.patchapplication/x-patch; name=v4-0002-Improve-log_autovacuum_min_duration-output.patchDownload

From 276122392aefcf0d7a079d7c7dc532228d495bd7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 21 Nov 2021 14:47:11 -0800
Subject: [PATCH v4 2/5] Improve log_autovacuum_min_duration output.

Add instrumentation of "missed dead tuples", and the number of pages
that had at least one such tuple.  These are fully DEAD (not just
RECENTLY_DEAD) tuples with storage that could not be pruned due to an
inability to acquire a cleanup lock.  This is a replacement for the
"skipped due to pin" instrumentation removed by the previous commit.
Note that the new instrumentation doesn't say anything about pages that
we failed to acquire a cleanup lock on when we see that there were no
missed dead tuples on the page.

Also report on visibility map pages skipped by VACUUM, without regard
for whether the pages were all-frozen or just all-visible.

Also report when and how relfrozenxid is advanced by VACUUM, including
non-aggressive VACUUM.  Apart from being useful on its own, this might
enable future work that teaches non-aggressive VACUUM to be more
concerned about advancing relfrozenxid sooner rather than later.

Also report number of tuples frozen.  This will become more important
when the later patch to perform opportunistic tuple freezing is
committed.

Also enhance how we report OldestXmin cutoff by putting it in context:
show how far behind it is at the _end_ of the VACUUM operation.

Deliberately don't do anything with VACUUM VERBOSE in this commit, since
a pending patch will generalize the log_autovacuum_min_duration code to
produce VACUUM VERBOSE output as well [1].  That'll get committed first.

[1] https://commitfest.postgresql.org/36/3431/
---
 src/include/commands/vacuum.h        |   2 +
 src/backend/access/heap/vacuumlazy.c | 108 +++++++++++++++++++--------
 src/backend/commands/analyze.c       |   3 +
 src/backend/commands/vacuum.c        |   9 +++
 4 files changed, 91 insertions(+), 31 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4cfd52eaf..bc625463e 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -263,6 +263,8 @@ extern void vac_update_relstats(Relation relation,
 								bool hasindex,
 								TransactionId frozenxid,
 								MultiXactId minmulti,
+								bool *frozenxid_updated,
+								bool *minmulti_updated,
 								bool in_outer_xact);
 extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c6d3a483f..238e07a78 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -351,6 +351,7 @@ typedef struct LVRelState
 	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 
 	/* Statistics output by us, for table */
@@ -363,9 +364,10 @@ typedef struct LVRelState
 	int			num_index_scans;
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # frozen by us */
 	int64		lpdead_items;	/* # deleted from indexes */
-	int64		new_dead_tuples;	/* new estimated total # of dead items in
-									 * table */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
 	int64		num_tuples;		/* total number of nonremovable tuples */
 	int64		live_tuples;	/* live tuples (reltuples estimate) */
 } LVRelState;
@@ -488,6 +490,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				write_rate;
 	bool		aggressive,
 				skipwithvm;
+	bool		frozenxid_updated,
+				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
@@ -705,9 +709,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
+		frozenxid_updated = minmulti_updated = false;
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId, false);
+							InvalidTransactionId, InvalidMultiXactId,
+							NULL, NULL, false);
 	}
 	else
 	{
@@ -716,7 +722,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			   orig_rel_pages);
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff, false);
+							FreezeLimit, MultiXactCutoff,
+							&frozenxid_updated, &minmulti_updated, false);
 	}
 
 	/*
@@ -732,7 +739,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(new_live_tuples, 0),
-						 vacrel->new_dead_tuples);
+						 vacrel->recently_dead_tuples +
+						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	/* and log the action if appropriate */
@@ -746,6 +754,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
+			int32		diff;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -792,16 +801,41 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped using visibility map (%.2f%% of total)\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->frozenskipped_pages);
+							 orig_rel_pages - vacrel->scanned_pages,
+							 orig_rel_pages == 0 ? 0 :
+							 100.0 * (orig_rel_pages - vacrel->scanned_pages) / orig_rel_pages);
 			appendStringInfo(&buf,
-							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
+							 _("tuples: %lld removed, %lld remain (%lld newly frozen), %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->new_dead_tuples,
-							 OldestXmin);
+							 (long long) vacrel->tuples_frozen,
+							 (long long) vacrel->recently_dead_tuples);
+			if (vacrel->missed_dead_tuples > 0)
+				appendStringInfo(&buf,
+								 _("tuples missed: %lld dead from %u contended pages\n"),
+								 (long long) vacrel->missed_dead_tuples,
+								 vacrel->missed_dead_pages);
+			diff = (int32) (ReadNextTransactionId() - OldestXmin);
+			appendStringInfo(&buf,
+							 _("removal cutoff: oldest xmin was %u, which is now %d xact IDs behind\n"),
+							 OldestXmin, diff);
+			if (frozenxid_updated)
+			{
+				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				appendStringInfo(&buf,
+								 _("relfrozenxid: advanced by %d xact IDs, new value: %u\n"),
+								 diff, FreezeLimit);
+			}
+			if (minmulti_updated)
+			{
+				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				appendStringInfo(&buf,
+								 _("relminmxid: advanced by %d multixact IDs, new value: %u\n"),
+								 diff, MultiXactCutoff);
+			}
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -957,13 +991,16 @@ lazy_scan_heap(LVRelState *vacrel, bool skipwithvm, int nworkers)
 	vacrel->frozenskipped_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
+	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
 
 	/* Initialize instrumentation counters */
 	vacrel->num_index_scans = 0;
 	vacrel->tuples_deleted = 0;
+	vacrel->tuples_frozen = 0;
 	vacrel->lpdead_items = 0;
-	vacrel->new_dead_tuples = 0;
+	vacrel->recently_dead_tuples = 0;
+	vacrel->missed_dead_tuples = 0;
 	vacrel->num_tuples = 0;
 	vacrel->live_tuples = 0;
 
@@ -1510,7 +1547,8 @@ lazy_scan_heap(LVRelState *vacrel, bool skipwithvm, int nworkers)
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->new_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
+		vacrel->missed_dead_tuples;
 
 	/*
 	 * Release any remaining pin on visibility map page.
@@ -1574,7 +1612,8 @@ lazy_scan_heap(LVRelState *vacrel, bool skipwithvm, int nworkers)
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
+					 (long long) vacrel->recently_dead_tuples,
+					 vacrel->OldestXmin);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
@@ -1756,7 +1795,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				lpdead_items,
-				new_dead_tuples,
+				recently_dead_tuples,
 				num_tuples,
 				live_tuples;
 	int			nnewlpdead;
@@ -1773,7 +1812,7 @@ retry:
 	/* Initialize (or reset) page-level counters */
 	tuples_deleted = 0;
 	lpdead_items = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
 	num_tuples = 0;
 	live_tuples = 0;
 
@@ -1932,11 +1971,11 @@ retry:
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * If tuple is recently deleted then we must not remove it
-				 * from relation.  (We only remove items that are LP_DEAD from
+				 * If tuple is recently dead then we must not remove it from
+				 * the relation.  (We only remove items that are LP_DEAD from
 				 * pruning.)
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				prunestate->all_visible = false;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -2111,8 +2150,9 @@ retry:
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
+	vacrel->tuples_frozen += nfrozen;
 	vacrel->lpdead_items += lpdead_items;
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
 }
@@ -2165,7 +2205,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	int			lpdead_items,
 				num_tuples,
 				live_tuples,
-				new_dead_tuples;
+				recently_dead_tuples,
+				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
@@ -2177,7 +2218,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	lpdead_items = 0;
 	num_tuples = 0;
 	live_tuples = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
+	missed_dead_tuples = 0;
 
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -2250,16 +2292,15 @@ lazy_scan_noprune(LVRelState *vacrel,
 				/*
 				 * There is some useful work for pruning to do, that won't be
 				 * done due to failure to get a cleanup lock.
-				 *
-				 * TODO Add dedicated instrumentation for this case
 				 */
+				missed_dead_tuples++;
 				break;
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 * Count in recently_dead_tuples, just like lazy_scan_prune
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
 
@@ -2287,13 +2328,15 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 *
 		 * We are not prepared to handle the corner case where a single pass
 		 * strategy VACUUM cannot get a cleanup lock, and we then find LP_DEAD
-		 * items.
+		 * items.  Count the LP_DEAD items as missed_dead_tuples instead. This
+		 * is slightly dishonest, but it's better than maintaining code to do
+		 * heap vacuuming for this one narrow corner case.
 		 */
 		if (lpdead_items > 0)
 			*hastup = true;
 		*hasfreespace = true;
 		num_tuples += lpdead_items;
-		/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+		missed_dead_tuples += lpdead_items;
 	}
 	else if (lpdead_items > 0)
 	{
@@ -2328,9 +2371,12 @@ lazy_scan_noprune(LVRelState *vacrel,
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
+	vacrel->missed_dead_tuples += missed_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
+	if (missed_dead_tuples > 0)
+		vacrel->missed_dead_pages++;
 
 	/* Caller won't need to call lazy_scan_prune with same page */
 	return true;
@@ -2404,8 +2450,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * dead_items space is not CPU cache resident.
 		 *
 		 * We don't take any special steps to remember the LP_DEAD items (such
-		 * as counting them in new_dead_tuples report to the stats collector)
-		 * when the optimization is applied.  Though the accounting used in
+		 * as counting them in our final report to the stats collector) when
+		 * the optimization is applied.  Though the accounting used in
 		 * analyze.c's acquire_sample_rows() will recognize the same LP_DEAD
 		 * items as dead rows in its own stats collector report, that's okay.
 		 * The discrepancy should be negligible.  If this optimization is ever
@@ -4061,7 +4107,7 @@ update_index_statistics(LVRelState *vacrel)
 							false,
 							InvalidTransactionId,
 							InvalidMultiXactId,
-							false);
+							NULL, NULL, false);
 	}
 }
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index cd77907fc..afd1cb8f5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -651,6 +651,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 							hasindex,
 							InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 
 		/* Same for indexes */
@@ -667,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								false,
 								InvalidTransactionId,
 								InvalidMultiXactId,
+								NULL, NULL,
 								in_outer_xact);
 		}
 	}
@@ -679,6 +681,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		vac_update_relstats(onerel, -1, totalrows,
 							0, hasindex, InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 	}
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5c4bc15b4..8bd4bd12c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1308,6 +1308,7 @@ vac_update_relstats(Relation relation,
 					BlockNumber num_all_visible_pages,
 					bool hasindex, TransactionId frozenxid,
 					MultiXactId minmulti,
+					bool *frozenxid_updated, bool *minmulti_updated,
 					bool in_outer_xact)
 {
 	Oid			relid = RelationGetRelid(relation);
@@ -1383,22 +1384,30 @@ vac_update_relstats(Relation relation,
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
+	if (frozenxid_updated)
+		*frozenxid_updated = false;
 	if (TransactionIdIsNormal(frozenxid) &&
 		pgcform->relfrozenxid != frozenxid &&
 		(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
 		 TransactionIdPrecedes(ReadNextTransactionId(),
 							   pgcform->relfrozenxid)))
 	{
+		if (frozenxid_updated)
+			*frozenxid_updated = true;
 		pgcform->relfrozenxid = frozenxid;
 		dirty = true;
 	}
 
 	/* Similarly for relminmxid */
+	if (minmulti_updated)
+		*minmulti_updated = false;
 	if (MultiXactIdIsValid(minmulti) &&
 		pgcform->relminmxid != minmulti &&
 		(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
 		 MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
 	{
+		if (minmulti_updated)
+			*minmulti_updated = true;
 		pgcform->relminmxid = minmulti;
 		dirty = true;
 	}
-- 
2.30.2

v4-0004-Decouple-advancing-relfrozenxid-from-freezing.patchapplication/x-patch; name=v4-0004-Decouple-advancing-relfrozenxid-from-freezing.patchDownload

From 6465bbf843dc8ee549cea5cc6a15c68784098f53 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 22 Nov 2021 10:02:30 -0800
Subject: [PATCH v4 4/5] Decouple advancing relfrozenxid from freezing.

Stop using tuple freezing (and MultiXact freezing) tuple header cutoffs
to determine the final relfrozenxid (and relminmxid) values that we set
for heap relations in pg_class.  Use "optimal" values instead.

Optimal values are the most recent values that are less than or equal to
any remaining XID/MultiXact in a tuple header (not counting frozen
xmin/xmax values).  This is now kept track of by VACUUM.  "Optimal"
values are always >= the tuple header FreezeLimit in an aggressive
VACUUM.  For a non-aggressive VACUUM, they can be less than or greater
than the tuple header FreezeLimit cutoff (though we still often pass
invalid values to indicate that we cannot advance relfrozenxid during
the VACUUM).
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |   4 +-
 src/include/commands/vacuum.h        |   1 +
 src/backend/access/heap/heapam.c     | 186 ++++++++++++++++++++-------
 src/backend/access/heap/vacuumlazy.c |  76 +++++++----
 src/backend/commands/cluster.c       |   5 +-
 src/backend/commands/vacuum.c        |  30 ++++-
 7 files changed, 228 insertions(+), 78 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 417dd288e..0eb5c36a2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -168,7 +168,9 @@ extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
 extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi, Buffer buf);
+									MultiXactId cutoff_multi,
+									TransactionId *NewRelfrozenxid,
+									MultiXactId *NewRelminmxid, Buffer buf);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index ab9e873bc..b0ede623e 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *NewRelfrozenxid,
+									  MultiXactId *NewRelminmxid);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 6eefe8129..114d6da89 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -271,6 +271,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0b4a46b31..d296a79ea 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6078,12 +6078,24 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "NewRelfrozenxid" is an output value; it's used to maintain target new
+ * relfrozenxid for the relation.  It can be ignored unless "flags" contains
+ * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
+ * here.  This follows the general convention: only track XIDs that will still
+ * be in the table after the ongoing VACUUM finishes.  Note that it's up to
+ * caller to maintain this when the Xid return value is itself an Xid.
+ *
+ * Note that we cannot depend on xmin to maintain NewRelfrozenxid.  We need to
+ * push maintenance of NewRelfrozenxid down this far, since in general xmin
+ * might have been frozen by an earlier VACUUM operation, in which case our
+ * caller will not have factored-in xmin when maintaining NewRelfrozenxid.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *NewRelfrozenxid)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6095,6 +6107,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId tempNewRelfrozenxid;
 
 	*flags = 0;
 
@@ -6189,13 +6202,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-		{
 			need_replace = true;
-			break;
-		}
+		if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+			tempNewRelfrozenxid = members[i].xid;
 	}
 
 	/*
@@ -6204,6 +6217,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (!need_replace)
 	{
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 		*flags |= FRM_NOOP;
 		pfree(members);
 		return InvalidTransactionId;
@@ -6213,6 +6227,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * If the multi needs to be updated, figure out which members do we need
 	 * to keep.
 	 */
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
@@ -6294,7 +6309,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 * list.)
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+					tempNewRelfrozenxid = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6304,6 +6323,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			{
 				/* running locker cannot possibly be older than the cutoff */
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6332,6 +6352,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+		/* Caller manages NewRelfrozenxid directly when we return an XID */
 	}
 	else
 	{
@@ -6341,6 +6362,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 	}
 
 	pfree(newmembers);
@@ -6359,6 +6381,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will actually go on to freeze as indicated by our *frz output, so
+ * any (xmin, xmax, xvac) XIDs that we indicate need to be frozen won't need
+ * to be counted here.  Values are valid lower bounds at the point that the
+ * ongoing VACUUM finishes.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
@@ -6383,7 +6412,9 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p,
+						  TransactionId *NewRelfrozenxid,
+						  MultiXactId *NewRelminmxid)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6427,6 +6458,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+		{
+			/* won't be frozen, but older than current NewRelfrozenxid */
+			*NewRelfrozenxid = xid;
+		}
 	}
 
 	/*
@@ -6444,10 +6480,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId temp = *NewRelfrozenxid;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi, &flags, &temp);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
@@ -6465,6 +6502,24 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
+
+			if (TransactionIdPrecedes(newxmax, *NewRelfrozenxid))
+			{
+				/* New xmax is an XID older than new NewRelfrozenxid */
+				*NewRelfrozenxid = newxmax;
+			}
+		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * Changing nothing, so might have to ratchet back NewRelminmxid,
+			 * NewRelfrozenxid, or both together
+			 */
+			if (MultiXactIdIsValid(xid) &&
+				MultiXactIdPrecedes(xid, *NewRelminmxid))
+				*NewRelminmxid = xid;
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6486,6 +6541,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 
 			changed = true;
+
+			/*
+			 * New multixact might have remaining XID older than
+			 * NewRelfrozenxid
+			 */
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
@@ -6513,7 +6575,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			freeze_xmax = true;
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			{
+				/* won't be frozen, but older than current NewRelfrozenxid */
+				*NewRelfrozenxid = xid;
+			}
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
@@ -6560,6 +6629,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * Since we always freeze here, NewRelfrozenxid doesn't need to be
+		 * maintained.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6637,11 +6709,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId NewRelfrozenxid = FirstNormalTransactionId;
+	MultiXactId NewRelminmxid = FirstMultiXactId;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &NewRelfrozenxid, &NewRelminmxid);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7071,6 +7146,15 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
  * are older than the specified cutoff XID or MultiXactId.  If so, return true.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will never freeze any of the XIDs from the tuple, even when we say
+ * that they should.  If caller opts to go with our recommendation to freeze,
+ * then it must account for the fact that it shouldn't trust how we've set
+ * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
+ * our recommendation because they must, and non-aggressive VACUUMs always opt
+ * to not freeze, preferring to ratchet back NewRelfrozenxid instead).
+ *
  * It doesn't matter whether the tuple is alive or dead, we are checking
  * to see if a tuple needs to be removed or frozen to avoid wraparound.
  *
@@ -7079,74 +7163,86 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  */
 bool
 heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi, Buffer buf)
+						MultiXactId cutoff_multi,
+						TransactionId *NewRelfrozenxid,
+						MultiXactId *NewRelminmxid, Buffer buf)
 {
 	TransactionId xid;
+	bool		needs_freeze = false;
 
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
+	if (TransactionIdIsNormal(xid))
+	{
+		if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			*NewRelfrozenxid = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			needs_freeze = true;
+	}
 
 	/*
 	 * The considerations for multixacts are complicated; look at
 	 * heap_prepare_freeze_tuple for justifications.  This routine had better
 	 * be in sync with that one!
+	 *
+	 * (Actually, we maintain NewRelminmxid differently here, because we
+	 * assume that XIDs that should be frozen according to cutoff_xid won't
+	 * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)
 	 */
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		MultiXactId multi;
+		MultiXactMember *members;
+		int			nmembers;
 
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+		if (MultiXactIdIsValid(multi) &&
+			MultiXactIdPrecedes(multi, *NewRelminmxid))
+			*NewRelminmxid = multi;
+
+		if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 			return true;
 		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
+			needs_freeze = true;
+
+		/* need to check whether any member of the mxact is too old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
 		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
-
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
+			if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+				needs_freeze = true;
+			if (TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
 		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 	else
 	{
 		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
-	return false;
+	return needs_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index da5b3f79a..c6facc9eb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -331,8 +331,10 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+
+	/* Track new pg_class.relfrozenxid/pg_class.relminmxid values */
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -501,6 +503,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
@@ -537,8 +540,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -602,8 +605,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+
+	/* Initialize values used to advance relfrozenxid/relminmxid at the end */
+	vacrel->NewRelfrozenxid = OldestXmin;
+	vacrel->NewRelminmxid = OldestMxact;
 
 	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	vacrel->relname = pstrdup(RelationGetRelationName(rel));
@@ -692,16 +697,18 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
 	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
 	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * the visibility map.  A non-aggressive VACUUM might only be able to
+	 * advance relfrozenxid to an XID from before FreezeLimit (or a relminmxid
+	 * from before MultiXactCutoff) when it wasn't possible to freeze some
+	 * tuples due to our inability to acquire a cleanup lock, but the effect
+	 * is usually insignificant -- NewRelfrozenxid value still has a decent
+	 * chance of being much more recent that the existing relfrozenxid.
 	 *
 	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
 	 * the rel_pages used by lazy_scan_heap, which won't match when we
 	 * happened to truncate the relation afterwards.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
@@ -718,7 +725,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			   orig_rel_pages);
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
+							vacrel->NewRelfrozenxid, vacrel->NewRelminmxid,
 							&frozenxid_updated, &minmulti_updated, false);
 	}
 
@@ -820,14 +827,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenxid - vacrel->relfrozenxid);
 				appendStringInfo(&buf,
 								 _("relfrozenxid: advanced by %d xact IDs, new value: %u\n"),
-								 diff, FreezeLimit);
+								 diff, vacrel->NewRelfrozenxid);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminmxid - vacrel->relminmxid);
 				appendStringInfo(&buf,
 								 _("relminmxid: advanced by %d multixact IDs, new value: %u\n"),
 								 diff, MultiXactCutoff);
@@ -1798,6 +1805,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	int			nfrozen;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1806,6 +1815,8 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level counters */
+	NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	NewRelminmxid = vacrel->NewRelminmxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	recently_dead_tuples = 0;
@@ -2015,7 +2026,9 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenxid,
+									  &NewRelminmxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -2029,13 +2042,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -2179,9 +2195,9 @@ retry:
  * We'll always return true for a non-aggressive VACUUM, even when we know
  * that this will cause them to miss out on freezing tuples from before
  * vacrel->FreezeLimit cutoff -- they should never have to wait for a cleanup
- * lock.  This does mean that they definitely won't be able to advance
- * relfrozenxid opportunistically (same applies to vacrel->MultiXactCutoff and
- * relminmxid).  Caller waits for full cleanup lock when we return false.
+ * lock.  This does mean that they will have NewRelfrozenxid ratcheting back
+ * to a known-safe value (same applies to NewRelminmxid).  Caller waits for
+ * full cleanup lock when we return false.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.  The
  * hasfreespace flag instructs caller on whether or not it should do generic
@@ -2205,6 +2221,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	MultiXactId NewRelminmxid = vacrel->NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2250,7 +2268,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_needs_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
+									vacrel->MultiXactCutoff,
+									&NewRelfrozenxid, &NewRelminmxid, buf))
 		{
 			if (vacrel->aggressive)
 			{
@@ -2260,10 +2279,11 @@ lazy_scan_noprune(LVRelState *vacrel,
 			}
 
 			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
+			 * A non-aggressive VACUUM doesn't have to wait on a cleanup lock
+			 * to ensure that it advances relfrozenxid to a sufficiently
+			 * recent XID that happens to be present on this page.  It can
+			 * just accept an older New/final relfrozenxid instead.
 			 */
-			vacrel->freeze_cutoffs_valid = false;
 		}
 
 		num_tuples++;
@@ -2313,6 +2333,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * We have committed to not freezing the tuples on this page (always
+	 * happens with a non-aggressive VACUUM), so make sure that the target
+	 * relfrozenxid/relminmxid values reflect the XIDs/MXIDs we encountered
+	 */
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
+
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel (though
 	 * only when VACUUM uses two-pass strategy).
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 66b87347d..6bd6688ae 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 6db7b8156..cce290c78 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -943,10 +943,28 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
  * - freezeLimit is the Xid below which all Xids are replaced by
  *	 FrozenTransactionId during vacuum.
  * - multiXactCutoff is the value below which all MultiXactIds are removed
  *   from Xmax.
+ *
+ * oldestXmin and oldestMxact can be thought of as the most recent values that
+ * can ever be passed to vac_update_relstats() as frozenxid and minmulti
+ * arguments.  These exact values will be used when no newer XIDs or
+ * MultiXacts remain in the heap relation (e.g., with an empty table).  It's
+ * typical for vacuumlazy.c caller to notice that older XIDs/Multixacts remain
+ * in the table, which will force it to use older value.  These older final
+ * values may not be any newer than the preexisting frozenxid/minmulti values
+ * from pg_class in extreme cases.  The final values are frequently fairly
+ * close to the optimal values that we give to vacuumlazy.c, though.
+ *
+ * An aggressive VACUUM always provides vac_update_relstats() arguments that
+ * are >= freezeLimit and >= multiXactCutoff.  A non-aggressive VACUUM may
+ * provide arguments that are either newer or older than freezeLimit and
+ * multiXactCutoff, or non-valid values (indicating that pg_class level
+ * cutoffs cannot be advanced at all).
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -955,6 +973,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -963,7 +982,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1059,9 +1077,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1076,8 +1096,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
-- 
2.30.2

v4-0005-Prototype-of-opportunistic-freezing.patchapplication/x-patch; name=v4-0005-Prototype-of-opportunistic-freezing.patchDownload

From 919bd29902da15a5d43166dabf379cdc60d7dacb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 13 Dec 2021 15:00:49 -0800
Subject: [PATCH v4 5/5] Prototype of opportunistic freezing.

Freeze whenever pruning modified the page, or whenever we see that we're
going to mark the page all-visible without also marking it all-frozen.

There has been plenty of discussion of opportunistic freezing in the
past.  It is generally considered important as a way of minimizing
repeated dirtying of heap pages (or the total volume of FPIs in the WAL
stream) over time.  While that goal is certainly very important, this
patch has another priority: making VACUUM advance relfrozenxid sooner
and more frequently.

The overall effect is that tables like pgbench's history table can be
vacuumed very frequently, and have most individual vacuum operations
generate 0 FPIs in WAL -- they will never need an aggressive VACUUM.
The old SKIP_PAGES_THRESHOLD heuristic is designed to make it more
likely that we'll be able to advance relfrozenxid, which works well when
combined with additions from earlier patches in the patch series, and
opportunistic freezing.

GUCs like vacuum_freeze_min_age never made much sense after the freeze
map work in PostgreSQL 9.6.  The default is 50 million transactions,
which current tends to result in our being unable to freeze tuples
before the page is marked all-visible (but not all-frozen).  This
creates a huge performance cliff later on, during the first aggressive
VACUUM.  And so an important goal of opportunistic freezing is to not
allow the system to get into too much "debt" from very old unfrozen
tuples.  That might actually be more important than minimizing the
absolute cost of freezing.

There is probably a small regression caused by opportunistic freezing
with workloads like pgbench, since we're freezing many more tuples than
we need to now -- while we do have fewer FPIs (even earlier on), that
may not be enough to make up for the increase in WAL records.  This
problem can be addressed in a later revision, when the general picture
for this patch (especially how it affects our ability to advance
relfrozenxid early) becomes clearer.
---
 src/include/access/heapam.h          |  1 +
 src/backend/access/heap/pruneheap.c  |  8 ++-
 src/backend/access/heap/vacuumlazy.c | 78 +++++++++++++++++++++++++---
 3 files changed, 79 insertions(+), 8 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 0eb5c36a2..5e1f24e5c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
+							bool *modified,
 							int	*nnewlpdead,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 522a00af6..e95dea38d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,11 +182,12 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
+			bool	modified;
 			int		ndeleted,
 					nnewlpdead;
 
 			ndeleted = heap_page_prune(relation, buffer, vistest, limited_xmin,
-									   limited_ts, &nnewlpdead, NULL);
+									   limited_ts, &modified, &nnewlpdead, NULL);
 
 			/*
 			 * Report the number of tuples reclaimed to pgstats.  This is
@@ -244,6 +245,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
+				bool *modified,
 				int	*nnewlpdead,
 				OffsetNumber *off_loc)
 {
@@ -375,6 +377,8 @@ heap_page_prune(Relation relation, Buffer buffer,
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
+
+		*modified = true;
 	}
 	else
 	{
@@ -387,12 +391,14 @@ heap_page_prune(Relation relation, Buffer buffer,
 		 * point in repeating the prune/defrag process until something else
 		 * happens to the page.
 		 */
+		*modified = false;
 		if (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid ||
 			PageIsFull(page))
 		{
 			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
 			PageClearFull(page);
 			MarkBufferDirtyHint(buffer, true);
+			*modified = true;
 		}
 	}
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c6facc9eb..a710c6cf8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -328,6 +328,7 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoff for pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
@@ -529,11 +530,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/*
 	 * Get cutoffs that determine which tuples we need to freeze during the
-	 * VACUUM operation.
+	 * VACUUM operation.  This includes information that is used during
+	 * opportunistic freezing, where the most aggressive possible cutoffs
+	 * (OldestXmin and OldestMxact) are used for some heap pages, based on
+	 * considerations about cost.
 	 *
 	 * Also determines if this is to be an aggressive VACUUM.  This will
 	 * eventually be required for any table where (for whatever reason) no
 	 * non-aggressive VACUUM ran to completion, and advanced relfrozenxid.
+	 * This used to be much more common, but we now work hard to advance
+	 * relfrozenxid in non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -603,6 +609,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Set cutoffs for entire VACUUM */
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
 
@@ -1807,6 +1814,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 	TransactionId NewRelfrozenxid;
 	MultiXactId NewRelminmxid;
+	bool		modified;
+	TransactionId FreezeLimit = vacrel->FreezeLimit;
+	MultiXactId MultiXactCutoff = vacrel->MultiXactCutoff;
+	bool		earlyfreezing = false;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1833,8 +1844,19 @@ retry:
 	 * that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vistest,
-									 InvalidTransactionId, 0, &nnewlpdead,
-									 &vacrel->offnum);
+									 InvalidTransactionId, 0, &modified,
+									 &nnewlpdead, &vacrel->offnum);
+
+	/*
+	 * If page was modified during pruning, then perform early freezing
+	 * opportunistically
+	 */
+	if (!earlyfreezing && modified)
+	{
+		earlyfreezing = true;
+		FreezeLimit = vacrel->OldestXmin;
+		MultiXactCutoff = vacrel->OldestMxact;
+	}
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
@@ -1889,7 +1911,7 @@ retry:
 		if (ItemIdIsDead(itemid))
 		{
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
+			/* Don't set all_visible to false just yet */
 			prunestate->has_lpdead_items = true;
 			continue;
 		}
@@ -2023,8 +2045,8 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
-									  vacrel->FreezeLimit,
-									  vacrel->MultiXactCutoff,
+									  FreezeLimit,
+									  MultiXactCutoff,
 									  &frozen[nfrozen],
 									  &tuple_totally_frozen,
 									  &NewRelfrozenxid,
@@ -2044,6 +2066,48 @@ retry:
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * If page is going to become all_visible (excluding any LP_DEAD items),
+	 * but won't also become all_frozen (either in ongoing first heap pass, or
+	 * in second heap pass after LP_DEAD items get set LP_UNUSED) then repeat
+	 * our pass over the heap, using more aggressive (opportunistic) freeze
+	 * limits.  This policy isn't guaranteed to be cheaper in the long run,
+	 * but it often is.  And it makes it far more likely that non-aggressive
+	 * VACUUMs will end up advancing relfrozenxid to a reasonably recent XID;
+	 * an XID that we opt to freeze won't hold back NewRelfrozenxid.
+	 *
+	 * We deliberately track all_visible in a way that excludes LP_DEAD items
+	 * here.  Our assumption is that any page that is "all_visible for tuples
+	 * with storage" will be safe to mark all_visible in the visibility map
+	 * during VACUUM's second heap pass, right after LP_DEAD items are set
+	 * LP_UNUSED.  Either way (with or without LP_DEAD items), our goal is to
+	 * ensure that a page that _would have_ been marked all_visible in the
+	 * visibility map gets marked all_frozen instead.
+	 */
+	if (!earlyfreezing && prunestate->all_visible && !prunestate->all_frozen)
+	{
+		/*
+		 * XXX Need to worry about leaking MultiXacts in FreezeMultiXactId()
+		 * now (via heap_prepare_freeze_tuple calls)?  That was already
+		 * possible, but presumably this makes it much more likely.
+		 *
+		 * On the other hand, that's only possible when we need to replace an
+		 * existing MultiXact with a new one.  Even then, we won't have
+		 * preallocated a new MultiXact (which we now risk leaking) if there
+		 * was only one remaining XID, and the XID is for an updater (we'll
+		 * only prepare to replace xmax with the XID directly).  So maybe it's
+		 * still a narrow enough problem to be ignored.
+		 */
+		earlyfreezing = true;
+		FreezeLimit = vacrel->OldestXmin;
+		MultiXactCutoff = vacrel->OldestMxact;
+		goto retry;
+	}
+
+	/* Time to define all_visible in a way that accounts for LP_DEAD items */
+	if (lpdead_items > 0)
+		prunestate->all_visible = false;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
@@ -2089,7 +2153,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(vacrel->rel, buf, FreezeLimit,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
-- 
2.30.2

v4-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/x-patch; name=v4-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload

From 6b4b69741461e60e66a5fcf6673ffb3e87aed1ae Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 17 Nov 2021 21:27:06 -0800
Subject: [PATCH v4 1/5] Simplify lazy_scan_heap's handling of scanned pages.

Redefine a scanned page as any heap page that actually gets pinned by
VACUUM's first pass over the heap.  Pages counted by scanned_pages are
now the complement of the pages that are skipped over using the
visibility map.  This new definition significantly simplifies quite a
few things.

Now heap relation truncation, visibility map bit setting, tuple counting
(e.g., for pg_class.reltuples), and tuple freezing all share a common
definition of scanned_pages.  That makes it possible to remove certain
special cases, that never made much sense.  We no longer need to track
tupcount_pages separately (see bugfix commit 1914c5ea for details),
since we now always count tuples from pages that are scanned_pages.  We
also don't need to needlessly distinguish between aggressive and
non-aggressive VACUUM operations when we cannot immediately acquire a
cleanup lock.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.  But we now
make the non-aggressive case notice that a failure to acquire a cleanup
lock on one particular heap page does not in itself make it unsafe to
advance relfrozenxid for the whole relation (which is what we usually
see in the aggressive case already).

We now also collect LP_DEAD items in the dead_items array in the case
where we cannot immediately get a cleanup lock on the buffer.  We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.
Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),
so we should not squander any opportunity to do that.  Commit 8523492d4e
taught VACUUM to set LP_DEAD line pointers to LP_UNUSED while only
holding an exclusive lock (not a cleanup lock), so we can expect to set
existing LP_DEAD items to LP_UNUSED reliably, even when we cannot
acquire our own cleanup lock at either pass over the heap (unless we opt
to skip index vacuuming, which implies that there is no second pass over
the heap).

We no longer report on "pin skipped pages" in log output.  A later patch
will add back an improved version of the same instrumentation.  We don't
want to show any information about any failures to acquire cleanup locks
unless we actually failed to do useful work as a consequence.  A page
that we could not acquire a cleanup lock on is now treated as equivalent
to any other scanned page in most cases.
---
 src/backend/access/heap/vacuumlazy.c          | 814 +++++++++++-------
 .../isolation/expected/vacuum-reltuples.out   |   2 +-
 .../isolation/specs/vacuum-reltuples.spec     |   7 +-
 3 files changed, 516 insertions(+), 307 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index db6becfed..c6d3a483f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -152,7 +152,7 @@ typedef enum
 /*
  * LVDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
  * Each TID points to an LP_DEAD line pointer from a heap page that has been
- * processed by lazy_scan_prune.
+ * processed by lazy_scan_prune (or by lazy_scan_noprune, perhaps).
  *
  * Also needed by lazy_vacuum_heap_rel, which marks the same LP_DEAD line
  * pointers as LP_UNUSED during second heap pass.
@@ -305,6 +305,8 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	bool		aggressive;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -329,6 +331,8 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Are FreezeLimit/MultiXactCutoff still valid? */
+	bool		freeze_cutoffs_valid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -343,10 +347,8 @@ typedef struct LVRelState
 	 */
 	LVDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -359,6 +361,7 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
+	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
 	int64		lpdead_items;	/* # deleted from indexes */
 	int64		new_dead_tuples;	/* new estimated total # of dead items in
@@ -398,19 +401,22 @@ static int	elevel = -1;
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
-						   bool aggressive);
+static void lazy_scan_heap(LVRelState *vacrel, bool skipwithvm, int nworkers);
+static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
+								   BlockNumber blkno, Page page,
+								   bool sharelock, Buffer vmbuffer);
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
 							LVPagePruneState *prunestate);
+static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
+							  BlockNumber blkno, Page page,
+							  bool *hastup, bool *hasfreespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, int index, Buffer *vmbuffer);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static void parallel_vacuum_process_all_indexes(LVRelState *vacrel, bool vacuum);
@@ -480,16 +486,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	int			usecs;
 	double		read_rate,
 				write_rate;
-	bool		aggressive;		/* should we scan all unfrozen pages? */
-	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
+	bool		aggressive,
+				skipwithvm;
+	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
-	TransactionId new_frozen_xid;
-	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
@@ -535,8 +540,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 											   xidFullScanLimit);
 	aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
 											  mxactFullScanLimit);
+	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+	{
+		/*
+		 * Force aggressive mode, and disable skipping blocks using the
+		 * visibility map (even those set all-frozen)
+		 */
 		aggressive = true;
+		skipwithvm = false;
+	}
 
 	vacrel = (LVRelState *) palloc0(sizeof(LVRelState));
 
@@ -544,6 +557,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
+	vacrel->aggressive = aggressive;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 
@@ -588,6 +602,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* Track if cutoffs became invalid (possible in !aggressive case only) */
+	vacrel->freeze_cutoffs_valid = true;
 
 	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	vacrel->relname = pstrdup(RelationGetRelationName(rel));
@@ -624,30 +640,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params, aggressive);
+	lazy_scan_heap(vacrel, skipwithvm, params->nworkers);
 
 	/* Done with indexes */
 	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the all unfrozen pages. If we did,
-	 * we can adjust relfrozenxid and relminmxid.
-	 *
-	 * NB: We need to check this before truncating the relation, because that
-	 * will change ->rel_pages.
-	 */
-	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
-		< vacrel->rel_pages)
-	{
-		Assert(!aggressive);
-		scanned_all_unfrozen = false;
-	}
-	else
-		scanned_all_unfrozen = true;
-
-	/*
-	 * Optionally truncate the relation.
+	 * Optionally truncate the relation.  But remember the relation size used
+	 * by lazy_scan_prune for later first.
 	 */
+	orig_rel_pages = vacrel->rel_pages;
 	if (should_attempt_truncation(vacrel))
 	{
 		/*
@@ -678,28 +680,44 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 *
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * relpages to.
-	 *
-	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
-	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrel->rel_pages;
+	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
 	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
-
-	vac_update_relstats(rel,
-						new_rel_pages,
-						new_live_tuples,
-						new_rel_allvisible,
-						vacrel->nindexes > 0,
-						new_frozen_xid,
-						new_min_multi,
-						false);
+	/*
+	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
+	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
+	 * provided we didn't skip any all-visible (not all-frozen) pages using
+	 * the visibility map, and assuming that we didn't fail to get a cleanup
+	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
+	 * MultiXactCutoff) established for VACUUM operation.
+	 *
+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_heap, which won't match when we
+	 * happened to truncate the relation afterwards.
+	 */
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}
 
 	/*
 	 * Report results to the stats collector, too.
@@ -728,7 +746,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
-			BlockNumber orig_rel_pages;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -775,10 +792,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->pinskipped_pages,
 							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
@@ -786,7 +802,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) vacrel->new_rel_tuples,
 							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
-			orig_rel_pages = vacrel->rel_pages + vacrel->pages_removed;
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -903,7 +918,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, bool skipwithvm, int nworkers)
 {
 	LVDeadItems *dead_items;
 	BlockNumber nblocks,
@@ -925,7 +940,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	pg_rusage_init(&ru0);
 
-	if (aggressive)
+	if (vacrel->aggressive)
 		ereport(elevel,
 				(errmsg("aggressively vacuuming \"%s.%s\"",
 						vacrel->relnamespace,
@@ -937,14 +952,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 						vacrel->relname)));
 
 	nblocks = RelationGetNumberOfBlocks(vacrel->rel);
-	next_unskippable_block = 0;
-	next_failsafe_block = 0;
-	next_fsm_block_to_vacuum = 0;
 	vacrel->rel_pages = nblocks;
 	vacrel->scanned_pages = 0;
-	vacrel->pinskipped_pages = 0;
 	vacrel->frozenskipped_pages = 0;
-	vacrel->tupcount_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -968,14 +978,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * dangerously old.
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
+	next_failsafe_block = 0;
 
 	/*
 	 * Allocate the space for dead_items.  Note that this handles parallel
 	 * VACUUM initialization as part of allocating shared memory space used
 	 * for dead_items.
 	 */
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, nworkers);
 	dead_items = vacrel->dead_items;
+	next_fsm_block_to_vacuum = 0;
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
@@ -984,7 +996,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/*
-	 * Except when aggressive is set, we want to skip pages that are
+	 * Set things up for skipping blocks using visibility map.
+	 *
+	 * Except when vacrel->aggressive is set, we want to skip pages that are
 	 * all-visible according to the visibility map, but only when we can skip
 	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
 	 * sequentially, the OS should be doing readahead for us, so there's no
@@ -993,8 +1007,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * page means that we can't update relfrozenxid, so we only want to do it
 	 * if we can skip a goodly number of pages.
 	 *
-	 * When aggressive is set, we can't skip pages just because they are
-	 * all-visible, but we can still skip pages that are all-frozen, since
+	 * When vacrel->aggressive is set, we can't skip pages just because they
+	 * are all-visible, but we can still skip pages that are all-frozen, since
 	 * such pages do not need freezing and do not affect the value that we can
 	 * safely set for relfrozenxid or relminmxid.
 	 *
@@ -1017,17 +1031,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * just added to that page are necessarily newer than the GlobalXmin we
 	 * computed, so they'll have no effect on the value to which we can safely
 	 * set relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 *
-	 * We will scan the table's last page, at least to the extent of
-	 * determining whether it has tuples or not, even if it should be skipped
-	 * according to the above rules; except when we've already determined that
-	 * it's not worth trying to truncate the table.  This avoids having
-	 * lazy_truncate_heap() take access-exclusive lock on the table to attempt
-	 * a truncation that just fails immediately because there are tuples in
-	 * the last page.  This is worth avoiding mainly because such a lock must
-	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
+	next_unskippable_block = 0;
+	if (skipwithvm)
 	{
 		while (next_unskippable_block < nblocks)
 		{
@@ -1036,7 +1042,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			vmstatus = visibilitymap_get_status(vacrel->rel,
 												next_unskippable_block,
 												&vmbuffer);
-			if (aggressive)
+			if (vacrel->aggressive)
 			{
 				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
 					break;
@@ -1063,13 +1069,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		bool		all_visible_according_to_vm = false;
 		LVPagePruneState prunestate;
 
-		/*
-		 * Consider need to skip blocks.  See note above about forcing
-		 * scanning of last page.
-		 */
-#define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
@@ -1079,7 +1078,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		{
 			/* Time to advance next_unskippable_block */
 			next_unskippable_block++;
-			if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
+			if (skipwithvm)
 			{
 				while (next_unskippable_block < nblocks)
 				{
@@ -1088,7 +1087,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 					vmskipflags = visibilitymap_get_status(vacrel->rel,
 														   next_unskippable_block,
 														   &vmbuffer);
-					if (aggressive)
+					if (vacrel->aggressive)
 					{
 						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
 							break;
@@ -1117,19 +1116,25 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * it's not all-visible.  But in an aggressive vacuum we know only
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
-			if (aggressive && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+			if (vacrel->aggressive &&
+				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 				all_visible_according_to_vm = true;
 		}
 		else
 		{
 			/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current page can be skipped if we've seen a long enough run
+			 * of skippable blocks to justify skipping it -- provided it's not
+			 * the last page in the relation (according to rel_pages/nblocks).
+			 *
+			 * We always scan the table's last page to determine whether it
+			 * has tuples or not, even if it would otherwise be skipped
+			 * (unless we're skipping every single page in the relation). This
+			 * avoids having lazy_truncate_heap() take access-exclusive lock
+			 * on the table to attempt a truncation that just fails
+			 * immediately because there are tuples on the last page.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks && blkno < nblocks - 1)
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
@@ -1138,18 +1143,32 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * careful to count it as a skipped all-frozen page in that
 				 * case, or else we'll think we can't update relfrozenxid and
 				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was all-frozen, so we have to recheck; but
-				 * in this case an approximate answer is OK.
+				 * know whether it was initially all-frozen, so we have to
+				 * recheck.
 				 */
-				if (aggressive || VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
+				if (vacrel->aggressive ||
+					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
+
+			/*
+			 * Otherwise it must be an all-visible (and possibly even
+			 * all-frozen) page that we decided to process regardless
+			 * (SKIP_PAGES_THRESHOLD must not have been crossed).
+			 */
 			all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
+		/*
+		 * We're not skipping this page using the visibility map, and so it is
+		 * (by definition) a scanned page.  Any tuples from this page are now
+		 * guaranteed to be counted below, after some preparatory checks.
+		 */
+		vacrel->scanned_pages++;
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1204,174 +1223,78 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
-		 * Set up visibility map page as needed.
-		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
-		 * already have the correct page pinned anyway.  However, it's
-		 * possible that (a) next_unskippable_block is covered by a different
-		 * VM page than the current block or (b) we released our pin and did a
-		 * cycle of index vacuuming.
+		 * already have the correct page pinned anyway.
 		 */
 		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
 
+		/* Finished preparatory checks.  Actually scan the page. */
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
 
 		/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing using lazy_scan_noprune.
 		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
-			bool		hastup;
+			bool		hastup,
+						hasfreespace;
 
-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
-			{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
-				continue;
-			}
-
-			/*
-			 * Read the page with share lock to see if any xids on it need to
-			 * be frozen.  If not we just skip the page, after updating our
-			 * scan statistics.  If there are some, we wait for cleanup lock.
-			 *
-			 * We could defer the lock request further by remembering the page
-			 * and coming back to it later, or we could even register
-			 * ourselves for multiple buffers and then service whichever one
-			 * is received first.  For now, this seems good enough.
-			 *
-			 * If we get here with aggressive false, then we're just forcibly
-			 * checking the page, and so we don't want to insist on getting
-			 * the lock; we only need to know if the page contains tuples, so
-			 * that we can update nonempty_pages correctly.  It's convenient
-			 * to use lazy_check_needs_freeze() for both situations, though.
-			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
 			{
-				UnlockReleaseBuffer(buf);
-				vacrel->scanned_pages++;
-				vacrel->pinskipped_pages++;
-				if (hastup)
-					vacrel->nonempty_pages = blkno + 1;
+				/* Processed as new/empty page (lock and pin released) */
 				continue;
 			}
-			if (!aggressive)
+
+			/* Collect LP_DEAD items in dead_items array, count tuples */
+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
+								  &hasfreespace))
 			{
+				Size		freespace;
+
 				/*
-				 * Here, we must not advance scanned_pages; that would amount
-				 * to claiming that the page contains no freezable tuples.
+				 * Processed page successfully (without cleanup lock) -- just
+				 * need to perform rel truncation and FSM steps, much like the
+				 * lazy_scan_prune case.  Don't bother trying to match its
+				 * visibility map setting steps, though.
 				 */
-				UnlockReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
+				if (hasfreespace)
+					freespace = PageGetHeapFreeSpace(page);
+				UnlockReleaseBuffer(buf);
+				if (hasfreespace)
+					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 				continue;
 			}
+
+			/*
+			 * lazy_scan_noprune could not do all required processing.  Wait
+			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
+			 */
+			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
-			/* drop through to normal processing */
 		}
 
-		/*
-		 * By here we definitely have enough dead_items space for whatever
-		 * LP_DEAD tids are on this page, we have the visibility map page set
-		 * up in case we need to set this page's all_visible/all_frozen bit,
-		 * and we have a cleanup lock.  Any tuples on this page are now sure
-		 * to be "counted" by this VACUUM.
-		 *
-		 * One last piece of preamble needs to take place before we can prune:
-		 * we need to consider new and empty pages.
-		 */
-		vacrel->scanned_pages++;
-		vacrel->tupcount_pages++;
-
-		page = BufferGetPage(buf);
-
-		if (PageIsNew(page))
+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
-			if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
-			{
-				Size		freespace = BLCKSZ - SizeOfPageHeaderData;
-
-				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-			}
-			continue;
-		}
-
-		if (PageIsEmpty(page))
-		{
-			Size		freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+			/* Processed as new/empty page (lock and pin released) */
 			continue;
 		}
 
 		/*
-		 * Prune and freeze tuples.
+		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
 		 * dead_items array.  This includes LP_DEAD line pointers that we
@@ -1579,7 +1502,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
-													 vacrel->tupcount_pages,
+													 vacrel->scanned_pages,
 													 vacrel->live_tuples);
 
 	/*
@@ -1652,14 +1575,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	appendStringInfo(&buf,
 					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
 					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
-	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
-									"Skipped %u pages due to buffer pins, ",
-									vacrel->pinskipped_pages),
-					 vacrel->pinskipped_pages);
-	appendStringInfo(&buf, ngettext("%u frozen page.\n",
-									"%u frozen pages.\n",
-									vacrel->frozenskipped_pages),
-					 vacrel->frozenskipped_pages);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
@@ -1673,6 +1588,138 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pfree(buf.data);
 }
 
+/*
+ *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
+ *
+ * Must call here to handle both new and empty pages before calling
+ * lazy_scan_prune or lazy_scan_noprune, since they're not prepared to deal
+ * with new or empty pages.
+ *
+ * It's necessary to consider new pages as a special case, since the rules for
+ * maintaining the visibility map and FSM with empty pages are a little
+ * different (though new pages can be truncated based on the usual rules).
+ *
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).
+ *
+ * Caller must hold at least a shared lock.  We might need to escalate the
+ * lock in that case, so the type of lock caller holds needs to be specified
+ * using 'sharelock' argument.
+ *
+ * Returns false in common case where caller should go on to call
+ * lazy_scan_prune (or lazy_scan_noprune).  Otherwise returns true, indicating
+ * that lazy_scan_heap is done processing the page, releasing lock on caller's
+ * behalf.
+ */
+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)
+{
+	Size		freespace;
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * All-zeroes pages can be left over if either a backend extends the
+		 * relation by a single page, but crashes before the newly initialized
+		 * page has been written out, or when bulk-extending the relation
+		 * (which creates a number of empty pages at the tail end of the
+		 * relation), and then enters them into the FSM.
+		 *
+		 * Note we do not enter the page into the visibilitymap. That has the
+		 * downside that we repeatedly visit this page in subsequent vacuums,
+		 * but otherwise we'll never discover the space on a promoted standby.
+		 * The harm of repeated checking ought to normally not be too bad. The
+		 * space usually should be used at some point, otherwise there
+		 * wouldn't be any regular vacuums.
+		 *
+		 * Make sure these pages are in the FSM, to ensure they can be reused.
+		 * Do that by testing if there's any space recorded for the page. If
+		 * not, enter it. We do so after releasing the lock on the heap page,
+		 * the FSM is approximate, after all.
+		 */
+		UnlockReleaseBuffer(buf);
+
+		if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
+		{
+			freespace = BLCKSZ - SizeOfPageHeaderData;
+
+			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		}
+
+		return true;
+	}
+
+	if (PageIsEmpty(page))
+	{
+		/*
+		 * It seems likely that caller will always be able to get a cleanup
+		 * lock on an empty page.  But don't take any chances -- escalate to
+		 * an exclusive lock (still don't need a cleanup lock, though).
+		 */
+		if (sharelock)
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (!PageIsEmpty(page))
+			{
+				/* page isn't new or empty -- keep lock and pin for now */
+				return false;
+			}
+		}
+		else
+		{
+			/* Already have a full cleanup lock (which is more than enough) */
+		}
+
+		freespace = PageGetHeapFreeSpace(page);
+
+		/*
+		 * Unlike new pages, empty pages are always set all-visible and
+		 * all-frozen.
+		 */
+		if (!PageIsAllVisible(page))
+		{
+			START_CRIT_SECTION();
+
+			/* mark buffer dirty before writing a WAL record */
+			MarkBufferDirty(buf);
+
+			/*
+			 * It's possible that another backend has extended the heap,
+			 * initialized the page, and then failed to WAL-log the page due
+			 * to an ERROR.  Since heap extension is not WAL-logged, recovery
+			 * might try to replay our record setting the page all-visible and
+			 * find that the page isn't initialized, which will cause a PANIC.
+			 * To prevent that, check whether the page has been previously
+			 * WAL-logged, and if not, do that now.
+			 */
+			if (RelationNeedsWAL(vacrel->rel) &&
+				PageGetLSN(page) == InvalidXLogRecPtr)
+				log_newpage_buffer(buf, true);
+
+			PageSetAllVisible(page);
+			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
+							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		return true;
+	}
+
+	/* page isn't new or empty -- keep lock and pin */
+	return false;
+}
+
 /*
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -1717,6 +1764,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	maxoff = PageGetMaxOffsetNumber(page);
 
 retry:
@@ -1779,10 +1828,9 @@ retry:
 		 * LP_DEAD items are processed outside of the loop.
 		 *
 		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how lazy_check_needs_freeze() or
-		 * count_nondeletable_pages() do it -- they only consider pages empty
-		 * when they only have LP_UNUSED items, which is important for
-		 * correctness.
+		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
+		 * it -- it only considers pages empty/truncatable when they have no
+		 * items at all (except LP_UNUSED items).
 		 *
 		 * Our assumption is that any LP_DEAD items we encounter here will
 		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
@@ -2069,6 +2117,225 @@ retry:
 	vacrel->live_tuples += live_tuples;
 }
 
+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.
+ *
+ * While pruning isn't performed here, we can at least collect existing
+ * LP_DEAD items into the dead_items array for removal from indexes.  It's
+ * quite possible that earlier opportunistic pruning left LP_DEAD items
+ * behind, and we shouldn't miss out on an opportunity to make them reusable
+ * (VACUUM alone is capable of cleaning up line pointer bloat like this).
+ * Note that we'll only require an exclusive lock (not a cleanup lock) later
+ * on when we set these LP_DEAD items to LP_UNUSED in lazy_vacuum_heap_page.
+ *
+ * Freezing isn't performed here either.  For aggressive VACUUM callers, we
+ * may return false to indicate that a full cleanup lock is required.  This is
+ * necessary because pruning requires a cleanup lock, and because VACUUM
+ * cannot freeze a page's tuples until after pruning takes place (freezing
+ * tuples effectively requires a cleanup lock, though we don't need a cleanup
+ * lock in lazy_vacuum_heap_page or in lazy_scan_new_or_empty to set a heap
+ * page all-frozen in the visibility map).
+ *
+ * Returns true to indicate that all required processing has been performed.
+ * We'll always return true for a non-aggressive VACUUM, even when we know
+ * that this will cause them to miss out on freezing tuples from before
+ * vacrel->FreezeLimit cutoff -- they should never have to wait for a cleanup
+ * lock.  This does mean that they definitely won't be able to advance
+ * relfrozenxid opportunistically (same applies to vacrel->MultiXactCutoff and
+ * relminmxid).  Caller waits for full cleanup lock when we return false.
+ *
+ * See lazy_scan_prune for an explanation of hastup return flag.  The
+ * hasfreespace flag instructs caller on whether or not it should do generic
+ * FSM processing for page, which is determined based on almost the same
+ * criteria as the lazy_scan_prune case.
+ */
+static bool
+lazy_scan_noprune(LVRelState *vacrel,
+				  Buffer buf,
+				  BlockNumber blkno,
+				  Page page,
+				  bool *hastup,
+				  bool *hasfreespace)
+{
+	OffsetNumber offnum,
+				maxoff;
+	int			lpdead_items,
+				num_tuples,
+				live_tuples,
+				new_dead_tuples;
+	HeapTupleHeader tupleheader;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
+	*hastup = false;			/* for now */
+	*hasfreespace = false;		/* for now */
+
+	lpdead_items = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+	new_dead_tuples = 0;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		if (!ItemIdIsUsed(itemid))
+			continue;
+
+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;		/* page prevents rel truncation */
+			continue;
+		}
+
+		if (ItemIdIsDead(itemid))
+		{
+			/*
+			 * Deliberately don't set hastup=true here.  See same point in
+			 * lazy_scan_prune for an explanation.
+			 */
+			deadoffsets[lpdead_items++] = offnum;
+			continue;
+		}
+
+		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+		if (heap_tuple_needs_freeze(tupleheader,
+									vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
+		{
+			if (vacrel->aggressive)
+			{
+				/* Going to have to get cleanup lock for lazy_scan_prune */
+				vacrel->offnum = InvalidOffsetNumber;
+				return false;
+			}
+
+			/*
+			 * Current non-aggressive VACUUM operation definitely won't be
+			 * able to advance relfrozenxid or relminmxid
+			 */
+			vacrel->freeze_cutoffs_valid = false;
+		}
+
+		num_tuples++;
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
+
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count both cases as live, just like lazy_scan_prune
+				 */
+				live_tuples++;
+
+				break;
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * There is some useful work for pruning to do, that won't be
+				 * done due to failure to get a cleanup lock.
+				 *
+				 * TODO Add dedicated instrumentation for this case
+				 */
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 */
+				new_dead_tuples++;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Do not count these rows as live, just like lazy_scan_prune
+				 */
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+	}
+
+	vacrel->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in vacrel (though
+	 * only when VACUUM uses two-pass strategy).
+	 */
+	if (vacrel->nindexes == 0)
+	{
+		/*
+		 * Using one-pass strategy.
+		 *
+		 * We are not prepared to handle the corner case where a single pass
+		 * strategy VACUUM cannot get a cleanup lock, and we then find LP_DEAD
+		 * items.
+		 */
+		if (lpdead_items > 0)
+			*hastup = true;
+		*hasfreespace = true;
+		num_tuples += lpdead_items;
+		/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+	}
+	else if (lpdead_items > 0)
+	{
+		LVDeadItems *dead_items = vacrel->dead_items;
+		ItemPointerData tmp;
+
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_items->items[dead_items->num_items++] = tmp;
+		}
+
+		Assert(dead_items->num_items <= dead_items->max_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_items->num_items);
+
+		vacrel->lpdead_items += lpdead_items;
+	}
+	else
+	{
+		/*
+		 * Caller won't be vacuuming this page later, so tell it to record
+		 * page's freespace in the FSM now
+		 */
+		*hasfreespace = true;
+	}
+
+	/*
+	 * Finally, add relevant page-local counts to whole-VACUUM counts
+	 */
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+
+	/* Caller won't need to call lazy_scan_prune with same page */
+	return true;
+}
+
 /*
  * Remove the collected garbage tuples from the table and its indexes.
  *
@@ -2515,67 +2782,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	return index;
 }
 
-/*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
- *
- * Returns true if the page needs to be vacuumed using cleanup lock.
- * Also returns a flag indicating whether page contains any tuples at all.
- */
-static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
-{
-	Page		page = BufferGetPage(buf);
-	OffsetNumber offnum,
-				maxoff;
-	HeapTupleHeader tupleheader;
-
-	*hastup = false;
-
-	/*
-	 * New and empty pages, obviously, don't contain tuples. We could make
-	 * sure that the page is registered in the FSM, but it doesn't seem worth
-	 * waiting for a cleanup lock just for that, especially because it's
-	 * likely that the pin holder will do so.
-	 */
-	if (PageIsNew(page) || PageIsEmpty(page))
-		return false;
-
-	maxoff = PageGetMaxOffsetNumber(page);
-	for (offnum = FirstOffsetNumber;
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		ItemId		itemid;
-
-		/*
-		 * Set the offset number so that we can display it along with any
-		 * error that occurred while processing this tuple.
-		 */
-		vacrel->offnum = offnum;
-		itemid = PageGetItemId(page, offnum);
-
-		/* this should match hastup test in count_nondeletable_pages() */
-		if (ItemIdIsUsed(itemid))
-			*hastup = true;
-
-		/* dead and redirect items never need freezing */
-		if (!ItemIdIsNormal(itemid))
-			continue;
-
-		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-
-		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
-			break;
-	}							/* scan along page */
-
-	/* Clear the offset information once we have processed the given page. */
-	vacrel->offnum = InvalidOffsetNumber;
-
-	return (offnum <= maxoff);
-}
-
 /*
  * Trigger the failsafe to avoid wraparound failure when vacrel table has a
  * relfrozenxid and/or relminmxid that is dangerously far in the past.
@@ -2663,7 +2869,7 @@ parallel_vacuum_process_all_indexes(LVRelState *vacrel, bool vacuum)
 		 */
 		vacrel->lps->lvshared->reltuples = vacrel->new_rel_tuples;
 		vacrel->lps->lvshared->estimated_count =
-			(vacrel->tupcount_pages < vacrel->rel_pages);
+			(vacrel->scanned_pages < vacrel->rel_pages);
 
 		new_status = PARALLEL_INDVAC_STATUS_NEED_CLEANUP;
 
@@ -2982,7 +3188,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		double		reltuples = vacrel->new_rel_tuples;
 		bool		estimated_count =
-		vacrel->tupcount_pages < vacrel->rel_pages;
+		vacrel->scanned_pages < vacrel->rel_pages;
 
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -3133,7 +3339,9 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * should_attempt_truncation - should we attempt to truncate the heap?
  *
  * Don't even think about it unless we have a shot at releasing a goodly
- * number of pages.  Otherwise, the time taken isn't worth it.
+ * number of pages.  Otherwise, the time taken isn't worth it, mainly because
+ * an AccessExclusive lock must be replayed on any hot standby, where it can
+ * be particularly disruptive.
  *
  * Also don't attempt it if wraparound failsafe is in effect.  It's hard to
  * predict how long lazy_truncate_heap will take.  Don't take any chances.
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
index cdbe7f3a6..ce55376e7 100644
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ b/src/test/isolation/expected/vacuum-reltuples.out
@@ -45,7 +45,7 @@ step stats:
 
 relpages|reltuples
 --------+---------
-       1|       20
+       1|       21
 (1 row)
 
 
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
index ae2f79b8f..a2a461f2f 100644
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ b/src/test/isolation/specs/vacuum-reltuples.spec
@@ -2,9 +2,10 @@
 # to page pins. We absolutely need to avoid setting reltuples=0 in
 # such cases, since that interferes badly with planning.
 #
-# Expected result in second permutation is 20 tuples rather than 21 as
-# for the others, because vacuum should leave the previous result
-# (from before the insert) in place.
+# Expected result for all three permutation is 21 tuples, including
+# the second permutation.  VACUUM is able to count the concurrently
+# inserted tuple in its final reltuples, even when a cleanup lock
+# cannot be acquired on the affected heap page.
 
 setup {
     create table smalltbl
-- 
2.30.2

v4-0003-Simplify-vacuum_set_xid_limits-signature.patchapplication/x-patch; name=v4-0003-Simplify-vacuum_set_xid_limits-signature.patchDownload

From 09dff0737d045713cc536c736b8b05b578bfdb9d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 11 Dec 2021 17:39:45 -0800
Subject: [PATCH v4 3/5] Simplify vacuum_set_xid_limits signature.

Refactoring, making the return value of vacuum_set_xid_limits()
determine whether or not this will be an aggressive VACUUM.

This will make it easier to set/return an oldestMxact value for
vacuumlazy.c caller in the next commit, which is an important detail
that enables advancing relminmxid opportunistically.
---
 src/include/commands/vacuum.h        |   6 +-
 src/backend/access/heap/vacuumlazy.c |  32 +++----
 src/backend/commands/cluster.c       |   3 +-
 src/backend/commands/vacuum.c        | 134 +++++++++++++--------------
 4 files changed, 79 insertions(+), 96 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index bc625463e..6eefe8129 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -266,15 +266,13 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern void vacuum_set_xid_limits(Relation rel,
+extern bool vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
 								  TransactionId *freezeLimit,
-								  TransactionId *xidFullScanLimit,
-								  MultiXactId *multiXactCutoff,
-								  MultiXactId *mxactFullScanLimit);
+								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 238e07a78..da5b3f79a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -494,8 +494,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
-	TransactionId xidFullScanLimit;
-	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
@@ -526,24 +524,22 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
 								  RelationGetRelid(rel));
 
-	vacuum_set_xid_limits(rel,
-						  params->freeze_min_age,
-						  params->freeze_table_age,
-						  params->multixact_freeze_min_age,
-						  params->multixact_freeze_table_age,
-						  &OldestXmin, &FreezeLimit, &xidFullScanLimit,
-						  &MultiXactCutoff, &mxactFullScanLimit);
-
 	/*
-	 * We request an aggressive scan if the table's frozen Xid is now older
-	 * than or equal to the requested Xid full-table scan limit; or if the
-	 * table's minimum MultiXactId is older than or equal to the requested
-	 * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
+	 * Get cutoffs that determine which tuples we need to freeze during the
+	 * VACUUM operation.
+	 *
+	 * Also determines if this is to be an aggressive VACUUM.  This will
+	 * eventually be required for any table where (for whatever reason) no
+	 * non-aggressive VACUUM ran to completion, and advanced relfrozenxid.
 	 */
-	aggressive = TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-											   xidFullScanLimit);
-	aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-											  mxactFullScanLimit);
+	aggressive = vacuum_set_xid_limits(rel,
+									   params->freeze_min_age,
+									   params->freeze_table_age,
+									   params->multixact_freeze_min_age,
+									   params->multixact_freeze_table_age,
+									   &OldestXmin, &FreezeLimit,
+									   &MultiXactCutoff);
+
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 9d22f648a..66b87347d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -857,8 +857,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, NULL, &MultiXactCutoff,
-						  NULL);
+						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 8bd4bd12c..6db7b8156 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -935,25 +935,20 @@ get_all_vacuum_rels(int options)
  *
  * Input parameters are the target relation, applicable freeze age settings.
  *
+ * Return value indicates whether caller should do an aggressive VACUUM or
+ * not.  This is a VACUUM that cannot skip any pages using the visibility map
+ * (except all-frozen pages), which is guaranteed to be able to advance
+ * relfrozenxid and relminmxid.
+ *
  * The output parameters are:
- * - oldestXmin is the cutoff value used to distinguish whether tuples are
- *	 DEAD or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ * - oldestXmin is the Xid below which tuples deleted by any xact (that
+ *   committed) should be considered DEAD, not just RECENTLY_DEAD.
  * - freezeLimit is the Xid below which all Xids are replaced by
  *	 FrozenTransactionId during vacuum.
- * - xidFullScanLimit (computed from freeze_table_age parameter)
- *	 represents a minimum Xid value; a table whose relfrozenxid is older than
- *	 this will have a full-table vacuum applied to it, to freeze tuples across
- *	 the whole table.  Vacuuming a table younger than this value can use a
- *	 partial scan.
- * - multiXactCutoff is the value below which all MultiXactIds are removed from
- *	 Xmax.
- * - mxactFullScanLimit is a value against which a table's relminmxid value is
- *	 compared to produce a full-table vacuum, as with xidFullScanLimit.
- *
- * xidFullScanLimit and mxactFullScanLimit can be passed as NULL if caller is
- * not interested.
+ * - multiXactCutoff is the value below which all MultiXactIds are removed
+ *   from Xmax.
  */
-void
+bool
 vacuum_set_xid_limits(Relation rel,
 					  int freeze_min_age,
 					  int freeze_table_age,
@@ -961,9 +956,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
 					  TransactionId *freezeLimit,
-					  TransactionId *xidFullScanLimit,
-					  MultiXactId *multiXactCutoff,
-					  MultiXactId *mxactFullScanLimit)
+					  MultiXactId *multiXactCutoff)
 {
 	int			freezemin;
 	int			mxid_freezemin;
@@ -973,6 +966,7 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
+	int			freezetable;
 
 	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
@@ -1090,64 +1084,60 @@ vacuum_set_xid_limits(Relation rel,
 
 	*multiXactCutoff = mxactLimit;
 
-	if (xidFullScanLimit != NULL)
-	{
-		int			freezetable;
+	/*
+	 * Done setting output parameters; just need to figure out if caller needs
+	 * to do an aggressive VACUUM or not.
+	 *
+	 * Determine the table freeze age to use: as specified by the caller, or
+	 * vacuum_freeze_table_age, but in any case not more than
+	 * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
+	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
+	 * before anti-wraparound autovacuum is launched.
+	 */
+	freezetable = freeze_table_age;
+	if (freezetable < 0)
+		freezetable = vacuum_freeze_table_age;
+	freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
+	Assert(freezetable >= 0);
 
-		Assert(mxactFullScanLimit != NULL);
+	/*
+	 * Compute XID limit causing an aggressive vacuum, being careful not to
+	 * generate a "permanent" XID
+	 */
+	limit = ReadNextTransactionId() - freezetable;
+	if (!TransactionIdIsNormal(limit))
+		limit = FirstNormalTransactionId;
+	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
+									  limit))
+		return true;
 
-		/*
-		 * Determine the table freeze age to use: as specified by the caller,
-		 * or vacuum_freeze_table_age, but in any case not more than
-		 * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-		 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
-		 * before anti-wraparound autovacuum is launched.
-		 */
-		freezetable = freeze_table_age;
-		if (freezetable < 0)
-			freezetable = vacuum_freeze_table_age;
-		freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
-		Assert(freezetable >= 0);
+	/*
+	 * Similar to the above, determine the table freeze age to use for
+	 * multixacts: as specified by the caller, or
+	 * vacuum_multixact_freeze_table_age, but in any case not more than
+	 * autovacuum_multixact_freeze_table_age * 0.95, so that if you have e.g.
+	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
+	 * multixacts before anti-wraparound autovacuum is launched.
+	 */
+	freezetable = multixact_freeze_table_age;
+	if (freezetable < 0)
+		freezetable = vacuum_multixact_freeze_table_age;
+	freezetable = Min(freezetable,
+					  effective_multixact_freeze_max_age * 0.95);
+	Assert(freezetable >= 0);
 
-		/*
-		 * Compute XID limit causing a full-table vacuum, being careful not to
-		 * generate a "permanent" XID.
-		 */
-		limit = ReadNextTransactionId() - freezetable;
-		if (!TransactionIdIsNormal(limit))
-			limit = FirstNormalTransactionId;
+	/*
+	 * Compute MultiXact limit causing an aggressive vacuum, being careful to
+	 * generate a valid MultiXact value
+	 */
+	mxactLimit = ReadNextMultiXactId() - freezetable;
+	if (mxactLimit < FirstMultiXactId)
+		mxactLimit = FirstMultiXactId;
+	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
+									mxactLimit))
+		return true;
 
-		*xidFullScanLimit = limit;
-
-		/*
-		 * Similar to the above, determine the table freeze age to use for
-		 * multixacts: as specified by the caller, or
-		 * vacuum_multixact_freeze_table_age, but in any case not more than
-		 * autovacuum_multixact_freeze_table_age * 0.95, so that if you have
-		 * e.g. nightly VACUUM schedule, the nightly VACUUM gets a chance to
-		 * freeze multixacts before anti-wraparound autovacuum is launched.
-		 */
-		freezetable = multixact_freeze_table_age;
-		if (freezetable < 0)
-			freezetable = vacuum_multixact_freeze_table_age;
-		freezetable = Min(freezetable,
-						  effective_multixact_freeze_max_age * 0.95);
-		Assert(freezetable >= 0);
-
-		/*
-		 * Compute MultiXact limit causing a full-table vacuum, being careful
-		 * to generate a valid MultiXact value.
-		 */
-		mxactLimit = ReadNextMultiXactId() - freezetable;
-		if (mxactLimit < FirstMultiXactId)
-			mxactLimit = FirstMultiXactId;
-
-		*mxactFullScanLimit = mxactLimit;
-	}
-	else
-	{
-		Assert(mxactFullScanLimit == NULL);
-	}
+	return false;
 }
 
 /*
-- 
2.30.2

#10

Masahiko Sawada

sawada.mshk@gmail.com

about 4 years ago

In reply to: Peter Geoghegan (#9)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Dec 16, 2021 at 5:27 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Dec 10, 2021 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:

* I'm still working on the optimization that we discussed on this
thread: the optimization that allows the final relfrozenxid (that we
set in pg_class) to be determined dynamically, based on the actual
XIDs we observed in the table (we don't just naively use FreezeLimit).

Attached is v4 of the patch series, which now includes this
optimization, broken out into its own patch. In addition, it includes
a prototype of opportunistic freezing.

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

Great!

I've looked at 0001 patch and here are some comments:

@@ -535,8 +540,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,

xidFullScanLimit);
aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,

                   mxactFullScanLimit);
+       skipwithvm = true;
        if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+       {
+               /*
+                * Force aggressive mode, and disable skipping blocks using the
+                * visibility map (even those set all-frozen)
+                */
                aggressive = true;
+               skipwithvm = false;
+       }

vacrel = (LVRelState *) palloc0(sizeof(LVRelState));

@@ -544,6 +557,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
vacrel->rel = rel;
vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
&vacrel->indrels);
+ vacrel->aggressive = aggressive;
vacrel->failsafe_active = false;
vacrel->consider_bypass_optimization = true;

How about adding skipwithvm to LVRelState too?

---
                        /*
-                        * The current block is potentially skippable;
if we've seen a
-                        * long enough run of skippable blocks to
justify skipping it, and
-                        * we're not forced to check it, then go ahead and skip.
-                        * Otherwise, the page must be at least
all-visible if not
-                        * all-frozen, so we can set
all_visible_according_to_vm = true.
+                        * The current page can be skipped if we've
seen a long enough run
+                        * of skippable blocks to justify skipping it
-- provided it's not
+                        * the last page in the relation (according to
rel_pages/nblocks).
+                        *
+                        * We always scan the table's last page to
determine whether it
+                        * has tuples or not, even if it would
otherwise be skipped
+                        * (unless we're skipping every single page in
the relation). This
+                        * avoids having lazy_truncate_heap() take
access-exclusive lock
+                        * on the table to attempt a truncation that just fails
+                        * immediately because there are tuples on the
last page.
                         */
-                       if (skipping_blocks && !FORCE_CHECK_PAGE())
+                       if (skipping_blocks && blkno < nblocks - 1)

Why do we always need to scan the last page even if heap truncation is
disabled (or in the failsafe mode)?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#11

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Masahiko Sawada (#10)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

Great!

Maybe this is a good time to revisit basic questions about VACUUM. I
wonder if we can get rid of some of the GUCs for VACUUM now.

Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
physical blocks, but we use logical units (XIDs).

We probably shouldn't be using any units, but using XIDs "feels wrong"
to me. Even with my patch, it is theoretically possible that we won't
be able to advance relfrozenxid very much, because we cannot get a
cleanup lock on one single heap page with one old XID. But even in
this extreme case, how relevant is the "age" of this old XID, really?
What really matters is whether or not we can advance relfrozenxid in
time (with time to spare). And so the wraparound risk of the system is
not affected all that much by the age of the single oldest XID. The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

On the other hand, the risk may be far greater if we have *many*
tuples that are still unfrozen, whose XIDs are only "middle aged"
right now. The idea behind vacuum_freeze_min_age seems to be to be
lazy about work (tuple freezing) in the hope that we'll never need to
do it, but that seems obsolete now. (It probably made a little more
sense before the visibility map.)

Using XIDs makes sense for things like autovacuum_freeze_max_age,
because there we have to worry about wraparound and relfrozenxid
(whether or not we like it). But with this patch, and with everything
else (the failsafe, insert-driven autovacuums, everything we've done
over the last several years) I think that it might be time to increase
the autovacuum_freeze_max_age default. Maybe even to something as high
as 800 million transaction IDs, but certainly to 400 million. What do
you think? (Maybe don't answer just yet, something to think about.)

+ vacrel->aggressive = aggressive;
vacrel->failsafe_active = false;
vacrel->consider_bypass_optimization = true;

How about adding skipwithvm to LVRelState too?

Agreed -- it's slightly better that way. Will change this.

*/
-                       if (skipping_blocks && !FORCE_CHECK_PAGE())
+                       if (skipping_blocks && blkno < nblocks - 1)
Why do we always need to scan the last page even if heap truncation is
disabled (or in the failsafe mode)?

My goal here was to keep the behavior from commit e8429082, "Avoid
useless truncation attempts during VACUUM", while simplifying things
around skipping heap pages via the visibility map (including removing
the FORCE_CHECK_PAGE() macro). Of course you're right that this
particular change that you have highlighted does change the behavior a
little -- now we will always treat the final page as a "scanned page",
except perhaps when 100% of all pages in the relation are skipped
using the visibility map.

This was a deliberate choice (and perhaps even a good choice!). I
think that avoiding accessing the last heap page like this isn't worth
the complexity. Note that we may already access heap pages (making
them "scanned pages") despite the fact that we know it's unnecessary:
the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't
even try to avoid wasting CPU cycles on these
not-skipped-but-skippable pages). So I think that the performance cost
for the last page isn't going to be noticeable.

However, now that I think about it, I wonder...what do you think of
SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today?
SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly
after the original visibility map implementation was committed in
2009. The idea that it helps us to advance relfrozenxid outside of
aggressive VACUUMs (per commit message from bf136cf6e3) seems like it
might no longer matter with the patch -- because now we won't ever set
a page all-visible but not all-frozen. Plus the idea that we need to
do all this work just to get readahead from the OS
seems...questionable.

--
Peter Geoghegan

#12

Masahiko Sawada

sawada.mshk@gmail.com

about 4 years ago

In reply to: Peter Geoghegan (#11)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Dec 18, 2021 at 11:29 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Dec 16, 2021 at 10:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

Great!

Maybe this is a good time to revisit basic questions about VACUUM. I
wonder if we can get rid of some of the GUCs for VACUUM now.

Can we fully get rid of vacuum_freeze_table_age?

Does it mean that a vacuum always is an aggressive vacuum? If
opportunistic freezing works well on all tables, we might no longer
need vacuum_freeze_table_age. But I’m not sure that’s true since the
cost of freezing tuples is not 0.

We probably shouldn't be using any units, but using XIDs "feels wrong"
to me. Even with my patch, it is theoretically possible that we won't
be able to advance relfrozenxid very much, because we cannot get a
cleanup lock on one single heap page with one old XID. But even in
this extreme case, how relevant is the "age" of this old XID, really?
What really matters is whether or not we can advance relfrozenxid in
time (with time to spare). And so the wraparound risk of the system is
not affected all that much by the age of the single oldest XID. The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

I think that that's true for (mostly) static tables. But regarding
constantly-updated tables, since autovacuum runs based on the number
of garbage tuples (or inserted tuples) and how old the relfrozenxid is
if an autovacuum could not advance the relfrozenxid because it could
not get a cleanup lock on the page that has the single oldest XID,
it's likely that when autovacuum runs next time it will have to
process other pages too since the page will get dirty enough.

It might be a good idea that we remember pages where we could not get
a cleanup lock somewhere and revisit them after index cleanup. While
revisiting the pages, we don’t prune the page but only freeze tuples.

On the other hand, the risk may be far greater if we have *many*
tuples that are still unfrozen, whose XIDs are only "middle aged"
right now. The idea behind vacuum_freeze_min_age seems to be to be
lazy about work (tuple freezing) in the hope that we'll never need to
do it, but that seems obsolete now. (It probably made a little more
sense before the visibility map.)

Why is it obsolete now? I guess that it's still valid depending on the
cases, for example, heavily updated tables.

Using XIDs makes sense for things like autovacuum_freeze_max_age,
because there we have to worry about wraparound and relfrozenxid
(whether or not we like it). But with this patch, and with everything
else (the failsafe, insert-driven autovacuums, everything we've done
over the last several years) I think that it might be time to increase
the autovacuum_freeze_max_age default. Maybe even to something as high
as 800 million transaction IDs, but certainly to 400 million. What do
you think? (Maybe don't answer just yet, something to think about.)

I don’t have an objection to increasing autovacuum_freeze_max_age for
now. One of my concerns with anti-wraparound vacuums is that too many
tables (or several large tables) will reach autovacuum_freeze_max_age
at once, using up autovacuum slots and preventing autovacuums from
being launched on tables that are heavily being updated. Given these
works, expanding the gap between vacuum_freeze_table_age and
autovacuum_freeze_max_age would have better chances for the tables to
advance its relfrozenxid by an aggressive vacuum instead of an
anti-wraparound-aggressive vacuum. 400 million seems to be a good
start.

+ vacrel->aggressive = aggressive;
vacrel->failsafe_active = false;
vacrel->consider_bypass_optimization = true;

How about adding skipwithvm to LVRelState too?

Agreed -- it's slightly better that way. Will change this.
*/
-                       if (skipping_blocks && !FORCE_CHECK_PAGE())
+                       if (skipping_blocks && blkno < nblocks - 1)
Why do we always need to scan the last page even if heap truncation is
disabled (or in the failsafe mode)?
My goal here was to keep the behavior from commit e8429082, "Avoid
useless truncation attempts during VACUUM", while simplifying things
around skipping heap pages via the visibility map (including removing
the FORCE_CHECK_PAGE() macro). Of course you're right that this
particular change that you have highlighted does change the behavior a
little -- now we will always treat the final page as a "scanned page",
except perhaps when 100% of all pages in the relation are skipped
using the visibility map.

This was a deliberate choice (and perhaps even a good choice!). I
think that avoiding accessing the last heap page like this isn't worth
the complexity. Note that we may already access heap pages (making
them "scanned pages") despite the fact that we know it's unnecessary:
the SKIP_PAGES_THRESHOLD test leads to this behavior (and we don't
even try to avoid wasting CPU cycles on these
not-skipped-but-skippable pages). So I think that the performance cost
for the last page isn't going to be noticeable.

Agreed.

However, now that I think about it, I wonder...what do you think of
SKIP_PAGES_THRESHOLD, in general? Is the optimal value still 32 today?
SKIP_PAGES_THRESHOLD hasn't changed since commit bf136cf6e3, shortly
after the original visibility map implementation was committed in
2009. The idea that it helps us to advance relfrozenxid outside of
aggressive VACUUMs (per commit message from bf136cf6e3) seems like it
might no longer matter with the patch -- because now we won't ever set
a page all-visible but not all-frozen. Plus the idea that we need to
do all this work just to get readahead from the OS
seems...questionable.

Given the opportunistic freezing, that's true but I'm concerned
whether opportunistic freezing always works well on all tables since
freezing tuples is not 0 cost.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#13

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Masahiko Sawada (#12)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Dec 20, 2021 at 8:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Can we fully get rid of vacuum_freeze_table_age?

Does it mean that a vacuum always is an aggressive vacuum?

No. Just somewhat more like one. Still no waiting for cleanup locks,
though. Also, autovacuum is still cancelable (that's technically from
anti-wraparound VACUUM, but you know what I mean). And there shouldn't
be a noticeable difference in terms of how many blocks can be skipped
using the VM.

If opportunistic freezing works well on all tables, we might no longer
need vacuum_freeze_table_age. But I’m not sure that’s true since the
cost of freezing tuples is not 0.

That's true, of course, but right now the only goal of opportunistic
freezing is to advance relfrozenxid in every VACUUM. It needs to be
shown to be worth it, of course. But let's assume that it is worth it,
for a moment (perhaps only because we optimize freezing itself in
passing) -- then there is little use for vacuum_freeze_table_age, that
I can see.

I think that that's true for (mostly) static tables. But regarding
constantly-updated tables, since autovacuum runs based on the number
of garbage tuples (or inserted tuples) and how old the relfrozenxid is
if an autovacuum could not advance the relfrozenxid because it could
not get a cleanup lock on the page that has the single oldest XID,
it's likely that when autovacuum runs next time it will have to
process other pages too since the page will get dirty enough.

I'm not arguing that the age of the single oldest XID is *totally*
irrelevant. Just that it's typically much less important than the
total amount of work we'd have to do (freezing) to be able to advance
relfrozenxid.

In any case, the extreme case where we just cannot get a cleanup lock
on one particular page with an old XID is probably very rare.

It might be a good idea that we remember pages where we could not get
a cleanup lock somewhere and revisit them after index cleanup. While
revisiting the pages, we don’t prune the page but only freeze tuples.

Maybe, but I think that it would make more sense to not use
FreezeLimit for that at all. In an aggressive VACUUM (where we might
actually have to wait for a cleanup lock), why should we wait once the
age is over vacuum_freeze_min_age (usually 50 million XIDs)? The
official answer is "because we need to advance relfrozenxid". But why
not accept a much older relfrozenxid that is still sufficiently
young/safe, in order to avoid waiting for a cleanup lock?

In other words, what if our approach of "being diligent about
advancing relfrozenxid" makes the relfrozenxid problem worse, not
better? The problem with "being diligent" is that it is defined by
FreezeLimit (which is more or less the same thing as
vacuum_freeze_min_age), which is supposed to be about which tuples we
will freeze. That's a very different thing to how old relfrozenxid
should be or can be (after an aggressive VACUUM finishes).

On the other hand, the risk may be far greater if we have *many*
tuples that are still unfrozen, whose XIDs are only "middle aged"
right now. The idea behind vacuum_freeze_min_age seems to be to be
lazy about work (tuple freezing) in the hope that we'll never need to
do it, but that seems obsolete now. (It probably made a little more
sense before the visibility map.)

Why is it obsolete now? I guess that it's still valid depending on the
cases, for example, heavily updated tables.

Because after the 9.6 freezemap work we'll often set the all-visible
bit in the VM, but not the all-frozen bit (unless we have the
opportunistic freezing patch applied, which specifically avoids that).
When that happens, affected heap pages will still have
older-than-vacuum_freeze_min_age-XIDs after VACUUM runs, until we get
to an aggressive VACUUM. There could be many VACUUMs before the
aggressive VACUUM.

This "freezing cliff" seems like it might be a big problem, in
general. That's what I'm trying to address here.

Either way, the system doesn't really respect vacuum_freeze_min_age in
the way that it did before 9.6 -- which is what I meant by "obsolete".

I don’t have an objection to increasing autovacuum_freeze_max_age for
now. One of my concerns with anti-wraparound vacuums is that too many
tables (or several large tables) will reach autovacuum_freeze_max_age
at once, using up autovacuum slots and preventing autovacuums from
being launched on tables that are heavily being updated.

I think that the patch helps with that, actually -- there tends to be
"natural variation" in the relfrozenxid age of each table, which comes
from per-table workload characteristics.

Given these
works, expanding the gap between vacuum_freeze_table_age and
autovacuum_freeze_max_age would have better chances for the tables to
advance its relfrozenxid by an aggressive vacuum instead of an
anti-wraparound-aggressive vacuum. 400 million seems to be a good
start.

The idea behind getting rid of vacuum_freeze_table_age (not to be
confused by the other idea about getting rid of vacuum_freeze_min_age)
is this: with the patch series, we only tend to get an anti-wraparound
VACUUM in extreme and relatively rare cases. For example, we will get
aggressive anti-wraparound VACUUMs on tables that *never* grow, but
constantly get HOT updates (e.g. the pgbench_accounts table with heap
fill factor reduced to 90). We won't really be able to use the VM when
this happens, either.

With tables like this -- tables that still get aggressive VACUUMs --
maybe the patch doesn't make a huge difference. But that's truly the
extreme case -- that is true only because there is already zero chance
of there being a non-aggressive VACUUM. We'll get aggressive
anti-wraparound VACUUMs every time we reach autovacuum_freeze_max_age,
again and again -- no change, really.

But since it's only these extreme cases that continue to get
aggressive VACUUMs, why do we still need vacuum_freeze_table_age? It
helps right now (without the patch) by "escalating" a regular VACUUM
to an aggressive one. But the cases that we still expect an aggressive
VACUUM (with the patch) are the cases where there is zero chance of
that happening. Almost by definition.

Given the opportunistic freezing, that's true but I'm concerned
whether opportunistic freezing always works well on all tables since
freezing tuples is not 0 cost.

That is the big question for this patch.

--
Peter Geoghegan

#14

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Peter Geoghegan (#13)

5 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Dec 20, 2021 at 9:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

Given the opportunistic freezing, that's true but I'm concerned
whether opportunistic freezing always works well on all tables since
freezing tuples is not 0 cost.

That is the big question for this patch.

Attached is a mechanical rebase of the patch series. This new version
just fixes bitrot, caused by Masahiko's recent lazyvacuum.c
refactoring work. In other words, this revision has no significant
changes compared to the v4 that I posted back in late December -- just
want to keep CFTester green.

I still have plenty of work to do here. Especially with the final
patch (the v5-0005-* "freeze early" patch), which is generally more
speculative than the other patches. I'm playing catch-up now, since I
just returned from vacation.

--
Peter Geoghegan

Attachments:

v5-0002-Improve-log_autovacuum_min_duration-output.patchapplication/x-patch; name=v5-0002-Improve-log_autovacuum_min_duration-output.patchDownload

From ee88cec87714420f195bc05d3a1e154c1aa759d6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 21 Nov 2021 14:47:11 -0800
Subject: [PATCH v5 2/5] Improve log_autovacuum_min_duration output.

Add instrumentation of "missed dead tuples", and the number of pages
that had at least one such tuple.  These are fully DEAD (not just
RECENTLY_DEAD) tuples with storage that could not be pruned due to an
inability to acquire a cleanup lock.  This is a replacement for the
"skipped due to pin" instrumentation removed by the previous commit.
Note that the new instrumentation doesn't say anything about pages that
we failed to acquire a cleanup lock on when we see that there were no
missed dead tuples on the page.

Also report on visibility map pages skipped by VACUUM, without regard
for whether the pages were all-frozen or just all-visible.

Also report when and how relfrozenxid is advanced by VACUUM, including
non-aggressive VACUUM.  Apart from being useful on its own, this might
enable future work that teaches non-aggressive VACUUM to be more
concerned about advancing relfrozenxid sooner rather than later.

Also report number of tuples frozen.  This will become more important
when the later patch to perform opportunistic tuple freezing is
committed.

Also enhance how we report OldestXmin cutoff by putting it in context:
show how far behind it is at the _end_ of the VACUUM operation.

Deliberately don't do anything with VACUUM VERBOSE in this commit, since
a pending patch will generalize the log_autovacuum_min_duration code to
produce VACUUM VERBOSE output as well [1].  That'll get committed first.

[1] https://commitfest.postgresql.org/36/3431/
---
 src/include/commands/vacuum.h        |   2 +
 src/backend/access/heap/vacuumlazy.c | 108 +++++++++++++++++++--------
 src/backend/commands/analyze.c       |   3 +
 src/backend/commands/vacuum.c        |   9 +++
 4 files changed, 91 insertions(+), 31 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5a36049be..772a257fc 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -283,6 +283,8 @@ extern void vac_update_relstats(Relation relation,
 								bool hasindex,
 								TransactionId frozenxid,
 								MultiXactId minmulti,
+								bool *frozenxid_updated,
+								bool *minmulti_updated,
 								bool in_outer_xact);
 extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 148129e59..2950df1ce 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -197,6 +197,7 @@ typedef struct LVRelState
 	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 
 	/* Statistics output by us, for table */
@@ -209,9 +210,10 @@ typedef struct LVRelState
 	int			num_index_scans;
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # frozen by us */
 	int64		lpdead_items;	/* # deleted from indexes */
-	int64		new_dead_tuples;	/* new estimated total # of dead items in
-									 * table */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
 	int64		num_tuples;		/* total number of nonremovable tuples */
 	int64		live_tuples;	/* live tuples (reltuples estimate) */
 } LVRelState;
@@ -317,6 +319,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				write_rate;
 	bool		aggressive,
 				skipwithvm;
+	bool		frozenxid_updated,
+				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
@@ -535,9 +539,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
+		frozenxid_updated = minmulti_updated = false;
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId, false);
+							InvalidTransactionId, InvalidMultiXactId,
+							NULL, NULL, false);
 	}
 	else
 	{
@@ -546,7 +552,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			   orig_rel_pages);
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff, false);
+							FreezeLimit, MultiXactCutoff,
+							&frozenxid_updated, &minmulti_updated, false);
 	}
 
 	/*
@@ -562,7 +569,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(new_live_tuples, 0),
-						 vacrel->new_dead_tuples);
+						 vacrel->recently_dead_tuples +
+						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	/* and log the action if appropriate */
@@ -576,6 +584,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
+			int32		diff;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -622,16 +631,41 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped using visibility map (%.2f%% of total)\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->frozenskipped_pages);
+							 orig_rel_pages - vacrel->scanned_pages,
+							 orig_rel_pages == 0 ? 0 :
+							 100.0 * (orig_rel_pages - vacrel->scanned_pages) / orig_rel_pages);
 			appendStringInfo(&buf,
-							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
+							 _("tuples: %lld removed, %lld remain (%lld newly frozen), %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->new_dead_tuples,
-							 OldestXmin);
+							 (long long) vacrel->tuples_frozen,
+							 (long long) vacrel->recently_dead_tuples);
+			if (vacrel->missed_dead_tuples > 0)
+				appendStringInfo(&buf,
+								 _("tuples missed: %lld dead from %u contended pages\n"),
+								 (long long) vacrel->missed_dead_tuples,
+								 vacrel->missed_dead_pages);
+			diff = (int32) (ReadNextTransactionId() - OldestXmin);
+			appendStringInfo(&buf,
+							 _("removable cutoff: %u, which is %d xids behind next\n"),
+							 OldestXmin, diff);
+			if (frozenxid_updated)
+			{
+				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				appendStringInfo(&buf,
+								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
+								 FreezeLimit, diff);
+			}
+			if (minmulti_updated)
+			{
+				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				appendStringInfo(&buf,
+								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
+								 MultiXactCutoff, diff);
+			}
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -787,13 +821,16 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	vacrel->frozenskipped_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
+	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
 
 	/* Initialize instrumentation counters */
 	vacrel->num_index_scans = 0;
 	vacrel->tuples_deleted = 0;
+	vacrel->tuples_frozen = 0;
 	vacrel->lpdead_items = 0;
-	vacrel->new_dead_tuples = 0;
+	vacrel->recently_dead_tuples = 0;
+	vacrel->missed_dead_tuples = 0;
 	vacrel->num_tuples = 0;
 	vacrel->live_tuples = 0;
 
@@ -1340,7 +1377,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->new_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
+		vacrel->missed_dead_tuples;
 
 	/*
 	 * Release any remaining pin on visibility map page.
@@ -1405,7 +1443,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	initStringInfo(&buf);
 	appendStringInfo(&buf,
 					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
-					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
+					 (long long) vacrel->recently_dead_tuples,
+					 vacrel->OldestXmin);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
@@ -1586,7 +1625,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				lpdead_items,
-				new_dead_tuples,
+				recently_dead_tuples,
 				num_tuples,
 				live_tuples;
 	int			nnewlpdead;
@@ -1603,7 +1642,7 @@ retry:
 	/* Initialize (or reset) page-level counters */
 	tuples_deleted = 0;
 	lpdead_items = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
 	num_tuples = 0;
 	live_tuples = 0;
 
@@ -1762,11 +1801,11 @@ retry:
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * If tuple is recently deleted then we must not remove it
-				 * from relation.  (We only remove items that are LP_DEAD from
+				 * If tuple is recently dead then we must not remove it from
+				 * the relation.  (We only remove items that are LP_DEAD from
 				 * pruning.)
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				prunestate->all_visible = false;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -1941,8 +1980,9 @@ retry:
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
+	vacrel->tuples_frozen += nfrozen;
 	vacrel->lpdead_items += lpdead_items;
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
 }
@@ -1995,7 +2035,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	int			lpdead_items,
 				num_tuples,
 				live_tuples,
-				new_dead_tuples;
+				recently_dead_tuples,
+				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
@@ -2007,7 +2048,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	lpdead_items = 0;
 	num_tuples = 0;
 	live_tuples = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
+	missed_dead_tuples = 0;
 
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -2081,16 +2123,15 @@ lazy_scan_noprune(LVRelState *vacrel,
 				/*
 				 * There is some useful work for pruning to do, that won't be
 				 * done due to failure to get a cleanup lock.
-				 *
-				 * TODO Add dedicated instrumentation for this case
 				 */
+				missed_dead_tuples++;
 				break;
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 * Count in recently_dead_tuples, just like lazy_scan_prune
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
 
@@ -2118,13 +2159,15 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 *
 		 * We are not prepared to handle the corner case where a single pass
 		 * strategy VACUUM cannot get a cleanup lock, and we then find LP_DEAD
-		 * items.
+		 * items.  Count the LP_DEAD items as missed_dead_tuples instead. This
+		 * is slightly dishonest, but it's better than maintaining code to do
+		 * heap vacuuming for this one narrow corner case.
 		 */
 		if (lpdead_items > 0)
 			*hastup = true;
 		*hasfreespace = true;
 		num_tuples += lpdead_items;
-		/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+		missed_dead_tuples += lpdead_items;
 	}
 	else if (lpdead_items > 0)
 	{
@@ -2159,9 +2202,12 @@ lazy_scan_noprune(LVRelState *vacrel,
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
+	vacrel->missed_dead_tuples += missed_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
+	if (missed_dead_tuples > 0)
+		vacrel->missed_dead_pages++;
 
 	/* Caller won't need to call lazy_scan_prune with same page */
 	return true;
@@ -2234,8 +2280,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * dead_items space is not CPU cache resident.
 		 *
 		 * We don't take any special steps to remember the LP_DEAD items (such
-		 * as counting them in new_dead_tuples report to the stats collector)
-		 * when the optimization is applied.  Though the accounting used in
+		 * as counting them in our final report to the stats collector) when
+		 * the optimization is applied.  Though the accounting used in
 		 * analyze.c's acquire_sample_rows() will recognize the same LP_DEAD
 		 * items as dead rows in its own stats collector report, that's okay.
 		 * The discrepancy should be negligible.  If this optimization is ever
@@ -3375,7 +3421,7 @@ update_index_statistics(LVRelState *vacrel)
 							false,
 							InvalidTransactionId,
 							InvalidMultiXactId,
-							false);
+							NULL, NULL, false);
 	}
 }
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index cd77907fc..afd1cb8f5 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -651,6 +651,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 							hasindex,
 							InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 
 		/* Same for indexes */
@@ -667,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								false,
 								InvalidTransactionId,
 								InvalidMultiXactId,
+								NULL, NULL,
 								in_outer_xact);
 		}
 	}
@@ -679,6 +681,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		vac_update_relstats(onerel, -1, totalrows,
 							0, hasindex, InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 	}
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c94c187d3..d1d38d509 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1315,6 +1315,7 @@ vac_update_relstats(Relation relation,
 					BlockNumber num_all_visible_pages,
 					bool hasindex, TransactionId frozenxid,
 					MultiXactId minmulti,
+					bool *frozenxid_updated, bool *minmulti_updated,
 					bool in_outer_xact)
 {
 	Oid			relid = RelationGetRelid(relation);
@@ -1390,22 +1391,30 @@ vac_update_relstats(Relation relation,
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
+	if (frozenxid_updated)
+		*frozenxid_updated = false;
 	if (TransactionIdIsNormal(frozenxid) &&
 		pgcform->relfrozenxid != frozenxid &&
 		(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
 		 TransactionIdPrecedes(ReadNextTransactionId(),
 							   pgcform->relfrozenxid)))
 	{
+		if (frozenxid_updated)
+			*frozenxid_updated = true;
 		pgcform->relfrozenxid = frozenxid;
 		dirty = true;
 	}
 
 	/* Similarly for relminmxid */
+	if (minmulti_updated)
+		*minmulti_updated = false;
 	if (MultiXactIdIsValid(minmulti) &&
 		pgcform->relminmxid != minmulti &&
 		(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
 		 MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
 	{
+		if (minmulti_updated)
+			*minmulti_updated = true;
 		pgcform->relminmxid = minmulti;
 		dirty = true;
 	}
-- 
2.30.2

v5-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchapplication/x-patch; name=v5-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchDownload

From be9b33339f0494ffd3b2327322433bacf72a5428 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 22 Nov 2021 10:02:30 -0800
Subject: [PATCH v5 4/5] Loosen coupling between relfrozenxid and tuple
 freezing.

Stop using tuple freezing (and MultiXact freezing) tuple header cutoffs
to determine the final relfrozenxid (and relminmxid) values that we set
for heap relations in pg_class.  Use "optimal" values instead.

Optimal values are the most recent values that are less than or equal to
any remaining XID/MultiXact in a tuple header (not counting frozen
xmin/xmax values).  This is now kept track of by VACUUM.  "Optimal"
values are always >= the tuple header FreezeLimit in an aggressive
VACUUM.  For a non-aggressive VACUUM, they can be less than or greater
than the tuple header FreezeLimit cutoff (though we still often pass
invalid values to indicate that we cannot advance relfrozenxid during
the VACUUM).
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |   4 +-
 src/include/commands/vacuum.h        |   1 +
 src/backend/access/heap/heapam.c     | 186 ++++++++++++++++++++-------
 src/backend/access/heap/vacuumlazy.c |  78 +++++++----
 src/backend/commands/cluster.c       |   5 +-
 src/backend/commands/vacuum.c        |  34 ++++-
 7 files changed, 231 insertions(+), 81 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f3fb1e93a..bc5a96796 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -168,7 +168,9 @@ extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
 extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi, Buffer buf);
+									MultiXactId cutoff_multi,
+									TransactionId *NewRelfrozenxid,
+									MultiXactId *NewRelminmxid, Buffer buf);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index ab9e873bc..b0ede623e 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *NewRelfrozenxid,
+									  MultiXactId *NewRelminmxid);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1848a65df..db813af99 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a1bacb0eb..2f3399265 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6087,12 +6087,24 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "NewRelfrozenxid" is an output value; it's used to maintain target new
+ * relfrozenxid for the relation.  It can be ignored unless "flags" contains
+ * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
+ * here.  This follows the general convention: only track XIDs that will still
+ * be in the table after the ongoing VACUUM finishes.  Note that it's up to
+ * caller to maintain this when the Xid return value is itself an Xid.
+ *
+ * Note that we cannot depend on xmin to maintain NewRelfrozenxid.  We need to
+ * push maintenance of NewRelfrozenxid down this far, since in general xmin
+ * might have been frozen by an earlier VACUUM operation, in which case our
+ * caller will not have factored-in xmin when maintaining NewRelfrozenxid.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *NewRelfrozenxid)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6104,6 +6116,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId tempNewRelfrozenxid;
 
 	*flags = 0;
 
@@ -6198,13 +6211,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-		{
 			need_replace = true;
-			break;
-		}
+		if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+			tempNewRelfrozenxid = members[i].xid;
 	}
 
 	/*
@@ -6213,6 +6226,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (!need_replace)
 	{
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 		*flags |= FRM_NOOP;
 		pfree(members);
 		return InvalidTransactionId;
@@ -6222,6 +6236,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * If the multi needs to be updated, figure out which members do we need
 	 * to keep.
 	 */
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
@@ -6303,7 +6318,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 * list.)
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+					tempNewRelfrozenxid = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6313,6 +6332,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			{
 				/* running locker cannot possibly be older than the cutoff */
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6341,6 +6361,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+		/* Caller manages NewRelfrozenxid directly when we return an XID */
 	}
 	else
 	{
@@ -6350,6 +6371,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 	}
 
 	pfree(newmembers);
@@ -6368,6 +6390,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will actually go on to freeze as indicated by our *frz output, so
+ * any (xmin, xmax, xvac) XIDs that we indicate need to be frozen won't need
+ * to be counted here.  Values are valid lower bounds at the point that the
+ * ongoing VACUUM finishes.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
@@ -6392,7 +6421,9 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p,
+						  TransactionId *NewRelfrozenxid,
+						  MultiXactId *NewRelminmxid)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6436,6 +6467,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+		{
+			/* won't be frozen, but older than current NewRelfrozenxid */
+			*NewRelfrozenxid = xid;
+		}
 	}
 
 	/*
@@ -6453,10 +6489,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId temp = *NewRelfrozenxid;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi, &flags, &temp);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
@@ -6474,6 +6511,24 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
+
+			if (TransactionIdPrecedes(newxmax, *NewRelfrozenxid))
+			{
+				/* New xmax is an XID older than new NewRelfrozenxid */
+				*NewRelfrozenxid = newxmax;
+			}
+		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * Changing nothing, so might have to ratchet back NewRelminmxid,
+			 * NewRelfrozenxid, or both together
+			 */
+			if (MultiXactIdIsValid(xid) &&
+				MultiXactIdPrecedes(xid, *NewRelminmxid))
+				*NewRelminmxid = xid;
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6495,6 +6550,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 
 			changed = true;
+
+			/*
+			 * New multixact might have remaining XID older than
+			 * NewRelfrozenxid
+			 */
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
@@ -6522,7 +6584,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			freeze_xmax = true;
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			{
+				/* won't be frozen, but older than current NewRelfrozenxid */
+				*NewRelfrozenxid = xid;
+			}
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
@@ -6569,6 +6638,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * Since we always freeze here, NewRelfrozenxid doesn't need to be
+		 * maintained.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6646,11 +6718,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId NewRelfrozenxid = FirstNormalTransactionId;
+	MultiXactId NewRelminmxid = FirstMultiXactId;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &NewRelfrozenxid, &NewRelminmxid);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7080,6 +7155,15 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
  * are older than the specified cutoff XID or MultiXactId.  If so, return true.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will never freeze any of the XIDs from the tuple, even when we say
+ * that they should.  If caller opts to go with our recommendation to freeze,
+ * then it must account for the fact that it shouldn't trust how we've set
+ * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
+ * our recommendation because they must, and non-aggressive VACUUMs always opt
+ * to not freeze, preferring to ratchet back NewRelfrozenxid instead).
+ *
  * It doesn't matter whether the tuple is alive or dead, we are checking
  * to see if a tuple needs to be removed or frozen to avoid wraparound.
  *
@@ -7088,74 +7172,86 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  */
 bool
 heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi, Buffer buf)
+						MultiXactId cutoff_multi,
+						TransactionId *NewRelfrozenxid,
+						MultiXactId *NewRelminmxid, Buffer buf)
 {
 	TransactionId xid;
+	bool		needs_freeze = false;
 
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
+	if (TransactionIdIsNormal(xid))
+	{
+		if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			*NewRelfrozenxid = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			needs_freeze = true;
+	}
 
 	/*
 	 * The considerations for multixacts are complicated; look at
 	 * heap_prepare_freeze_tuple for justifications.  This routine had better
 	 * be in sync with that one!
+	 *
+	 * (Actually, we maintain NewRelminmxid differently here, because we
+	 * assume that XIDs that should be frozen according to cutoff_xid won't
+	 * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)
 	 */
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		MultiXactId multi;
+		MultiXactMember *members;
+		int			nmembers;
 
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+		if (MultiXactIdIsValid(multi) &&
+			MultiXactIdPrecedes(multi, *NewRelminmxid))
+			*NewRelminmxid = multi;
+
+		if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 			return true;
 		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
+			needs_freeze = true;
+
+		/* need to check whether any member of the mxact is too old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
 		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
-
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
+			if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+				needs_freeze = true;
+			if (TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
 		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 	else
 	{
 		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
-	return false;
+	return needs_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 7614d6108..eade44ed0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -171,8 +171,10 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+
+	/* Track new pg_class.relfrozenxid/pg_class.relminmxid values */
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -330,6 +332,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
@@ -366,8 +369,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -432,8 +435,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+
+	/* Initialize values used to advance relfrozenxid/relminmxid at the end */
+	vacrel->NewRelfrozenxid = OldestXmin;
+	vacrel->NewRelminmxid = OldestMxact;
 
 	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	vacrel->relname = pstrdup(RelationGetRelationName(rel));
@@ -522,16 +527,18 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
 	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
 	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * the visibility map.  A non-aggressive VACUUM might only be able to
+	 * advance relfrozenxid to an XID from before FreezeLimit (or a relminmxid
+	 * from before MultiXactCutoff) when it wasn't possible to freeze some
+	 * tuples due to our inability to acquire a cleanup lock, but the effect
+	 * is usually insignificant -- NewRelfrozenxid value still has a decent
+	 * chance of being much more recent that the existing relfrozenxid.
 	 *
 	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
 	 * the rel_pages used by lazy_scan_heap, which won't match when we
 	 * happened to truncate the relation afterwards.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
@@ -548,7 +555,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			   orig_rel_pages);
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
+							vacrel->NewRelfrozenxid, vacrel->NewRelminmxid,
 							&frozenxid_updated, &minmulti_updated, false);
 	}
 
@@ -650,17 +657,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenxid - vacrel->relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenxid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminmxid - vacrel->relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminmxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -1628,6 +1635,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	int			nfrozen;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1636,6 +1645,8 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level counters */
+	NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	NewRelminmxid = vacrel->NewRelminmxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	recently_dead_tuples = 0;
@@ -1845,7 +1856,9 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenxid,
+									  &NewRelminmxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1859,13 +1872,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -2009,9 +2025,9 @@ retry:
  * We'll always return true for a non-aggressive VACUUM, even when we know
  * that this will cause them to miss out on freezing tuples from before
  * vacrel->FreezeLimit cutoff -- they should never have to wait for a cleanup
- * lock.  This does mean that they definitely won't be able to advance
- * relfrozenxid opportunistically (same applies to vacrel->MultiXactCutoff and
- * relminmxid).  Caller waits for full cleanup lock when we return false.
+ * lock.  This does mean that they will have NewRelfrozenxid ratcheting back
+ * to a known-safe value (same applies to NewRelminmxid).  Caller waits for
+ * full cleanup lock when we return false.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.  The
  * hasfreespace flag instructs caller on whether or not it should do generic
@@ -2035,6 +2051,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	MultiXactId NewRelminmxid = vacrel->NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2081,7 +2099,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_needs_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
+									vacrel->MultiXactCutoff,
+									&NewRelfrozenxid, &NewRelminmxid, buf))
 		{
 			if (vacrel->aggressive)
 			{
@@ -2091,10 +2110,11 @@ lazy_scan_noprune(LVRelState *vacrel,
 			}
 
 			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
+			 * A non-aggressive VACUUM doesn't have to wait on a cleanup lock
+			 * to ensure that it advances relfrozenxid to a sufficiently
+			 * recent XID that happens to be present on this page.  It can
+			 * just accept an older New/final relfrozenxid instead.
 			 */
-			vacrel->freeze_cutoffs_valid = false;
 		}
 
 		num_tuples++;
@@ -2144,6 +2164,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * We have committed to not freezing the tuples on this page (always
+	 * happens with a non-aggressive VACUUM), so make sure that the target
+	 * relfrozenxid/relminmxid values reflect the XIDs/MXIDs we encountered
+	 */
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
+
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel (though
 	 * only when VACUUM uses two-pass strategy).
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 66b87347d..6bd6688ae 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ddf6279c7..c39e8088a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -950,10 +950,28 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
  * - multiXactCutoff is the value below which all MultiXactIds are removed
  *   from Xmax.
+ *
+ * oldestXmin and oldestMxact can be thought of as the most recent values that
+ * can ever be passed to vac_update_relstats() as frozenxid and minmulti
+ * arguments.  These exact values will be used when no newer XIDs or
+ * MultiXacts remain in the heap relation (e.g., with an empty table).  It's
+ * typical for vacuumlazy.c caller to notice that older XIDs/Multixacts remain
+ * in the table, which will force it to use older value.  These older final
+ * values may not be any newer than the preexisting frozenxid/minmulti values
+ * from pg_class in extreme cases.  The final values are frequently fairly
+ * close to the optimal values that we give to vacuumlazy.c, though.
+ *
+ * An aggressive VACUUM always provides vac_update_relstats() arguments that
+ * are >= freezeLimit and >= multiXactCutoff.  A non-aggressive VACUUM may
+ * provide arguments that are either newer or older than freezeLimit and
+ * multiXactCutoff, or non-valid values (indicating that pg_class level
+ * cutoffs cannot be advanced at all).
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -962,6 +980,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -970,7 +989,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1066,9 +1084,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1083,8 +1103,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
-- 
2.30.2

v5-0005-Freeze-tuples-early-to-advance-relfrozenxid.patchapplication/x-patch; name=v5-0005-Freeze-tuples-early-to-advance-relfrozenxid.patchDownload

From 6bbaf3b1a874e84a66b6dfa806c57a30f70f054a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 13 Dec 2021 15:00:49 -0800
Subject: [PATCH v5 5/5] Freeze tuples early to advance relfrozenxid.

Freeze whenever pruning modified the page, or whenever we see that we're
going to mark the page all-visible without also marking it all-frozen.

There has been plenty of discussion of early/opportunistic freezing in
the past.  It is generally considered important as a way of minimizing
repeated dirtying of heap pages (or the total volume of FPIs in the WAL
stream) over time.  While that goal is certainly very important, this
patch has another priority: making VACUUM advance relfrozenxid sooner
and more frequently.

The overall effect is that tables like pgbench's history table can be
vacuumed very frequently, and have most individual vacuum operations
generate 0 FPIs in WAL -- they will never need an aggressive VACUUM.

GUCs like vacuum_freeze_min_age never made much sense after the freeze
map work in PostgreSQL 9.6.  The default is 50 million transactions,
which current tends to result in our being unable to freeze tuples
before the page is marked all-visible (but not all-frozen).  This
creates a huge performance cliff later on, during the first aggressive
VACUUM.  Freezing early effectively avoids accumulating "debt" from very
old unfrozen tuples.
---
 src/include/access/heapam.h          |  1 +
 src/backend/access/heap/pruneheap.c  |  8 ++-
 src/backend/access/heap/vacuumlazy.c | 87 +++++++++++++++++++++++++---
 3 files changed, 88 insertions(+), 8 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bc5a96796..9eaa365df 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
+							bool *modified,
 							int	*nnewlpdead,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 522a00af6..e95dea38d 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -182,11 +182,12 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
+			bool	modified;
 			int		ndeleted,
 					nnewlpdead;
 
 			ndeleted = heap_page_prune(relation, buffer, vistest, limited_xmin,
-									   limited_ts, &nnewlpdead, NULL);
+									   limited_ts, &modified, &nnewlpdead, NULL);
 
 			/*
 			 * Report the number of tuples reclaimed to pgstats.  This is
@@ -244,6 +245,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
+				bool *modified,
 				int	*nnewlpdead,
 				OffsetNumber *off_loc)
 {
@@ -375,6 +377,8 @@ heap_page_prune(Relation relation, Buffer buffer,
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
+
+		*modified = true;
 	}
 	else
 	{
@@ -387,12 +391,14 @@ heap_page_prune(Relation relation, Buffer buffer,
 		 * point in repeating the prune/defrag process until something else
 		 * happens to the page.
 		 */
+		*modified = false;
 		if (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid ||
 			PageIsFull(page))
 		{
 			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
 			PageClearFull(page);
 			MarkBufferDirtyHint(buffer, true);
+			*modified = true;
 		}
 	}
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index eade44ed0..a10faea6e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -168,6 +168,7 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoff for pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
@@ -358,11 +359,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/*
 	 * Get cutoffs that determine which tuples we need to freeze during the
-	 * VACUUM operation.
+	 * VACUUM operation.  This includes information that is used during
+	 * opportunistic freezing, where the most aggressive possible cutoffs
+	 * (OldestXmin and OldestMxact) are used for some heap pages, based on
+	 * considerations about cost.
 	 *
 	 * Also determines if this is to be an aggressive VACUUM.  This will
 	 * eventually be required for any table where (for whatever reason) no
 	 * non-aggressive VACUUM ran to completion, and advanced relfrozenxid.
+	 * This used to be much more common, but we now work hard to advance
+	 * relfrozenxid in non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -433,6 +439,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Set cutoffs for entire VACUUM */
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
 
@@ -1637,6 +1644,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 	TransactionId NewRelfrozenxid;
 	MultiXactId NewRelminmxid;
+	bool		modified;
+	TransactionId FreezeLimit = vacrel->FreezeLimit;
+	MultiXactId MultiXactCutoff = vacrel->MultiXactCutoff;
+	bool		earlyfreezing = false;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1663,8 +1674,19 @@ retry:
 	 * that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vistest,
-									 InvalidTransactionId, 0, &nnewlpdead,
-									 &vacrel->offnum);
+									 InvalidTransactionId, 0, &modified,
+									 &nnewlpdead, &vacrel->offnum);
+
+	/*
+	 * If page was modified during pruning, then perform early freezing
+	 * opportunistically
+	 */
+	if (!earlyfreezing && modified)
+	{
+		earlyfreezing = true;
+		FreezeLimit = vacrel->OldestXmin;
+		MultiXactCutoff = vacrel->OldestMxact;
+	}
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
@@ -1719,7 +1741,7 @@ retry:
 		if (ItemIdIsDead(itemid))
 		{
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
+			/* Don't set all_visible to false just yet */
 			prunestate->has_lpdead_items = true;
 			continue;
 		}
@@ -1853,8 +1875,8 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
-									  vacrel->FreezeLimit,
-									  vacrel->MultiXactCutoff,
+									  FreezeLimit,
+									  MultiXactCutoff,
 									  &frozen[nfrozen],
 									  &tuple_totally_frozen,
 									  &NewRelfrozenxid,
@@ -1874,6 +1896,57 @@ retry:
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * Reconsider applying early freezing before committing to processing the
+	 * page as currently planned.  There are 2 reasons to change our mind:
+	 *
+	 * 1. The standard FreezeLimit cutoff generally indicates that we should
+	 * freeze XIDs that are more than freeze_min_age XIDs in the past
+	 * (relative to OldestXmin).  But that should only be treated as a rough
+	 * guideline; it makes sense to freeze all eligible tuples on pages where
+	 * we're going to freeze at least one in any case.
+	 *
+	 * 2. If the page is now eligible to be marked all_visible, but is not
+	 * also eligible to be marked all_frozen, then we freeze early to make
+	 * sure that the page becomes all_frozen.  We should avoid building up
+	 * "freeze debt" that can only be paid off by an aggressive VACUUM, later
+	 * on.  This makes it much less likely that an aggressive VACUUM will ever
+	 * be required.
+	 *
+	 * Note: We deliberately track all_visible in a way that excludes LP_DEAD
+	 * items here.  Any page that is "all_visible for tuples with storage"
+	 * will be eligible to have its visibility map bit set during the ongoing
+	 * VACUUM, one way or another.  LP_DEAD items only make it unsafe to set
+	 * the page all_visible during the first heap pass, but the second heap
+	 * pass should be able to perform equivalent processing. (The second heap
+	 * pass cannot freeze tuples, though.)
+	 */
+	if (!earlyfreezing &&
+		((nfrozen > 0 && nfrozen < num_tuples) ||
+		 (prunestate->all_visible && !prunestate->all_frozen)))
+	{
+		/*
+		 * XXX Need to worry about leaking MultiXacts in FreezeMultiXactId()
+		 * now (via heap_prepare_freeze_tuple calls)?  That was already
+		 * possible, but presumably this makes it much more likely.
+		 *
+		 * On the other hand, that's only possible when we need to replace an
+		 * existing MultiXact with a new one.  Even then, we won't have
+		 * preallocated a new MultiXact (which we now risk leaking) if there
+		 * was only one remaining XID, and the XID is for an updater (we'll
+		 * only prepare to replace xmax with the XID directly).  So maybe it's
+		 * still a narrow enough problem to be ignored.
+		 */
+		earlyfreezing = true;
+		FreezeLimit = vacrel->OldestXmin;
+		MultiXactCutoff = vacrel->OldestMxact;
+		goto retry;
+	}
+
+	/* Time to define all_visible in a way that accounts for LP_DEAD items */
+	if (lpdead_items > 0)
+		prunestate->all_visible = false;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
@@ -1919,7 +1992,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(vacrel->rel, buf, FreezeLimit,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
-- 
2.30.2

v5-0003-Simplify-vacuum_set_xid_limits-signature.patchapplication/x-patch; name=v5-0003-Simplify-vacuum_set_xid_limits-signature.patchDownload

From 13980cd0db4278d4d7c5ce9f9b048dff57019589 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 11 Dec 2021 17:39:45 -0800
Subject: [PATCH v5 3/5] Simplify vacuum_set_xid_limits signature.

Refactoring, making the return value of vacuum_set_xid_limits()
determine whether or not this will be an aggressive VACUUM.

This will make it easier to set/return an oldestMxact value for
vacuumlazy.c caller in the next commit, which is an important detail
that enables advancing relminmxid opportunistically.
---
 src/include/commands/vacuum.h        |   6 +-
 src/backend/access/heap/vacuumlazy.c |  32 +++----
 src/backend/commands/cluster.c       |   3 +-
 src/backend/commands/vacuum.c        | 134 +++++++++++++--------------
 4 files changed, 79 insertions(+), 96 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 772a257fc..1848a65df 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -286,15 +286,13 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern void vacuum_set_xid_limits(Relation rel,
+extern bool vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
 								  TransactionId *freezeLimit,
-								  TransactionId *xidFullScanLimit,
-								  MultiXactId *multiXactCutoff,
-								  MultiXactId *mxactFullScanLimit);
+								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2950df1ce..7614d6108 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -323,8 +323,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
-	TransactionId xidFullScanLimit;
-	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
@@ -355,24 +353,22 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
 								  RelationGetRelid(rel));
 
-	vacuum_set_xid_limits(rel,
-						  params->freeze_min_age,
-						  params->freeze_table_age,
-						  params->multixact_freeze_min_age,
-						  params->multixact_freeze_table_age,
-						  &OldestXmin, &FreezeLimit, &xidFullScanLimit,
-						  &MultiXactCutoff, &mxactFullScanLimit);
-
 	/*
-	 * We request an aggressive scan if the table's frozen Xid is now older
-	 * than or equal to the requested Xid full-table scan limit; or if the
-	 * table's minimum MultiXactId is older than or equal to the requested
-	 * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
+	 * Get cutoffs that determine which tuples we need to freeze during the
+	 * VACUUM operation.
+	 *
+	 * Also determines if this is to be an aggressive VACUUM.  This will
+	 * eventually be required for any table where (for whatever reason) no
+	 * non-aggressive VACUUM ran to completion, and advanced relfrozenxid.
 	 */
-	aggressive = TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-											   xidFullScanLimit);
-	aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-											  mxactFullScanLimit);
+	aggressive = vacuum_set_xid_limits(rel,
+									   params->freeze_min_age,
+									   params->freeze_table_age,
+									   params->multixact_freeze_min_age,
+									   params->multixact_freeze_table_age,
+									   &OldestXmin, &FreezeLimit,
+									   &MultiXactCutoff);
+
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 9d22f648a..66b87347d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -857,8 +857,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, NULL, &MultiXactCutoff,
-						  NULL);
+						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d1d38d509..ddf6279c7 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -942,25 +942,20 @@ get_all_vacuum_rels(int options)
  *
  * Input parameters are the target relation, applicable freeze age settings.
  *
+ * Return value indicates whether caller should do an aggressive VACUUM or
+ * not.  This is a VACUUM that cannot skip any pages using the visibility map
+ * (except all-frozen pages), which is guaranteed to be able to advance
+ * relfrozenxid and relminmxid.
+ *
  * The output parameters are:
- * - oldestXmin is the cutoff value used to distinguish whether tuples are
- *	 DEAD or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ * - oldestXmin is the Xid below which tuples deleted by any xact (that
+ *   committed) should be considered DEAD, not just RECENTLY_DEAD.
  * - freezeLimit is the Xid below which all Xids are replaced by
  *	 FrozenTransactionId during vacuum.
- * - xidFullScanLimit (computed from freeze_table_age parameter)
- *	 represents a minimum Xid value; a table whose relfrozenxid is older than
- *	 this will have a full-table vacuum applied to it, to freeze tuples across
- *	 the whole table.  Vacuuming a table younger than this value can use a
- *	 partial scan.
- * - multiXactCutoff is the value below which all MultiXactIds are removed from
- *	 Xmax.
- * - mxactFullScanLimit is a value against which a table's relminmxid value is
- *	 compared to produce a full-table vacuum, as with xidFullScanLimit.
- *
- * xidFullScanLimit and mxactFullScanLimit can be passed as NULL if caller is
- * not interested.
+ * - multiXactCutoff is the value below which all MultiXactIds are removed
+ *   from Xmax.
  */
-void
+bool
 vacuum_set_xid_limits(Relation rel,
 					  int freeze_min_age,
 					  int freeze_table_age,
@@ -968,9 +963,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
 					  TransactionId *freezeLimit,
-					  TransactionId *xidFullScanLimit,
-					  MultiXactId *multiXactCutoff,
-					  MultiXactId *mxactFullScanLimit)
+					  MultiXactId *multiXactCutoff)
 {
 	int			freezemin;
 	int			mxid_freezemin;
@@ -980,6 +973,7 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
+	int			freezetable;
 
 	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
@@ -1097,64 +1091,60 @@ vacuum_set_xid_limits(Relation rel,
 
 	*multiXactCutoff = mxactLimit;
 
-	if (xidFullScanLimit != NULL)
-	{
-		int			freezetable;
+	/*
+	 * Done setting output parameters; just need to figure out if caller needs
+	 * to do an aggressive VACUUM or not.
+	 *
+	 * Determine the table freeze age to use: as specified by the caller, or
+	 * vacuum_freeze_table_age, but in any case not more than
+	 * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
+	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
+	 * before anti-wraparound autovacuum is launched.
+	 */
+	freezetable = freeze_table_age;
+	if (freezetable < 0)
+		freezetable = vacuum_freeze_table_age;
+	freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
+	Assert(freezetable >= 0);
 
-		Assert(mxactFullScanLimit != NULL);
+	/*
+	 * Compute XID limit causing an aggressive vacuum, being careful not to
+	 * generate a "permanent" XID
+	 */
+	limit = ReadNextTransactionId() - freezetable;
+	if (!TransactionIdIsNormal(limit))
+		limit = FirstNormalTransactionId;
+	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
+									  limit))
+		return true;
 
-		/*
-		 * Determine the table freeze age to use: as specified by the caller,
-		 * or vacuum_freeze_table_age, but in any case not more than
-		 * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-		 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
-		 * before anti-wraparound autovacuum is launched.
-		 */
-		freezetable = freeze_table_age;
-		if (freezetable < 0)
-			freezetable = vacuum_freeze_table_age;
-		freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
-		Assert(freezetable >= 0);
+	/*
+	 * Similar to the above, determine the table freeze age to use for
+	 * multixacts: as specified by the caller, or
+	 * vacuum_multixact_freeze_table_age, but in any case not more than
+	 * autovacuum_multixact_freeze_table_age * 0.95, so that if you have e.g.
+	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
+	 * multixacts before anti-wraparound autovacuum is launched.
+	 */
+	freezetable = multixact_freeze_table_age;
+	if (freezetable < 0)
+		freezetable = vacuum_multixact_freeze_table_age;
+	freezetable = Min(freezetable,
+					  effective_multixact_freeze_max_age * 0.95);
+	Assert(freezetable >= 0);
 
-		/*
-		 * Compute XID limit causing a full-table vacuum, being careful not to
-		 * generate a "permanent" XID.
-		 */
-		limit = ReadNextTransactionId() - freezetable;
-		if (!TransactionIdIsNormal(limit))
-			limit = FirstNormalTransactionId;
+	/*
+	 * Compute MultiXact limit causing an aggressive vacuum, being careful to
+	 * generate a valid MultiXact value
+	 */
+	mxactLimit = ReadNextMultiXactId() - freezetable;
+	if (mxactLimit < FirstMultiXactId)
+		mxactLimit = FirstMultiXactId;
+	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
+									mxactLimit))
+		return true;
 
-		*xidFullScanLimit = limit;
-
-		/*
-		 * Similar to the above, determine the table freeze age to use for
-		 * multixacts: as specified by the caller, or
-		 * vacuum_multixact_freeze_table_age, but in any case not more than
-		 * autovacuum_multixact_freeze_table_age * 0.95, so that if you have
-		 * e.g. nightly VACUUM schedule, the nightly VACUUM gets a chance to
-		 * freeze multixacts before anti-wraparound autovacuum is launched.
-		 */
-		freezetable = multixact_freeze_table_age;
-		if (freezetable < 0)
-			freezetable = vacuum_multixact_freeze_table_age;
-		freezetable = Min(freezetable,
-						  effective_multixact_freeze_max_age * 0.95);
-		Assert(freezetable >= 0);
-
-		/*
-		 * Compute MultiXact limit causing a full-table vacuum, being careful
-		 * to generate a valid MultiXact value.
-		 */
-		mxactLimit = ReadNextMultiXactId() - freezetable;
-		if (mxactLimit < FirstMultiXactId)
-			mxactLimit = FirstMultiXactId;
-
-		*mxactFullScanLimit = mxactLimit;
-	}
-	else
-	{
-		Assert(mxactFullScanLimit == NULL);
-	}
+	return false;
 }
 
 /*
-- 
2.30.2

v5-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/x-patch; name=v5-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload

From 6345b080a521277746a0088980a3a44dce66367d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 17 Nov 2021 21:27:06 -0800
Subject: [PATCH v5 1/5] Simplify lazy_scan_heap's handling of scanned pages.

Redefine a scanned page as any heap page that actually gets pinned by
VACUUM's first pass over the heap.  Pages counted by scanned_pages are
now the complement of the pages that are skipped over using the
visibility map.  This new definition significantly simplifies quite a
few things.

Now heap relation truncation, visibility map bit setting, tuple counting
(e.g., for pg_class.reltuples), and tuple freezing all share a common
definition of scanned_pages.  That makes it possible to remove certain
special cases, that never made much sense.  We no longer need to track
tupcount_pages separately (see bugfix commit 1914c5ea for details),
since we now always count tuples from pages that are scanned_pages.  We
also don't need to needlessly distinguish between aggressive and
non-aggressive VACUUM operations when we cannot immediately acquire a
cleanup lock.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.  But we now
make the non-aggressive case notice that a failure to acquire a cleanup
lock on one particular heap page does not in itself make it unsafe to
advance relfrozenxid for the whole relation (which is what we usually
see in the aggressive case already).

We now also collect LP_DEAD items in the dead_items array in the case
where we cannot immediately get a cleanup lock on the buffer.  We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.
Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),
so we should not squander any opportunity to do that.  Commit 8523492d4e
taught VACUUM to set LP_DEAD line pointers to LP_UNUSED while only
holding an exclusive lock (not a cleanup lock), so we can expect to set
existing LP_DEAD items to LP_UNUSED reliably, even when we cannot
acquire our own cleanup lock at either pass over the heap (unless we opt
to skip index vacuuming, which implies that there is no second pass over
the heap).

We no longer report on "pin skipped pages" in log output.  A later patch
will add back an improved version of the same instrumentation.  We don't
want to show any information about any failures to acquire cleanup locks
unless we actually failed to do useful work as a consequence.  A page
that we could not acquire a cleanup lock on is now treated as equivalent
to any other scanned page in most cases.
---
 src/backend/access/heap/vacuumlazy.c          | 815 +++++++++++-------
 .../isolation/expected/vacuum-reltuples.out   |   2 +-
 .../isolation/specs/vacuum-reltuples.spec     |   7 +-
 3 files changed, 518 insertions(+), 306 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index cd603e6aa..148129e59 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -143,6 +143,10 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	bool		aggressive;
+	/* Use visibility map to skip? (disabled via reloption) */
+	bool		skipwithvm;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -167,6 +171,8 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Are FreezeLimit/MultiXactCutoff still valid? */
+	bool		freeze_cutoffs_valid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -187,10 +193,8 @@ typedef struct LVRelState
 	 */
 	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -203,6 +207,7 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
+	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
 	int64		lpdead_items;	/* # deleted from indexes */
 	int64		new_dead_tuples;	/* new estimated total # of dead items in
@@ -242,19 +247,22 @@ static int	elevel = -1;
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
-						   bool aggressive);
+static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
+								   BlockNumber blkno, Page page,
+								   bool sharelock, Buffer vmbuffer);
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
 							LVPagePruneState *prunestate);
+static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
+							  BlockNumber blkno, Page page,
+							  bool *hastup, bool *hasfreespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, int index, Buffer *vmbuffer);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -307,16 +315,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	int			usecs;
 	double		read_rate,
 				write_rate;
-	bool		aggressive;		/* should we scan all unfrozen pages? */
-	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
+	bool		aggressive,
+				skipwithvm;
+	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
-	TransactionId new_frozen_xid;
-	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
@@ -362,8 +369,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 											   xidFullScanLimit);
 	aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
 											  mxactFullScanLimit);
+	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+	{
+		/*
+		 * Force aggressive mode, and disable skipping blocks using the
+		 * visibility map (even those set all-frozen)
+		 */
 		aggressive = true;
+		skipwithvm = false;
+	}
 
 	vacrel = (LVRelState *) palloc0(sizeof(LVRelState));
 
@@ -371,6 +386,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->rel = rel;
 	vac_open_indexes(vacrel->rel, RowExclusiveLock, &vacrel->nindexes,
 					 &vacrel->indrels);
+	vacrel->aggressive = aggressive;
+	vacrel->skipwithvm = skipwithvm;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 
@@ -415,6 +432,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* Track if cutoffs became invalid (possible in !aggressive case only) */
+	vacrel->freeze_cutoffs_valid = true;
 
 	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	vacrel->relname = pstrdup(RelationGetRelationName(rel));
@@ -451,30 +470,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params, aggressive);
+	lazy_scan_heap(vacrel, params->nworkers);
 
 	/* Done with indexes */
 	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the all unfrozen pages. If we did,
-	 * we can adjust relfrozenxid and relminmxid.
-	 *
-	 * NB: We need to check this before truncating the relation, because that
-	 * will change ->rel_pages.
-	 */
-	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
-		< vacrel->rel_pages)
-	{
-		Assert(!aggressive);
-		scanned_all_unfrozen = false;
-	}
-	else
-		scanned_all_unfrozen = true;
-
-	/*
-	 * Optionally truncate the relation.
+	 * Optionally truncate the relation.  But remember the relation size used
+	 * by lazy_scan_prune for later first.
 	 */
+	orig_rel_pages = vacrel->rel_pages;
 	if (should_attempt_truncation(vacrel))
 	{
 		/*
@@ -505,28 +510,44 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 *
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * relpages to.
-	 *
-	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
-	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrel->rel_pages;
+	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
 	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
-
-	vac_update_relstats(rel,
-						new_rel_pages,
-						new_live_tuples,
-						new_rel_allvisible,
-						vacrel->nindexes > 0,
-						new_frozen_xid,
-						new_min_multi,
-						false);
+	/*
+	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
+	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
+	 * provided we didn't skip any all-visible (not all-frozen) pages using
+	 * the visibility map, and assuming that we didn't fail to get a cleanup
+	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
+	 * MultiXactCutoff) established for VACUUM operation.
+	 *
+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_heap, which won't match when we
+	 * happened to truncate the relation afterwards.
+	 */
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}
 
 	/*
 	 * Report results to the stats collector, too.
@@ -555,7 +576,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
-			BlockNumber orig_rel_pages;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -602,10 +622,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->pinskipped_pages,
 							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
@@ -613,7 +632,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) vacrel->new_rel_tuples,
 							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
-			orig_rel_pages = vacrel->rel_pages + vacrel->pages_removed;
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -730,7 +748,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, int nworkers)
 {
 	VacDeadItems *dead_items;
 	BlockNumber nblocks,
@@ -752,7 +770,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	pg_rusage_init(&ru0);
 
-	if (aggressive)
+	if (vacrel->aggressive)
 		ereport(elevel,
 				(errmsg("aggressively vacuuming \"%s.%s\"",
 						vacrel->relnamespace,
@@ -764,14 +782,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 						vacrel->relname)));
 
 	nblocks = RelationGetNumberOfBlocks(vacrel->rel);
-	next_unskippable_block = 0;
-	next_failsafe_block = 0;
-	next_fsm_block_to_vacuum = 0;
 	vacrel->rel_pages = nblocks;
 	vacrel->scanned_pages = 0;
-	vacrel->pinskipped_pages = 0;
 	vacrel->frozenskipped_pages = 0;
-	vacrel->tupcount_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -795,14 +808,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * dangerously old.
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
+	next_failsafe_block = 0;
 
 	/*
 	 * Allocate the space for dead_items.  Note that this handles parallel
 	 * VACUUM initialization as part of allocating shared memory space used
 	 * for dead_items.
 	 */
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, nworkers);
 	dead_items = vacrel->dead_items;
+	next_fsm_block_to_vacuum = 0;
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
@@ -811,7 +826,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/*
-	 * Except when aggressive is set, we want to skip pages that are
+	 * Set things up for skipping blocks using visibility map.
+	 *
+	 * Except when vacrel->aggressive is set, we want to skip pages that are
 	 * all-visible according to the visibility map, but only when we can skip
 	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
 	 * sequentially, the OS should be doing readahead for us, so there's no
@@ -820,8 +837,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * page means that we can't update relfrozenxid, so we only want to do it
 	 * if we can skip a goodly number of pages.
 	 *
-	 * When aggressive is set, we can't skip pages just because they are
-	 * all-visible, but we can still skip pages that are all-frozen, since
+	 * When vacrel->aggressive is set, we can't skip pages just because they
+	 * are all-visible, but we can still skip pages that are all-frozen, since
 	 * such pages do not need freezing and do not affect the value that we can
 	 * safely set for relfrozenxid or relminmxid.
 	 *
@@ -844,17 +861,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * just added to that page are necessarily newer than the GlobalXmin we
 	 * computed, so they'll have no effect on the value to which we can safely
 	 * set relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 *
-	 * We will scan the table's last page, at least to the extent of
-	 * determining whether it has tuples or not, even if it should be skipped
-	 * according to the above rules; except when we've already determined that
-	 * it's not worth trying to truncate the table.  This avoids having
-	 * lazy_truncate_heap() take access-exclusive lock on the table to attempt
-	 * a truncation that just fails immediately because there are tuples in
-	 * the last page.  This is worth avoiding mainly because such a lock must
-	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
+	next_unskippable_block = 0;
+	if (vacrel->skipwithvm)
 	{
 		while (next_unskippable_block < nblocks)
 		{
@@ -863,7 +872,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			vmstatus = visibilitymap_get_status(vacrel->rel,
 												next_unskippable_block,
 												&vmbuffer);
-			if (aggressive)
+			if (vacrel->aggressive)
 			{
 				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
 					break;
@@ -890,13 +899,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		bool		all_visible_according_to_vm = false;
 		LVPagePruneState prunestate;
 
-		/*
-		 * Consider need to skip blocks.  See note above about forcing
-		 * scanning of last page.
-		 */
-#define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
@@ -906,7 +908,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		{
 			/* Time to advance next_unskippable_block */
 			next_unskippable_block++;
-			if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
+			if (vacrel->skipwithvm)
 			{
 				while (next_unskippable_block < nblocks)
 				{
@@ -915,7 +917,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 					vmskipflags = visibilitymap_get_status(vacrel->rel,
 														   next_unskippable_block,
 														   &vmbuffer);
-					if (aggressive)
+					if (vacrel->aggressive)
 					{
 						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
 							break;
@@ -944,19 +946,25 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * it's not all-visible.  But in an aggressive vacuum we know only
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
-			if (aggressive && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+			if (vacrel->aggressive &&
+				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 				all_visible_according_to_vm = true;
 		}
 		else
 		{
 			/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current page can be skipped if we've seen a long enough run
+			 * of skippable blocks to justify skipping it -- provided it's not
+			 * the last page in the relation (according to rel_pages/nblocks).
+			 *
+			 * We always scan the table's last page to determine whether it
+			 * has tuples or not, even if it would otherwise be skipped
+			 * (unless we're skipping every single page in the relation). This
+			 * avoids having lazy_truncate_heap() take access-exclusive lock
+			 * on the table to attempt a truncation that just fails
+			 * immediately because there are tuples on the last page.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks && blkno < nblocks - 1)
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
@@ -965,18 +973,32 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * careful to count it as a skipped all-frozen page in that
 				 * case, or else we'll think we can't update relfrozenxid and
 				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was all-frozen, so we have to recheck; but
-				 * in this case an approximate answer is OK.
+				 * know whether it was initially all-frozen, so we have to
+				 * recheck.
 				 */
-				if (aggressive || VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
+				if (vacrel->aggressive ||
+					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
+
+			/*
+			 * Otherwise it must be an all-visible (and possibly even
+			 * all-frozen) page that we decided to process regardless
+			 * (SKIP_PAGES_THRESHOLD must not have been crossed).
+			 */
 			all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
+		/*
+		 * We're not skipping this page using the visibility map, and so it is
+		 * (by definition) a scanned page.  Any tuples from this page are now
+		 * guaranteed to be counted below, after some preparatory checks.
+		 */
+		vacrel->scanned_pages++;
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1031,174 +1053,78 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
-		 * Set up visibility map page as needed.
-		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
-		 * already have the correct page pinned anyway.  However, it's
-		 * possible that (a) next_unskippable_block is covered by a different
-		 * VM page than the current block or (b) we released our pin and did a
-		 * cycle of index vacuuming.
+		 * already have the correct page pinned anyway.
 		 */
 		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
 
+		/* Finished preparatory checks.  Actually scan the page. */
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
 
 		/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing using lazy_scan_noprune.
 		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
-			bool		hastup;
+			bool		hastup,
+						hasfreespace;
 
-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
-			{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
-				continue;
-			}
-
-			/*
-			 * Read the page with share lock to see if any xids on it need to
-			 * be frozen.  If not we just skip the page, after updating our
-			 * scan statistics.  If there are some, we wait for cleanup lock.
-			 *
-			 * We could defer the lock request further by remembering the page
-			 * and coming back to it later, or we could even register
-			 * ourselves for multiple buffers and then service whichever one
-			 * is received first.  For now, this seems good enough.
-			 *
-			 * If we get here with aggressive false, then we're just forcibly
-			 * checking the page, and so we don't want to insist on getting
-			 * the lock; we only need to know if the page contains tuples, so
-			 * that we can update nonempty_pages correctly.  It's convenient
-			 * to use lazy_check_needs_freeze() for both situations, though.
-			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
 			{
-				UnlockReleaseBuffer(buf);
-				vacrel->scanned_pages++;
-				vacrel->pinskipped_pages++;
-				if (hastup)
-					vacrel->nonempty_pages = blkno + 1;
+				/* Processed as new/empty page (lock and pin released) */
 				continue;
 			}
-			if (!aggressive)
+
+			/* Collect LP_DEAD items in dead_items array, count tuples */
+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
+								  &hasfreespace))
 			{
+				Size		freespace;
+
 				/*
-				 * Here, we must not advance scanned_pages; that would amount
-				 * to claiming that the page contains no freezable tuples.
+				 * Processed page successfully (without cleanup lock) -- just
+				 * need to perform rel truncation and FSM steps, much like the
+				 * lazy_scan_prune case.  Don't bother trying to match its
+				 * visibility map setting steps, though.
 				 */
-				UnlockReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
+				if (hasfreespace)
+					freespace = PageGetHeapFreeSpace(page);
+				UnlockReleaseBuffer(buf);
+				if (hasfreespace)
+					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 				continue;
 			}
+
+			/*
+			 * lazy_scan_noprune could not do all required processing.  Wait
+			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
+			 */
+			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
-			/* drop through to normal processing */
 		}
 
-		/*
-		 * By here we definitely have enough dead_items space for whatever
-		 * LP_DEAD tids are on this page, we have the visibility map page set
-		 * up in case we need to set this page's all_visible/all_frozen bit,
-		 * and we have a cleanup lock.  Any tuples on this page are now sure
-		 * to be "counted" by this VACUUM.
-		 *
-		 * One last piece of preamble needs to take place before we can prune:
-		 * we need to consider new and empty pages.
-		 */
-		vacrel->scanned_pages++;
-		vacrel->tupcount_pages++;
-
-		page = BufferGetPage(buf);
-
-		if (PageIsNew(page))
+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
-			if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
-			{
-				Size		freespace = BLCKSZ - SizeOfPageHeaderData;
-
-				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-			}
-			continue;
-		}
-
-		if (PageIsEmpty(page))
-		{
-			Size		freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+			/* Processed as new/empty page (lock and pin released) */
 			continue;
 		}
 
 		/*
-		 * Prune and freeze tuples.
+		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
 		 * dead_items array.  This includes LP_DEAD line pointers that we
@@ -1406,7 +1332,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
-													 vacrel->tupcount_pages,
+													 vacrel->scanned_pages,
 													 vacrel->live_tuples);
 
 	/*
@@ -1480,14 +1406,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	appendStringInfo(&buf,
 					 _("%lld dead row versions cannot be removed yet, oldest xmin: %u\n"),
 					 (long long) vacrel->new_dead_tuples, vacrel->OldestXmin);
-	appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ",
-									"Skipped %u pages due to buffer pins, ",
-									vacrel->pinskipped_pages),
-					 vacrel->pinskipped_pages);
-	appendStringInfo(&buf, ngettext("%u frozen page.\n",
-									"%u frozen pages.\n",
-									vacrel->frozenskipped_pages),
-					 vacrel->frozenskipped_pages);
 	appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0));
 
 	ereport(elevel,
@@ -1501,6 +1419,137 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pfree(buf.data);
 }
 
+/*
+ *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
+ *
+ * Must call here to handle both new and empty pages before calling
+ * lazy_scan_prune or lazy_scan_noprune, since they're not prepared to deal
+ * with new or empty pages.
+ *
+ * It's necessary to consider new pages as a special case, since the rules for
+ * maintaining the visibility map and FSM with empty pages are a little
+ * different (though new pages can be truncated based on the usual rules).
+ *
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).
+ *
+ * Caller must hold at least a shared lock.  We might need to escalate the
+ * lock in that case, so the type of lock caller holds needs to be specified
+ * using 'sharelock' argument.
+ *
+ * Returns false in common case where caller should go on to call
+ * lazy_scan_prune (or lazy_scan_noprune).  Otherwise returns true, indicating
+ * that lazy_scan_heap is done processing the page, releasing lock on caller's
+ * behalf.
+ */
+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)
+{
+	Size		freespace;
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * All-zeroes pages can be left over if either a backend extends the
+		 * relation by a single page, but crashes before the newly initialized
+		 * page has been written out, or when bulk-extending the relation
+		 * (which creates a number of empty pages at the tail end of the
+		 * relation), and then enters them into the FSM.
+		 *
+		 * Note we do not enter the page into the visibilitymap. That has the
+		 * downside that we repeatedly visit this page in subsequent vacuums,
+		 * but otherwise we'll never discover the space on a promoted standby.
+		 * The harm of repeated checking ought to normally not be too bad. The
+		 * space usually should be used at some point, otherwise there
+		 * wouldn't be any regular vacuums.
+		 *
+		 * Make sure these pages are in the FSM, to ensure they can be reused.
+		 * Do that by testing if there's any space recorded for the page. If
+		 * not, enter it. We do so after releasing the lock on the heap page,
+		 * the FSM is approximate, after all.
+		 */
+		UnlockReleaseBuffer(buf);
+
+		if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
+		{
+			freespace = BLCKSZ - SizeOfPageHeaderData;
+
+			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		}
+
+		return true;
+	}
+
+	if (PageIsEmpty(page))
+	{
+		/*
+		 * It seems likely that caller will always be able to get a cleanup
+		 * lock on an empty page.  But don't take any chances -- escalate to
+		 * an exclusive lock (still don't need a cleanup lock, though).
+		 */
+		if (sharelock)
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (!PageIsEmpty(page))
+			{
+				/* page isn't new or empty -- keep lock and pin for now */
+				return false;
+			}
+		}
+		else
+		{
+			/* Already have a full cleanup lock (which is more than enough) */
+		}
+
+		/*
+		 * Unlike new pages, empty pages are always set all-visible and
+		 * all-frozen.
+		 */
+		if (!PageIsAllVisible(page))
+		{
+			START_CRIT_SECTION();
+
+			/* mark buffer dirty before writing a WAL record */
+			MarkBufferDirty(buf);
+
+			/*
+			 * It's possible that another backend has extended the heap,
+			 * initialized the page, and then failed to WAL-log the page due
+			 * to an ERROR.  Since heap extension is not WAL-logged, recovery
+			 * might try to replay our record setting the page all-visible and
+			 * find that the page isn't initialized, which will cause a PANIC.
+			 * To prevent that, check whether the page has been previously
+			 * WAL-logged, and if not, do that now.
+			 */
+			if (RelationNeedsWAL(vacrel->rel) &&
+				PageGetLSN(page) == InvalidXLogRecPtr)
+				log_newpage_buffer(buf, true);
+
+			PageSetAllVisible(page);
+			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
+							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+			END_CRIT_SECTION();
+		}
+
+		freespace = PageGetHeapFreeSpace(page);
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		return true;
+	}
+
+	/* page isn't new or empty -- keep lock and pin */
+	return false;
+}
+
 /*
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -1545,6 +1594,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	maxoff = PageGetMaxOffsetNumber(page);
 
 retry:
@@ -1607,10 +1658,9 @@ retry:
 		 * LP_DEAD items are processed outside of the loop.
 		 *
 		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how lazy_check_needs_freeze() or
-		 * count_nondeletable_pages() do it -- they only consider pages empty
-		 * when they only have LP_UNUSED items, which is important for
-		 * correctness.
+		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
+		 * it -- it only considers pages empty/truncatable when they have no
+		 * items at all (except LP_UNUSED items).
 		 *
 		 * Our assumption is that any LP_DEAD items we encounter here will
 		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
@@ -1897,6 +1947,226 @@ retry:
 	vacrel->live_tuples += live_tuples;
 }
 
+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.
+ *
+ * While pruning isn't performed here, we can at least collect existing
+ * LP_DEAD items into the dead_items array for removal from indexes.  It's
+ * quite possible that earlier opportunistic pruning left LP_DEAD items
+ * behind, and we shouldn't miss out on an opportunity to make them reusable
+ * (VACUUM alone is capable of cleaning up line pointer bloat like this).
+ * Note that we'll only require an exclusive lock (not a cleanup lock) later
+ * on when we set these LP_DEAD items to LP_UNUSED in lazy_vacuum_heap_page.
+ *
+ * Freezing isn't performed here either.  For aggressive VACUUM callers, we
+ * may return false to indicate that a full cleanup lock is required.  This is
+ * necessary because pruning requires a cleanup lock, and because VACUUM
+ * cannot freeze a page's tuples until after pruning takes place (freezing
+ * tuples effectively requires a cleanup lock, though we don't need a cleanup
+ * lock in lazy_vacuum_heap_page or in lazy_scan_new_or_empty to set a heap
+ * page all-frozen in the visibility map).
+ *
+ * Returns true to indicate that all required processing has been performed.
+ * We'll always return true for a non-aggressive VACUUM, even when we know
+ * that this will cause them to miss out on freezing tuples from before
+ * vacrel->FreezeLimit cutoff -- they should never have to wait for a cleanup
+ * lock.  This does mean that they definitely won't be able to advance
+ * relfrozenxid opportunistically (same applies to vacrel->MultiXactCutoff and
+ * relminmxid).  Caller waits for full cleanup lock when we return false.
+ *
+ * See lazy_scan_prune for an explanation of hastup return flag.  The
+ * hasfreespace flag instructs caller on whether or not it should do generic
+ * FSM processing for page, which is determined based on almost the same
+ * criteria as the lazy_scan_prune case.
+ */
+static bool
+lazy_scan_noprune(LVRelState *vacrel,
+				  Buffer buf,
+				  BlockNumber blkno,
+				  Page page,
+				  bool *hastup,
+				  bool *hasfreespace)
+{
+	OffsetNumber offnum,
+				maxoff;
+	int			lpdead_items,
+				num_tuples,
+				live_tuples,
+				new_dead_tuples;
+	HeapTupleHeader tupleheader;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
+	*hastup = false;			/* for now */
+	*hasfreespace = false;		/* for now */
+
+	lpdead_items = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+	new_dead_tuples = 0;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		if (!ItemIdIsUsed(itemid))
+			continue;
+
+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;
+			continue;
+		}
+
+		if (ItemIdIsDead(itemid))
+		{
+			/*
+			 * Deliberately don't set hastup=true here.  See same point in
+			 * lazy_scan_prune for an explanation.
+			 */
+			deadoffsets[lpdead_items++] = offnum;
+			continue;
+		}
+
+		*hastup = true;		/* page prevents rel truncation */
+		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+		if (heap_tuple_needs_freeze(tupleheader,
+									vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
+		{
+			if (vacrel->aggressive)
+			{
+				/* Going to have to get cleanup lock for lazy_scan_prune */
+				vacrel->offnum = InvalidOffsetNumber;
+				return false;
+			}
+
+			/*
+			 * Current non-aggressive VACUUM operation definitely won't be
+			 * able to advance relfrozenxid or relminmxid
+			 */
+			vacrel->freeze_cutoffs_valid = false;
+		}
+
+		num_tuples++;
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
+
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count both cases as live, just like lazy_scan_prune
+				 */
+				live_tuples++;
+
+				break;
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * There is some useful work for pruning to do, that won't be
+				 * done due to failure to get a cleanup lock.
+				 *
+				 * TODO Add dedicated instrumentation for this case
+				 */
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 */
+				new_dead_tuples++;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Do not count these rows as live, just like lazy_scan_prune
+				 */
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+	}
+
+	vacrel->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in vacrel (though
+	 * only when VACUUM uses two-pass strategy).
+	 */
+	if (vacrel->nindexes == 0)
+	{
+		/*
+		 * Using one-pass strategy.
+		 *
+		 * We are not prepared to handle the corner case where a single pass
+		 * strategy VACUUM cannot get a cleanup lock, and we then find LP_DEAD
+		 * items.
+		 */
+		if (lpdead_items > 0)
+			*hastup = true;
+		*hasfreespace = true;
+		num_tuples += lpdead_items;
+		/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+	}
+	else if (lpdead_items > 0)
+	{
+		VacDeadItems *dead_items = vacrel->dead_items;
+		ItemPointerData tmp;
+
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_items->items[dead_items->num_items++] = tmp;
+		}
+
+		Assert(dead_items->num_items <= dead_items->max_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_items->num_items);
+
+		vacrel->lpdead_items += lpdead_items;
+	}
+	else
+	{
+		/*
+		 * Caller won't be vacuuming this page later, so tell it to record
+		 * page's freespace in the FSM now
+		 */
+		*hasfreespace = true;
+	}
+
+	/*
+	 * Finally, add relevant page-local counts to whole-VACUUM counts
+	 */
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+
+	/* Caller won't need to call lazy_scan_prune with same page */
+	return true;
+}
+
 /*
  * Remove the collected garbage tuples from the table and its indexes.
  *
@@ -2342,67 +2612,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	return index;
 }
 
-/*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
- *
- * Returns true if the page needs to be vacuumed using cleanup lock.
- * Also returns a flag indicating whether page contains any tuples at all.
- */
-static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
-{
-	Page		page = BufferGetPage(buf);
-	OffsetNumber offnum,
-				maxoff;
-	HeapTupleHeader tupleheader;
-
-	*hastup = false;
-
-	/*
-	 * New and empty pages, obviously, don't contain tuples. We could make
-	 * sure that the page is registered in the FSM, but it doesn't seem worth
-	 * waiting for a cleanup lock just for that, especially because it's
-	 * likely that the pin holder will do so.
-	 */
-	if (PageIsNew(page) || PageIsEmpty(page))
-		return false;
-
-	maxoff = PageGetMaxOffsetNumber(page);
-	for (offnum = FirstOffsetNumber;
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		ItemId		itemid;
-
-		/*
-		 * Set the offset number so that we can display it along with any
-		 * error that occurred while processing this tuple.
-		 */
-		vacrel->offnum = offnum;
-		itemid = PageGetItemId(page, offnum);
-
-		/* this should match hastup test in count_nondeletable_pages() */
-		if (ItemIdIsUsed(itemid))
-			*hastup = true;
-
-		/* dead and redirect items never need freezing */
-		if (!ItemIdIsNormal(itemid))
-			continue;
-
-		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-
-		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
-			break;
-	}							/* scan along page */
-
-	/* Clear the offset information once we have processed the given page. */
-	vacrel->offnum = InvalidOffsetNumber;
-
-	return (offnum <= maxoff);
-}
-
 /*
  * Trigger the failsafe to avoid wraparound failure when vacrel table has a
  * relfrozenxid and/or relminmxid that is dangerously far in the past.
@@ -2468,7 +2677,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		double		reltuples = vacrel->new_rel_tuples;
 		bool		estimated_count =
-		vacrel->tupcount_pages < vacrel->rel_pages;
+		vacrel->scanned_pages < vacrel->rel_pages;
 
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2485,7 +2694,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 		/* Outsource everything to parallel variant */
 		parallel_vacuum_cleanup_all_indexes(vacrel->pvs, vacrel->new_rel_tuples,
 											vacrel->num_index_scans,
-											(vacrel->tupcount_pages < vacrel->rel_pages));
+											(vacrel->scanned_pages < vacrel->rel_pages));
 	}
 }
 
@@ -2592,7 +2801,9 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * should_attempt_truncation - should we attempt to truncate the heap?
  *
  * Don't even think about it unless we have a shot at releasing a goodly
- * number of pages.  Otherwise, the time taken isn't worth it.
+ * number of pages.  Otherwise, the time taken isn't worth it, mainly because
+ * an AccessExclusive lock must be replayed on any hot standby, where it can
+ * be particularly disruptive.
  *
  * Also don't attempt it if wraparound failsafe is in effect.  It's hard to
  * predict how long lazy_truncate_heap will take.  Don't take any chances.
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
index cdbe7f3a6..ce55376e7 100644
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ b/src/test/isolation/expected/vacuum-reltuples.out
@@ -45,7 +45,7 @@ step stats:
 
 relpages|reltuples
 --------+---------
-       1|       20
+       1|       21
 (1 row)
 
 
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
index ae2f79b8f..a2a461f2f 100644
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ b/src/test/isolation/specs/vacuum-reltuples.spec
@@ -2,9 +2,10 @@
 # to page pins. We absolutely need to avoid setting reltuples=0 in
 # such cases, since that interferes badly with planning.
 #
-# Expected result in second permutation is 20 tuples rather than 21 as
-# for the others, because vacuum should leave the previous result
-# (from before the insert) in place.
+# Expected result for all three permutation is 21 tuples, including
+# the second permutation.  VACUUM is able to count the concurrently
+# inserted tuple in its final reltuples, even when a cleanup lock
+# cannot be acquired on the affected heap page.
 
 setup {
     create table smalltbl
-- 
2.30.2

#15

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Peter Geoghegan (#11)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Dec 17, 2021 at 9:30 PM Peter Geoghegan <pg@bowt.ie> wrote:

Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
physical blocks, but we use logical units (XIDs).

I don't see how we can get rid of these. We know that catastrophe will
ensue if we fail to freeze old XIDs for a sufficiently long time ---
where sufficiently long has to do with the number of XIDs that have
been subsequently consumed. So it's natural to decide whether or not
we're going to wait for cleanup locks on pages on the basis of how old
the XIDs they contain actually are. Admittedly, that decision doesn't
need to be made at the start of the vacuum, as we do today. We could
happily skip waiting for a cleanup lock on pages that contain only
newer XIDs, but if there is a page that both contains an old XID and
stays pinned for a long time, we eventually have to sit there and wait
for that pin to be released. And the best way to decide when to switch
to that strategy is really based on the age of that XID, at least as I
see it, because it is the age of that XID reaching 2 billion that is
going to kill us.

I think vacuum_freeze_min_age also serves a useful purpose: it
prevents us from freezing data that's going to be modified again or
even deleted in the near future. Since we can't know the future, we
must base our decision on the assumption that the future will be like
the past: if the page hasn't been modified for a while, then we should
assume it's not likely to be modified again soon; otherwise not. If we
knew the time at which the page had last been modified, it would be
very reasonable to use that here - say, freeze the XIDs if the page
hasn't been touched in an hour, or whatever. But since we lack such
timestamps the XID age is the closest proxy we have.

The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

I don't really see any reason for optimism here. There could be a lot
of unfrozen pages in the relation, and we'd have to troll through all
of those in order to find that single old XID. Moreover, there is
nothing whatsoever to focus autovacuum's attention on that single old
XID rather than anything else. Nothing in the autovacuum algorithm
will cause it to focus its efforts on that single old XID at a time
when there's no pin on the page, or at a time when that XID becomes
the thing that's holding back vacuuming throughout the cluster. A lot
of vacuum problems that users experience today would be avoided if
autovacuum had perfect knowledge of what it ought to be prioritizing
at any given time, or even some knowledge. But it doesn't, and is
often busy fiddling while Rome burns.

IOW, the time that it takes to freeze that one tuple *in theory* might
be small. But in practice it may be very large, because we won't
necessarily get around to it on any meaningful time frame.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Robert Haas (#15)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 6, 2022 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 17, 2021 at 9:30 PM Peter Geoghegan <pg@bowt.ie> wrote:

Can we fully get rid of vacuum_freeze_table_age? Maybe even get rid of
vacuum_freeze_min_age, too? Freezing tuples is a maintenance task for
physical blocks, but we use logical units (XIDs).

I don't see how we can get rid of these. We know that catastrophe will
ensue if we fail to freeze old XIDs for a sufficiently long time ---
where sufficiently long has to do with the number of XIDs that have
been subsequently consumed.

I don't really disagree with anything you've said, I think. There are
a few subtleties here. I'll try to tease them apart.

I agree that we cannot do without something like vacrel->FreezeLimit
for the foreseeable future -- but the closely related GUC
(vacuum_freeze_min_age) is another matter. Although everything you've
said in favor of the GUC seems true, the GUC is not a particularly
effective (or natural) way of constraining the problem. It just
doesn't make sense as a tunable.

One obvious reason for this is that the opportunistic freezing stuff
is expected to be the thing that usually forces freezing -- not
vacuum_freeze_min_age, nor FreezeLimit, nor any other XID-based
cutoff. As you more or less pointed out yourself, we still need
FreezeLimit as a backstop mechanism. But the value of FreezeLimit can
just come from autovacuum_freeze_max_age/2 in all cases (no separate
GUC), or something along those lines. We don't particularly expect the
value of FreezeLimit to matter, at least most of the time. It should
only noticeably affect our behavior during anti-wraparound VACUUMs,
which become rare with the patch (e.g. my pgbench_accounts example
upthread). Most individual tables will never get even one
anti-wraparound VACUUM -- it just doesn't ever come for most tables in
practice.

My big issue with vacuum_freeze_min_age is that it doesn't really work
with the freeze map work in 9.6, which creates problems that I'm
trying to address by freezing early and so on. After all, HEAD (and
all stable branches) can easily set a page to all-visible (but not
all-frozen) in the VM, meaning that the page's tuples won't be
considered for freezing until the next aggressive VACUUM. This means
that vacuum_freeze_min_age is already frequently ignored by the
implementation -- it's conditioned on other things that are practically
impossible to predict.

Curious about your thoughts on this existing issue with
vacuum_freeze_min_age. I am concerned about the "freezing cliff" that
it creates.

So it's natural to decide whether or not
we're going to wait for cleanup locks on pages on the basis of how old
the XIDs they contain actually are.

I agree, but again, it's only a backstop. With the patch we'd have to
be rather unlucky to ever need to wait like this.

What are the chances that we keep failing to freeze an old XID from
one particular page, again and again? My testing indicates that it's a
negligible concern in practice (barring pathological cases with idle
cursors, etc).

I think vacuum_freeze_min_age also serves a useful purpose: it
prevents us from freezing data that's going to be modified again or
even deleted in the near future. Since we can't know the future, we
must base our decision on the assumption that the future will be like
the past: if the page hasn't been modified for a while, then we should
assume it's not likely to be modified again soon; otherwise not.

But the "freeze early" heuristics work a bit like that anyway. We
won't freeze all the tuples on a whole heap page early if we won't
otherwise set the heap page to all-visible (not all-frozen) in the VM
anyway.

If we
knew the time at which the page had last been modified, it would be
very reasonable to use that here - say, freeze the XIDs if the page
hasn't been touched in an hour, or whatever. But since we lack such
timestamps the XID age is the closest proxy we have.

XID age is a *terrible* proxy. The age of an XID in a tuple header may
advance quickly, even when nobody modifies the same table at all.

I concede that it is true that we are (in some sense) "gambling" by
freezing early -- we may end up freezing a tuple that we subsequently
update anyway. But aren't we also "gambling" by *not* freezing early?
By not freezing, we risk getting into "freezing debt" that will have
to be paid off in one ruinously large installment. I would much rather
"gamble" on something where we can tolerate consistently "losing" than
gamble on something where I cannot ever afford to lose (even if it's
much less likely that I'll lose during any given VACUUM operation).

Besides all this, I think that we have a rather decent chance of
coming out ahead in practice by freezing early. In practice the
marginal cost of freezing early is consistently pretty low.
Cost-control-driven (as opposed to need-driven) freezing is *supposed*
to be cheaper, of course. And like it or not, freezing is really just part of
the cost of storing data using Postgres (for the time being, at least).

The
risk mostly comes from how much total work we still need to do to
advance relfrozenxid. If the single old XID is quite old indeed (~1.5
billion XIDs), but there is only one, then we just have to freeze one
tuple to be able to safely advance relfrozenxid (maybe advance it by a
huge amount!). How long can it take to freeze one tuple, with the
freeze map, etc?

I don't really see any reason for optimism here.

IOW, the time that it takes to freeze that one tuple *in theory* might
be small. But in practice it may be very large, because we won't
necessarily get around to it on any meaningful time frame.

On second thought I agree that my specific example of 1.5 billion XIDs
was a little too optimistic of me. But 50 million XIDs (i.e. the
vacuum_freeze_min_age default) is too pessimistic. The important point
is that FreezeLimit could plausibly become nothing more than a
backstop mechanism, with the design from the patch series -- something
that typically has no effect on what tuples actually get frozen.

--
Peter Geoghegan

#17

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Peter Geoghegan (#16)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 6, 2022 at 2:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

But the "freeze early" heuristics work a bit like that anyway. We
won't freeze all the tuples on a whole heap page early if we won't
otherwise set the heap page to all-visible (not all-frozen) in the VM
anyway.

I believe that applications tend to update rows according to
predictable patterns. Andy Pavlo made an observation about this at one
point:

https://youtu.be/AD1HW9mLlrg?t=3202

I think that we don't do a good enough job of keeping logically
related tuples (tuples inserted around the same time) together, on the
same original heap page, which motivated a lot of my experiments with
the FSM from last year. Even still, it seems like a good idea for us
to err in the direction of assuming that tuples on the same heap page
are logically related. The tuples should all be frozen together when
possible. And *not* frozen early when the heap page as a whole can't
be frozen (barring cases with one *much* older XID before
FreezeLimit).

--
Peter Geoghegan

#18

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Peter Geoghegan (#16)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 6, 2022 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote:

One obvious reason for this is that the opportunistic freezing stuff
is expected to be the thing that usually forces freezing -- not
vacuum_freeze_min_age, nor FreezeLimit, nor any other XID-based
cutoff. As you more or less pointed out yourself, we still need
FreezeLimit as a backstop mechanism. But the value of FreezeLimit can
just come from autovacuum_freeze_max_age/2 in all cases (no separate
GUC), or something along those lines. We don't particularly expect the
value of FreezeLimit to matter, at least most of the time. It should
only noticeably affect our behavior during anti-wraparound VACUUMs,
which become rare with the patch (e.g. my pgbench_accounts example
upthread). Most individual tables will never get even one
anti-wraparound VACUUM -- it just doesn't ever come for most tables in
practice.

This seems like a weak argument. Sure, you COULD hard-code the limit
to be autovacuum_freeze_max_age/2 rather than making it a separate
tunable, but I don't think it's better. I am generally very skeptical
about the idea of using the same GUC value for multiple purposes,
because it often turns out that the optimal value for one purpose is
different than the optimal value for some other purpose. For example,
the optimal amount of memory for a hash table is likely different than
the optimal amount for a sort, which is why we now have
hash_mem_multiplier. When it's not even the same value that's being
used in both places, but the original value in one place and a value
derived from some formula in the other, the chances of things working
out are even less.

I feel generally that a lot of the argument you're making here
supposes that tables are going to get vacuumed regularly. I agree that
IF tables are being vacuumed on a regular basis, and if as part of
that we always push relfrozenxid forward as far as we can, we will
rarely have a situation where aggressive strategies to avoid
wraparound are required. However, I disagree strongly with the idea
that we can assume that tables will get vacuumed regularly. That can
fail to happen for all sorts of reasons. One of the common ones is a
poor choice of autovacuum configuration. The most common problem in my
experience is a cost limit that is too low to permit the amount of
vacuuming that is actually required, but other kinds of problems like
not enough workers (so tables get starved), too many workers (so the
cost limit is being shared between many processes), autovacuum=off
either globally or on one table (because of ... reasons),
autovacuum_vacuum_insert_threshold = -1 plus not many updates (so
thing ever triggers the vacuum), autovacuum_naptime=1d (actually seen
in the real world! ... and, no, it didn't work well), or stats
collector problems are all possible. We can *hope* that there are
going to be regular vacuums of the table long before wraparound
becomes a danger, but realistically, we better not assume that in our
choice of algorithms, because the real world is a messy place where
all sorts of crazy things happen.

Now, I agree with you in part: I don't think it's obvious that it's
useful to tune vacuum_freeze_table_age. When I advise customers on how
to fix vacuum problems, I am usually telling them to increase
autovacuum_vacuum_cost_limit, possibly also with an increase in
autovacuum_workers; or to increase or decrease
autovacuum_freeze_max_age depending on which problem they have; or
occasionally to adjust settings like autovacuum_naptime. It doesn't
often seem to be necessary to change vacuum_freeze_table_age or, for
that matter, vacuum_freeze_min_age. But if we remove them and then
discover scenarios where tuning them would have been useful, we'll
have no options for fixing PostgreSQL systems in the field. Waiting
for the next major release in such a scenario, or even the next minor
release, is not good. We should be VERY conservative about removing
existing settings if there's any chance that somebody could use them
to tune their way out of trouble.

My big issue with vacuum_freeze_min_age is that it doesn't really work
with the freeze map work in 9.6, which creates problems that I'm
trying to address by freezing early and so on. After all, HEAD (and
all stable branches) can easily set a page to all-visible (but not
all-frozen) in the VM, meaning that the page's tuples won't be
considered for freezing until the next aggressive VACUUM. This means
that vacuum_freeze_min_age is already frequently ignored by the
implementation -- it's conditioned on other things that are practically
impossible to predict.

Curious about your thoughts on this existing issue with
vacuum_freeze_min_age. I am concerned about the "freezing cliff" that
it creates.

So, let's see: if we see a page where the tuples are all-visible and
we seize the opportunity to freeze it, we can spare ourselves the need
to ever visit that page again (unless it gets modified). But if we
only mark it all-visible and leave the freezing for later, the next
aggressive vacuum will have to scan and dirty the page. I'm prepared
to believe that it's worth the cost of freezing the page in that
scenario. We've already dirtied the page and written some WAL and
maybe generated an FPW, so doing the rest of the work now rather than
saving it until later seems likely to be a win. I think it's OK to
behave, in this situation, as if vacuum_freeze_min_age=0.

There's another situation in which vacuum_freeze_min_age could apply,
though: suppose the page isn't all-visible yet. I'd argue that in that
case we don't want to run around freezing stuff unless it's quite old
- like older than vacuum_freeze_table_age, say. Because we know we're
going to have to revisit this page in the next vacuum anyway, and
expending effort to freeze tuples that may be about to be modified
again doesn't seem prudent. So, hmm, on further reflection, maybe it's
OK to remove vacuum_freeze_min_age. But if we do, then I think we had
better carefully distinguish between the case where the page can
thereby be marked all-frozen and the case where it cannot. I guess you
say the same, further down.

So it's natural to decide whether or not
we're going to wait for cleanup locks on pages on the basis of how old
the XIDs they contain actually are.

I agree, but again, it's only a backstop. With the patch we'd have to
be rather unlucky to ever need to wait like this.

What are the chances that we keep failing to freeze an old XID from
one particular page, again and again? My testing indicates that it's a
negligible concern in practice (barring pathological cases with idle
cursors, etc).

I mean, those kinds of pathological cases happen *all the time*. Sure,
there are plenty of users who don't leave cursors open. But the ones
who do don't leave them around for short periods of time on randomly
selected pages of the table. They are disproportionately likely to
leave them on the same table pages over and over, just like data can't
in general be assumed to be uniformly accessed. And not uncommonly,
they leave them around until the snow melts.

And we need to worry about those kinds of users, actually much more
than we need to worry about users doing normal things. Honestly,
autovacuum on a system where things are mostly "normal" - no
long-running transactions, adequate resources for autovacuum to do its
job, reasonable configuration settings - isn't that bad. It's true
that there are people who get surprised by an aggressive autovacuum
kicking off unexpectedly, but it's usually the first one during the
cluster lifetime (which is typically the biggest, since the initial
load tends to be bigger than later ones) and it's usually annoying but
survivable. The places where autovacuum becomes incredibly frustrating
are the pathological cases. When insufficient resources are available
to complete the work in a timely fashion, or difficult trade-offs have
to be made, autovacuum is too dumb to make the right choices. And even
if you call your favorite PostgreSQL support provider and they provide
an expert, once it gets behind, autovacuum isn't very tractable: it
will insist on vacuuming everything, right now, in an order that it
chooses, and it's not going to listen to take any nonsense from some
human being who thinks they might have some useful advice to provide!

But the "freeze early" heuristics work a bit like that anyway. We
won't freeze all the tuples on a whole heap page early if we won't
otherwise set the heap page to all-visible (not all-frozen) in the VM
anyway.

Hmm, I didn't realize that we had that. Is that an existing thing or
something new you're proposing to do? If existing, where is it?

IOW, the time that it takes to freeze that one tuple *in theory* might
be small. But in practice it may be very large, because we won't
necessarily get around to it on any meaningful time frame.

On second thought I agree that my specific example of 1.5 billion XIDs
was a little too optimistic of me. But 50 million XIDs (i.e. the
vacuum_freeze_min_age default) is too pessimistic. The important point
is that FreezeLimit could plausibly become nothing more than a
backstop mechanism, with the design from the patch series -- something
that typically has no effect on what tuples actually get frozen.

I agree that it's OK for this to become a purely backstop mechanism
... but again, I think that the design of such backstop mechanisms
should be done as carefully as we know how, because users seem to hit
the backstop all the time. We want it to be made of, you know, nylon
twine, rather than, say, sharp nails. :-)

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

Peter Geoghegan

pg@bowt.ie

about 4 years ago

In reply to: Robert Haas (#18)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Jan 7, 2022 at 12:24 PM Robert Haas <robertmhaas@gmail.com> wrote:

This seems like a weak argument. Sure, you COULD hard-code the limit
to be autovacuum_freeze_max_age/2 rather than making it a separate
tunable, but I don't think it's better. I am generally very skeptical
about the idea of using the same GUC value for multiple purposes,
because it often turns out that the optimal value for one purpose is
different than the optimal value for some other purpose.

I thought I was being conservative by suggesting
autovacuum_freeze_max_age/2. My first thought was to teach VACUUM to
make its FreezeLimit "OldestXmin - autovacuum_freeze_max_age". To me
these two concepts really *are* the same thing: vacrel->FreezeLimit
becomes a backstop, just as anti-wraparound autovacuum (the
autovacuum_freeze_max_age cutoff) becomes a backstop.

Of course, an anti-wraparound VACUUM will do early freezing in the
same way as any other VACUUM will (with the patch series). So even
when the FreezeLimit backstop XID cutoff actually affects the behavior
of a given VACUUM operation, it may well not be the reason why most
individual tuples that we freeze get frozen. That is, most individual
heap pages will probably have tuples frozen for some other reason.
Though it depends on workload characteristics, most individual heap
pages will typically be frozen as a group, even here. This is a
logical consequence of the fact that tuple freezing and advancing
relfrozenxid are now only loosely coupled -- it's about as loose as
the current relfrozenxid invariant will allow.

I feel generally that a lot of the argument you're making here
supposes that tables are going to get vacuumed regularly.

I agree that
IF tables are being vacuumed on a regular basis, and if as part of
that we always push relfrozenxid forward as far as we can, we will
rarely have a situation where aggressive strategies to avoid
wraparound are required.

It's all relative. We hope that (with the patch) cases that only ever
get anti-wraparound VACUUMs are limited to tables where nothing else
drives VACUUM, for sensible reasons related to workload
characteristics (like the pgbench_accounts example upthread). It's
inevitable that some users will misconfigure the system, though -- no
question about that.

I don't see why users that misconfigure the system in this way should
be any worse off than they would be today. They probably won't do
substantially less freezing (usually somewhat more), and will advance
pg_class.relfrozenxid in exactly the same way as today (usually a bit
better, actually). What have I missed?

Admittedly the design of the "Freeze tuples early to advance
relfrozenxid" patch (i.e. v5-0005-*patch) is still unsettled; I need
to verify that my claims about it are really robust. But as far as I
know they are. Reviewers should certainly look at that with a critical
eye.

Now, I agree with you in part: I don't think it's obvious that it's
useful to tune vacuum_freeze_table_age.

That's definitely the easier argument to make. After all,
vacuum_freeze_table_age will do nothing unless VACUUM runs before the
anti-wraparound threshold (autovacuum_freeze_max_age) is reached. The
patch series should be strictly better than that. Primarily because
it's "continuous", and so isn't limited to cases where the table age
falls within the "vacuum_freeze_table_age - autovacuum_freeze_max_age"
goldilocks age range.

We should be VERY conservative about removing
existing settings if there's any chance that somebody could use them
to tune their way out of trouble.

I agree, I suppose, but right now I honestly can't think of a reason
why they would be useful.

If I am wrong about this then I'm probably also wrong about some basic
facet of the high-level design, in which case I should change course
altogether. In other words, removing the GUCs is not an incidental
thing. It's possible that I would never have pursued this project if I
didn't first notice how wrong-headed the GUCs are.

So, let's see: if we see a page where the tuples are all-visible and
we seize the opportunity to freeze it, we can spare ourselves the need
to ever visit that page again (unless it gets modified). But if we
only mark it all-visible and leave the freezing for later, the next
aggressive vacuum will have to scan and dirty the page. I'm prepared
to believe that it's worth the cost of freezing the page in that
scenario.

That's certainly the most compelling reason to perform early freezing.
It's not completely free of downsides, but it's pretty close.

There's another situation in which vacuum_freeze_min_age could apply,
though: suppose the page isn't all-visible yet. I'd argue that in that
case we don't want to run around freezing stuff unless it's quite old
- like older than vacuum_freeze_table_age, say. Because we know we're
going to have to revisit this page in the next vacuum anyway, and
expending effort to freeze tuples that may be about to be modified
again doesn't seem prudent. So, hmm, on further reflection, maybe it's
OK to remove vacuum_freeze_min_age. But if we do, then I think we had
better carefully distinguish between the case where the page can
thereby be marked all-frozen and the case where it cannot. I guess you
say the same, further down.

I do. Although v5-0005-*patch still freezes early when the page is
dirtied by pruning, I have my doubts about that particular "freeze
early" criteria. I believe that everything I just said about
misconfigured autovacuums doesn't rely on anything more than the "most
compelling scenario for early freezing" mechanism that arranges to
make us set the all-frozen bit (not just the all-visible bit).

I mean, those kinds of pathological cases happen *all the time*. Sure,
there are plenty of users who don't leave cursors open. But the ones
who do don't leave them around for short periods of time on randomly
selected pages of the table. They are disproportionately likely to
leave them on the same table pages over and over, just like data can't
in general be assumed to be uniformly accessed. And not uncommonly,
they leave them around until the snow melts.

And we need to worry about those kinds of users, actually much more
than we need to worry about users doing normal things.

I couldn't agree more. In fact, I was mostly thinking about how to
*help* these users. Insisting on waiting for a cleanup lock before it
becomes strictly necessary (when the table age is only 50
million/vacuum_freeze_min_age) is actually a big part of the problem
for these users. vacuum_freeze_min_age enforces a false dichotomy on
aggressive VACUUMs, that just isn't unhelpful. Why should waiting on a
cleanup lock fix anything?

Even in the extreme case where we are guaranteed to eventually have a
wraparound failure in the end (due to an idle cursor in an
unsupervised database), the user is still much better off, I think. We
will have at least managed to advance relfrozenxid to the exact oldest
XID on the one heap page that somebody holds an idle cursor
(conflicting buffer pin) on. And we'll usually have frozen most of the
tuples that need to be frozen. Sure, the user may need to use
single-user mode to run a manual VACUUM, but at least this process
only needs to freeze approximately one tuple to get the system back
online again.

If the DBA notices the problem before the database starts to refuse to
allocate XIDs, then they'll have a much better chance of avoiding a
wraparound failure through simple intervention (like killing the
backend with the idle cursor). We can pay down 99.9% of the "freeze
debt" independently of this intractable problem of something holding
onto an idle cursor.

Honestly,
autovacuum on a system where things are mostly "normal" - no
long-running transactions, adequate resources for autovacuum to do its
job, reasonable configuration settings - isn't that bad.

Right. Autovacuum is "too big to fail".

But the "freeze early" heuristics work a bit like that anyway. We
won't freeze all the tuples on a whole heap page early if we won't
otherwise set the heap page to all-visible (not all-frozen) in the VM
anyway.

Hmm, I didn't realize that we had that. Is that an existing thing or
something new you're proposing to do? If existing, where is it?

It's part of v5-0005-*patch. Still in flux to some degree, because
it's necessary to balance a few things. That shouldn't undermine the
arguments I've made here.

I agree that it's OK for this to become a purely backstop mechanism
... but again, I think that the design of such backstop mechanisms
should be done as carefully as we know how, because users seem to hit
the backstop all the time. We want it to be made of, you know, nylon
twine, rather than, say, sharp nails. :-)

Absolutely. But if autovacuum can only ever run due to
age(relfrozenxid) reaching autovacuum_freeze_max_age, then I can't see
a downside.

Again, the v5-0005-*patch needs to meet the standard that I've laid
out. If it doesn't then I've messed up already.

--
Peter Geoghegan

#20

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#19)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Jan 7, 2022 at 5:20 PM Peter Geoghegan <pg@bowt.ie> wrote:

I thought I was being conservative by suggesting
autovacuum_freeze_max_age/2. My first thought was to teach VACUUM to
make its FreezeLimit "OldestXmin - autovacuum_freeze_max_age". To me
these two concepts really *are* the same thing: vacrel->FreezeLimit
becomes a backstop, just as anti-wraparound autovacuum (the
autovacuum_freeze_max_age cutoff) becomes a backstop.

I can't follow this. If the idea is that we're going to
opportunistically freeze a page whenever that allows us to mark it
all-visible, then the remaining question is what XID age we should use
to force freezing when that rule doesn't apply. It seems to me that
there is a rebuttable presumption that that case ought to work just as
it does today - and I think I hear you saying that it should NOT work
as it does today, but should use some other threshold. Yet I can't
understand why you think that.

I couldn't agree more. In fact, I was mostly thinking about how to
*help* these users. Insisting on waiting for a cleanup lock before it
becomes strictly necessary (when the table age is only 50
million/vacuum_freeze_min_age) is actually a big part of the problem
for these users. vacuum_freeze_min_age enforces a false dichotomy on
aggressive VACUUMs, that just isn't unhelpful. Why should waiting on a
cleanup lock fix anything?

Because waiting on a lock means that we'll acquire it as soon as it's
available. If you repeatedly call your local Pizzeria Uno's and ask
whether there is a wait, and head to the restaurant only when the
answer is in the negative, you may never get there, because they may
be busy every time you call - especially if you always call around
lunch or dinner time. Even if you eventually get there, it may take
multiple days before you find a time when a table is immediately
available, whereas if you had just gone over there and stood in line,
you likely would have been seated in under an hour and savoring the
goodness of quality deep-dish pizza not too long thereafter. The same
principle applies here.

I do think that waiting for a cleanup lock when the age of the page is
only vacuum_freeze_min_age seems like it might be too aggressive, but
I don't think that's how it works. AFAICS, it's based on whether the
vacuum is marked as aggressive, which has to do with
vacuum_freeze_table_age, not vacuum_freeze_min_age. Let's turn the
question around: if the age of the oldest XID on the page is >150
million transactions and the buffer cleanup lock is not available now,
what makes you think that it's any more likely to be available when
the XID age reaches 200 million or 300 million or 700 million? There
is perhaps an argument for some kind of tunable that eventually shoots
the other session in the head (if we can identify it, anyway) but it
seems to me that regardless of what threshold we pick, polling is
strictly less likely to find a time when the page is available than
waiting for the cleanup lock. It has the counterbalancing advantage of
allowing the autovacuum worker to do other useful work in the meantime
and that is indeed a significant upside, but at some point you're
going to have to give up and admit that polling is a failed strategy,
and it's unclear why 150 million XIDs - or probably even 50 million
XIDs - isn't long enough to say that we're not getting the job done
with half measures.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#20)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 13, 2022 at 12:19 PM Robert Haas <robertmhaas@gmail.com> wrote:

I can't follow this. If the idea is that we're going to
opportunistically freeze a page whenever that allows us to mark it
all-visible, then the remaining question is what XID age we should use
to force freezing when that rule doesn't apply.

That is the idea, yes.

It seems to me that
there is a rebuttable presumption that that case ought to work just as
it does today - and I think I hear you saying that it should NOT work
as it does today, but should use some other threshold. Yet I can't
understand why you think that.

Cases where we can not get a cleanup lock fall into 2 sharply distinct
categories in my mind:

1. Cases where our inability to get a cleanup lock signifies nothing
at all about the page in question, or any page in the same table, with
the same workload.

2. Pathological cases. Cases where we're at least at the mercy of the
application to do something about an idle cursor, where the situation
may be entirely hopeless on a long enough timeline. (Whether or not it
actually happens in the end is less significant.)

As far as I can tell, based on testing, category 1 cases are fixed by
the patch series: while a small number of pages from tables in
category 1 cannot be cleanup-locked during each VACUUM, even with the
patch series, it happens at random, with no discernable pattern. The
overall result is that our ability to advance relfrozenxid is really
not impacted *over time*. It's reasonable to suppose that lightning
will not strike in the same place twice -- and it would really have to
strike several times to invalidate this assumption. It's not
impossible, but the chances over time are infinitesimal -- and the
aggregate effect over time (not any one VACUUM operation) is what
matters.

There are seldom more than 5 or so of these pages, even on large
tables. What are the chances that some random not-yet-all-frozen block
(that we cannot freeze tuples on) will also have the oldest
couldn't-be-frozen XID, even once? And when it is the oldest, why
should it be the oldest by very many XIDs? And what are the chances
that the same page has the same problem, again and again, without that
being due to some pathological workload thing?

Admittedly you may see a blip from this -- you might notice that the
final relfrozenxid value for that one single VACUUM isn't quite as new
as you'd like. But then the next VACUUM should catch up with the
stable long term average again. It's hard to describe exactly why this
effect is robust, but as I said, empirically, in practice, it appears
to be robust. That might not be good enough as an explanation that
justifies committing the patch series, but that's what I see. And I
think I will be able to nail it down.

AFAICT that just leaves concern for cases in category 2. More on that below.

Even if you eventually get there, it may take
multiple days before you find a time when a table is immediately
available, whereas if you had just gone over there and stood in line,
you likely would have been seated in under an hour and savoring the
goodness of quality deep-dish pizza not too long thereafter. The same
principle applies here.

I think that you're focussing on individual VACUUM operations, whereas
I'm more concerned about the aggregate effect of a particular policy
over time.

Let's assume for a moment that the only thing that we really care
about is reliably keeping relfrozenxid reasonably recent. Even then,
waiting for a cleanup lock (to freeze some tuples) might be the wrong
thing to do. Waiting in line means that we're not freezing other
tuples (nobody else can either). So we're allowing ourselves to fall
behind on necessary, routine maintenance work that allows us to
advance relfrozenxid....in order to advance relfrozenxid.

I do think that waiting for a cleanup lock when the age of the page is
only vacuum_freeze_min_age seems like it might be too aggressive, but
I don't think that's how it works. AFAICS, it's based on whether the
vacuum is marked as aggressive, which has to do with
vacuum_freeze_table_age, not vacuum_freeze_min_age. Let's turn the
question around: if the age of the oldest XID on the page is >150
million transactions and the buffer cleanup lock is not available now,
what makes you think that it's any more likely to be available when
the XID age reaches 200 million or 300 million or 700 million?

This is my concern -- what I've called category 2 cases have this
exact quality. So given that, why not freeze what you can, elsewhere,
on other pages that don't have the same issue (presumably the vast
vast majority in the table)? That way you have the best possible
chance of recovering once the DBA gets a clue and fixes the issue.

There
is perhaps an argument for some kind of tunable that eventually shoots
the other session in the head (if we can identify it, anyway) but it
seems to me that regardless of what threshold we pick, polling is
strictly less likely to find a time when the page is available than
waiting for the cleanup lock. It has the counterbalancing advantage of
allowing the autovacuum worker to do other useful work in the meantime
and that is indeed a significant upside, but at some point you're
going to have to give up and admit that polling is a failed strategy,
and it's unclear why 150 million XIDs - or probably even 50 million
XIDs - isn't long enough to say that we're not getting the job done
with half measures.

That's kind of what I meant. The difference between 50 million and 150
million is rather unclear indeed. So having accepted that that might
be true, why not be open to the possibility that it won't turn out to
be true in the long run, for any given table? With the enhancements
from the patch series in place (particularly the early freezing
stuff), what do we have to lose by making the FreezeLimit XID cutoff
for freezing much higher than your typical vacuum_freeze_min_age?
Maybe the same as autovacuum_freeze_max_age or vacuum_freeze_table_age
(it can't be higher than that without also making these other settings
become meaningless, of course).

Taking a wait-and-see approach like this (not being too quick to
decide that a table is in category 1 or category 2) doesn't seem to
make wraparound failure any more likely in any particular scenario,
but makes it less likely in other scenarios. It also gives us early
visibility into the problem, because we'll see that autovacuum can no
longer advance relfrozenxid (using the enhanced log output) where
that's generally expected.

--
Peter Geoghegan

#22

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#21)

5 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 13, 2022 at 1:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

Admittedly you may see a blip from this -- you might notice that the
final relfrozenxid value for that one single VACUUM isn't quite as new
as you'd like. But then the next VACUUM should catch up with the
stable long term average again. It's hard to describe exactly why this
effect is robust, but as I said, empirically, in practice, it appears
to be robust. That might not be good enough as an explanation that
justifies committing the patch series, but that's what I see. And I
think I will be able to nail it down.

Attached is v6, which like v5 is a rebased version that I'm posting to
keep CFTester happy. I pushed a commit that consolidates VACUUM
VERBOSE and autovacuum logging earlier (commit 49c9d9fc), which bitrot
v5. So new real changes, nothing to note.

Although it technically has nothing to do with this patch series, I
will point out that it's now a lot easier to debug using VACUUM
VERBOSE, which will directly display information about how we've
advanced relfrozenxid, tuples frozen, etc:

pg@regression:5432 =# delete from mytenk2 where hundred < 15;
DELETE 1500
pg@regression:5432 =# vacuum VERBOSE mytenk2;
INFO: vacuuming "regression.public.mytenk2"
INFO: finished vacuuming "regression.public.mytenk2": index scans: 1
pages: 0 removed, 345 remain, 0 skipped using visibility map (0.00% of total)
tuples: 1500 removed, 8500 remain (8500 newly frozen), 0 are dead but
not yet removable
removable cutoff: 17411, which is 0 xids behind next
new relfrozenxid: 17411, which is 3 xids ahead of previous value
index scan needed: 341 pages from table (98.84% of total) had 1500
dead item identifiers removed
index "mytenk2_unique1_idx": pages: 39 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "mytenk2_unique2_idx": pages: 30 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "mytenk2_hundred_idx": pages: 11 in total, 1 newly deleted, 1
currently deleted, 0 reusable
I/O timings: read: 0.011 ms, write: 0.000 ms
avg read rate: 1.428 MB/s, avg write rate: 2.141 MB/s
buffer usage: 1133 hits, 2 misses, 3 dirtied
WAL usage: 1446 records, 1 full page images, 199702 bytes
system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s
VACUUM

--
Peter Geoghegan

Attachments:

v6-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchapplication/octet-stream; name=v6-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchDownload

From 17a28a0ea04b9b6f1e0b7177b13fe931874e5dca Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 22 Nov 2021 10:02:30 -0800
Subject: [PATCH v6 4/5] Loosen coupling between relfrozenxid and tuple
 freezing.

Stop using tuple freezing (and MultiXact freezing) tuple header cutoffs
to determine the final relfrozenxid (and relminmxid) values that we set
for heap relations in pg_class.  Use "optimal" values instead.

Optimal values are the most recent values that are less than or equal to
any remaining XID/MultiXact in a tuple header (not counting frozen
xmin/xmax values).  This is now kept track of by VACUUM.  "Optimal"
values are always >= the tuple header FreezeLimit in an aggressive
VACUUM.  For a non-aggressive VACUUM, they can be less than or greater
than the tuple header FreezeLimit cutoff (though we still often pass
invalid values to indicate that we cannot advance relfrozenxid during
the VACUUM).
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |   4 +-
 src/include/commands/vacuum.h        |   1 +
 src/backend/access/heap/heapam.c     | 186 ++++++++++++++++++++-------
 src/backend/access/heap/vacuumlazy.c |  78 +++++++----
 src/backend/commands/cluster.c       |   5 +-
 src/backend/commands/vacuum.c        |  34 ++++-
 7 files changed, 231 insertions(+), 81 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 0ad87730e..d35402f9f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -168,7 +168,9 @@ extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
 extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi, Buffer buf);
+									MultiXactId cutoff_multi,
+									TransactionId *NewRelfrozenxid,
+									MultiXactId *NewRelminmxid, Buffer buf);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..ae55c90f7 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *NewRelfrozenxid,
+									  MultiXactId *NewRelminmxid);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6ec57f3d8..521eb8044 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6087,12 +6087,24 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "NewRelfrozenxid" is an output value; it's used to maintain target new
+ * relfrozenxid for the relation.  It can be ignored unless "flags" contains
+ * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
+ * here.  This follows the general convention: only track XIDs that will still
+ * be in the table after the ongoing VACUUM finishes.  Note that it's up to
+ * caller to maintain this when the Xid return value is itself an Xid.
+ *
+ * Note that we cannot depend on xmin to maintain NewRelfrozenxid.  We need to
+ * push maintenance of NewRelfrozenxid down this far, since in general xmin
+ * might have been frozen by an earlier VACUUM operation, in which case our
+ * caller will not have factored-in xmin when maintaining NewRelfrozenxid.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *NewRelfrozenxid)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6104,6 +6116,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId tempNewRelfrozenxid;
 
 	*flags = 0;
 
@@ -6198,13 +6211,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-		{
 			need_replace = true;
-			break;
-		}
+		if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+			tempNewRelfrozenxid = members[i].xid;
 	}
 
 	/*
@@ -6213,6 +6226,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (!need_replace)
 	{
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 		*flags |= FRM_NOOP;
 		pfree(members);
 		return InvalidTransactionId;
@@ -6222,6 +6236,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * If the multi needs to be updated, figure out which members do we need
 	 * to keep.
 	 */
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
@@ -6303,7 +6318,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 * list.)
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+					tempNewRelfrozenxid = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6313,6 +6332,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			{
 				/* running locker cannot possibly be older than the cutoff */
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6341,6 +6361,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+		/* Caller manages NewRelfrozenxid directly when we return an XID */
 	}
 	else
 	{
@@ -6350,6 +6371,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 	}
 
 	pfree(newmembers);
@@ -6368,6 +6390,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will actually go on to freeze as indicated by our *frz output, so
+ * any (xmin, xmax, xvac) XIDs that we indicate need to be frozen won't need
+ * to be counted here.  Values are valid lower bounds at the point that the
+ * ongoing VACUUM finishes.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
@@ -6392,7 +6421,9 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p,
+						  TransactionId *NewRelfrozenxid,
+						  MultiXactId *NewRelminmxid)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6436,6 +6467,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+		{
+			/* won't be frozen, but older than current NewRelfrozenxid */
+			*NewRelfrozenxid = xid;
+		}
 	}
 
 	/*
@@ -6453,10 +6489,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId temp = *NewRelfrozenxid;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi, &flags, &temp);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
@@ -6474,6 +6511,24 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
+
+			if (TransactionIdPrecedes(newxmax, *NewRelfrozenxid))
+			{
+				/* New xmax is an XID older than new NewRelfrozenxid */
+				*NewRelfrozenxid = newxmax;
+			}
+		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * Changing nothing, so might have to ratchet back NewRelminmxid,
+			 * NewRelfrozenxid, or both together
+			 */
+			if (MultiXactIdIsValid(xid) &&
+				MultiXactIdPrecedes(xid, *NewRelminmxid))
+				*NewRelminmxid = xid;
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6495,6 +6550,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 
 			changed = true;
+
+			/*
+			 * New multixact might have remaining XID older than
+			 * NewRelfrozenxid
+			 */
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
@@ -6522,7 +6584,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			freeze_xmax = true;
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			{
+				/* won't be frozen, but older than current NewRelfrozenxid */
+				*NewRelfrozenxid = xid;
+			}
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
@@ -6569,6 +6638,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * Since we always freeze here, NewRelfrozenxid doesn't need to be
+		 * maintained.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6646,11 +6718,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId NewRelfrozenxid = FirstNormalTransactionId;
+	MultiXactId NewRelminmxid = FirstMultiXactId;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &NewRelfrozenxid, &NewRelminmxid);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7080,6 +7155,15 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
  * are older than the specified cutoff XID or MultiXactId.  If so, return true.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will never freeze any of the XIDs from the tuple, even when we say
+ * that they should.  If caller opts to go with our recommendation to freeze,
+ * then it must account for the fact that it shouldn't trust how we've set
+ * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
+ * our recommendation because they must, and non-aggressive VACUUMs always opt
+ * to not freeze, preferring to ratchet back NewRelfrozenxid instead).
+ *
  * It doesn't matter whether the tuple is alive or dead, we are checking
  * to see if a tuple needs to be removed or frozen to avoid wraparound.
  *
@@ -7088,74 +7172,86 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  */
 bool
 heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi, Buffer buf)
+						MultiXactId cutoff_multi,
+						TransactionId *NewRelfrozenxid,
+						MultiXactId *NewRelminmxid, Buffer buf)
 {
 	TransactionId xid;
+	bool		needs_freeze = false;
 
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
+	if (TransactionIdIsNormal(xid))
+	{
+		if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			*NewRelfrozenxid = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			needs_freeze = true;
+	}
 
 	/*
 	 * The considerations for multixacts are complicated; look at
 	 * heap_prepare_freeze_tuple for justifications.  This routine had better
 	 * be in sync with that one!
+	 *
+	 * (Actually, we maintain NewRelminmxid differently here, because we
+	 * assume that XIDs that should be frozen according to cutoff_xid won't
+	 * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)
 	 */
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		MultiXactId multi;
+		MultiXactMember *members;
+		int			nmembers;
 
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+		if (MultiXactIdIsValid(multi) &&
+			MultiXactIdPrecedes(multi, *NewRelminmxid))
+			*NewRelminmxid = multi;
+
+		if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 			return true;
 		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
+			needs_freeze = true;
+
+		/* need to check whether any member of the mxact is too old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
 		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
-
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
+			if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+				needs_freeze = true;
+			if (TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
 		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 	else
 	{
 		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
-	return false;
+	return needs_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 539214fcb..dd557fddb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -171,8 +171,10 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+
+	/* Track new pg_class.relfrozenxid/pg_class.relminmxid values */
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -330,6 +332,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
@@ -363,8 +366,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -471,8 +474,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+
+	/* Initialize values used to advance relfrozenxid/relminmxid at the end */
+	vacrel->NewRelfrozenxid = OldestXmin;
+	vacrel->NewRelminmxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -525,16 +530,18 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
 	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
 	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * the visibility map.  A non-aggressive VACUUM might only be able to
+	 * advance relfrozenxid to an XID from before FreezeLimit (or a relminmxid
+	 * from before MultiXactCutoff) when it wasn't possible to freeze some
+	 * tuples due to our inability to acquire a cleanup lock, but the effect
+	 * is usually insignificant -- NewRelfrozenxid value still has a decent
+	 * chance of being much more recent that the existing relfrozenxid.
 	 *
 	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
 	 * the rel_pages used by lazy_scan_heap, which won't match when we
 	 * happened to truncate the relation afterwards.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
@@ -551,7 +558,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			   orig_rel_pages);
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
+							vacrel->NewRelfrozenxid, vacrel->NewRelminmxid,
 							&frozenxid_updated, &minmulti_updated, false);
 	}
 
@@ -657,17 +664,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenxid - vacrel->relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenxid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminmxid - vacrel->relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminmxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -1581,6 +1588,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	int			nfrozen;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1589,6 +1598,8 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level counters */
+	NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	NewRelminmxid = vacrel->NewRelminmxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	recently_dead_tuples = 0;
@@ -1798,7 +1809,9 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenxid,
+									  &NewRelminmxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1812,13 +1825,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1962,9 +1978,9 @@ retry:
  * We'll always return true for a non-aggressive VACUUM, even when we know
  * that this will cause them to miss out on freezing tuples from before
  * vacrel->FreezeLimit cutoff -- they should never have to wait for a cleanup
- * lock.  This does mean that they definitely won't be able to advance
- * relfrozenxid opportunistically (same applies to vacrel->MultiXactCutoff and
- * relminmxid).  Caller waits for full cleanup lock when we return false.
+ * lock.  This does mean that they will have NewRelfrozenxid ratcheting back
+ * to a known-safe value (same applies to NewRelminmxid).  Caller waits for
+ * full cleanup lock when we return false.
  *
  * See lazy_scan_prune for an explanation of hastup return flag.  The
  * hasfreespace flag instructs caller on whether or not it should do generic
@@ -1988,6 +2004,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	MultiXactId NewRelminmxid = vacrel->NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2034,7 +2052,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_needs_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
+									vacrel->MultiXactCutoff,
+									&NewRelfrozenxid, &NewRelminmxid, buf))
 		{
 			if (vacrel->aggressive)
 			{
@@ -2044,10 +2063,11 @@ lazy_scan_noprune(LVRelState *vacrel,
 			}
 
 			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
+			 * A non-aggressive VACUUM doesn't have to wait on a cleanup lock
+			 * to ensure that it advances relfrozenxid to a sufficiently
+			 * recent XID that happens to be present on this page.  It can
+			 * just accept an older New/final relfrozenxid instead.
 			 */
-			vacrel->freeze_cutoffs_valid = false;
 		}
 
 		num_tuples++;
@@ -2097,6 +2117,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * We have committed to not freezing the tuples on this page (always
+	 * happens with a non-aggressive VACUUM), so make sure that the target
+	 * relfrozenxid/relminmxid values reflect the XIDs/MXIDs we encountered
+	 */
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
+
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel (though
 	 * only when VACUUM uses two-pass strategy).
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 61ced4413..fb76aac42 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f19e4a561..44cb50bc7 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -950,10 +950,28 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
  * - multiXactCutoff is the value below which all MultiXactIds are removed
  *   from Xmax.
+ *
+ * oldestXmin and oldestMxact can be thought of as the most recent values that
+ * can ever be passed to vac_update_relstats() as frozenxid and minmulti
+ * arguments.  These exact values will be used when no newer XIDs or
+ * MultiXacts remain in the heap relation (e.g., with an empty table).  It's
+ * typical for vacuumlazy.c caller to notice that older XIDs/Multixacts remain
+ * in the table, which will force it to use older value.  These older final
+ * values may not be any newer than the preexisting frozenxid/minmulti values
+ * from pg_class in extreme cases.  The final values are frequently fairly
+ * close to the optimal values that we give to vacuumlazy.c, though.
+ *
+ * An aggressive VACUUM always provides vac_update_relstats() arguments that
+ * are >= freezeLimit and >= multiXactCutoff.  A non-aggressive VACUUM may
+ * provide arguments that are either newer or older than freezeLimit and
+ * multiXactCutoff, or non-valid values (indicating that pg_class level
+ * cutoffs cannot be advanced at all).
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -962,6 +980,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -970,7 +989,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1066,9 +1084,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1083,8 +1103,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
-- 
2.30.2

v6-0005-Freeze-tuples-early-to-advance-relfrozenxid.patchapplication/octet-stream; name=v6-0005-Freeze-tuples-early-to-advance-relfrozenxid.patchDownload

From 0ebde19d306488c59d8b3b6e0913c5bb51c5c5e6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 13 Dec 2021 15:00:49 -0800
Subject: [PATCH v6 5/5] Freeze tuples early to advance relfrozenxid.

Freeze whenever pruning modified the page, or whenever we see that we're
going to mark the page all-visible without also marking it all-frozen.

There has been plenty of discussion of early/opportunistic freezing in
the past.  It is generally considered important as a way of minimizing
repeated dirtying of heap pages (or the total volume of FPIs in the WAL
stream) over time.  While that goal is certainly very important, this
patch has another priority: making VACUUM advance relfrozenxid sooner
and more frequently.

The overall effect is that tables like pgbench's history table can be
vacuumed very frequently, and have most individual vacuum operations
generate 0 FPIs in WAL -- they will never need an aggressive VACUUM.

GUCs like vacuum_freeze_min_age never made much sense after the freeze
map work in PostgreSQL 9.6.  The default is 50 million transactions,
which current tends to result in our being unable to freeze tuples
before the page is marked all-visible (but not all-frozen).  This
creates a huge performance cliff later on, during the first aggressive
VACUUM.  Freezing early effectively avoids accumulating "debt" from very
old unfrozen tuples.
---
 src/include/access/heapam.h          |  1 +
 src/backend/access/heap/pruneheap.c  |  8 ++-
 src/backend/access/heap/vacuumlazy.c | 87 +++++++++++++++++++++++++---
 3 files changed, 88 insertions(+), 8 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index d35402f9f..ba094507c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -188,6 +188,7 @@ extern int	heap_page_prune(Relation relation, Buffer buffer,
 							struct GlobalVisState *vistest,
 							TransactionId old_snap_xmin,
 							TimestampTz old_snap_ts_ts,
+							bool *modified,
 							int	*nnewlpdead,
 							OffsetNumber *off_loc);
 extern void heap_page_prune_execute(Buffer buffer,
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 3201fcc52..7d2b72e89 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -202,11 +202,12 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		 */
 		if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
 		{
+			bool	modified;
 			int		ndeleted,
 					nnewlpdead;
 
 			ndeleted = heap_page_prune(relation, buffer, vistest, limited_xmin,
-									   limited_ts, &nnewlpdead, NULL);
+									   limited_ts, &modified, &nnewlpdead, NULL);
 
 			/*
 			 * Report the number of tuples reclaimed to pgstats.  This is
@@ -264,6 +265,7 @@ heap_page_prune(Relation relation, Buffer buffer,
 				GlobalVisState *vistest,
 				TransactionId old_snap_xmin,
 				TimestampTz old_snap_ts,
+				bool *modified,
 				int	*nnewlpdead,
 				OffsetNumber *off_loc)
 {
@@ -445,6 +447,8 @@ heap_page_prune(Relation relation, Buffer buffer,
 
 			PageSetLSN(BufferGetPage(buffer), recptr);
 		}
+
+		*modified = true;
 	}
 	else
 	{
@@ -457,12 +461,14 @@ heap_page_prune(Relation relation, Buffer buffer,
 		 * point in repeating the prune/defrag process until something else
 		 * happens to the page.
 		 */
+		*modified = false;
 		if (((PageHeader) page)->pd_prune_xid != prstate.new_prune_xid ||
 			PageIsFull(page))
 		{
 			((PageHeader) page)->pd_prune_xid = prstate.new_prune_xid;
 			PageClearFull(page);
 			MarkBufferDirtyHint(buffer, true);
+			*modified = true;
 		}
 	}
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dd557fddb..a7704977a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -168,6 +168,7 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoff for pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
@@ -355,11 +356,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/*
 	 * Get cutoffs that determine which tuples we need to freeze during the
-	 * VACUUM operation.
+	 * VACUUM operation.  This includes information that is used during
+	 * opportunistic freezing, where the most aggressive possible cutoffs
+	 * (OldestXmin and OldestMxact) are used for some heap pages, based on
+	 * considerations about cost.
 	 *
 	 * Also determines if this is to be an aggressive VACUUM.  This will
 	 * eventually be required for any table where (for whatever reason) no
 	 * non-aggressive VACUUM ran to completion, and advanced relfrozenxid.
+	 * This used to be much more common, but we now work hard to advance
+	 * relfrozenxid in non-aggressive VACUUMs.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -472,6 +478,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Set cutoffs for entire VACUUM */
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
 
@@ -1590,6 +1597,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 	TransactionId NewRelfrozenxid;
 	MultiXactId NewRelminmxid;
+	bool		modified;
+	TransactionId FreezeLimit = vacrel->FreezeLimit;
+	MultiXactId MultiXactCutoff = vacrel->MultiXactCutoff;
+	bool		earlyfreezing = false;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1616,8 +1627,19 @@ retry:
 	 * that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vistest,
-									 InvalidTransactionId, 0, &nnewlpdead,
-									 &vacrel->offnum);
+									 InvalidTransactionId, 0, &modified,
+									 &nnewlpdead, &vacrel->offnum);
+
+	/*
+	 * If page was modified during pruning, then perform early freezing
+	 * opportunistically
+	 */
+	if (!earlyfreezing && modified)
+	{
+		earlyfreezing = true;
+		FreezeLimit = vacrel->OldestXmin;
+		MultiXactCutoff = vacrel->OldestMxact;
+	}
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
@@ -1672,7 +1694,7 @@ retry:
 		if (ItemIdIsDead(itemid))
 		{
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
+			/* Don't set all_visible to false just yet */
 			prunestate->has_lpdead_items = true;
 			continue;
 		}
@@ -1806,8 +1828,8 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
-									  vacrel->FreezeLimit,
-									  vacrel->MultiXactCutoff,
+									  FreezeLimit,
+									  MultiXactCutoff,
 									  &frozen[nfrozen],
 									  &tuple_totally_frozen,
 									  &NewRelfrozenxid,
@@ -1827,6 +1849,57 @@ retry:
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * Reconsider applying early freezing before committing to processing the
+	 * page as currently planned.  There are 2 reasons to change our mind:
+	 *
+	 * 1. The standard FreezeLimit cutoff generally indicates that we should
+	 * freeze XIDs that are more than freeze_min_age XIDs in the past
+	 * (relative to OldestXmin).  But that should only be treated as a rough
+	 * guideline; it makes sense to freeze all eligible tuples on pages where
+	 * we're going to freeze at least one in any case.
+	 *
+	 * 2. If the page is now eligible to be marked all_visible, but is not
+	 * also eligible to be marked all_frozen, then we freeze early to make
+	 * sure that the page becomes all_frozen.  We should avoid building up
+	 * "freeze debt" that can only be paid off by an aggressive VACUUM, later
+	 * on.  This makes it much less likely that an aggressive VACUUM will ever
+	 * be required.
+	 *
+	 * Note: We deliberately track all_visible in a way that excludes LP_DEAD
+	 * items here.  Any page that is "all_visible for tuples with storage"
+	 * will be eligible to have its visibility map bit set during the ongoing
+	 * VACUUM, one way or another.  LP_DEAD items only make it unsafe to set
+	 * the page all_visible during the first heap pass, but the second heap
+	 * pass should be able to perform equivalent processing. (The second heap
+	 * pass cannot freeze tuples, though.)
+	 */
+	if (!earlyfreezing &&
+		((nfrozen > 0 && nfrozen < num_tuples) ||
+		 (prunestate->all_visible && !prunestate->all_frozen)))
+	{
+		/*
+		 * XXX Need to worry about leaking MultiXacts in FreezeMultiXactId()
+		 * now (via heap_prepare_freeze_tuple calls)?  That was already
+		 * possible, but presumably this makes it much more likely.
+		 *
+		 * On the other hand, that's only possible when we need to replace an
+		 * existing MultiXact with a new one.  Even then, we won't have
+		 * preallocated a new MultiXact (which we now risk leaking) if there
+		 * was only one remaining XID, and the XID is for an updater (we'll
+		 * only prepare to replace xmax with the XID directly).  So maybe it's
+		 * still a narrow enough problem to be ignored.
+		 */
+		earlyfreezing = true;
+		FreezeLimit = vacrel->OldestXmin;
+		MultiXactCutoff = vacrel->OldestMxact;
+		goto retry;
+	}
+
+	/* Time to define all_visible in a way that accounts for LP_DEAD items */
+	if (lpdead_items > 0)
+		prunestate->all_visible = false;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
@@ -1872,7 +1945,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(vacrel->rel, buf, FreezeLimit,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
-- 
2.30.2

v6-0003-Consolidate-VACUUM-xid-cutoff-logic.patchapplication/octet-stream; name=v6-0003-Consolidate-VACUUM-xid-cutoff-logic.patchDownload

From b1c5bda8102a3a8e7c7b5548b8110485092d3795 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 11 Dec 2021 17:39:45 -0800
Subject: [PATCH v6 3/5] Consolidate VACUUM xid cutoff logic.

Push the logic for determining whether or not any given VACUUM operation
will be aggressive down into vacuum_set_xid_limits().  This makes its
function signature significantly simpler.

This refactoring work will make it easier to set/return an "oldestMxact"
value the function's vacuumlazy.c caller in a later commit that teaches
VACUUM to intelligently set relfrozenxid and relminmxid to the oldest
real remain xid/MultiXactId.

A VACUUM operation's oldestMxact can be thought of as the MultiXactId
equivalent of its OldestXmin: just as OldestXmin is used as our initial
target relfrozenxid (which we'll ratchet back as the VACUUM progresses
and notices that it'll leave older XIDs in place), oldestMxact will be
our initial target MultiXactId (for a target MultiXactId that is itself
ratcheted back in the same way).
---
 src/include/commands/vacuum.h        |   6 +-
 src/backend/access/heap/vacuumlazy.c |  32 +++----
 src/backend/commands/cluster.c       |   3 +-
 src/backend/commands/vacuum.c        | 134 +++++++++++++--------------
 4 files changed, 79 insertions(+), 96 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e5e548d6b..d64f6268f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -286,15 +286,13 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern void vacuum_set_xid_limits(Relation rel,
+extern bool vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
 								  TransactionId *freezeLimit,
-								  TransactionId *xidFullScanLimit,
-								  MultiXactId *multiXactCutoff,
-								  MultiXactId *mxactFullScanLimit);
+								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 49847bc00..539214fcb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -323,8 +323,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
-	TransactionId xidFullScanLimit;
-	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
@@ -352,24 +350,22 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
 								  RelationGetRelid(rel));
 
-	vacuum_set_xid_limits(rel,
-						  params->freeze_min_age,
-						  params->freeze_table_age,
-						  params->multixact_freeze_min_age,
-						  params->multixact_freeze_table_age,
-						  &OldestXmin, &FreezeLimit, &xidFullScanLimit,
-						  &MultiXactCutoff, &mxactFullScanLimit);
-
 	/*
-	 * We request an aggressive scan if the table's frozen Xid is now older
-	 * than or equal to the requested Xid full-table scan limit; or if the
-	 * table's minimum MultiXactId is older than or equal to the requested
-	 * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
+	 * Get cutoffs that determine which tuples we need to freeze during the
+	 * VACUUM operation.
+	 *
+	 * Also determines if this is to be an aggressive VACUUM.  This will
+	 * eventually be required for any table where (for whatever reason) no
+	 * non-aggressive VACUUM ran to completion, and advanced relfrozenxid.
 	 */
-	aggressive = TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-											   xidFullScanLimit);
-	aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-											  mxactFullScanLimit);
+	aggressive = vacuum_set_xid_limits(rel,
+									   params->freeze_min_age,
+									   params->freeze_table_age,
+									   params->multixact_freeze_min_age,
+									   params->multixact_freeze_table_age,
+									   &OldestXmin, &FreezeLimit,
+									   &MultiXactCutoff);
+
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 61853e6de..61ced4413 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -857,8 +857,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, NULL, &MultiXactCutoff,
-						  NULL);
+						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b72ce01c5..f19e4a561 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -942,25 +942,20 @@ get_all_vacuum_rels(int options)
  *
  * Input parameters are the target relation, applicable freeze age settings.
  *
+ * Return value indicates whether caller should do an aggressive VACUUM or
+ * not.  This is a VACUUM that cannot skip any pages using the visibility map
+ * (except all-frozen pages), which is guaranteed to be able to advance
+ * relfrozenxid and relminmxid.
+ *
  * The output parameters are:
- * - oldestXmin is the cutoff value used to distinguish whether tuples are
- *	 DEAD or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ * - oldestXmin is the Xid below which tuples deleted by any xact (that
+ *   committed) should be considered DEAD, not just RECENTLY_DEAD.
  * - freezeLimit is the Xid below which all Xids are replaced by
  *	 FrozenTransactionId during vacuum.
- * - xidFullScanLimit (computed from freeze_table_age parameter)
- *	 represents a minimum Xid value; a table whose relfrozenxid is older than
- *	 this will have a full-table vacuum applied to it, to freeze tuples across
- *	 the whole table.  Vacuuming a table younger than this value can use a
- *	 partial scan.
- * - multiXactCutoff is the value below which all MultiXactIds are removed from
- *	 Xmax.
- * - mxactFullScanLimit is a value against which a table's relminmxid value is
- *	 compared to produce a full-table vacuum, as with xidFullScanLimit.
- *
- * xidFullScanLimit and mxactFullScanLimit can be passed as NULL if caller is
- * not interested.
+ * - multiXactCutoff is the value below which all MultiXactIds are removed
+ *   from Xmax.
  */
-void
+bool
 vacuum_set_xid_limits(Relation rel,
 					  int freeze_min_age,
 					  int freeze_table_age,
@@ -968,9 +963,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
 					  TransactionId *freezeLimit,
-					  TransactionId *xidFullScanLimit,
-					  MultiXactId *multiXactCutoff,
-					  MultiXactId *mxactFullScanLimit)
+					  MultiXactId *multiXactCutoff)
 {
 	int			freezemin;
 	int			mxid_freezemin;
@@ -980,6 +973,7 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
+	int			freezetable;
 
 	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
@@ -1097,64 +1091,60 @@ vacuum_set_xid_limits(Relation rel,
 
 	*multiXactCutoff = mxactLimit;
 
-	if (xidFullScanLimit != NULL)
-	{
-		int			freezetable;
+	/*
+	 * Done setting output parameters; just need to figure out if caller needs
+	 * to do an aggressive VACUUM or not.
+	 *
+	 * Determine the table freeze age to use: as specified by the caller, or
+	 * vacuum_freeze_table_age, but in any case not more than
+	 * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
+	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
+	 * before anti-wraparound autovacuum is launched.
+	 */
+	freezetable = freeze_table_age;
+	if (freezetable < 0)
+		freezetable = vacuum_freeze_table_age;
+	freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
+	Assert(freezetable >= 0);
 
-		Assert(mxactFullScanLimit != NULL);
+	/*
+	 * Compute XID limit causing an aggressive vacuum, being careful not to
+	 * generate a "permanent" XID
+	 */
+	limit = ReadNextTransactionId() - freezetable;
+	if (!TransactionIdIsNormal(limit))
+		limit = FirstNormalTransactionId;
+	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
+									  limit))
+		return true;
 
-		/*
-		 * Determine the table freeze age to use: as specified by the caller,
-		 * or vacuum_freeze_table_age, but in any case not more than
-		 * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-		 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
-		 * before anti-wraparound autovacuum is launched.
-		 */
-		freezetable = freeze_table_age;
-		if (freezetable < 0)
-			freezetable = vacuum_freeze_table_age;
-		freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
-		Assert(freezetable >= 0);
+	/*
+	 * Similar to the above, determine the table freeze age to use for
+	 * multixacts: as specified by the caller, or
+	 * vacuum_multixact_freeze_table_age, but in any case not more than
+	 * autovacuum_multixact_freeze_table_age * 0.95, so that if you have e.g.
+	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
+	 * multixacts before anti-wraparound autovacuum is launched.
+	 */
+	freezetable = multixact_freeze_table_age;
+	if (freezetable < 0)
+		freezetable = vacuum_multixact_freeze_table_age;
+	freezetable = Min(freezetable,
+					  effective_multixact_freeze_max_age * 0.95);
+	Assert(freezetable >= 0);
 
-		/*
-		 * Compute XID limit causing a full-table vacuum, being careful not to
-		 * generate a "permanent" XID.
-		 */
-		limit = ReadNextTransactionId() - freezetable;
-		if (!TransactionIdIsNormal(limit))
-			limit = FirstNormalTransactionId;
+	/*
+	 * Compute MultiXact limit causing an aggressive vacuum, being careful to
+	 * generate a valid MultiXact value
+	 */
+	mxactLimit = ReadNextMultiXactId() - freezetable;
+	if (mxactLimit < FirstMultiXactId)
+		mxactLimit = FirstMultiXactId;
+	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
+									mxactLimit))
+		return true;
 
-		*xidFullScanLimit = limit;
-
-		/*
-		 * Similar to the above, determine the table freeze age to use for
-		 * multixacts: as specified by the caller, or
-		 * vacuum_multixact_freeze_table_age, but in any case not more than
-		 * autovacuum_multixact_freeze_table_age * 0.95, so that if you have
-		 * e.g. nightly VACUUM schedule, the nightly VACUUM gets a chance to
-		 * freeze multixacts before anti-wraparound autovacuum is launched.
-		 */
-		freezetable = multixact_freeze_table_age;
-		if (freezetable < 0)
-			freezetable = vacuum_multixact_freeze_table_age;
-		freezetable = Min(freezetable,
-						  effective_multixact_freeze_max_age * 0.95);
-		Assert(freezetable >= 0);
-
-		/*
-		 * Compute MultiXact limit causing a full-table vacuum, being careful
-		 * to generate a valid MultiXact value.
-		 */
-		mxactLimit = ReadNextMultiXactId() - freezetable;
-		if (mxactLimit < FirstMultiXactId)
-			mxactLimit = FirstMultiXactId;
-
-		*mxactFullScanLimit = mxactLimit;
-	}
-	else
-	{
-		Assert(mxactFullScanLimit == NULL);
-	}
+	return false;
 }
 
 /*
-- 
2.30.2

v6-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/octet-stream; name=v6-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload

From ba4d5dd0b280d189d4c1dbe45a7dbb1970fb66c6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 17 Nov 2021 21:27:06 -0800
Subject: [PATCH v6 1/5] Simplify lazy_scan_heap's handling of scanned pages.

Redefine a scanned page as any heap page that actually gets pinned by
VACUUM's first pass over the heap.  Pages counted by scanned_pages are
now the complement of the pages that are skipped over using the
visibility map.  This new definition significantly simplifies quite a
few things.

Now heap relation truncation, visibility map bit setting, tuple counting
(e.g., for pg_class.reltuples), and tuple freezing all share a common
definition of scanned_pages.  That makes it possible to remove certain
special cases, that never made much sense.  We no longer need to track
tupcount_pages separately (see bugfix commit 1914c5ea for details),
since we now always count tuples from pages that are scanned_pages.  We
also don't need to needlessly distinguish between aggressive and
non-aggressive VACUUM operations when we cannot immediately acquire a
cleanup lock.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.  But we now
make the non-aggressive case notice that a failure to acquire a cleanup
lock on one particular heap page does not in itself make it unsafe to
advance relfrozenxid for the whole relation (which is what we usually
see in the aggressive case already).

We now also collect LP_DEAD items in the dead_items array in the case
where we cannot immediately get a cleanup lock on the buffer.  We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.
Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),
so we should not squander any opportunity to do that.  Commit 8523492d4e
taught VACUUM to set LP_DEAD line pointers to LP_UNUSED while only
holding an exclusive lock (not a cleanup lock), so we can expect to set
existing LP_DEAD items to LP_UNUSED reliably, even when we cannot
acquire our own cleanup lock at either pass over the heap (unless we opt
to skip index vacuuming, which implies that there is no second pass over
the heap).

We no longer report on "pin skipped pages" in log output.  A later patch
will add back an improved version of the same instrumentation.  We don't
want to show any information about any failures to acquire cleanup locks
unless we actually failed to do useful work as a consequence.  A page
that we could not acquire a cleanup lock on is now treated as equivalent
to any other scanned page in most cases.
---
 src/backend/access/heap/vacuumlazy.c          | 805 +++++++++++-------
 .../isolation/expected/vacuum-reltuples.out   |   2 +-
 .../isolation/specs/vacuum-reltuples.spec     |   7 +-
 3 files changed, 517 insertions(+), 297 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1749cc2a4..16f88bab0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -143,6 +143,10 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	bool		aggressive;
+	/* Use visibility map to skip? (disabled via reloption) */
+	bool		skipwithvm;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -167,6 +171,8 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Are FreezeLimit/MultiXactCutoff still valid? */
+	bool		freeze_cutoffs_valid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -188,10 +194,8 @@ typedef struct LVRelState
 	 */
 	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* pages whose tuples we counted */
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -204,6 +208,7 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
+	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
 	int64		lpdead_items;	/* # deleted from indexes */
 	int64		new_dead_tuples;	/* new estimated total # of dead items in
@@ -240,19 +245,22 @@ typedef struct LVSavedErrInfo
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
-						   bool aggressive);
+static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
+								   BlockNumber blkno, Page page,
+								   bool sharelock, Buffer vmbuffer);
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
 							LVPagePruneState *prunestate);
+static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
+							  BlockNumber blkno, Page page,
+							  bool *hastup, bool *hasfreespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, int index, Buffer *vmbuffer);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -307,16 +315,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	int			usecs;
 	double		read_rate,
 				write_rate;
-	bool		aggressive;		/* should we scan all unfrozen pages? */
-	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
+	bool		aggressive,
+				skipwithvm;
+	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
-	TransactionId new_frozen_xid;
-	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
@@ -359,8 +366,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 											   xidFullScanLimit);
 	aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
 											  mxactFullScanLimit);
+	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+	{
+		/*
+		 * Force aggressive mode, and disable skipping blocks using the
+		 * visibility map (even those set all-frozen)
+		 */
 		aggressive = true;
+		skipwithvm = false;
+	}
 
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
@@ -423,6 +438,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
+	vacrel->aggressive = aggressive;
+	vacrel->skipwithvm = skipwithvm;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -454,35 +471,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* Track if cutoffs became invalid (possible in !aggressive case only) */
+	vacrel->freeze_cutoffs_valid = true;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params, aggressive);
+	lazy_scan_heap(vacrel, params->nworkers);
 
 	/* Done with indexes */
 	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the all unfrozen pages. If we did,
-	 * we can adjust relfrozenxid and relminmxid.
-	 *
-	 * NB: We need to check this before truncating the relation, because that
-	 * will change ->rel_pages.
-	 */
-	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
-		< vacrel->rel_pages)
-	{
-		Assert(!aggressive);
-		scanned_all_unfrozen = false;
-	}
-	else
-		scanned_all_unfrozen = true;
-
-	/*
-	 * Optionally truncate the relation.
+	 * Optionally truncate the relation.  But remember the relation size used
+	 * by lazy_scan_prune for later first.
 	 */
+	orig_rel_pages = vacrel->rel_pages;
 	if (should_attempt_truncation(vacrel))
 	{
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
@@ -508,28 +513,44 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 *
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * relpages to.
-	 *
-	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
-	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrel->rel_pages;
+	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
 	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
-
-	vac_update_relstats(rel,
-						new_rel_pages,
-						new_live_tuples,
-						new_rel_allvisible,
-						vacrel->nindexes > 0,
-						new_frozen_xid,
-						new_min_multi,
-						false);
+	/*
+	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
+	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
+	 * provided we didn't skip any all-visible (not all-frozen) pages using
+	 * the visibility map, and assuming that we didn't fail to get a cleanup
+	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
+	 * MultiXactCutoff) established for VACUUM operation.
+	 *
+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_heap, which won't match when we
+	 * happened to truncate the relation afterwards.
+	 */
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}
 
 	/*
 	 * Report results to the stats collector, too.
@@ -557,7 +578,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
-			BlockNumber orig_rel_pages;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -609,10 +629,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->pinskipped_pages,
 							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
@@ -620,7 +639,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) vacrel->new_rel_tuples,
 							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
-			orig_rel_pages = vacrel->rel_pages + vacrel->pages_removed;
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -737,7 +755,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, int nworkers)
 {
 	VacDeadItems *dead_items;
 	BlockNumber nblocks,
@@ -756,14 +774,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	GlobalVisState *vistest;
 
 	nblocks = RelationGetNumberOfBlocks(vacrel->rel);
-	next_unskippable_block = 0;
-	next_failsafe_block = 0;
-	next_fsm_block_to_vacuum = 0;
 	vacrel->rel_pages = nblocks;
 	vacrel->scanned_pages = 0;
-	vacrel->pinskipped_pages = 0;
 	vacrel->frozenskipped_pages = 0;
-	vacrel->tupcount_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -787,14 +800,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * dangerously old.
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
+	next_failsafe_block = 0;
 
 	/*
 	 * Allocate the space for dead_items.  Note that this handles parallel
 	 * VACUUM initialization as part of allocating shared memory space used
 	 * for dead_items.
 	 */
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, nworkers);
 	dead_items = vacrel->dead_items;
+	next_fsm_block_to_vacuum = 0;
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
@@ -803,7 +818,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/*
-	 * Except when aggressive is set, we want to skip pages that are
+	 * Set things up for skipping blocks using visibility map.
+	 *
+	 * Except when vacrel->aggressive is set, we want to skip pages that are
 	 * all-visible according to the visibility map, but only when we can skip
 	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
 	 * sequentially, the OS should be doing readahead for us, so there's no
@@ -812,8 +829,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * page means that we can't update relfrozenxid, so we only want to do it
 	 * if we can skip a goodly number of pages.
 	 *
-	 * When aggressive is set, we can't skip pages just because they are
-	 * all-visible, but we can still skip pages that are all-frozen, since
+	 * When vacrel->aggressive is set, we can't skip pages just because they
+	 * are all-visible, but we can still skip pages that are all-frozen, since
 	 * such pages do not need freezing and do not affect the value that we can
 	 * safely set for relfrozenxid or relminmxid.
 	 *
@@ -836,17 +853,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * just added to that page are necessarily newer than the GlobalXmin we
 	 * computed, so they'll have no effect on the value to which we can safely
 	 * set relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 *
-	 * We will scan the table's last page, at least to the extent of
-	 * determining whether it has tuples or not, even if it should be skipped
-	 * according to the above rules; except when we've already determined that
-	 * it's not worth trying to truncate the table.  This avoids having
-	 * lazy_truncate_heap() take access-exclusive lock on the table to attempt
-	 * a truncation that just fails immediately because there are tuples in
-	 * the last page.  This is worth avoiding mainly because such a lock must
-	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
+	next_unskippable_block = 0;
+	if (vacrel->skipwithvm)
 	{
 		while (next_unskippable_block < nblocks)
 		{
@@ -855,7 +864,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			vmstatus = visibilitymap_get_status(vacrel->rel,
 												next_unskippable_block,
 												&vmbuffer);
-			if (aggressive)
+			if (vacrel->aggressive)
 			{
 				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
 					break;
@@ -882,13 +891,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		bool		all_visible_according_to_vm = false;
 		LVPagePruneState prunestate;
 
-		/*
-		 * Consider need to skip blocks.  See note above about forcing
-		 * scanning of last page.
-		 */
-#define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
@@ -898,7 +900,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		{
 			/* Time to advance next_unskippable_block */
 			next_unskippable_block++;
-			if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
+			if (vacrel->skipwithvm)
 			{
 				while (next_unskippable_block < nblocks)
 				{
@@ -907,7 +909,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 					vmskipflags = visibilitymap_get_status(vacrel->rel,
 														   next_unskippable_block,
 														   &vmbuffer);
-					if (aggressive)
+					if (vacrel->aggressive)
 					{
 						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
 							break;
@@ -936,19 +938,25 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * it's not all-visible.  But in an aggressive vacuum we know only
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
-			if (aggressive && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+			if (vacrel->aggressive &&
+				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 				all_visible_according_to_vm = true;
 		}
 		else
 		{
 			/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current page can be skipped if we've seen a long enough run
+			 * of skippable blocks to justify skipping it -- provided it's not
+			 * the last page in the relation (according to rel_pages/nblocks).
+			 *
+			 * We always scan the table's last page to determine whether it
+			 * has tuples or not, even if it would otherwise be skipped
+			 * (unless we're skipping every single page in the relation). This
+			 * avoids having lazy_truncate_heap() take access-exclusive lock
+			 * on the table to attempt a truncation that just fails
+			 * immediately because there are tuples on the last page.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks && blkno < nblocks - 1)
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
@@ -957,18 +965,32 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * careful to count it as a skipped all-frozen page in that
 				 * case, or else we'll think we can't update relfrozenxid and
 				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was all-frozen, so we have to recheck; but
-				 * in this case an approximate answer is OK.
+				 * know whether it was initially all-frozen, so we have to
+				 * recheck.
 				 */
-				if (aggressive || VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
+				if (vacrel->aggressive ||
+					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
+
+			/*
+			 * Otherwise it must be an all-visible (and possibly even
+			 * all-frozen) page that we decided to process regardless
+			 * (SKIP_PAGES_THRESHOLD must not have been crossed).
+			 */
 			all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
+		/*
+		 * We're not skipping this page using the visibility map, and so it is
+		 * (by definition) a scanned page.  Any tuples from this page are now
+		 * guaranteed to be counted below, after some preparatory checks.
+		 */
+		vacrel->scanned_pages++;
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1023,174 +1045,78 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
-		 * Set up visibility map page as needed.
-		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
-		 * already have the correct page pinned anyway.  However, it's
-		 * possible that (a) next_unskippable_block is covered by a different
-		 * VM page than the current block or (b) we released our pin and did a
-		 * cycle of index vacuuming.
+		 * already have the correct page pinned anyway.
 		 */
 		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
 
+		/* Finished preparatory checks.  Actually scan the page. */
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
 
 		/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing using lazy_scan_noprune.
 		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
-			bool		hastup;
+			bool		hastup,
+						hasfreespace;
 
-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
-			{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
-				continue;
-			}
-
-			/*
-			 * Read the page with share lock to see if any xids on it need to
-			 * be frozen.  If not we just skip the page, after updating our
-			 * scan statistics.  If there are some, we wait for cleanup lock.
-			 *
-			 * We could defer the lock request further by remembering the page
-			 * and coming back to it later, or we could even register
-			 * ourselves for multiple buffers and then service whichever one
-			 * is received first.  For now, this seems good enough.
-			 *
-			 * If we get here with aggressive false, then we're just forcibly
-			 * checking the page, and so we don't want to insist on getting
-			 * the lock; we only need to know if the page contains tuples, so
-			 * that we can update nonempty_pages correctly.  It's convenient
-			 * to use lazy_check_needs_freeze() for both situations, though.
-			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
 			{
-				UnlockReleaseBuffer(buf);
-				vacrel->scanned_pages++;
-				vacrel->pinskipped_pages++;
-				if (hastup)
-					vacrel->nonempty_pages = blkno + 1;
+				/* Processed as new/empty page (lock and pin released) */
 				continue;
 			}
-			if (!aggressive)
+
+			/* Collect LP_DEAD items in dead_items array, count tuples */
+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
+								  &hasfreespace))
 			{
+				Size		freespace;
+
 				/*
-				 * Here, we must not advance scanned_pages; that would amount
-				 * to claiming that the page contains no freezable tuples.
+				 * Processed page successfully (without cleanup lock) -- just
+				 * need to perform rel truncation and FSM steps, much like the
+				 * lazy_scan_prune case.  Don't bother trying to match its
+				 * visibility map setting steps, though.
 				 */
-				UnlockReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
+				if (hasfreespace)
+					freespace = PageGetHeapFreeSpace(page);
+				UnlockReleaseBuffer(buf);
+				if (hasfreespace)
+					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 				continue;
 			}
+
+			/*
+			 * lazy_scan_noprune could not do all required processing.  Wait
+			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
+			 */
+			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
-			/* drop through to normal processing */
 		}
 
-		/*
-		 * By here we definitely have enough dead_items space for whatever
-		 * LP_DEAD tids are on this page, we have the visibility map page set
-		 * up in case we need to set this page's all_visible/all_frozen bit,
-		 * and we have a cleanup lock.  Any tuples on this page are now sure
-		 * to be "counted" by this VACUUM.
-		 *
-		 * One last piece of preamble needs to take place before we can prune:
-		 * we need to consider new and empty pages.
-		 */
-		vacrel->scanned_pages++;
-		vacrel->tupcount_pages++;
-
-		page = BufferGetPage(buf);
-
-		if (PageIsNew(page))
+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
-			if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
-			{
-				Size		freespace = BLCKSZ - SizeOfPageHeaderData;
-
-				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-			}
-			continue;
-		}
-
-		if (PageIsEmpty(page))
-		{
-			Size		freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+			/* Processed as new/empty page (lock and pin released) */
 			continue;
 		}
 
 		/*
-		 * Prune and freeze tuples.
+		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
 		 * dead_items array.  This includes LP_DEAD line pointers that we
@@ -1398,7 +1324,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
-													 vacrel->tupcount_pages,
+													 vacrel->scanned_pages,
 													 vacrel->live_tuples);
 
 	/*
@@ -1447,6 +1373,137 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		update_index_statistics(vacrel);
 }
 
+/*
+ *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
+ *
+ * Must call here to handle both new and empty pages before calling
+ * lazy_scan_prune or lazy_scan_noprune, since they're not prepared to deal
+ * with new or empty pages.
+ *
+ * It's necessary to consider new pages as a special case, since the rules for
+ * maintaining the visibility map and FSM with empty pages are a little
+ * different (though new pages can be truncated based on the usual rules).
+ *
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).
+ *
+ * Caller must hold at least a shared lock.  We might need to escalate the
+ * lock in that case, so the type of lock caller holds needs to be specified
+ * using 'sharelock' argument.
+ *
+ * Returns false in common case where caller should go on to call
+ * lazy_scan_prune (or lazy_scan_noprune).  Otherwise returns true, indicating
+ * that lazy_scan_heap is done processing the page, releasing lock on caller's
+ * behalf.
+ */
+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)
+{
+	Size		freespace;
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * All-zeroes pages can be left over if either a backend extends the
+		 * relation by a single page, but crashes before the newly initialized
+		 * page has been written out, or when bulk-extending the relation
+		 * (which creates a number of empty pages at the tail end of the
+		 * relation), and then enters them into the FSM.
+		 *
+		 * Note we do not enter the page into the visibilitymap. That has the
+		 * downside that we repeatedly visit this page in subsequent vacuums,
+		 * but otherwise we'll never discover the space on a promoted standby.
+		 * The harm of repeated checking ought to normally not be too bad. The
+		 * space usually should be used at some point, otherwise there
+		 * wouldn't be any regular vacuums.
+		 *
+		 * Make sure these pages are in the FSM, to ensure they can be reused.
+		 * Do that by testing if there's any space recorded for the page. If
+		 * not, enter it. We do so after releasing the lock on the heap page,
+		 * the FSM is approximate, after all.
+		 */
+		UnlockReleaseBuffer(buf);
+
+		if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
+		{
+			freespace = BLCKSZ - SizeOfPageHeaderData;
+
+			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		}
+
+		return true;
+	}
+
+	if (PageIsEmpty(page))
+	{
+		/*
+		 * It seems likely that caller will always be able to get a cleanup
+		 * lock on an empty page.  But don't take any chances -- escalate to
+		 * an exclusive lock (still don't need a cleanup lock, though).
+		 */
+		if (sharelock)
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (!PageIsEmpty(page))
+			{
+				/* page isn't new or empty -- keep lock and pin for now */
+				return false;
+			}
+		}
+		else
+		{
+			/* Already have a full cleanup lock (which is more than enough) */
+		}
+
+		/*
+		 * Unlike new pages, empty pages are always set all-visible and
+		 * all-frozen.
+		 */
+		if (!PageIsAllVisible(page))
+		{
+			START_CRIT_SECTION();
+
+			/* mark buffer dirty before writing a WAL record */
+			MarkBufferDirty(buf);
+
+			/*
+			 * It's possible that another backend has extended the heap,
+			 * initialized the page, and then failed to WAL-log the page due
+			 * to an ERROR.  Since heap extension is not WAL-logged, recovery
+			 * might try to replay our record setting the page all-visible and
+			 * find that the page isn't initialized, which will cause a PANIC.
+			 * To prevent that, check whether the page has been previously
+			 * WAL-logged, and if not, do that now.
+			 */
+			if (RelationNeedsWAL(vacrel->rel) &&
+				PageGetLSN(page) == InvalidXLogRecPtr)
+				log_newpage_buffer(buf, true);
+
+			PageSetAllVisible(page);
+			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
+							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+			END_CRIT_SECTION();
+		}
+
+		freespace = PageGetHeapFreeSpace(page);
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		return true;
+	}
+
+	/* page isn't new or empty -- keep lock and pin */
+	return false;
+}
+
 /*
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -1491,6 +1548,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	maxoff = PageGetMaxOffsetNumber(page);
 
 retry:
@@ -1553,10 +1612,9 @@ retry:
 		 * LP_DEAD items are processed outside of the loop.
 		 *
 		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how lazy_check_needs_freeze() or
-		 * count_nondeletable_pages() do it -- they only consider pages empty
-		 * when they only have LP_UNUSED items, which is important for
-		 * correctness.
+		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
+		 * it -- it only considers pages empty/truncatable when they have no
+		 * items at all (except LP_UNUSED items).
 		 *
 		 * Our assumption is that any LP_DEAD items we encounter here will
 		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
@@ -1843,6 +1901,226 @@ retry:
 	vacrel->live_tuples += live_tuples;
 }
 
+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() variant without pruning
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.
+ *
+ * While pruning isn't performed here, we can at least collect existing
+ * LP_DEAD items into the dead_items array for removal from indexes.  It's
+ * quite possible that earlier opportunistic pruning left LP_DEAD items
+ * behind, and we shouldn't miss out on an opportunity to make them reusable
+ * (VACUUM alone is capable of cleaning up line pointer bloat like this).
+ * Note that we'll only require an exclusive lock (not a cleanup lock) later
+ * on when we set these LP_DEAD items to LP_UNUSED in lazy_vacuum_heap_page.
+ *
+ * Freezing isn't performed here either.  For aggressive VACUUM callers, we
+ * may return false to indicate that a full cleanup lock is required.  This is
+ * necessary because pruning requires a cleanup lock, and because VACUUM
+ * cannot freeze a page's tuples until after pruning takes place (freezing
+ * tuples effectively requires a cleanup lock, though we don't need a cleanup
+ * lock in lazy_vacuum_heap_page or in lazy_scan_new_or_empty to set a heap
+ * page all-frozen in the visibility map).
+ *
+ * Returns true to indicate that all required processing has been performed.
+ * We'll always return true for a non-aggressive VACUUM, even when we know
+ * that this will cause them to miss out on freezing tuples from before
+ * vacrel->FreezeLimit cutoff -- they should never have to wait for a cleanup
+ * lock.  This does mean that they definitely won't be able to advance
+ * relfrozenxid opportunistically (same applies to vacrel->MultiXactCutoff and
+ * relminmxid).  Caller waits for full cleanup lock when we return false.
+ *
+ * See lazy_scan_prune for an explanation of hastup return flag.  The
+ * hasfreespace flag instructs caller on whether or not it should do generic
+ * FSM processing for page, which is determined based on almost the same
+ * criteria as the lazy_scan_prune case.
+ */
+static bool
+lazy_scan_noprune(LVRelState *vacrel,
+				  Buffer buf,
+				  BlockNumber blkno,
+				  Page page,
+				  bool *hastup,
+				  bool *hasfreespace)
+{
+	OffsetNumber offnum,
+				maxoff;
+	int			lpdead_items,
+				num_tuples,
+				live_tuples,
+				new_dead_tuples;
+	HeapTupleHeader tupleheader;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
+	*hastup = false;			/* for now */
+	*hasfreespace = false;		/* for now */
+
+	lpdead_items = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+	new_dead_tuples = 0;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		if (!ItemIdIsUsed(itemid))
+			continue;
+
+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;
+			continue;
+		}
+
+		if (ItemIdIsDead(itemid))
+		{
+			/*
+			 * Deliberately don't set hastup=true here.  See same point in
+			 * lazy_scan_prune for an explanation.
+			 */
+			deadoffsets[lpdead_items++] = offnum;
+			continue;
+		}
+
+		*hastup = true;		/* page prevents rel truncation */
+		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+		if (heap_tuple_needs_freeze(tupleheader,
+									vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
+		{
+			if (vacrel->aggressive)
+			{
+				/* Going to have to get cleanup lock for lazy_scan_prune */
+				vacrel->offnum = InvalidOffsetNumber;
+				return false;
+			}
+
+			/*
+			 * Current non-aggressive VACUUM operation definitely won't be
+			 * able to advance relfrozenxid or relminmxid
+			 */
+			vacrel->freeze_cutoffs_valid = false;
+		}
+
+		num_tuples++;
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
+
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count both cases as live, just like lazy_scan_prune
+				 */
+				live_tuples++;
+
+				break;
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * There is some useful work for pruning to do, that won't be
+				 * done due to failure to get a cleanup lock.
+				 *
+				 * TODO Add dedicated instrumentation for this case
+				 */
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 */
+				new_dead_tuples++;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Do not count these rows as live, just like lazy_scan_prune
+				 */
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+	}
+
+	vacrel->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in vacrel (though
+	 * only when VACUUM uses two-pass strategy).
+	 */
+	if (vacrel->nindexes == 0)
+	{
+		/*
+		 * Using one-pass strategy.
+		 *
+		 * We are not prepared to handle the corner case where a single pass
+		 * strategy VACUUM cannot get a cleanup lock, and we then find LP_DEAD
+		 * items.
+		 */
+		if (lpdead_items > 0)
+			*hastup = true;
+		*hasfreespace = true;
+		num_tuples += lpdead_items;
+		/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+	}
+	else if (lpdead_items > 0)
+	{
+		VacDeadItems *dead_items = vacrel->dead_items;
+		ItemPointerData tmp;
+
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_items->items[dead_items->num_items++] = tmp;
+		}
+
+		Assert(dead_items->num_items <= dead_items->max_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_items->num_items);
+
+		vacrel->lpdead_items += lpdead_items;
+	}
+	else
+	{
+		/*
+		 * Caller won't be vacuuming this page later, so tell it to record
+		 * page's freespace in the FSM now
+		 */
+		*hasfreespace = true;
+	}
+
+	/*
+	 * Finally, add relevant page-local counts to whole-VACUUM counts
+	 */
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+
+	/* Caller won't need to call lazy_scan_prune with same page */
+	return true;
+}
+
 /*
  * Main entry point for index vacuuming and heap vacuuming.
  *
@@ -2286,67 +2564,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	return index;
 }
 
-/*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
- *
- * Returns true if the page needs to be vacuumed using cleanup lock.
- * Also returns a flag indicating whether page contains any tuples at all.
- */
-static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
-{
-	Page		page = BufferGetPage(buf);
-	OffsetNumber offnum,
-				maxoff;
-	HeapTupleHeader tupleheader;
-
-	*hastup = false;
-
-	/*
-	 * New and empty pages, obviously, don't contain tuples. We could make
-	 * sure that the page is registered in the FSM, but it doesn't seem worth
-	 * waiting for a cleanup lock just for that, especially because it's
-	 * likely that the pin holder will do so.
-	 */
-	if (PageIsNew(page) || PageIsEmpty(page))
-		return false;
-
-	maxoff = PageGetMaxOffsetNumber(page);
-	for (offnum = FirstOffsetNumber;
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		ItemId		itemid;
-
-		/*
-		 * Set the offset number so that we can display it along with any
-		 * error that occurred while processing this tuple.
-		 */
-		vacrel->offnum = offnum;
-		itemid = PageGetItemId(page, offnum);
-
-		/* this should match hastup test in count_nondeletable_pages() */
-		if (ItemIdIsUsed(itemid))
-			*hastup = true;
-
-		/* dead and redirect items never need freezing */
-		if (!ItemIdIsNormal(itemid))
-			continue;
-
-		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-
-		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
-			break;
-	}							/* scan along page */
-
-	/* Clear the offset information once we have processed the given page. */
-	vacrel->offnum = InvalidOffsetNumber;
-
-	return (offnum <= maxoff);
-}
-
 /*
  * Trigger the failsafe to avoid wraparound failure when vacrel table has a
  * relfrozenxid and/or relminmxid that is dangerously far in the past.
@@ -2412,7 +2629,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		double		reltuples = vacrel->new_rel_tuples;
 		bool		estimated_count =
-		vacrel->tupcount_pages < vacrel->rel_pages;
+		vacrel->scanned_pages < vacrel->rel_pages;
 
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2429,7 +2646,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 		/* Outsource everything to parallel variant */
 		parallel_vacuum_cleanup_all_indexes(vacrel->pvs, vacrel->new_rel_tuples,
 											vacrel->num_index_scans,
-											(vacrel->tupcount_pages < vacrel->rel_pages));
+											(vacrel->scanned_pages < vacrel->rel_pages));
 	}
 }
 
@@ -2536,7 +2753,9 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * should_attempt_truncation - should we attempt to truncate the heap?
  *
  * Don't even think about it unless we have a shot at releasing a goodly
- * number of pages.  Otherwise, the time taken isn't worth it.
+ * number of pages.  Otherwise, the time taken isn't worth it, mainly because
+ * an AccessExclusive lock must be replayed on any hot standby, where it can
+ * be particularly disruptive.
  *
  * Also don't attempt it if wraparound failsafe is in effect.  It's hard to
  * predict how long lazy_truncate_heap will take.  Don't take any chances.
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
index cdbe7f3a6..ce55376e7 100644
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ b/src/test/isolation/expected/vacuum-reltuples.out
@@ -45,7 +45,7 @@ step stats:
 
 relpages|reltuples
 --------+---------
-       1|       20
+       1|       21
 (1 row)
 
 
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
index ae2f79b8f..a2a461f2f 100644
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ b/src/test/isolation/specs/vacuum-reltuples.spec
@@ -2,9 +2,10 @@
 # to page pins. We absolutely need to avoid setting reltuples=0 in
 # such cases, since that interferes badly with planning.
 #
-# Expected result in second permutation is 20 tuples rather than 21 as
-# for the others, because vacuum should leave the previous result
-# (from before the insert) in place.
+# Expected result for all three permutation is 21 tuples, including
+# the second permutation.  VACUUM is able to count the concurrently
+# inserted tuple in its final reltuples, even when a cleanup lock
+# cannot be acquired on the affected heap page.
 
 setup {
     create table smalltbl
-- 
2.30.2

v6-0002-Improve-VACUUM-instrumentation.patchapplication/octet-stream; name=v6-0002-Improve-VACUUM-instrumentation.patchDownload

From a3c4fdb7f87dcc96c59fc9bcb584c622124b6215 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 21 Nov 2021 14:47:11 -0800
Subject: [PATCH v6 2/5] Improve VACUUM instrumentation.

Add instrumentation of "missed dead tuples", and the number of pages
that had at least one such tuple.  These are fully DEAD (not just
RECENTLY_DEAD) tuples with storage that could not be pruned due to an
inability to acquire a cleanup lock.  This is a replacement for the
"skipped due to pin" instrumentation removed by the previous commit.
Note that the new instrumentation doesn't say anything about pages that
we failed to acquire a cleanup lock on when we see that there were no
missed dead tuples on the page.

Also report on visibility map pages skipped by VACUUM, without regard
for whether the pages were all-frozen or just all-visible.

Also report when and how relfrozenxid is advanced by VACUUM, including
non-aggressive VACUUM.  Apart from being useful on its own, this might
enable future work that teaches non-aggressive VACUUM to be more
concerned about advancing relfrozenxid sooner rather than later.

Also report number of tuples frozen.  This will become more important
when a later patch that freezes tuples early is committed.

Also enhance how we report OldestXmin cutoff by putting it in context:
show how far behind it is at the _end_ of the VACUUM operation.
---
 src/include/commands/vacuum.h        |   2 +
 src/backend/access/heap/vacuumlazy.c | 105 +++++++++++++++++++--------
 src/backend/commands/analyze.c       |   3 +
 src/backend/commands/vacuum.c        |   9 +++
 4 files changed, 89 insertions(+), 30 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d0bdfa42..e5e548d6b 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -283,6 +283,8 @@ extern void vac_update_relstats(Relation relation,
 								bool hasindex,
 								TransactionId frozenxid,
 								MultiXactId minmulti,
+								bool *frozenxid_updated,
+								bool *minmulti_updated,
 								bool in_outer_xact);
 extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 16f88bab0..49847bc00 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -198,6 +198,7 @@ typedef struct LVRelState
 	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber pages_removed;	/* pages remove by truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 
 	/* Statistics output by us, for table */
@@ -210,9 +211,10 @@ typedef struct LVRelState
 	int			num_index_scans;
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
+	int64		tuples_frozen;	/* # frozen by us */
 	int64		lpdead_items;	/* # deleted from indexes */
-	int64		new_dead_tuples;	/* new estimated total # of dead items in
-									 * table */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
 	int64		num_tuples;		/* total number of nonremovable tuples */
 	int64		live_tuples;	/* live tuples (reltuples estimate) */
 } LVRelState;
@@ -317,6 +319,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				write_rate;
 	bool		aggressive,
 				skipwithvm;
+	bool		frozenxid_updated,
+				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
@@ -538,9 +542,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
+		frozenxid_updated = minmulti_updated = false;
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId, false);
+							InvalidTransactionId, InvalidMultiXactId,
+							NULL, NULL, false);
 	}
 	else
 	{
@@ -549,7 +555,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			   orig_rel_pages);
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff, false);
+							FreezeLimit, MultiXactCutoff,
+							&frozenxid_updated, &minmulti_updated, false);
 	}
 
 	/*
@@ -565,7 +572,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(new_live_tuples, 0),
-						 vacrel->new_dead_tuples);
+						 vacrel->recently_dead_tuples +
+						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -578,6 +586,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
+			int32		diff;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -629,16 +638,41 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped using visibility map (%.2f%% of total)\n"),
 							 vacrel->pages_removed,
 							 vacrel->rel_pages,
-							 vacrel->frozenskipped_pages);
+							 orig_rel_pages - vacrel->scanned_pages,
+							 orig_rel_pages == 0 ? 0 :
+							 100.0 * (orig_rel_pages - vacrel->scanned_pages) / orig_rel_pages);
 			appendStringInfo(&buf,
-							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
+							 _("tuples: %lld removed, %lld remain (%lld newly frozen), %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->new_dead_tuples,
-							 OldestXmin);
+							 (long long) vacrel->tuples_frozen,
+							 (long long) vacrel->recently_dead_tuples);
+			if (vacrel->missed_dead_tuples > 0)
+				appendStringInfo(&buf,
+								 _("tuples missed: %lld dead from %u contended pages\n"),
+								 (long long) vacrel->missed_dead_tuples,
+								 vacrel->missed_dead_pages);
+			diff = (int32) (ReadNextTransactionId() - OldestXmin);
+			appendStringInfo(&buf,
+							 _("removable cutoff: %u, which is %d xids behind next\n"),
+							 OldestXmin, diff);
+			if (frozenxid_updated)
+			{
+				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				appendStringInfo(&buf,
+								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
+								 FreezeLimit, diff);
+			}
+			if (minmulti_updated)
+			{
+				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				appendStringInfo(&buf,
+								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
+								 MultiXactCutoff, diff);
+			}
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -779,13 +813,16 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	vacrel->frozenskipped_pages = 0;
 	vacrel->pages_removed = 0;
 	vacrel->lpdead_item_pages = 0;
+	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
 
 	/* Initialize instrumentation counters */
 	vacrel->num_index_scans = 0;
 	vacrel->tuples_deleted = 0;
+	vacrel->tuples_frozen = 0;
 	vacrel->lpdead_items = 0;
-	vacrel->new_dead_tuples = 0;
+	vacrel->recently_dead_tuples = 0;
+	vacrel->missed_dead_tuples = 0;
 	vacrel->num_tuples = 0;
 	vacrel->live_tuples = 0;
 
@@ -1332,7 +1369,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->new_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
+		vacrel->missed_dead_tuples;
 
 	/*
 	 * Release any remaining pin on visibility map page.
@@ -1540,7 +1578,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				lpdead_items,
-				new_dead_tuples,
+				recently_dead_tuples,
 				num_tuples,
 				live_tuples;
 	int			nnewlpdead;
@@ -1557,7 +1595,7 @@ retry:
 	/* Initialize (or reset) page-level counters */
 	tuples_deleted = 0;
 	lpdead_items = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
 	num_tuples = 0;
 	live_tuples = 0;
 
@@ -1716,11 +1754,11 @@ retry:
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * If tuple is recently deleted then we must not remove it
-				 * from relation.  (We only remove items that are LP_DEAD from
+				 * If tuple is recently dead then we must not remove it from
+				 * the relation.  (We only remove items that are LP_DEAD from
 				 * pruning.)
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				prunestate->all_visible = false;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -1895,8 +1933,9 @@ retry:
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
+	vacrel->tuples_frozen += nfrozen;
 	vacrel->lpdead_items += lpdead_items;
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
 }
@@ -1949,7 +1988,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	int			lpdead_items,
 				num_tuples,
 				live_tuples,
-				new_dead_tuples;
+				recently_dead_tuples,
+				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
@@ -1961,7 +2001,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	lpdead_items = 0;
 	num_tuples = 0;
 	live_tuples = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
+	missed_dead_tuples = 0;
 
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -2035,16 +2076,15 @@ lazy_scan_noprune(LVRelState *vacrel,
 				/*
 				 * There is some useful work for pruning to do, that won't be
 				 * done due to failure to get a cleanup lock.
-				 *
-				 * TODO Add dedicated instrumentation for this case
 				 */
+				missed_dead_tuples++;
 				break;
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 * Count in recently_dead_tuples, just like lazy_scan_prune
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
 
@@ -2072,13 +2112,15 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 *
 		 * We are not prepared to handle the corner case where a single pass
 		 * strategy VACUUM cannot get a cleanup lock, and we then find LP_DEAD
-		 * items.
+		 * items.  Count the LP_DEAD items as missed_dead_tuples instead. This
+		 * is slightly dishonest, but it's better than maintaining code to do
+		 * heap vacuuming for this one narrow corner case.
 		 */
 		if (lpdead_items > 0)
 			*hastup = true;
 		*hasfreespace = true;
 		num_tuples += lpdead_items;
-		/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+		missed_dead_tuples += lpdead_items;
 	}
 	else if (lpdead_items > 0)
 	{
@@ -2113,9 +2155,12 @@ lazy_scan_noprune(LVRelState *vacrel,
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
+	vacrel->missed_dead_tuples += missed_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
+	if (missed_dead_tuples > 0)
+		vacrel->missed_dead_pages++;
 
 	/* Caller won't need to call lazy_scan_prune with same page */
 	return true;
@@ -2194,8 +2239,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * dead_items space is not CPU cache resident.
 		 *
 		 * We don't take any special steps to remember the LP_DEAD items (such
-		 * as counting them in new_dead_tuples report to the stats collector)
-		 * when the optimization is applied.  Though the accounting used in
+		 * as counting them in our final report to the stats collector) when
+		 * the optimization is applied.  Though the accounting used in
 		 * analyze.c's acquire_sample_rows() will recognize the same LP_DEAD
 		 * items as dead rows in its own stats collector report, that's okay.
 		 * The discrepancy should be negligible.  If this optimization is ever
@@ -3322,7 +3367,7 @@ update_index_statistics(LVRelState *vacrel)
 							false,
 							InvalidTransactionId,
 							InvalidMultiXactId,
-							false);
+							NULL, NULL, false);
 	}
 }
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index cc9705d06..7bf7f6e86 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -651,6 +651,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 							hasindex,
 							InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 
 		/* Same for indexes */
@@ -667,6 +668,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								false,
 								InvalidTransactionId,
 								InvalidMultiXactId,
+								NULL, NULL,
 								in_outer_xact);
 		}
 	}
@@ -679,6 +681,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		vac_update_relstats(onerel, -1, totalrows,
 							0, hasindex, InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 	}
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 283ffaea7..b72ce01c5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1315,6 +1315,7 @@ vac_update_relstats(Relation relation,
 					BlockNumber num_all_visible_pages,
 					bool hasindex, TransactionId frozenxid,
 					MultiXactId minmulti,
+					bool *frozenxid_updated, bool *minmulti_updated,
 					bool in_outer_xact)
 {
 	Oid			relid = RelationGetRelid(relation);
@@ -1390,22 +1391,30 @@ vac_update_relstats(Relation relation,
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
+	if (frozenxid_updated)
+		*frozenxid_updated = false;
 	if (TransactionIdIsNormal(frozenxid) &&
 		pgcform->relfrozenxid != frozenxid &&
 		(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
 		 TransactionIdPrecedes(ReadNextTransactionId(),
 							   pgcform->relfrozenxid)))
 	{
+		if (frozenxid_updated)
+			*frozenxid_updated = true;
 		pgcform->relfrozenxid = frozenxid;
 		dirty = true;
 	}
 
 	/* Similarly for relminmxid */
+	if (minmulti_updated)
+		*minmulti_updated = false;
 	if (MultiXactIdIsValid(minmulti) &&
 		pgcform->relminmxid != minmulti &&
 		(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
 		 MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
 	{
+		if (minmulti_updated)
+			*minmulti_updated = true;
 		pgcform->relminmxid = minmulti;
 		dirty = true;
 	}
-- 
2.30.2

#23

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#21)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 13, 2022 at 4:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

1. Cases where our inability to get a cleanup lock signifies nothing
at all about the page in question, or any page in the same table, with
the same workload.

2. Pathological cases. Cases where we're at least at the mercy of the
application to do something about an idle cursor, where the situation
may be entirely hopeless on a long enough timeline. (Whether or not it
actually happens in the end is less significant.)

Sure. I'm worrying about case (2). I agree that in case (1) waiting
for the lock is almost always the wrong idea.

I think that you're focussing on individual VACUUM operations, whereas
I'm more concerned about the aggregate effect of a particular policy
over time.

I don't think so. I think I'm worrying about the aggregate effect of a
particular policy over time *in the pathological cases* i.e. (2).

This is my concern -- what I've called category 2 cases have this
exact quality. So given that, why not freeze what you can, elsewhere,
on other pages that don't have the same issue (presumably the vast
vast majority in the table)? That way you have the best possible
chance of recovering once the DBA gets a clue and fixes the issue.

That's the part I'm not sure I believe. Imagine a table with a
gigantic number of pages that are not yet all-visible, a small number
of all-visible pages, and one page containing very old XIDs on which a
cursor holds a pin. I don't think it's obvious that not waiting is
best. Maybe you're going to end up vacuuming the table repeatedly and
doing nothing useful. If you avoid vacuuming it repeatedly, you still
have a lot of work to do once the DBA locates a clue.

I think there's probably an important principle buried in here: the
XID threshold that forces a vacuum had better also force waiting for
pins. If it doesn't, you can tight-loop on that table without getting
anything done.

That's kind of what I meant. The difference between 50 million and 150
million is rather unclear indeed. So having accepted that that might
be true, why not be open to the possibility that it won't turn out to
be true in the long run, for any given table? With the enhancements
from the patch series in place (particularly the early freezing
stuff), what do we have to lose by making the FreezeLimit XID cutoff
for freezing much higher than your typical vacuum_freeze_min_age?
Maybe the same as autovacuum_freeze_max_age or vacuum_freeze_table_age
(it can't be higher than that without also making these other settings
become meaningless, of course).

We should probably distinguish between the situation where (a) an
adverse pin is held continuously and effectively forever and (b)
adverse pins are held frequently but for short periods of time. I
think it's possible to imagine a small, very hot table (or portion of
a table) where very high concurrency means there are often pins. In
case (a), it's not obvious that waiting will ever resolve anything,
although it might prevent other problems like infinite looping. In
case (b), a brief wait will do a lot of good. But maybe that doesn't
even matter. I think part of your argument is that if we fail to
update relfrozenxid for a while, that really isn't that bad.

I think I agree, up to a point. One consequence of failing to
immediately advance relfrozenxid might be that pg_clog and friends are
bigger, but that's pretty minor. Another consequence might be that we
might vacuum the table more times, which is more serious. I'm not
really sure that can happen to a degree that is meaningful, apart from
the infinite loop case already described, but I'm also not entirely
sure that it can't.

--
Robert Haas
EDB: http://www.enterprisedb.com

#24

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#23)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Jan 17, 2022 at 7:12 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 13, 2022 at 4:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

1. Cases where our inability to get a cleanup lock signifies nothing
at all about the page in question, or any page in the same table, with
the same workload.

2. Pathological cases. Cases where we're at least at the mercy of the
application to do something about an idle cursor, where the situation
may be entirely hopeless on a long enough timeline. (Whether or not it
actually happens in the end is less significant.)

Sure. I'm worrying about case (2). I agree that in case (1) waiting
for the lock is almost always the wrong idea.

I don't doubt that we'd each have little difficulty determining
which category (1 or 2) a given real world case should be placed in,
using a variety of methods that put the issue in context (e.g.,
looking at the application code, talking to the developers or the
DBA). Of course, it doesn't follow that it would be easy to teach
vacuumlazy.c how to determine which category the same "can't get
cleanup lock" falls under, since (just for starters) there is no
practical way for VACUUM to see all that context.

That's what I'm effectively trying to work around with this "wait and
see approach" that demotes FreezeLimit to a backstop (and so justifies
removing the vacuum_freeze_min_age GUC that directly dictates our
FreezeLimit today). The cure may be worse than the disease, and the cure
isn't actually all that great at the best of times, so we should wait
until the disease visibly gets pretty bad before being
"interventionist" by waiting for a cleanup lock.

I've already said plenty about why I don't like vacuum_freeze_min_age
(or FreezeLimit) due to XIDs being fundamentally the wrong unit. But
that's not the only fundamental problem that I see. The other problem
is this: vacuum_freeze_min_age also dictates when an aggressive VACUUM
will start to wait for a cleanup lock. But why should the first thing
be the same as the second thing? I see absolutely no reason for it.
(Hence the idea of making FreezeLimit a backstop, and getting rid of
the GUC itself.)

This is my concern -- what I've called category 2 cases have this
exact quality. So given that, why not freeze what you can, elsewhere,
on other pages that don't have the same issue (presumably the vast
vast majority in the table)? That way you have the best possible
chance of recovering once the DBA gets a clue and fixes the issue.

That's the part I'm not sure I believe.

To be clear, I think that I have yet to adequately demonstrate that
this is true. It's a bit tricky to do so -- absence of evidence isn't
evidence of absence. I think that your principled skepticism makes
sense right now.

Fortunately the early refactoring patches should be uncontroversial.
The controversial parts are all in the last patch in the patch series,
which isn't too much code. (Plus another patch to at least get rid of
vacuum_freeze_min_age, and maybe vacuum_freeze_table_age too, that
hasn't been written just yet.)

Imagine a table with a
gigantic number of pages that are not yet all-visible, a small number
of all-visible pages, and one page containing very old XIDs on which a
cursor holds a pin. I don't think it's obvious that not waiting is
best. Maybe you're going to end up vacuuming the table repeatedly and
doing nothing useful. If you avoid vacuuming it repeatedly, you still
have a lot of work to do once the DBA locates a clue.

Maybe this is a simpler way of putting it: I want to delay waiting on
a pin until it's pretty clear that we truly have a pathological case,
which should in practice be limited to an anti-wraparound VACUUM,
which will now be naturally rare -- most individual tables will
literally never have even one anti-wraparound VACUUM.

We don't need to reason about the vacuuming schedule this way, since
anti-wraparound VACUUMs are driven by age(relfrozenxid) -- we don't
really have to predict anything. Maybe we'll need to do an
anti-wraparound VACUUM immediately after a non-aggressive autovacuum
runs, without getting a cleanup lock (due to an idle cursor
pathological case). We won't be able to advance relfrozenxid until the
anti-wraparound VACUUM runs (at the earliest) in this scenario, but it
makes no difference. Rather than predicting the future, we're covering
every possible outcome (at least to the extent that that's possible).

I think there's probably an important principle buried in here: the
XID threshold that forces a vacuum had better also force waiting for
pins. If it doesn't, you can tight-loop on that table without getting
anything done.

I absolutely agree -- that's why I think that we still need
FreezeLimit. Just as a backstop, that in practice very rarely
influences our behavior. Probably just in those remaining cases that
are never vacuumed except for the occasional anti-wraparound VACUUM
(even then it might not be very important).

We should probably distinguish between the situation where (a) an
adverse pin is held continuously and effectively forever and (b)
adverse pins are held frequently but for short periods of time.

I agree. It's just hard to do that from vacuumlazy.c, during a routine
non-aggressive VACUUM operation.

I think it's possible to imagine a small, very hot table (or portion of
a table) where very high concurrency means there are often pins. In
case (a), it's not obvious that waiting will ever resolve anything,
although it might prevent other problems like infinite looping. In
case (b), a brief wait will do a lot of good. But maybe that doesn't
even matter. I think part of your argument is that if we fail to
update relfrozenxid for a while, that really isn't that bad.

Yeah, that is a part of it -- it doesn't matter (until it really
matters), and we should be careful to avoid making the situation worse
by waiting for a cleanup lock unnecessarily. That's actually a very
drastic thing to do, at least in a world where freezing has been
decoupled from advancing relfrozenxid.

Updating relfrozenxid should now be thought of as a continuous thing,
not a discrete thing. And so it's highly unlikely that any given
VACUUM will ever *completely* fail to advance relfrozenxid -- that
fact alone signals a pathological case (things that are supposed to be
continuous should not ever appear to be discrete). But you need multiple
VACUUMs to see this "signal". It is only revealed over time.

It seems wise to make the most modest possible assumptions about
what's going on here. We might well "get lucky" before the next VACUUM
comes around when we encounter what at first appears to be a
problematic case involving an idle cursor -- for all kinds of reasons.
Like maybe an opportunistic prune gets rid of the old XID for us,
without any freezing, during some brief window where the application
doesn't have a cursor. We're only talking about one or two heap pages
here.

We might also *not* "get lucky" with the application and its use of
idle cursors, of course. But in that case we must have been doomed all
along. And we'll at least have put things on a much better footing in
this disaster scenario -- there is relatively little freezing left to
do in single user mode, and relfrozenxid should already be the same as
the exact oldest XID in that one page.

I think I agree, up to a point. One consequence of failing to
immediately advance relfrozenxid might be that pg_clog and friends are
bigger, but that's pretty minor.

My arguments are probabilistic (sort of), which makes it tricky.
Actual test cases/benchmarks should bear out the claims that I've
made. If anything fully convinces you, it'll be that, I think.

Another consequence might be that we
might vacuum the table more times, which is more serious. I'm not
really sure that can happen to a degree that is meaningful, apart from
the infinite loop case already described, but I'm also not entirely
sure that it can't.

It's definitely true that this overall strategy could result in there
being more individual VACUUM operations. But that naturally
follow from teaching VACUUM to avoid waiting indefinitely.

Obviously the important question is whether we'll do
meaningfully more work for less benefit (in Postgres 15, relative to
Postgres 14). Your concern is very reasonable. I just can't imagine
how we could lose out to any notable degree. Which is a start.

--
Peter Geoghegan

#25

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#24)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

Updating relfrozenxid should now be thought of as a continuous thing,
not a discrete thing.

I think that's pretty nearly 100% wrong. The most simplistic way of
expressing that is to say - clearly it can only happen when VACUUM
runs, which is not all the time. That's a bit facile, though; let me
try to say something a little smarter. There are real production
systems that exist today where essentially all vacuums are
anti-wraparound vacuums. And there are also real production systems
that exist today where virtually none of the vacuums are
anti-wraparound vacuums. So if we ship your proposed patches, the
frequency with which relfrozenxid gets updated is going to increase by
a large multiple, perhaps 100x, for the second group of people, who
will then perceive the movement of relfrozenxid to be much closer to
continuous than it is today even though, technically, it's still a
step function. But the people in the first category are not going to
see any difference at all.

And therefore the reasoning that says - anti-wraparound vacuums just
aren't going to happen any more - or - relfrozenxid will advance
continuously seems like dangerous wishful thinking to me. It's only
true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need
not be true in any particular environment, which to me means that all
conclusions based on the idea that it has to be true are pretty
dubious. There's no doubt in my mind that advancing relfrozenxid
opportunistically is a good idea. However, I'm not sure how reasonable
it is to change any other behavior on the basis of the fact that we're
doing it, because we don't know how often it really happens.

If someone says "every time I travel to Europe on business, I will use
the opportunity to bring you back a nice present," you can't evaluate
how much impact that will have on your life without knowing how often
they travel to Europe on business. And that varies radically from
"never" to "a lot" based on the person.

--
Robert Haas
EDB: http://www.enterprisedb.com

#26

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#25)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Jan 17, 2022 at 2:13 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 17, 2022 at 4:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

Updating relfrozenxid should now be thought of as a continuous thing,
not a discrete thing.

I think that's pretty nearly 100% wrong. The most simplistic way of
expressing that is to say - clearly it can only happen when VACUUM
runs, which is not all the time.

That just seems like semantics to me. The very next sentence after the
one you quoted in your reply was "And so it's highly unlikely that any
given VACUUM will ever *completely* fail to advance relfrozenxid".
It's continuous *within* each VACUUM. As far as I can tell there is
pretty much no way that the patch series will ever fail to advance
relfrozenxid *by at least a little bit*, barring pathological cases
with cursors and whatnot.

That's a bit facile, though; let me
try to say something a little smarter. There are real production
systems that exist today where essentially all vacuums are
anti-wraparound vacuums. And there are also real production systems
that exist today where virtually none of the vacuums are
anti-wraparound vacuums. So if we ship your proposed patches, the
frequency with which relfrozenxid gets updated is going to increase by
a large multiple, perhaps 100x, for the second group of people, who
will then perceive the movement of relfrozenxid to be much closer to
continuous than it is today even though, technically, it's still a
step function. But the people in the first category are not going to
see any difference at all.

Actually, I think that even the people in the first category might
well have about the same improved experience. Not just because of this
patch series, mind you. It would also have a lot to do with the
autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to
mention the freeze map. What version are these users on?

I have actually seen this for myself. With BenchmarkSQL, the largest
table (the order lines table) starts out having its autovacuums driven
entirely by autovacuum_vacuum_insert_scale_factor, even though there
is a fair amount of bloat from updates. It stays like that for hours
on HEAD. But even with my reasonably tuned setup, there is eventually
a switchover point. Eventually all autovacuums end up as aggressive
anti-wraparound VACUUMs -- this happens once the table gets
sufficiently large (this is one of the two that is append-only, with
one update to every inserted row from the delivery transaction, which
happens hours after the initial insert).

With the patch series, we have a kind of virtuous circle with freezing
and with advancing relfrozenxid with the same order lines table. As
far as I can tell, we fix the problem with the patch series. Because
there are about 10 tuples inserted per new order transaction, the
actual "XID consumption rate of the table" is much lower than the
"worst case XID consumption" for such a table.

It's also true that even with the patch we still get anti-wraparound
VACUUMs for two fixed-size, hot-update-only tables: the stock table,
and the customers table. But that's no big deal. It only happens
because nothing else will ever trigger an autovacuum, no matter the
autovacuum_freeze_max_age setting.

And therefore the reasoning that says - anti-wraparound vacuums just
aren't going to happen any more - or - relfrozenxid will advance
continuously seems like dangerous wishful thinking to me.

I never said that anti-wraparound vacuums just won't happen anymore. I
said that they'll be limited to cases like the stock table or
customers table case. I was very clear on that point.

With pgbench, whether or not you ever see any anti-wraparound VACUUMs
will depend on how heap fillfactor for the accounts table -- set it
low enough (maybe to 90) and you will still get them, since there
won't be any other reason to VACUUM. As for the branches table, and
the tellers table, they'll get VACUUMs in any case, regardless of heap
fillfactor. And so they'll always advance relfrozenxid during eac
VACUUM, and never have even one anti-wraparound VACUUM.

It's only
true if (# of vacuums) / (# of wraparound vacuums) >> 1. And that need
not be true in any particular environment, which to me means that all
conclusions based on the idea that it has to be true are pretty
dubious. There's no doubt in my mind that advancing relfrozenxid
opportunistically is a good idea. However, I'm not sure how reasonable
it is to change any other behavior on the basis of the fact that we're
doing it, because we don't know how often it really happens.

It isn't that hard to see that the cases where we continue to get any
anti-wraparound VACUUMs with the patch seem to be limited to cases
like the stock/customers table, or cases like the pathological idle
cursor cases we've been discussing. Pretty narrow cases, overall.
Don't take my word for it - see for yourself.

--
Peter Geoghegan

#27

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#26)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

That just seems like semantics to me. The very next sentence after the
one you quoted in your reply was "And so it's highly unlikely that any
given VACUUM will ever *completely* fail to advance relfrozenxid".
It's continuous *within* each VACUUM. As far as I can tell there is
pretty much no way that the patch series will ever fail to advance
relfrozenxid *by at least a little bit*, barring pathological cases
with cursors and whatnot.

I mean this boils down to saying that VACUUM will advance relfrozenxid
except when it doesn't.

Actually, I think that even the people in the first category might
well have about the same improved experience. Not just because of this
patch series, mind you. It would also have a lot to do with the
autovacuum_vacuum_insert_scale_factor stuff in Postgres 13. Not to
mention the freeze map. What version are these users on?

I think it varies. I expect the increase in the default cost limit to
have had a much more salutary effect than
autovacuum_vacuum_insert_scale_factor, but I don't know for sure. At
any rate, if you make the database big enough and generate dirty data
fast enough, it doesn't matter what the default limits are.

I never said that anti-wraparound vacuums just won't happen anymore. I
said that they'll be limited to cases like the stock table or
customers table case. I was very clear on that point.

I don't know how I'm supposed to sensibly respond to a statement like
this. If you were very clear, then I'm being deliberately obtuse if I
fail to understand. If I say you weren't very clear, then we're just
contradicting each other.

It isn't that hard to see that the cases where we continue to get any
anti-wraparound VACUUMs with the patch seem to be limited to cases
like the stock/customers table, or cases like the pathological idle
cursor cases we've been discussing. Pretty narrow cases, overall.
Don't take my word for it - see for yourself.

I don't think that's really possible. Words like "narrow" and
"pathological" are value judgments, not factual statements. If I do an
experiment where no wraparound autovacuums happen, as I'm sure I can,
then those are the normal cases where the patch helps. If I do an
experiment where they do happen, as I'm sure that I also can, you'll
probably say either that the case in question is like the
stock/customers table, or that it's pathological. What will any of
this prove?

I think we're reaching the point of diminishing returns in this
conversation. What I want to know is that users aren't going to be
harmed - even in cases where they have behavior that is like the
stock/customers table, or that you consider pathological, or whatever
other words we want to use to describe the weird things that happen to
people. And I think we've made perhaps a bit of modest progress in
exploring that issue, but certainly less than I'd like. I don't want
to spend the next several days going around in circles about it
though. That does not seem likely to make anyone happy.

--
Robert Haas
EDB: http://www.enterprisedb.com

#28

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#27)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Jan 17, 2022 at 8:13 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 17, 2022 at 5:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

That just seems like semantics to me. The very next sentence after the
one you quoted in your reply was "And so it's highly unlikely that any
given VACUUM will ever *completely* fail to advance relfrozenxid".
It's continuous *within* each VACUUM. As far as I can tell there is
pretty much no way that the patch series will ever fail to advance
relfrozenxid *by at least a little bit*, barring pathological cases
with cursors and whatnot.

I mean this boils down to saying that VACUUM will advance relfrozenxid
except when it doesn't.

It actually doesn't boil down, at all. The world is complicated and
messy, whether we like it or not.

I never said that anti-wraparound vacuums just won't happen anymore. I
said that they'll be limited to cases like the stock table or
customers table case. I was very clear on that point.

I don't know how I'm supposed to sensibly respond to a statement like
this. If you were very clear, then I'm being deliberately obtuse if I
fail to understand.

I don't know if I'd accuse you of being obtuse, exactly. Mostly I just
think it's strange that you don't seem to take what I say seriously
when it cannot be proven very easily. I don't think that you intend
this to be disrespectful, and I don't take it personally. I just don't
understand it.

It isn't that hard to see that the cases where we continue to get any
anti-wraparound VACUUMs with the patch seem to be limited to cases
like the stock/customers table, or cases like the pathological idle
cursor cases we've been discussing. Pretty narrow cases, overall.
Don't take my word for it - see for yourself.

I don't think that's really possible. Words like "narrow" and
"pathological" are value judgments, not factual statements. If I do an
experiment where no wraparound autovacuums happen, as I'm sure I can,
then those are the normal cases where the patch helps. If I do an
experiment where they do happen, as I'm sure that I also can, you'll
probably say either that the case in question is like the
stock/customers table, or that it's pathological. What will any of
this prove?

You seem to be suggesting that I used words like "pathological" in
some kind of highly informal, totally subjective way, when I did no
such thing.

I quite clearly said that you'll only get an anti-wraparound VACUUM
with the patch applied when the only factor that *ever* causes *any*
autovacuum worker to VACUUM the table (assuming the workload is
stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With
a table like this, even increasing autovacuum_freeze_max_age to its
absolute maximum of 2 billion would not make it any more likely that
we'd get a non-aggressive VACUUM -- it would merely make the
anti-wraparound VACUUMs less frequent. No big change should be
expected with a table like that.

Also, since the patch is not magic, and doesn't even change the basic
invariants for relfrozenxid, it's still true that any scenario in
which it's fundamentally impossible for VACUUM to keep up will also
have anti-wraparound VACUUMs. But that's the least of the user's
trouble -- in the long run we're going to have the system refuse to
allocate new XIDs with such a workload.

The claim that I have made is 100% testable. Even if it was flat out
incorrect, not getting anti-wraparound VACUUMs per se is not the
important part. The important part is that the work is managed
intelligently, and the burden is spread out over time. I am
particularly concerned about the "freezing cliff" we get when many
pages are all-visible but not also all-frozen. Consistently avoiding
an anti-wraparound VACUUM (except with very particular workload
characteristics) is really just a side effect -- it's something that
makes the overall benefit relatively obvious, and relatively easy to
measure. I thought that you'd appreciate that.

--
Peter Geoghegan

#29

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#28)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Jan 18, 2022 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote:

I quite clearly said that you'll only get an anti-wraparound VACUUM
with the patch applied when the only factor that *ever* causes *any*
autovacuum worker to VACUUM the table (assuming the workload is
stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With
a table like this, even increasing autovacuum_freeze_max_age to its
absolute maximum of 2 billion would not make it any more likely that
we'd get a non-aggressive VACUUM -- it would merely make the
anti-wraparound VACUUMs less frequent. No big change should be
expected with a table like that.

Sure, I don't disagree with any of that. I don't see how I could. But
I don't see how it detracts from the points I was trying to make
either.

Also, since the patch is not magic, and doesn't even change the basic
invariants for relfrozenxid, it's still true that any scenario in
which it's fundamentally impossible for VACUUM to keep up will also
have anti-wraparound VACUUMs. But that's the least of the user's
trouble -- in the long run we're going to have the system refuse to
allocate new XIDs with such a workload.

Also true. But again, it's just about making sure that the patch
doesn't make other decisions that make things worse for people in that
situation. That's what I was expressing uncertainty about.

The claim that I have made is 100% testable. Even if it was flat out
incorrect, not getting anti-wraparound VACUUMs per se is not the
important part. The important part is that the work is managed
intelligently, and the burden is spread out over time. I am
particularly concerned about the "freezing cliff" we get when many
pages are all-visible but not also all-frozen. Consistently avoiding
an anti-wraparound VACUUM (except with very particular workload
characteristics) is really just a side effect -- it's something that
makes the overall benefit relatively obvious, and relatively easy to
measure. I thought that you'd appreciate that.

I do.

--
Robert Haas
EDB: http://www.enterprisedb.com

#30

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#29)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Jan 18, 2022 at 6:11 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 18, 2022 at 12:14 AM Peter Geoghegan <pg@bowt.ie> wrote:

I quite clearly said that you'll only get an anti-wraparound VACUUM
with the patch applied when the only factor that *ever* causes *any*
autovacuum worker to VACUUM the table (assuming the workload is
stable) is the anti-wraparound/autovacuum_freeze_max_age cutoff. With
a table like this, even increasing autovacuum_freeze_max_age to its
absolute maximum of 2 billion would not make it any more likely that
we'd get a non-aggressive VACUUM -- it would merely make the
anti-wraparound VACUUMs less frequent. No big change should be
expected with a table like that.

Sure, I don't disagree with any of that. I don't see how I could. But
I don't see how it detracts from the points I was trying to make
either.

You said "...the reasoning that says - anti-wraparound vacuums just
aren't going to happen any more - or - relfrozenxid will advance
continuously seems like dangerous wishful thinking to me". You then
proceeded to attack a straw man -- a view that I couldn't possibly
hold. This certainly surprised me, because my actual claims seemed
well within the bounds of what is possible, and in any case can be
verified with a fairly modest effort.

That's what I was reacting to -- it had nothing to do with any
concerns you may have had. I wasn't thinking about long-idle cursors
at all. I was defending myself, because I was put in a position where
I had to defend myself.

Also, since the patch is not magic, and doesn't even change the basic
invariants for relfrozenxid, it's still true that any scenario in
which it's fundamentally impossible for VACUUM to keep up will also
have anti-wraparound VACUUMs. But that's the least of the user's
trouble -- in the long run we're going to have the system refuse to
allocate new XIDs with such a workload.

Also true. But again, it's just about making sure that the patch
doesn't make other decisions that make things worse for people in that
situation. That's what I was expressing uncertainty about.

I am not just trying to avoid making things worse when users are in
this situation. I actually want to give users every chance to avoid
being in this situation in the first place. In fact, almost everything
I've said about this aspect of things was about improving things for
these users. It was not about covering myself -- not at all. It would
be easy for me to throw up my hands, and change nothing here (keep the
behavior that makes FreezeLimit derived from the vacuum_freeze_min
GUC), since it's all incidental to the main goals of this patch
series.

I still don't understand why you think that my idea (not yet
implemented) of making FreezeLimit into a backstop (making it
autovacuum_freeze_max_age/2 or something) and relying on the new
"early freezing" criteria for almost everything is going to make the
situation worse in this scenario with long idle cursors. It's intended
to make it better.

Why do you think that the current vacuum_freeze_min_age-based
FreezeLimit isn't actually the main problem in these scenarios? I
think that the way that that works right now (in particular during
aggressive VACUUMs) is just an accident of history. It's all path
dependence -- each incremental step may have made sense, but what we
have now doesn't seem to. Waiting for a cleanup lock might feel like
the diligent thing to do, but that doesn't make it so.

My sense is that there are very few apps that are hopelessly incapable
of advancing relfrozenxid from day one. I find it much easier to
believe that users that had this experience got away with it for a
very long time, until their luck ran out, somehow. I would like to
minimize the chance of that ever happening, to the extent that that's
possible within the confines of the basic heapam/vacuumlazy.c
invariants.

--
Peter Geoghegan

#31

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#30)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Jan 18, 2022 at 1:48 PM Peter Geoghegan <pg@bowt.ie> wrote:

That's what I was reacting to -- it had nothing to do with any
concerns you may have had. I wasn't thinking about long-idle cursors
at all. I was defending myself, because I was put in a position where
I had to defend myself.

I don't think I've said anything on this thread that is an attack on
you. I am getting pretty frustrated with the tenor of the discussion,
though. I feel like you're the one attacking me, and I don't like it.

I still don't understand why you think that my idea (not yet
implemented) of making FreezeLimit into a backstop (making it
autovacuum_freeze_max_age/2 or something) and relying on the new
"early freezing" criteria for almost everything is going to make the
situation worse in this scenario with long idle cursors. It's intended
to make it better.

I just don't understand how I haven't been able to convey my concern
here by now. I've already written multiple emails about it. If none of
them were clear enough for you to understand, I'm not sure how saying
the same thing over again can help. When I say I've already written
about this, I'm referring specifically to the following:

- /messages/by-id/CA+TgmobKJm9BsZR3ETeb6MJdLKWxKK5ZXx0XhLf-W9kUgvOcNA@mail.gmail.com
in the second-to-last paragraph, beginning with "I don't really see"
- /messages/by-id/CA+TgmoaGoZ2wX6T4sj0eL5YAOQKW3tS8ViMuN+tcqWJqFPKFaA@mail.gmail.com
in the second paragraph beginning with "Because waiting on a lock"
- /messages/by-id/CA+TgmoZYri_LUp4od_aea=A8RtjC+-Z1YmTc7ABzTf+tRD2Opw@mail.gmail.com
in the paragraph beginning with "That's the part I'm not sure I
believe."

For all of that, I'm not even convinced that you're wrong. I just
think you might be wrong. I don't really know. It seems to me however
that you're understating the value of waiting, which I've tried to
explain in the above places. Waiting does have the very real
disadvantage of starving the rest of the system of the work that
autovacuum worker would have been doing, and that's why I think you
might be right. However, there are cases where waiting, and only
waiting, gets the job done. If you're not willing to admit that those
cases exist, or you think they don't matter, then we disagree. If you
admit that they exist and think they matter but believe that there's
some reason why increasing FreezeLimit can't cause any damage, then
either (a) you have a good reason for that belief which I have thus
far been unable to understand or (b) you're more optimistic about the
proposed change than can be entirely justified.

My sense is that there are very few apps that are hopelessly incapable
of advancing relfrozenxid from day one. I find it much easier to
believe that users that had this experience got away with it for a
very long time, until their luck ran out, somehow. I would like to
minimize the chance of that ever happening, to the extent that that's
possible within the confines of the basic heapam/vacuumlazy.c
invariants.

I agree with the idea that most people are OK at the beginning and
then at some point their luck runs out and catastrophe strikes. I
think there are a couple of different kinds of catastrophe that can
happen. For instance, somebody could park a cursor in the middle of a
table someplace and leave it there until the snow melts. Or, somebody
could take a table lock and sit on it forever. Or, there could be a
corrupted page in the table that causes VACUUM to error out every time
it's reached. In the second and third situations, it doesn't matter a
bit what we do with FreezeLimit, but in the first one it might. If the
user is going to leave that cursor sitting there literally forever,
the best solution is to raise FreezeLimit as high as we possibly can.
The system is bound to shut down due to wraparound at some point, but
we at least might as well vacuum other stuff while we're waiting for
that to happen. On the other hand if that user is going to close that
cursor after 10 minutes and open a new one in the same place 10
seconds later, the best thing to do is to keep FreezeLimit as low as
possible, because the first time we wait for the pin to be released
we're guaranteed to advance relfrozenxid within 10 minutes, whereas if
we don't do that we may keep missing the brief windows in which no
cursor is held for a very long time. But we have absolutely no way of
knowing which of those things is going to happen on any particular
system, or of estimating which one is more common in general.

--
Robert Haas
EDB: http://www.enterprisedb.com

#32

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#31)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Jan 19, 2022 at 6:56 AM Robert Haas <robertmhaas@gmail.com> wrote:

I don't think I've said anything on this thread that is an attack on
you. I am getting pretty frustrated with the tenor of the discussion,
though. I feel like you're the one attacking me, and I don't like it.

"Attack" is a strong word (much stronger than "defend"), and I don't
think I'd use it to describe anything that has happened on this
thread. All I said was that you misrepresented my views when you
pounced on my use of the word "continuous". Which, honestly, I was
very surprised by.

For all of that, I'm not even convinced that you're wrong. I just
think you might be wrong. I don't really know.

I agree that I might be wrong, though of course I think that I'm
probably correct. I value your input as a critical voice -- that's
generally how we get really good designs.

However, there are cases where waiting, and only
waiting, gets the job done. If you're not willing to admit that those
cases exist, or you think they don't matter, then we disagree.

They exist, of course. That's why I don't want to completely eliminate
the idea of waiting for a cleanup lock. Rather, I want to change the
design to recognize that that's an extreme measure, that should be
delayed for as long as possible. There are many ways that the problem
could naturally resolve itself.

Waiting for a cleanup lock after only 50 million XIDs (the
vacuum_freeze_min_age default) is like performing brain surgery to
treat somebody with a headache (at least with the infrastructure from
the earlier patches in place). It's not impossible that "surgery"
could help, in theory (could be a tumor, better to catch these things
early!), but that fact alone can hardly justify such a drastic
measure. That doesn't mean that brain surgery isn't ever appropriate,
of course. It should be delayed until it starts to become obvious that
it's really necessary (but before it really is too late).

If you
admit that they exist and think they matter but believe that there's
some reason why increasing FreezeLimit can't cause any damage, then
either (a) you have a good reason for that belief which I have thus
far been unable to understand or (b) you're more optimistic about the
proposed change than can be entirely justified.

I don't deny that it's just about possible that the changes that I'm
thinking of could make the situation worse in some cases, but I think
that the overwhelming likelihood is that things will be improved
across the board.

Consider the age of the tables from BenchmarkSQL, with the patch series:

relname │ age │ mxid_age
──────────────────┼─────────────┼──────────
bmsql_district │ 657 │ 0
bmsql_warehouse │ 696 │ 0
bmsql_item │ 1,371,978 │ 0
bmsql_config │ 1,372,061 │ 0
bmsql_new_order │ 3,754,163 │ 0
bmsql_history │ 11,545,940 │ 0
bmsql_order_line │ 23,095,678 │ 0
bmsql_oorder │ 40,653,743 │ 0
bmsql_customer │ 51,371,610 │ 0
bmsql_stock │ 51,371,610 │ 0
(10 rows)

We see significant "natural variation" here, unlike HEAD, where the
age of all tables is exactly the same at all times, or close to it
(incidentally, this leads to the largest tables all being
anti-wraparound VACUUMed at the same time). There is a kind of natural
ebb and flow for each table over time, as relfrozenxid is advanced,
due in part to workload characteristics. Less than half of all XIDs
will ever modify the two largest tables, for example, and so
autovacuum should probably never be launched because of the age of
either table (barring some change in workload conditions, perhaps). As
I've said a few times now, XIDs are generally "the wrong unit", except
when needed as a backstop against wraparound failure.

The natural variation that I see contributes to my optimism. A
situation where we cannot get a cleanup lock may well resolve itself,
for many reasons, that are hard to precisely nail down but are
nevertheless very real.

The vacuum_freeze_min_age design (particularly within an aggressive
VACUUM) is needlessly rigid, probably just because the assumption
before now has always been that we can only advance relfrozenxid in an
aggressive VACUUM (it might happen in a non-aggressive VACUUM if we
get very lucky, which cannot be accounted for). Because it is rigid,
it is brittle. Because it is brittle, it will (on a long enough
timeline, for a susceptible workload) actually break.

On the other hand if that user is going to close that
cursor after 10 minutes and open a new one in the same place 10
seconds later, the best thing to do is to keep FreezeLimit as low as
possible, because the first time we wait for the pin to be released
we're guaranteed to advance relfrozenxid within 10 minutes, whereas if
we don't do that we may keep missing the brief windows in which no
cursor is held for a very long time. But we have absolutely no way of
knowing which of those things is going to happen on any particular
system, or of estimating which one is more common in general.

I agree with all that, and I think that this particular scenario is
the crux of the issue.

The first time this happens (and we don't get a cleanup lock), then we
will at least be able to set relfrozenxid to the exact oldest unfrozen
XID. So that'll already have bought us some wallclock time -- often a
great deal (why should the oldest XID on such a page be particularly
old?). Furthermore, there will often be many more VACUUMs before we
need to do an aggressive VACUUM -- each of these VACUUM operations is
an opportunity to freeze the oldest tuple that holds up cleanup. Or
maybe this XID is in a dead tuple, and so somebody's opportunistic
pruning operation does the right thing for us. Never underestimate the
power of dumb luck, especially in a situation where there are many
individual "trials", and we only have to get lucky once.

If and when that doesn't work out, and we actually have to do an
anti-wraparound VACUUM, then something will have to give. Since
anti-wraparound VACUUMs are naturally confined to certain kinds of
tables/workloads with the patch series, we can now be pretty confident
that the problem really is with this one problematic heap page, with
the idle cursor. We could even verify this directly if we wanted to,
by noticing that the preexisting relfrozenxid is an exact match for
one XID on some can't-cleanup-lock page -- we could emit a WARNING
about the page/tuple if we wanted to. To return to my colorful analogy
from earlier, we now know that the patient almost certainly has a
brain tumor.

What new risk is implied by delaying the wait like this? Very little,
I believe. Lets say we derive FreezeLimit from
autovacuum_freeze_max_age/2 (instead of vacuum_freeze_min_age). We
still ought to have the opportunity to wait for the cleanup lock for
rather a long time -- if the XID consumption rate is so high that that
isn't true, then we're doomed anyway. All told, there seems to be a
huge net reduction in risk with this design.

--
Peter Geoghegan

#33

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#32)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Jan 19, 2022 at 2:54 PM Peter Geoghegan <pg@bowt.ie> wrote:

On the other hand if that user is going to close that
cursor after 10 minutes and open a new one in the same place 10
seconds later, the best thing to do is to keep FreezeLimit as low as
possible, because the first time we wait for the pin to be released
we're guaranteed to advance relfrozenxid within 10 minutes, whereas if
we don't do that we may keep missing the brief windows in which no
cursor is held for a very long time. But we have absolutely no way of
knowing which of those things is going to happen on any particular
system, or of estimating which one is more common in general.

I agree with all that, and I think that this particular scenario is
the crux of the issue.

Great, I'm glad we agree on that much. I would be interested in
hearing what other people think about this scenario.

The first time this happens (and we don't get a cleanup lock), then we
will at least be able to set relfrozenxid to the exact oldest unfrozen
XID. So that'll already have bought us some wallclock time -- often a
great deal (why should the oldest XID on such a page be particularly
old?). Furthermore, there will often be many more VACUUMs before we
need to do an aggressive VACUUM -- each of these VACUUM operations is
an opportunity to freeze the oldest tuple that holds up cleanup. Or
maybe this XID is in a dead tuple, and so somebody's opportunistic
pruning operation does the right thing for us. Never underestimate the
power of dumb luck, especially in a situation where there are many
individual "trials", and we only have to get lucky once.

If and when that doesn't work out, and we actually have to do an
anti-wraparound VACUUM, then something will have to give. Since
anti-wraparound VACUUMs are naturally confined to certain kinds of
tables/workloads with the patch series, we can now be pretty confident
that the problem really is with this one problematic heap page, with
the idle cursor. We could even verify this directly if we wanted to,
by noticing that the preexisting relfrozenxid is an exact match for
one XID on some can't-cleanup-lock page -- we could emit a WARNING
about the page/tuple if we wanted to. To return to my colorful analogy
from earlier, we now know that the patient almost certainly has a
brain tumor.

What new risk is implied by delaying the wait like this? Very little,
I believe. Lets say we derive FreezeLimit from
autovacuum_freeze_max_age/2 (instead of vacuum_freeze_min_age). We
still ought to have the opportunity to wait for the cleanup lock for
rather a long time -- if the XID consumption rate is so high that that
isn't true, then we're doomed anyway. All told, there seems to be a
huge net reduction in risk with this design.

I'm just being honest here when I say that I can't see any huge
reduction in risk. Nor a huge increase in risk. It just seems
speculative to me. If I knew something about the system or the
workload, then I could say what would likely work out best on that
system, but in the abstract I neither know nor understand how it's
possible to know.

My gut feeling is that it's going to make very little difference
either way. People who never release their cursors or locks or
whatever are going to be sad either way, and people who usually do
will be happy either way. There's some in-between category of people
who release sometimes but not too often for whom it may matter,
possibly quite a lot. It also seems possible that one decision rather
than another will make the happy people MORE happy, or the sad people
MORE sad. For most people, though, I think it's going to be
irrelevant. The fact that you seem to view the situation quite
differently is a big part of what worries me here. At least one of us
is missing something.

--
Robert Haas
EDB: http://www.enterprisedb.com

#34

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#33)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 20, 2022 at 6:55 AM Robert Haas <robertmhaas@gmail.com> wrote:

Great, I'm glad we agree on that much. I would be interested in
hearing what other people think about this scenario.

Agreed.

I'm just being honest here when I say that I can't see any huge
reduction in risk. Nor a huge increase in risk. It just seems
speculative to me. If I knew something about the system or the
workload, then I could say what would likely work out best on that
system, but in the abstract I neither know nor understand how it's
possible to know.

I think that it's very hard to predict the timeline with a scenario
like this -- no question. But I often imagine idealized scenarios like
the one you brought up with cursors, with the intention of lowering
the overall exposure to problems to the extent that that's possible;
if it was obvious, we'd have fixed it by now already. I cannot think
of any reason why making FreezeLimit into what I've been calling a
backstop introduces any new risk, but I can think of ways in which it
avoids risk. We shouldn't be waiting indefinitely for something
totally outside our control or understanding, and so blocking all
freezing and other maintenance on the table, until it's provably
necessary.

More fundamentally, freezing should be thought of as an overhead of
storing tuples in heap blocks, as opposed to an overhead of
transactions (that allocate XIDs). Meaning that FreezeLimit becomes
almost an emergency thing, closely associated with aggressive
anti-wraparound VACUUMs.

My gut feeling is that it's going to make very little difference
either way. People who never release their cursors or locks or
whatever are going to be sad either way, and people who usually do
will be happy either way.

In a real world scenario, the rate at which XIDs are used could be
very low. Buying a few hundred million more XIDs until the pain begins
could amount to buying weeks or months for the user in practice. Plus
they have visibility into the issue, in that they can potentially see
exactly when they stopped being able to advance relfrozenxid by
looking at the autovacuum logs.

My thinking on vacuum_freeze_min_age has shifted very slightly. I now
think that I'll probably need to keep it around, just so things like
VACUUM FREEZE (which sets vacuum_freeze_min_age to 0 internally)
continue to work. So maybe its default should be changed to -1, which
is interpreted as "whatever autovacuum_freeze_max_age/2 is". But it
should still be greatly deemphasized in user docs.

--
Peter Geoghegan

#35

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#34)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 20, 2022 at 11:45 AM Peter Geoghegan <pg@bowt.ie> wrote:

My thinking on vacuum_freeze_min_age has shifted very slightly. I now
think that I'll probably need to keep it around, just so things like
VACUUM FREEZE (which sets vacuum_freeze_min_age to 0 internally)
continue to work. So maybe its default should be changed to -1, which
is interpreted as "whatever autovacuum_freeze_max_age/2 is". But it
should still be greatly deemphasized in user docs.

I like that better, because it lets us retain an escape valve in case
we should need it. I suggest that the documentation should say things
like "The default is believed to be suitable for most use cases" or
"We are not aware of a reason to change the default" rather than
something like "There is almost certainly no good reason to change
this" or "What kind of idiot are you, anyway?" :-)

--
Robert Haas
EDB: http://www.enterprisedb.com

#36

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#35)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 20, 2022 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 20, 2022 at 11:45 AM Peter Geoghegan <pg@bowt.ie> wrote:

My thinking on vacuum_freeze_min_age has shifted very slightly. I now
think that I'll probably need to keep it around, just so things like
VACUUM FREEZE (which sets vacuum_freeze_min_age to 0 internally)
continue to work. So maybe its default should be changed to -1, which
is interpreted as "whatever autovacuum_freeze_max_age/2 is". But it
should still be greatly deemphasized in user docs.

I like that better, because it lets us retain an escape valve in case
we should need it.

I do see some value in that, too. Though it's not going to be a way of
turning off the early freezing stuff, which seems unnecessary (though
I do still have work to do on getting the overhead for that down).

I suggest that the documentation should say things
like "The default is believed to be suitable for most use cases" or
"We are not aware of a reason to change the default" rather than
something like "There is almost certainly no good reason to change
this" or "What kind of idiot are you, anyway?" :-)

I will admit to having a big bias here: I absolutely *loathe* these
GUCs. I really, really hate them.

Consider how we have to include messy caveats about
autovacuum_freeze_min_age when talking about
autovacuum_vacuum_insert_scale_factor. Then there's the fact that you
really cannot think about the rate of XID consumption intuitively --
it has at best a weak, unpredictable relationship with anything that
users can understand, such as data stored or wall clock time.

Then there are the problems with the equivalent MultiXact GUCs, which
somehow, against all odds, are even worse:

https://buttondown.email/nelhage/archive/notes-on-some-postgresql-implementation-details/

--
Peter Geoghegan

#37

Greg Stark

stark@mit.edu

almost 4 years ago

In reply to: Peter Geoghegan (#36)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, 20 Jan 2022 at 17:01, Peter Geoghegan <pg@bowt.ie> wrote:

Then there's the fact that you
really cannot think about the rate of XID consumption intuitively --
it has at best a weak, unpredictable relationship with anything that
users can understand, such as data stored or wall clock time.

This confuses me. "Transactions per second" is a headline database
metric that lots of users actually focus on quite heavily -- rather
too heavily imho. Ok, XID consumption is only a subset of transactions
that are not read-only but that's a detail that's pretty easy to
explain and users get pretty quickly.

There are corner cases like transactions that look read-only but are
actually read-write or transactions that consume multiple xids but
complex systems are full of corner cases and people don't seem too
surprised about these things.

What I find confuses people much more is the concept of the
oldestxmin. I think most of the autovacuum problems I've seen come
from cases where autovacuum is happily kicking off useless vacuums
because the oldestxmin hasn't actually advanced enough for them to do
any useful work.

--
greg

#38

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Greg Stark (#37)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Jan 21, 2022 at 12:07 PM Greg Stark <stark@mit.edu> wrote:

This confuses me. "Transactions per second" is a headline database
metric that lots of users actually focus on quite heavily -- rather
too heavily imho.

But transactions per second is for the whole database, not for
individual tables. It's also really a benchmarking thing, where the
size and variety of transactions is fixed. With something like pgbench
it actually is exactly the same thing, but such a workload is not at
all realistic. Even BenchmarkSQL/TPC-C isn't like that, despite the
fact that it is a fairly synthetic workload (it's just not super
synthetic).

Ok, XID consumption is only a subset of transactions
that are not read-only but that's a detail that's pretty easy to
explain and users get pretty quickly.

My point was mostly this: the number of distinct extant unfrozen tuple
headers (and the range of the relevant XIDs) is generally highly
unpredictable today. And the number of tuples we'll have to freeze to
be able to advance relfrozenxid by a good amount is quite variable, in
general.

For example, if we bulk extend a relation as part of an ETL process,
then the number of distinct XIDs could be as low as 1, even though we
can expect a great deal of "freeze debt" that will have to be paid off
at some point (with the current design, in the common case where the
user doesn't account for this effect because they're not already an
expert). There are other common cases that are not quite as extreme as
that, that still have the same effect -- even an expert will find it
hard or impossible to tune autovacuum_freeze_min_age for that.

Another case of interest (that illustrates the general principle) is
something like pgbench_tellers. We'll never have an aggressive VACUUM
of the table with the patch, and we shouldn't ever need to freeze any
tuples. But, owing to workload characteristics, we'll constantly be
able to keep its relfrozenxid very current, because (even if we
introduce skew) each individual row cannot go very long without being
updated, allowing old XIDs to age out that way.

There is also an interesting middle ground, where you get a mixture of
both tendencies due to skew. The tuple that's most likely to get
updated was the one that was just updated. How are you as a DBA ever
supposed to tune autovacuum_freeze_min_age if tuples happen to be
qualitatively different in this way?

What I find confuses people much more is the concept of the
oldestxmin. I think most of the autovacuum problems I've seen come
from cases where autovacuum is happily kicking off useless vacuums
because the oldestxmin hasn't actually advanced enough for them to do
any useful work.

As it happens, the proposed log output won't use the term oldestxmin
anymore -- I think that it makes sense to rename it to "removable
cutoff". Here's an example:

LOG: automatic vacuum of table "regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 317308 remain, 250258 skipped using visibility map
(78.87% of total)
tuples: 70 removed, 34105925 remain (6830471 newly frozen), 2528 are
dead but not yet removable
removable cutoff: 37574752, which is 230115 xids behind next
new relfrozenxid: 35221275, which is 5219310 xids ahead of previous value
index scan needed: 55540 pages from table (17.50% of total) had
3339809 dead item identifiers removed
index "bmsql_oorder_pkey": pages: 144257 in total, 0 newly deleted, 0
currently deleted, 0 reusable
index "bmsql_oorder_idx2": pages: 330083 in total, 0 newly deleted, 0
currently deleted, 0 reusable
I/O timings: read: 7928.207 ms, write: 1386.662 ms
avg read rate: 33.107 MB/s, avg write rate: 26.218 MB/s
buffer usage: 220825 hits, 443331 misses, 351084 dirtied
WAL usage: 576110 records, 364797 full page images, 2046767817 bytes
system usage: CPU: user: 10.62 s, system: 7.56 s, elapsed: 104.61 s

Note also that I deliberately made the "new relfrozenxid" line that
immediately follows (information that we haven't shown before now)
similar, to highlight that they're now closely related concepts. Now
if you VACUUM a table that is either empty or has only frozen tuples,
VACUUM will set relfrozenxid to oldestxmin/removable cutoff.
Internally, oldestxmin is the "starting point" for our final/target
relfrozenxid for the table. We ratchet it back dynamically, whenever
we see an older-than-current-target XID that cannot be immediately
frozen (e.g., when we can't easily get a cleanup lock on the page).

--
Peter Geoghegan

#39

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#36)

6 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

I do see some value in that, too. Though it's not going to be a way of
turning off the early freezing stuff, which seems unnecessary (though
I do still have work to do on getting the overhead for that down).

Attached is v7, a revision that overhauls the algorithm that decides
what to freeze. I'm now calling it block-driven freezing in the commit
message. Also included is a new patch, that makes VACUUM record zero
free space in the FSM for an all-visible page, unless the total amount
of free space happens to be greater than one half of BLCKSZ.

The fact that I am now including this new FSM patch (v7-0006-*patch)
may seem like a case of expanding the scope of something that could
well do without it. But hear me out! It's true that the new FSM patch
isn't essential. I'm including it now because it seems relevant to the
approach taken with block-driven freezing -- it may even make my
general approach easier to understand. The new approach to freezing is
to freeze every tuple on a block that is about to be set all-visible
(and thus set it all-frozen too), or to not freeze anything on the
page at all (at least until one XID gets really old, which should be
rare). This approach has all the benefits that I described upthread,
and a new benefit: it effectively encourages the application to allow
pages to "become settled".

The main difference in how we freeze here (relative to v6 of the
patch) is that I'm *not* freezing a page just because it was
dirtied/pruned. I now think about freezing as an essentially
page-level thing, barring edge cases where we have to freeze
individual tuples, just because the XIDs really are getting old (it's
an edge case when we can't freeze all the tuples together due to a mix
of new and old, which is something we specifically set out to avoid
now).

Freezing whole pages
====================

When VACUUM sees that all remaining/unpruned tuples on a page are
all-visible, it isn't just important because of cost control
considerations. It's deeper than that. It's also treated as a
tentative signal from the application itself, about the data itself.
Which is: this page looks "settled" -- it may never be updated again,
but if there is an update it likely won't change too much about the
whole page. Also, if the page is ever updated in the future, it's
likely that that will happen at a much later time than you should
expect for those *other* nearby pages, that *don't* appear to be
settled. And so VACUUM infers that the page is *qualitatively*
different to these other nearby pages. VACUUM therefore makes it hard
(though not impossible) for future inserts or updates to disturb these
settled pages, via this FSM behavior -- it is short sighted to just
see the space remaining on the page as free space, equivalent to any
other. This holistic approach seems to work well for
TPC-C/BenchmarkSQL, and perhaps even in general. More on TPC-C below.

This is not unlike the approach taken by other DB systems, where free
space management is baked into concurrency control, and the concept of
physical data independence as we know it from Postgres never really
existed. My approach also seems related to the concept of a "tenured
generation", which is key to generational garbage collection. The
whole basis of generational garbage collection is the generational
hypothesis: "most objects die young". This is an empirical observation
about applications written in GC'd programming languages actually
behave, not a rigorous principle, and yet in practice it appears to
always hold. Intuitively, it seems to me like the hypothesis must work
in practice because if it didn't then a counterexample nemesis
application's behavior would be totally chaotic, in every way.
Theoretically possible, but of no real concern, since the program
makes zero practical sense *as an actual program*. A Java program must
make sense to *somebody* (at least the person that wrote it), which,
it turns out, helpfully constrains the space of possibilities that any
industrial strength GC implementation needs to handle well.

The same principles seem to apply here, with VACUUM. Grouping logical
rows into pages that become their "permanent home until further
notice" may be somewhat arbitrary, at first, but that doesn't mean it
won't end up sticking. Just like with generational garbage collection,
where the application isn't expected to instruct the GC about its
plans for memory that it allocates, that can nevertheless be usefully
organized into distinct generations through an adaptive process.

Second order effects
====================

Relating the FSM to page freezing/all-visible setting makes much more
sense if you consider the second order effects.

There is bound to be competition for free space among backends that
access the free space map. By *not* freezing a page during VACUUM
because it looks unsettled, we make its free space available in the
traditional way instead. It follows that unsettled pages (in tables
with lots of updates) are now the only place that backends that need
more free space from the FSM can look -- unsettled pages therefore
become a hot commodity, freespace-wise. A page that initially appeared
"unsettled", that went on to become settled in this newly competitive
environment might have that happen by pure chance -- but probably not.
It *could* happen by chance, of course -- in which case the page will
get dirtied again, and the cycle continues, for now. There will be
further opportunities to figure it out, and freezing the tuples on the
page "prematurely" still has plenty of benefits.

Locality matters a lot, obviously. The goal with the FSM stuff is
merely to make it *possible* for pages to settle naturally, to the
extent that we can. We really just want to avoid hindering a naturally
occurring process -- we want to avoid destroying naturally occuring
locality. We must be willing to accept some cost for that. Even if it
takes a few attempts for certain pages, constraining the application's
choice of where to get free space from (can't be a page marked
all-visible) allows pages to *systematically* become settled over
time.

The application is in charge, really -- not VACUUM. This is already
the case, whether we like it or not. VACUUM needs to learn to live in
that reality, rather than fighting it. When VACUUM considers a page
settled, and the physical page still has a relatively large amount of
free space (say 45% of BLCKSZ, a borderline case in the new FSM
patch), "losing" so much free space certainly is unappealing. We set
the free space to 0 in the free space map all the same, because we're
cutting our losses at that point. While the exact threshold I've
proposed is tentative, the underlying theory seems pretty sound to me.
The BLCKSZ/2 cutoff (and the way that it extends the general rules for
whole-page freezing) is intended to catch pages that are qualitatively
different, as well as quantitatively different. It is a balancing act,
between not wasting space, and the risk of systemic problems involving
excessive amounts of non-HOT updates that must move a successor
version to another page.

It's possible that a higher cutoff (for example a cutoff of 80% of
BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
addition to the downsides from fragmentation -- it's far from a simple
trade-off. (Not that you should believe that 50% is special, it's just
a starting point for me.)

TPC-C
=====

I'm going to talk about a benchmark that ran throughout the week,
starting on Monday. Each run lasted 24 hours, and there were 2 runs in
total, for both the patch and for master/baseline. So this benchmark
lasted 4 days, not including the initial bulk loading, with databases
that were over 450GB in size by the time I was done (that's 450GB+ for
both the patch and master) . Benchmarking for days at a time is pretty
inconvenient, but it seems necessary to see certain effects in play.
We need to wait until the baseline/master case starts to have
anti-wraparound VACUUMs with default, realistic settings, which just
takes days and days.

I make available all of my data for the Benchmark in question, which
is way more information that anybody is likely to want -- I dump
anything that even might be useful from the system views in an
automated way. There are html reports for all 4 24 hour long runs.
Google drive link:

https://drive.google.com/drive/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR?usp=sharing

While the patch did well overall, and I will get to the particulars
towards the end of the email, I want to start with what I consider to
be the important part: the user/admin experience with VACUUM, and
VACUUM's performance stability. This is about making VACUUM less
scary.

As I've said several times now, with an append-only table like
pgbench_history we see a consistent pattern where relfrozenxid is set
to a value very close to the same VACUUM's OldestXmin value (even
precisely equal to OldestXmin) during each VACUUM operation, again and
again, forever -- that case is easy to understand and appreciate, and
has already been discussed. Now (with v7's new approach to freezing),
a related pattern can be seen in the case of the two big, troublesome
TPC-C tables, the orders and order lines tables.

To recap, these tables are somewhat like the history table, in that
new orders insert into both tables, again and again, forever. But they
also have one huge difference to simple append-only tables too, which
is the source of most of our problems with TPC-C. The difference is:
there are also delayed, correlated updates of each row from each
table. Exactly one such update per row for both tables, which takes
place hours after each order's insert, when the earlier order is
processed by TPC-C's delivery transaction. In the long run we need the
data to age out and not get re-dirtied, as the table grows and grows
indefinitely, much like with a simple append-only table. At the same
time, we don't want to have poor free space management for these
deferred updates. It's adversarial, sort of, but in a way that is
grounded in reality.

With the order and order lines tables, relfrozenxid tends to be
advanced up to the OldestXmin used by the *previous* VACUUM operation
-- an unmistakable pattern. I'll show you all of the autovacuum log
output for the orders table during the second 24 hour long benchmark
run:

2022-01-27 01:46:27 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1205349 remain, 887225 skipped using visibility map
(73.61% of total)
tuples: 253872 removed, 134182902 remain (26482225 newly frozen),
27193 are dead but not yet removable
removable cutoff: 243783407, older by 728844 xids when operation ended
new relfrozenxid: 215400514, which is 26840669 xids ahead of previous value
...
2022-01-27 05:54:39 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1345302 remain, 993924 skipped using visibility map
(73.88% of total)
tuples: 261656 removed, 150022816 remain (29757570 newly frozen),
29216 are dead but not yet removable
removable cutoff: 276319403, older by 826850 xids when operation ended
new relfrozenxid: 243838706, which is 28438192 xids ahead of previous value
...
2022-01-27 10:37:24 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1504707 remain, 1110002 skipped using visibility map
(73.77% of total)
tuples: 316086 removed, 167990124 remain (33754949 newly frozen),
33326 are dead but not yet removable
removable cutoff: 313328445, older by 987732 xids when operation ended
new relfrozenxid: 276309397, which is 32470691 xids ahead of previous value
...
2022-01-27 15:49:51 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1680649 remain, 1250525 skipped using visibility map
(74.41% of total)
tuples: 343946 removed, 187739072 remain (37346315 newly frozen),
38037 are dead but not yet removable
removable cutoff: 354149019, older by 1222160 xids when operation ended
new relfrozenxid: 313332249, which is 37022852 xids ahead of previous value
...
2022-01-27 21:55:34 PST LOG: automatic vacuum of table
"regression.public.bmsql_oorder": index scans: 1
pages: 0 removed, 1886336 remain, 1403800 skipped using visibility map
(74.42% of total)
tuples: 389748 removed, 210899148 remain (43453900 newly frozen),
45802 are dead but not yet removable
removable cutoff: 401955979, older by 1458514 xids when operation ended
new relfrozenxid: 354134615, which is 40802366 xids ahead of previous value

This mostly speaks for itself, I think. (Anybody that's interested can
drill down to the logs for order lines, which looks similar.)

The effect we see with the order/order lines table isn't perfectly
reliable. Actually, it depends on how you define it. It's possible
that we won't be able to acquire a cleanup lock on the wrong page at
the wrong time, and as a result fail to advance relfrozenxid by the
usual amount, once. But that effect appears to be both rare and of no
real consequence. One could reasonably argue that we never fell
behind, because we still did 99.9%+ of the required freezing -- we
just didn't immediately get to advance relfrozenxid, because of a
temporary hiccup on one page. We will still advance relfrozenxid by a
small amount. Sometimes it'll be by only hundreds of XIDs when
millions or tens of millions of XIDs were expected. Once we advance it
by some amount, we can reasonably suppose that the issue was just a
hiccup.

On the master branch, the first 24 hour period has no anti-wraparound
VACUUMs, and so looking at that first 24 hour period gives you some
idea of how worse off we are in the short term -- the freezing stuff
won't really start to pay for itself until the second 24 hour run with
these mostly-default freeze related settings. The second 24 hour run
on master almost exclusively has anti-wraparound VACUUMs for all the
largest tables, though -- all at the same time. And not just the first
time, either! This causes big spikes that the patch totally avoids,
simply by avoiding anti-wraparound VACUUMs. With the patch, there are
no anti-wraparound VACUUMs, barring tables that will never be vacuumed
for any other reason, where it's still inevitable, limited to the
stock table and customers table.

It was a mistake for me to emphasize "no anti-wraparound VACUUMs
outside pathological cases" before now. I stand by those statements as
accurate, but anti-wraparound VACUUMs should not have been given so
much emphasis. Let's assume that somehow we really were to get an
anti-wraparound VACUUM against one of the tables where that's just not
expected, like this orders table -- let's suppose that I got that part
wrong, in some way. It would hardly matter at all! We'd still have
avoided the freezing cliff during this anti-wraparound VACUUM, which
is the real benefit. Chances are good that we needed to VACUUM anyway,
just to clean any very old garbage tuples up -- relfrozenxid is now
predictive of the age of the oldest garbage tuples, which might have
been a good enough reason to VACUUM anyway. The stampede of
anti-wraparound VACUUMs against multiple tables seems like it would
still be fixed, since relfrozenxid now actually tells us something
about the table (as opposed to telling us only about what the user set
vacuum_freeze_min_age to). The only concerns that this leaves for me
are all usability related, and not of primary importance (e.g. do we
really need to make anti-wraparound VACUUMs non-cancelable now?).

TPC-C raw numbers
=================

The single most important number for the patch might be the decrease
in both buffer misses and buffer hits, which I believe is caused by
the patch being able to use index-only scans much more effectively
(with modifications to BenchmarkSQL to improve the indexing strategy
[1]: https://github.com/pgsql-io/benchmarksql/pull/16 -- Peter Geoghegan

Patch:

Here is the same pg_stat_database info for master:

The blks_read is x0.938 of master/baseline for the patch -- not bad.
More importantly, blks_hit is x0.616 for the patch -- quite a
significant reduction in a key cost. Note that we start to get this
particular benefit for individual read queries pretty early on --
avoiding unsetting visibility map bits like this matters right from
the start. In TPC-C terms, the ORDER_STATUS transaction will have much
lower latency, particularly tail latency, since it uses index-only
scans to good effect. There are 5 distinct transaction types from the
benchmark, and an improvement to one particular transaction type isn't
unusual -- so you often have to drill down, and look at the full html
report. The latency situation is improved across the board with the
patch, by quite a bit, especially after the second run. This server
can sustain much more throughput than the TPC-C spec formally permits,
even though I've increased the TPM rate from the benchmark by 10x the
spec legal limit, so query latency is the main TPC-C metric of
interest here.

WAL
===

Then there's the WAL overhead. Like practically any workload, the WAL
consumption for this workload is dominated by FPIs, despite the fact
that I've tuned checkpoints reasonably well. The patch *does* write
more WAL in the first set of runs -- it writes a total of ~3.991 TiB,
versus ~3.834 TiB for master. In other words, during the first 24 hour
run (before the trouble with the anti-wraparound freeze cliff even
begins for the master branch), the patch writes x1.040 as much WAL in
total. The good news is that the patch comes out ahead by the end,
after the second set of 24 hour runs. By the time the second run
finishes, it's 8.332 TiB of WAL total for the patch, versus 8.409 TiB
for master, putting the patch at x0.990 in the end -- a small
improvement. I believe that most of the WAL doesn't get generated by
VACUUM here anyway -- opportunistic pruning works well for this
workload.

I expect to be able to commit the first 2 patches in a couple of
weeks, since that won't need to block on making the case for the final
3 or 4 patches from the patch series. The early stuff is mostly just
refactoring work that removes needless differences between aggressive
and non-aggressive VACUUM operations. It makes a lot of sense on its
own.

[1]: https://github.com/pgsql-io/benchmarksql/pull/16 -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

v7-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchapplication/octet-stream; name=v7-0004-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchDownload

From 8a6624b51960019dd3001050e06dc1716b67816f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 22 Nov 2021 10:02:30 -0800
Subject: [PATCH v7 4/6] Loosen coupling between relfrozenxid and tuple
 freezing.

The pg_class.relfrozenxid invariant for heap relations is as follows:
relfrozenxid must be less than or equal to the oldest extant XID in the
table, and must never wraparound (it must be advanced by VACUUM before
wraparound, or in extreme cases the system must be forced to stop
allocating new XIDs).

Before now, VACUUM always set relfrozenxid to whatever value it happened
to use when determining which tuples to freeze (the VACUUM operation's
FreezeLimit cutoff).  But there was no inherent reason why the oldest
extant XID in the table should be anywhere near as old as that.
Furthermore, even if it really was almost as old as FreezeLimit, that
tells us much more about the mechanism that VACUUM used to determine
which tuples to freeze than anything else.  Depending on the details of
the table and workload, it may have been possible to safely advance
relfrozenxid by many more XIDs, at a relatively small cost in freezing
(possibly no extra cost at all) -- but VACUUM rigidly coupled freezing
with advancing relfrozenxid, missing all this.

Teach VACUUM to track the newest possible safe final relfrozenxid
dynamically (and to track a new value for relminmxid).  In the extreme
though common case where all tuples are already frozen, or became frozen
(or were removed by pruning), the final relfrozenxid value will be
exactly equal to the OldestXmin value used by the same VACUUM operation.
A later patch will overhaul the strategy that VACUUM uses for freezing
so that relfrozenxid will tend to get set to a value that's relatively
close to OldestXmin in almost all cases.

Final relfrozenxid values still follow the same rules as before.  They
must still be >= FreezeLimit in an aggressive VACUUM.  Non-aggressive
VACUUMs can set relfrozenxid to any value that's greater than the
preexisting relfrozenxid, which could be either much earlier or much
later than FreezeLimit.  Much depends on workload characteristics.  In
practice there is significant natural variation that we can take
advantage of.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |   4 +-
 src/include/commands/vacuum.h        |   1 +
 src/backend/access/heap/heapam.c     | 186 ++++++++++++++++++++-------
 src/backend/access/heap/vacuumlazy.c |  86 +++++++++----
 src/backend/commands/cluster.c       |   5 +-
 src/backend/commands/vacuum.c        |  34 ++++-
 7 files changed, 241 insertions(+), 79 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 0ad87730e..d35402f9f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -168,7 +168,9 @@ extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
 extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi, Buffer buf);
+									MultiXactId cutoff_multi,
+									TransactionId *NewRelfrozenxid,
+									MultiXactId *NewRelminmxid, Buffer buf);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..ae55c90f7 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *NewRelfrozenxid,
+									  MultiXactId *NewRelminmxid);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 98230aac4..d85a817ff 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6087,12 +6087,24 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "NewRelfrozenxid" is an output value; it's used to maintain target new
+ * relfrozenxid for the relation.  It can be ignored unless "flags" contains
+ * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
+ * here.  This follows the general convention: only track XIDs that will still
+ * be in the table after the ongoing VACUUM finishes.  Note that it's up to
+ * caller to maintain this when the Xid return value is itself an Xid.
+ *
+ * Note that we cannot depend on xmin to maintain NewRelfrozenxid.  We need to
+ * push maintenance of NewRelfrozenxid down this far, since in general xmin
+ * might have been frozen by an earlier VACUUM operation, in which case our
+ * caller will not have factored-in xmin when maintaining NewRelfrozenxid.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *NewRelfrozenxid)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6104,6 +6116,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId tempNewRelfrozenxid;
 
 	*flags = 0;
 
@@ -6198,13 +6211,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-		{
 			need_replace = true;
-			break;
-		}
+		if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+			tempNewRelfrozenxid = members[i].xid;
 	}
 
 	/*
@@ -6213,6 +6226,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (!need_replace)
 	{
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 		*flags |= FRM_NOOP;
 		pfree(members);
 		return InvalidTransactionId;
@@ -6222,6 +6236,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * If the multi needs to be updated, figure out which members do we need
 	 * to keep.
 	 */
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
@@ -6303,7 +6318,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 * list.)
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+					tempNewRelfrozenxid = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6313,6 +6332,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			{
 				/* running locker cannot possibly be older than the cutoff */
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6341,6 +6361,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+		/* Caller manages NewRelfrozenxid directly when we return an XID */
 	}
 	else
 	{
@@ -6350,6 +6371,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 	}
 
 	pfree(newmembers);
@@ -6368,6 +6390,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will actually go on to freeze as indicated by our *frz output, so
+ * any (xmin, xmax, xvac) XIDs that we indicate need to be frozen won't need
+ * to be counted here.  Values are valid lower bounds at the point that the
+ * ongoing VACUUM finishes.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
@@ -6392,7 +6421,9 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p,
+						  TransactionId *NewRelfrozenxid,
+						  MultiXactId *NewRelminmxid)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6436,6 +6467,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+		{
+			/* won't be frozen, but older than current NewRelfrozenxid */
+			*NewRelfrozenxid = xid;
+		}
 	}
 
 	/*
@@ -6453,10 +6489,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId temp = *NewRelfrozenxid;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi, &flags, &temp);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
@@ -6474,6 +6511,24 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
+
+			if (TransactionIdPrecedes(newxmax, *NewRelfrozenxid))
+			{
+				/* New xmax is an XID older than new NewRelfrozenxid */
+				*NewRelfrozenxid = newxmax;
+			}
+		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * Changing nothing, so might have to ratchet back NewRelminmxid,
+			 * NewRelfrozenxid, or both together
+			 */
+			if (MultiXactIdIsValid(xid) &&
+				MultiXactIdPrecedes(xid, *NewRelminmxid))
+				*NewRelminmxid = xid;
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6495,6 +6550,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 
 			changed = true;
+
+			/*
+			 * New multixact might have remaining XID older than
+			 * NewRelfrozenxid
+			 */
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
@@ -6522,7 +6584,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			freeze_xmax = true;
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			{
+				/* won't be frozen, but older than current NewRelfrozenxid */
+				*NewRelfrozenxid = xid;
+			}
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
@@ -6569,6 +6638,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * Since we always freeze here, NewRelfrozenxid doesn't need to be
+		 * maintained.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6646,11 +6718,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId NewRelfrozenxid = FirstNormalTransactionId;
+	MultiXactId NewRelminmxid = FirstMultiXactId;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &NewRelfrozenxid, &NewRelminmxid);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7080,6 +7155,15 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
  * are older than the specified cutoff XID or MultiXactId.  If so, return true.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will never freeze any of the XIDs from the tuple, even when we say
+ * that they should.  If caller opts to go with our recommendation to freeze,
+ * then it must account for the fact that it shouldn't trust how we've set
+ * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
+ * our recommendation because they must, and non-aggressive VACUUMs always opt
+ * to not freeze, preferring to ratchet back NewRelfrozenxid instead).
+ *
  * It doesn't matter whether the tuple is alive or dead, we are checking
  * to see if a tuple needs to be removed or frozen to avoid wraparound.
  *
@@ -7088,74 +7172,86 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  */
 bool
 heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi, Buffer buf)
+						MultiXactId cutoff_multi,
+						TransactionId *NewRelfrozenxid,
+						MultiXactId *NewRelminmxid, Buffer buf)
 {
 	TransactionId xid;
+	bool		needs_freeze = false;
 
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
+	if (TransactionIdIsNormal(xid))
+	{
+		if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			*NewRelfrozenxid = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			needs_freeze = true;
+	}
 
 	/*
 	 * The considerations for multixacts are complicated; look at
 	 * heap_prepare_freeze_tuple for justifications.  This routine had better
 	 * be in sync with that one!
+	 *
+	 * (Actually, we maintain NewRelminmxid differently here, because we
+	 * assume that XIDs that should be frozen according to cutoff_xid won't
+	 * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)
 	 */
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		MultiXactId multi;
+		MultiXactMember *members;
+		int			nmembers;
 
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+		if (MultiXactIdIsValid(multi) &&
+			MultiXactIdPrecedes(multi, *NewRelminmxid))
+			*NewRelminmxid = multi;
+
+		if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 			return true;
 		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
+			needs_freeze = true;
+
+		/* need to check whether any member of the mxact is too old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
 		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
-
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
+			if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+				needs_freeze = true;
+			if (TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
 		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 	else
 	{
 		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
-	return false;
+	return needs_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d804e2553..9cc5742ad 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -171,8 +171,10 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+
+	/* Track new pg_class.relfrozenxid/pg_class.relminmxid values */
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -329,6 +331,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
@@ -362,8 +365,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -470,8 +473,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+
+	/* Initialize values used to advance relfrozenxid/relminmxid at the end */
+	vacrel->NewRelfrozenxid = OldestXmin;
+	vacrel->NewRelminmxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -524,16 +529,18 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
 	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
 	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * the visibility map.  A non-aggressive VACUUM might only be able to
+	 * advance relfrozenxid to an XID from before FreezeLimit (or a relminmxid
+	 * from before MultiXactCutoff) when it wasn't possible to freeze some
+	 * tuples due to our inability to acquire a cleanup lock, but the effect
+	 * is usually insignificant -- NewRelfrozenxid value still has a decent
+	 * chance of being much more recent that the existing relfrozenxid.
 	 *
 	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
 	 * the rel_pages used by lazy_scan_heap, which won't match when we
 	 * happened to truncate the relation afterwards.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
@@ -545,12 +552,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 	else
 	{
-		/* Can safely advance relfrozen and relminmxid, too */
+		/*
+		 * Aggressive case is strictly required to advance relfrozenxid, at
+		 * least up to FreezeLimit (same applies with relminmxid and its
+		 * cutoff, MultiXactCutoff).  Assert that we got this right now.
+		 */
 		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
 			   orig_rel_pages);
+		Assert(!aggressive ||
+			   TransactionIdPrecedesOrEquals(FreezeLimit,
+											 vacrel->NewRelfrozenxid));
+		Assert(!aggressive ||
+			   MultiXactIdPrecedesOrEquals(MultiXactCutoff,
+										   vacrel->NewRelminmxid));
+
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
+							vacrel->NewRelfrozenxid, vacrel->NewRelminmxid,
 							&frozenxid_updated, &minmulti_updated, false);
 	}
 
@@ -655,17 +673,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenxid - vacrel->relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenxid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminmxid - vacrel->relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminmxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -1580,6 +1598,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	int			nfrozen;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1588,6 +1608,8 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level counters */
+	NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	NewRelminmxid = vacrel->NewRelminmxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	recently_dead_tuples = 0;
@@ -1797,7 +1819,9 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenxid,
+									  &NewRelminmxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1811,13 +1835,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1978,6 +2005,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	MultiXactId NewRelminmxid = vacrel->NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2024,7 +2053,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_needs_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
+									vacrel->MultiXactCutoff,
+									&NewRelfrozenxid, &NewRelminmxid, buf))
 		{
 			if (vacrel->aggressive)
 			{
@@ -2034,10 +2064,12 @@ lazy_scan_noprune(LVRelState *vacrel,
 			}
 
 			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
+			 * A non-aggressive VACUUM doesn't have to wait on a cleanup lock
+			 * to ensure that it advances relfrozenxid to a sufficiently
+			 * recent XID that happens to be present on this page.  It can
+			 * just accept an older New/final relfrozenxid instead.  There is
+			 * a decent chance that the problem will go away naturally.
 			 */
-			vacrel->freeze_cutoffs_valid = false;
 		}
 
 		num_tuples++;
@@ -2087,6 +2119,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * We have committed to not freezing the tuples on this page (always
+	 * happens with a non-aggressive VACUUM), so make sure that the target
+	 * relfrozenxid/relminmxid values reflect the XIDs/MXIDs we encountered
+	 */
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
+
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel (though
 	 * only when VACUUM uses two-pass strategy).
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 02a7e94bf..a7e988298 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d1cadf126..4f07e426e 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -950,10 +950,28 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
  * - multiXactCutoff is the value below which all MultiXactIds are removed
  *   from Xmax.
+ *
+ * oldestXmin and oldestMxact can be thought of as the most recent values that
+ * can ever be passed to vac_update_relstats() as frozenxid and minmulti
+ * arguments.  These exact values will be used when no newer XIDs or
+ * MultiXacts remain in the heap relation (e.g., with an empty table).  It's
+ * typical for vacuumlazy.c caller to notice that older XIDs/Multixacts remain
+ * in the table, which will force it to use older value.  These older final
+ * values may not be any newer than the preexisting frozenxid/minmulti values
+ * from pg_class in extreme cases.  The final values are frequently fairly
+ * close to the optimal values that we give to vacuumlazy.c, though.
+ *
+ * An aggressive VACUUM always provides vac_update_relstats() arguments that
+ * are >= freezeLimit and >= multiXactCutoff.  A non-aggressive VACUUM may
+ * provide arguments that are either newer or older than freezeLimit and
+ * multiXactCutoff, or non-valid values (indicating that pg_class level
+ * cutoffs cannot be advanced at all).
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -962,6 +980,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -970,7 +989,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1066,9 +1084,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1083,8 +1103,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
-- 
2.30.2

v7-0005-Make-block-level-characteristics-drive-freezing.patchapplication/octet-stream; name=v7-0005-Make-block-level-characteristics-drive-freezing.patchDownload

From bb7ae7d77b55902f3003b5dc8851bd3f949c1d09 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 13 Dec 2021 15:00:49 -0800
Subject: [PATCH v7 5/6] Make block-level characteristics drive freezing.

Teach VACUUM to freeze all of the tuples on a page whenever it notices
that it would otherwise mark the page all-visible, without also marking
it all-frozen.  VACUUM won't freeze _any_ tuples on the page unless
_all_ tuples (that remain after pruning) are all-visible.  It may
occasionally be necessary to freeze the page due to the presence of a
particularly old XID, from before VACUUM's FreezeLimit cutoff.  But the
FreezeLimit mechanism will seldom have any impact on which pages are
frozen anymore -- it is just a backstop now.

Freezing can now informally be thought of as something that takes place
at the level of an entire page, or not at all -- differences in XIDs
among tuples on the same page are not interesting, barring extreme
cases.  Freezing a page is now practically synonymous with setting the
page to all-visible in the visibility map, at least to users.

The main upside of the new approach to freezing is that it makes the
overhead of vacuuming much more predictable over time.  We avoid the
need for large balloon payments, since the system no longer accumulates
"freezing debt" that can only be paid off by anti-wraparound vacuuming.
This seems to have been particularly troublesome with append-only
tables, especially in the common case where XIDs from pages that are
marked all-visible for the first time are still fairly young (in
particular, not as old as indicated by VACUUM's vacuum_freeze_min_age
freezing cutoff).  Before now, nothing stopped these pages from being
set to all-visible (without also being set to all-frozen) the first time
they were reached by VACUUM, which meant that they just couldn't be
frozen until the next anti-wraparound VACUUM -- at which point the XIDs
from the unfrozen tuples might be much older than vacuum_freeze_min_age.
In summary, the old vacuum_freeze_min_age-based FreezeLimit cutoff could
not _reliably_ limit freezing debt unless the GUC was set to 0.

There is a virtuous cycle enabled by the new approach to freezing:
freezing more tuples earlier during non-aggressive VACUUMs allows us to
advance relfrozenxid eagerly, which buys time.  This creates every
opportunity for the workload to naturally generate enough dead tuples
(or newly inserted tuples) to make the autovacuum launcher launch a
non-aggressive autovacuum.  The overall effect is that most individual
tables no longer require _any_ anti-wraparound vacuum operations.  This
effect also owes much to the enhancement added by commit ?????, which
loosened the coupling between freezing and advancing relfrozenxid,
allowing VACUUM to precisely determine a new relfrozenxid.

It's still possible (and sometimes even likely) that VACUUM won't be
able to freeze a tuple with a somewhat older XID due only to a cleanup
lock not being immediately available.  It's even possible that some
VACUUM operations will fail to advance relfrozenxid by very many XIDs as
a consequence.  But the impact over time should be negligible.  The next
VACUUM operation for the table will effectively get a new opportunity to
freeze (or perhaps remove) the same tuple that was originally missed.
Once that happens, relfrozenxid will completely catch up. (Actually, one
could reasonably argue that we never really "fell behind" in the first
place -- the amount of freezing needed to significantly advance
relfrozenxid won't have measurably increased at any point.  A once-off
drop in the extent to which VACUUM can advance relfrozenxid is almost
certainly harmless noise.)

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 101 ++++++++++++++++++++++-----
 1 file changed, 84 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9cc5742ad..52cfb00ea 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -168,6 +168,7 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoff for pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
@@ -199,6 +200,7 @@ typedef struct LVRelState
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber newly_frozen_pages; /* # pages with tuples frozen by us */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -353,12 +355,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 								  RelationGetRelid(rel));
 
 	/*
-	 * Get cutoffs that determine which tuples we need to freeze during the
-	 * VACUUM operation.
+	 * Determine if this is to be an aggressive VACUUM.  This will eventually
+	 * be required for any table where (for whatever reason) no non-aggressive
+	 * VACUUM ran to completion and advanced relfrozenxid below.  This became
+	 * much rarer when the strategy used to determine what to freeze was
+	 * taught to focus on freezing whole physical pages as the page was about
+	 * to be set all-visible (to avoid big cliffs during aggressive VACUUMs).
 	 *
-	 * Also determines if this is to be an aggressive VACUUM.  This will
-	 * eventually be required for any table where (for whatever reason) no
-	 * non-aggressive VACUUM ran to completion, and advanced relfrozenxid.
+	 * Also gets cutoffs that determine which tuples we should definitely
+	 * freeze.  If any one tuple is from before FreezeLimit, we will freeze
+	 * the whole page, even when we wouldn't otherwise freeze because the page
+	 * can't be set all-visible. (Actually, we won't freeze whatever the
+	 * tuples are that are individually not all visible on the page, since
+	 * that would render the tuples all-visible prematurely.)
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
@@ -471,6 +480,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Set cutoffs for entire VACUUM */
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
 
@@ -651,12 +661,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u newly frozen (%.2f%% of total)\n"),
 							 vacrel->removed_pages,
 							 vacrel->rel_pages,
 							 vacrel->scanned_pages,
 							 orig_rel_pages == 0 ? 0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scanned_pages / orig_rel_pages,
+							 vacrel->newly_frozen_pages,
+							 orig_rel_pages == 0 ? 0 :
+							 100.0 * vacrel->newly_frozen_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
@@ -824,6 +837,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	vacrel->scanned_pages = 0;
 	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
+	vacrel->newly_frozen_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -1025,7 +1039,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 			}
 
 			/*
-			 * Otherwise it must be an all-visible (and possibly even
+			 * Otherwise it must be an all-visible (and very likely
 			 * all-frozen) page that we decided to process regardless
 			 * (SKIP_PAGES_THRESHOLD must not have been crossed).
 			 */
@@ -1589,7 +1603,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	ItemId		itemid;
 	HeapTupleData tuple;
 	HTSV_Result res;
-	int			tuples_deleted,
+	int			tuples_deleted = 0,
 				lpdead_items,
 				recently_dead_tuples,
 				num_tuples,
@@ -1600,6 +1614,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 	TransactionId NewRelfrozenxid;
 	MultiXactId NewRelminmxid;
+	TransactionId FreezeLimit = vacrel->FreezeLimit;
+	MultiXactId MultiXactCutoff = vacrel->MultiXactCutoff;
+	bool		freezeblk = false;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1610,7 +1627,6 @@ retry:
 	/* Initialize (or reset) page-level counters */
 	NewRelfrozenxid = vacrel->NewRelfrozenxid;
 	NewRelminmxid = vacrel->NewRelminmxid;
-	tuples_deleted = 0;
 	lpdead_items = 0;
 	recently_dead_tuples = 0;
 	num_tuples = 0;
@@ -1625,9 +1641,9 @@ retry:
 	 * lpdead_items's final value can be thought of as the number of tuples
 	 * that were deleted from indexes.
 	 */
-	tuples_deleted = heap_page_prune(rel, buf, vistest,
-									 InvalidTransactionId, 0, &nnewlpdead,
-									 &vacrel->offnum);
+	tuples_deleted += heap_page_prune(rel, buf, vistest,
+									  InvalidTransactionId, 0, &nnewlpdead,
+									  &vacrel->offnum);
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
@@ -1678,11 +1694,16 @@ retry:
 		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
 		 * treated as advisory/unreliable, so we might as well be slightly
 		 * optimistic.
+		 *
+		 * We delay setting all_visible to false due to seeing an LP_DEAD
+		 * item.  We need to test "is the page all_visible if we just consider
+		 * remaining tuples with tuple storage?" below, when considering if we
+		 * should freeze the tuples on the page.  (all_visible will be set to
+		 * false for caller once we've decided on what to freeze.)
 		 */
 		if (ItemIdIsDead(itemid))
 		{
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
 			prunestate->has_lpdead_items = true;
 			continue;
 		}
@@ -1816,8 +1837,8 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
-									  vacrel->FreezeLimit,
-									  vacrel->MultiXactCutoff,
+									  FreezeLimit,
+									  MultiXactCutoff,
 									  &frozen[nfrozen],
 									  &tuple_totally_frozen,
 									  &NewRelfrozenxid,
@@ -1837,6 +1858,50 @@ retry:
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * Freeze the whole page using OldestXmin (not FreezeLimit) as our cutoff
+	 * if the page is now eligible to be marked all_visible (barring any
+	 * LP_DEAD items) when the page is not already eligible to be marked
+	 * all_frozen.  We generally expect to freeze all of a block's tuples
+	 * together and at once, or none at all.  FreezeLimit is just a backstop
+	 * mechanism that makes sure that we don't overlook one or two older
+	 * tuples.
+	 *
+	 * For example, it's just about possible that successive VACUUM operations
+	 * will never quite manage to use the main block-level logic to freeze one
+	 * old tuple from a page where all other tuples are continually updated.
+	 * We should not be in any hurry to freeze such a tuple.  Even still, it's
+	 * better if we take care of it before an anti-wraparound VACUUM becomes
+	 * necessary -- that would mean that we'd have to wait for a cleanup lock
+	 * during the aggressive VACUUM, which has risks of its own.
+	 *
+	 * FIXME This code structure has been used for prototyping and testing the
+	 * algorithm, details of which have settled.  Code itself to be rewritten,
+	 * though.  It is backwards right now -- should be _starting_ with
+	 * OldestXmin (not FreezeLimit), since that's what happens at the
+	 * conceptual level.
+	 *
+	 * TODO Make vacuum_freeze_min_age GUC/reloption default -1, which will be
+	 * interpreted as "whatever autovacuum_freeze_max_age/2 is".  Idea is to
+	 * make FreezeLimit into a true backstop, and to do our best to avoid
+	 * waiting for a cleanup lock (always prefer to punt to the next VACUUM,
+	 * since we can advance relfrozenxid to the oldest XID on the page inside
+	 * lazy_scan_noprune).
+	 */
+	if (!freezeblk &&
+		((nfrozen > 0 && nfrozen < num_tuples) ||
+		 (prunestate->all_visible && !prunestate->all_frozen)))
+	{
+		freezeblk = true;
+		FreezeLimit = vacrel->OldestXmin;
+		MultiXactCutoff = vacrel->OldestMxact;
+		goto retry;
+	}
+
+	/* Time to define all_visible in a way that accounts for LP_DEAD items */
+	if (lpdead_items > 0)
+		prunestate->all_visible = false;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
@@ -1854,6 +1919,8 @@ retry:
 	{
 		Assert(prunestate->hastup);
 
+		vacrel->newly_frozen_pages++;
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1882,7 +1949,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(vacrel->rel, buf, FreezeLimit,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
-- 
2.30.2

v7-0006-Add-all-visible-FSM-heuristic.patchapplication/octet-stream; name=v7-0006-Add-all-visible-FSM-heuristic.patchDownload

From b091b9051f64d28a376768dbbf17479dc1238f89 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 23 Jan 2022 21:10:38 -0800
Subject: [PATCH v7 6/6] Add all-visible FSM heuristic.

When recording free space in all-frozen page, record that the page has
zero free space when it has less than half BLCKSZ worth of space,
according to the traditional definition.  Otherwise record free space as
usual.

Making all-visible pages resistant to change like this can be thought of
as a form of hysteresis.  The page is given an opportunity to "settle"
and permanently stay in the same state when the tuples on the page will
never be updated or deleted.  But when they are updated or deleted, the
page can once again be used to store any tuple.  Over time, most pages
tend to settle permanently in many workloads, perhaps only on the second
or third attempt.
---
 src/backend/access/heap/vacuumlazy.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 52cfb00ea..5608a6e19 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1238,6 +1238,13 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 				 */
 				freespace = PageGetHeapFreeSpace(page);
 
+				/*
+				 * An all-visible page should not have its free space
+				 * available from FSM unless it's more than half empty
+				 */
+				if (PageIsAllVisible(page) && freespace < BLCKSZ / 2)
+					freespace = 0;
+
 				UnlockReleaseBuffer(buf);
 				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 				continue;
@@ -1375,6 +1382,13 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
+			/*
+			 * An all-visible page should not have its free space available
+			 * from FSM unless it's more than half empty
+			 */
+			if (PageIsAllVisible(page) && freespace < BLCKSZ / 2)
+				freespace = 0;
+
 			UnlockReleaseBuffer(buf);
 			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		}
@@ -2549,6 +2563,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		freespace = PageGetHeapFreeSpace(page);
 
+		/*
+		 * An all-visible page should not have its free space available from
+		 * FSM unless it's more than half empty
+		 */
+		if (PageIsAllVisible(page) && freespace < BLCKSZ / 2)
+			freespace = 0;
+
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
 		vacuumed_pages++;
-- 
2.30.2

v7-0003-Consolidate-VACUUM-xid-cutoff-logic.patchapplication/octet-stream; name=v7-0003-Consolidate-VACUUM-xid-cutoff-logic.patchDownload

From ff523df6b910f4a51377c9ec75c616e64372850b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 11 Dec 2021 17:39:45 -0800
Subject: [PATCH v7 3/6] Consolidate VACUUM xid cutoff logic.

Push the logic for determining whether or not any given VACUUM operation
will be aggressive down into vacuum_set_xid_limits().  This makes its
function signature significantly simpler.

This refactoring work will make it easier to set/return an "oldestMxact"
value the function's vacuumlazy.c caller in a later commit that teaches
VACUUM to intelligently set relfrozenxid and relminmxid to the oldest
real remain xid/MultiXactId.

A VACUUM operation's oldestMxact can be thought of as the MultiXactId
equivalent of its OldestXmin: just as OldestXmin is used as our initial
target relfrozenxid (which we'll ratchet back as the VACUUM progresses
and notices that it'll leave older XIDs in place), oldestMxact will be
our initial target MultiXactId (for a target MultiXactId that is itself
ratcheted back in the same way).
---
 src/include/commands/vacuum.h        |   6 +-
 src/backend/access/heap/vacuumlazy.c |  32 +++----
 src/backend/commands/cluster.c       |   3 +-
 src/backend/commands/vacuum.c        | 134 +++++++++++++--------------
 4 files changed, 79 insertions(+), 96 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index e5e548d6b..d64f6268f 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -286,15 +286,13 @@ extern void vac_update_relstats(Relation relation,
 								bool *frozenxid_updated,
 								bool *minmulti_updated,
 								bool in_outer_xact);
-extern void vacuum_set_xid_limits(Relation rel,
+extern bool vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
 								  TransactionId *freezeLimit,
-								  TransactionId *xidFullScanLimit,
-								  MultiXactId *multiXactCutoff,
-								  MultiXactId *mxactFullScanLimit);
+								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
 									  MultiXactId relminmxid);
 extern void vac_update_datfrozenxid(void);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 71378740c..d804e2553 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -322,8 +322,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
-	TransactionId xidFullScanLimit;
-	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
@@ -351,24 +349,22 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
 								  RelationGetRelid(rel));
 
-	vacuum_set_xid_limits(rel,
-						  params->freeze_min_age,
-						  params->freeze_table_age,
-						  params->multixact_freeze_min_age,
-						  params->multixact_freeze_table_age,
-						  &OldestXmin, &FreezeLimit, &xidFullScanLimit,
-						  &MultiXactCutoff, &mxactFullScanLimit);
-
 	/*
-	 * We request an aggressive scan if the table's frozen Xid is now older
-	 * than or equal to the requested Xid full-table scan limit; or if the
-	 * table's minimum MultiXactId is older than or equal to the requested
-	 * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
+	 * Get cutoffs that determine which tuples we need to freeze during the
+	 * VACUUM operation.
+	 *
+	 * Also determines if this is to be an aggressive VACUUM.  This will
+	 * eventually be required for any table where (for whatever reason) no
+	 * non-aggressive VACUUM ran to completion, and advanced relfrozenxid.
 	 */
-	aggressive = TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
-											   xidFullScanLimit);
-	aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
-											  mxactFullScanLimit);
+	aggressive = vacuum_set_xid_limits(rel,
+									   params->freeze_min_age,
+									   params->freeze_table_age,
+									   params->multixact_freeze_min_age,
+									   params->multixact_freeze_table_age,
+									   &OldestXmin, &FreezeLimit,
+									   &MultiXactCutoff);
+
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
 	{
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 2e8efe4f8..02a7e94bf 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -857,8 +857,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * not to be aggressive about this.
 	 */
 	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, NULL, &MultiXactCutoff,
-						  NULL);
+						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 37413dd43..d1cadf126 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -942,25 +942,20 @@ get_all_vacuum_rels(int options)
  *
  * Input parameters are the target relation, applicable freeze age settings.
  *
+ * Return value indicates whether caller should do an aggressive VACUUM or
+ * not.  This is a VACUUM that cannot skip any pages using the visibility map
+ * (except all-frozen pages), which is guaranteed to be able to advance
+ * relfrozenxid and relminmxid.
+ *
  * The output parameters are:
- * - oldestXmin is the cutoff value used to distinguish whether tuples are
- *	 DEAD or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ * - oldestXmin is the Xid below which tuples deleted by any xact (that
+ *   committed) should be considered DEAD, not just RECENTLY_DEAD.
  * - freezeLimit is the Xid below which all Xids are replaced by
  *	 FrozenTransactionId during vacuum.
- * - xidFullScanLimit (computed from freeze_table_age parameter)
- *	 represents a minimum Xid value; a table whose relfrozenxid is older than
- *	 this will have a full-table vacuum applied to it, to freeze tuples across
- *	 the whole table.  Vacuuming a table younger than this value can use a
- *	 partial scan.
- * - multiXactCutoff is the value below which all MultiXactIds are removed from
- *	 Xmax.
- * - mxactFullScanLimit is a value against which a table's relminmxid value is
- *	 compared to produce a full-table vacuum, as with xidFullScanLimit.
- *
- * xidFullScanLimit and mxactFullScanLimit can be passed as NULL if caller is
- * not interested.
+ * - multiXactCutoff is the value below which all MultiXactIds are removed
+ *   from Xmax.
  */
-void
+bool
 vacuum_set_xid_limits(Relation rel,
 					  int freeze_min_age,
 					  int freeze_table_age,
@@ -968,9 +963,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
 					  TransactionId *freezeLimit,
-					  TransactionId *xidFullScanLimit,
-					  MultiXactId *multiXactCutoff,
-					  MultiXactId *mxactFullScanLimit)
+					  MultiXactId *multiXactCutoff)
 {
 	int			freezemin;
 	int			mxid_freezemin;
@@ -980,6 +973,7 @@ vacuum_set_xid_limits(Relation rel,
 	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
+	int			freezetable;
 
 	/*
 	 * We can always ignore processes running lazy vacuum.  This is because we
@@ -1097,64 +1091,60 @@ vacuum_set_xid_limits(Relation rel,
 
 	*multiXactCutoff = mxactLimit;
 
-	if (xidFullScanLimit != NULL)
-	{
-		int			freezetable;
+	/*
+	 * Done setting output parameters; just need to figure out if caller needs
+	 * to do an aggressive VACUUM or not.
+	 *
+	 * Determine the table freeze age to use: as specified by the caller, or
+	 * vacuum_freeze_table_age, but in any case not more than
+	 * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
+	 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
+	 * before anti-wraparound autovacuum is launched.
+	 */
+	freezetable = freeze_table_age;
+	if (freezetable < 0)
+		freezetable = vacuum_freeze_table_age;
+	freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
+	Assert(freezetable >= 0);
 
-		Assert(mxactFullScanLimit != NULL);
+	/*
+	 * Compute XID limit causing an aggressive vacuum, being careful not to
+	 * generate a "permanent" XID
+	 */
+	limit = ReadNextTransactionId() - freezetable;
+	if (!TransactionIdIsNormal(limit))
+		limit = FirstNormalTransactionId;
+	if (TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
+									  limit))
+		return true;
 
-		/*
-		 * Determine the table freeze age to use: as specified by the caller,
-		 * or vacuum_freeze_table_age, but in any case not more than
-		 * autovacuum_freeze_max_age * 0.95, so that if you have e.g nightly
-		 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
-		 * before anti-wraparound autovacuum is launched.
-		 */
-		freezetable = freeze_table_age;
-		if (freezetable < 0)
-			freezetable = vacuum_freeze_table_age;
-		freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
-		Assert(freezetable >= 0);
+	/*
+	 * Similar to the above, determine the table freeze age to use for
+	 * multixacts: as specified by the caller, or
+	 * vacuum_multixact_freeze_table_age, but in any case not more than
+	 * autovacuum_multixact_freeze_table_age * 0.95, so that if you have e.g.
+	 * nightly VACUUM schedule, the nightly VACUUM gets a chance to freeze
+	 * multixacts before anti-wraparound autovacuum is launched.
+	 */
+	freezetable = multixact_freeze_table_age;
+	if (freezetable < 0)
+		freezetable = vacuum_multixact_freeze_table_age;
+	freezetable = Min(freezetable,
+					  effective_multixact_freeze_max_age * 0.95);
+	Assert(freezetable >= 0);
 
-		/*
-		 * Compute XID limit causing a full-table vacuum, being careful not to
-		 * generate a "permanent" XID.
-		 */
-		limit = ReadNextTransactionId() - freezetable;
-		if (!TransactionIdIsNormal(limit))
-			limit = FirstNormalTransactionId;
+	/*
+	 * Compute MultiXact limit causing an aggressive vacuum, being careful to
+	 * generate a valid MultiXact value
+	 */
+	mxactLimit = ReadNextMultiXactId() - freezetable;
+	if (mxactLimit < FirstMultiXactId)
+		mxactLimit = FirstMultiXactId;
+	if (MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
+									mxactLimit))
+		return true;
 
-		*xidFullScanLimit = limit;
-
-		/*
-		 * Similar to the above, determine the table freeze age to use for
-		 * multixacts: as specified by the caller, or
-		 * vacuum_multixact_freeze_table_age, but in any case not more than
-		 * autovacuum_multixact_freeze_table_age * 0.95, so that if you have
-		 * e.g. nightly VACUUM schedule, the nightly VACUUM gets a chance to
-		 * freeze multixacts before anti-wraparound autovacuum is launched.
-		 */
-		freezetable = multixact_freeze_table_age;
-		if (freezetable < 0)
-			freezetable = vacuum_multixact_freeze_table_age;
-		freezetable = Min(freezetable,
-						  effective_multixact_freeze_max_age * 0.95);
-		Assert(freezetable >= 0);
-
-		/*
-		 * Compute MultiXact limit causing a full-table vacuum, being careful
-		 * to generate a valid MultiXact value.
-		 */
-		mxactLimit = ReadNextMultiXactId() - freezetable;
-		if (mxactLimit < FirstMultiXactId)
-			mxactLimit = FirstMultiXactId;
-
-		*mxactFullScanLimit = mxactLimit;
-	}
-	else
-	{
-		Assert(mxactFullScanLimit == NULL);
-	}
+	return false;
 }
 
 /*
-- 
2.30.2

v7-0002-Add-VACUUM-instrumentation-for-scanned-pages-relf.patchapplication/octet-stream; name=v7-0002-Add-VACUUM-instrumentation-for-scanned-pages-relf.patchDownload

From 18380fbc60810eeea2b60ba33e7a2850d0212fdd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 21 Nov 2021 14:47:11 -0800
Subject: [PATCH v7 2/6] Add VACUUM instrumentation for scanned pages,
 relfrozenxid.

Report on scanned pages within VACUUM VERBOSE and autovacuum logging.
These are pages that were physically scanned during the VACUUM
operation, including pages that are marked all-visible in the visibility
map that were nevertheless scanned (typically because the visibility map
skipping decided not to skip a skippable page that is physically
surrounded by non-skippable pages).

Also report when relfrozenxid is advanced by VACUUM, and by how much.
Rename the user-visible OldestXmin output field to "removal cutoff", and
show some supplementary information: how far behind the cutoff is
(number of XIDs behind) by the time the VACUUM operation finished.  This
should give users some chance of figuring out what's _not_ working, and
highlights the relationship between OldestXmin and the relfrozenxid.

Finally, add instrumentation of "missed dead tuples", and the number of
pages that had at least one such tuple.  These are fully DEAD (not just
RECENTLY_DEAD) tuples with storage that could not be pruned due to an
inability to acquire a cleanup lock.  This is a replacement for the
"skipped due to pin" instrumentation removed by the previous commit.  It
shows more details than before for pages where failing to get a cleanup
lock actually mattered (those with missed dead tuples), and shows
nothing for pages where no real work was missed.  In practice there
seems to be far more of the latter than of the former, so the signal to
noise ratio is much improved.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wznp=c=Opj8Z7RMR3G=ec3_JfGYMN_YvmCEjoPCHzWbx0g@mail.gmail.com
---
 src/include/commands/vacuum.h        |  2 +
 src/backend/access/heap/vacuumlazy.c | 97 +++++++++++++++++++---------
 src/backend/commands/analyze.c       |  3 +
 src/backend/commands/vacuum.c        |  9 +++
 4 files changed, 82 insertions(+), 29 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d0bdfa42..e5e548d6b 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -283,6 +283,8 @@ extern void vac_update_relstats(Relation relation,
 								bool hasindex,
 								TransactionId frozenxid,
 								MultiXactId minmulti,
+								bool *frozenxid_updated,
+								bool *minmulti_updated,
 								bool in_outer_xact);
 extern void vacuum_set_xid_limits(Relation rel,
 								  int freeze_min_age, int freeze_table_age,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a9c83f6dc..71378740c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -198,6 +198,7 @@ typedef struct LVRelState
 	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
+	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 
 	/* Statistics output by us, for table */
@@ -211,8 +212,8 @@ typedef struct LVRelState
 	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
 	int64		lpdead_items;	/* # deleted from indexes */
-	int64		new_dead_tuples;	/* new estimated total # of dead items in
-									 * table */
+	int64		recently_dead_tuples;	/* # dead, but not yet removable */
+	int64		missed_dead_tuples; /* # removable, but not removed */
 	int64		num_tuples;		/* total number of nonremovable tuples */
 	int64		live_tuples;	/* live tuples (reltuples estimate) */
 } LVRelState;
@@ -317,6 +318,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				write_rate;
 	bool		aggressive,
 				skipwithvm;
+	bool		frozenxid_updated,
+				minmulti_updated;
 	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
@@ -538,9 +541,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	{
 		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
 		Assert(!aggressive);
+		frozenxid_updated = minmulti_updated = false;
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId, false);
+							InvalidTransactionId, InvalidMultiXactId,
+							NULL, NULL, false);
 	}
 	else
 	{
@@ -549,7 +554,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			   orig_rel_pages);
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff, false);
+							FreezeLimit, MultiXactCutoff,
+							&frozenxid_updated, &minmulti_updated, false);
 	}
 
 	/*
@@ -565,7 +571,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
 						 Max(new_live_tuples, 0),
-						 vacrel->new_dead_tuples);
+						 vacrel->recently_dead_tuples +
+						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
 
 	if (instrument)
@@ -578,6 +585,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
+			int32		diff;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -629,16 +637,40 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
 							 vacrel->removed_pages,
 							 vacrel->rel_pages,
-							 vacrel->frozenskipped_pages);
+							 vacrel->scanned_pages,
+							 orig_rel_pages == 0 ? 0 :
+							 100.0 * vacrel->scanned_pages / orig_rel_pages);
 			appendStringInfo(&buf,
-							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
+							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
 							 (long long) vacrel->new_rel_tuples,
-							 (long long) vacrel->new_dead_tuples,
-							 OldestXmin);
+							 (long long) vacrel->recently_dead_tuples);
+			if (vacrel->missed_dead_tuples > 0)
+				appendStringInfo(&buf,
+								 _("tuples missed: %lld dead from %u pages not removed due to cleanup lock contention\n"),
+								 (long long) vacrel->missed_dead_tuples,
+								 vacrel->missed_dead_pages);
+			diff = (int32) (ReadNextTransactionId() - OldestXmin);
+			appendStringInfo(&buf,
+							 _("removable cutoff: %u, older by %d xids when operation ended\n"),
+							 OldestXmin, diff);
+			if (frozenxid_updated)
+			{
+				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				appendStringInfo(&buf,
+								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
+								 FreezeLimit, diff);
+			}
+			if (minmulti_updated)
+			{
+				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				appendStringInfo(&buf,
+								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
+								 MultiXactCutoff, diff);
+			}
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -779,13 +811,15 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
 	vacrel->lpdead_item_pages = 0;
+	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
 
 	/* Initialize instrumentation counters */
 	vacrel->num_index_scans = 0;
 	vacrel->tuples_deleted = 0;
 	vacrel->lpdead_items = 0;
-	vacrel->new_dead_tuples = 0;
+	vacrel->recently_dead_tuples = 0;
+	vacrel->missed_dead_tuples = 0;
 	vacrel->num_tuples = 0;
 	vacrel->live_tuples = 0;
 
@@ -1334,7 +1368,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	 * (unlikely) scenario that new_live_tuples is -1, take it as zero.
 	 */
 	vacrel->new_rel_tuples =
-		Max(vacrel->new_live_tuples, 0) + vacrel->new_dead_tuples;
+		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
+		vacrel->missed_dead_tuples;
 
 	/*
 	 * Release any remaining pin on visibility map page.
@@ -1542,7 +1577,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				lpdead_items,
-				new_dead_tuples,
+				recently_dead_tuples,
 				num_tuples,
 				live_tuples;
 	int			nnewlpdead;
@@ -1559,7 +1594,7 @@ retry:
 	/* Initialize (or reset) page-level counters */
 	tuples_deleted = 0;
 	lpdead_items = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
 	num_tuples = 0;
 	live_tuples = 0;
 
@@ -1718,11 +1753,11 @@ retry:
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * If tuple is recently deleted then we must not remove it
-				 * from relation.  (We only remove items that are LP_DEAD from
+				 * If tuple is recently dead then we must not remove it from
+				 * the relation.  (We only remove items that are LP_DEAD from
 				 * pruning.)
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				prunestate->all_visible = false;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
@@ -1898,7 +1933,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->lpdead_items += lpdead_items;
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
 }
@@ -1943,7 +1978,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	int			lpdead_items,
 				num_tuples,
 				live_tuples,
-				new_dead_tuples;
+				recently_dead_tuples,
+				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
@@ -1955,7 +1991,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 	lpdead_items = 0;
 	num_tuples = 0;
 	live_tuples = 0;
-	new_dead_tuples = 0;
+	recently_dead_tuples = 0;
+	missed_dead_tuples = 0;
 
 	maxoff = PageGetMaxOffsetNumber(page);
 	for (offnum = FirstOffsetNumber;
@@ -2029,16 +2066,15 @@ lazy_scan_noprune(LVRelState *vacrel,
 				/*
 				 * There is some useful work for pruning to do, that won't be
 				 * done due to failure to get a cleanup lock.
-				 *
-				 * TODO Add dedicated instrumentation for this case
 				 */
+				missed_dead_tuples++;
 				break;
 			case HEAPTUPLE_RECENTLY_DEAD:
 
 				/*
-				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 * Count in recently_dead_tuples, just like lazy_scan_prune
 				 */
-				new_dead_tuples++;
+				recently_dead_tuples++;
 				break;
 			case HEAPTUPLE_INSERT_IN_PROGRESS:
 
@@ -2074,7 +2110,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 			 */
 			*hastup = true;
 			num_tuples += lpdead_items;
-			/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+			missed_dead_tuples += lpdead_items;
 		}
 
 		/* Caller records free space, with or without LP_DEAD items */
@@ -2120,9 +2156,12 @@ lazy_scan_noprune(LVRelState *vacrel,
 	/*
 	 * Finally, add relevant page-local counts to whole-VACUUM counts
 	 */
-	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->recently_dead_tuples += recently_dead_tuples;
+	vacrel->missed_dead_tuples += missed_dead_tuples;
 	vacrel->num_tuples += num_tuples;
 	vacrel->live_tuples += live_tuples;
+	if (missed_dead_tuples > 0)
+		vacrel->missed_dead_pages++;
 
 	/* Caller won't need to call lazy_scan_prune with same page */
 	return true;
@@ -2201,8 +2240,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * dead_items space is not CPU cache resident.
 		 *
 		 * We don't take any special steps to remember the LP_DEAD items (such
-		 * as counting them in new_dead_tuples report to the stats collector)
-		 * when the optimization is applied.  Though the accounting used in
+		 * as counting them in our final report to the stats collector) when
+		 * the optimization is applied.  Though the accounting used in
 		 * analyze.c's acquire_sample_rows() will recognize the same LP_DEAD
 		 * items as dead rows in its own stats collector report, that's okay.
 		 * The discrepancy should be negligible.  If this optimization is ever
@@ -3329,7 +3368,7 @@ update_index_statistics(LVRelState *vacrel)
 							false,
 							InvalidTransactionId,
 							InvalidMultiXactId,
-							false);
+							NULL, NULL, false);
 	}
 }
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index a0da998c2..736479295 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -645,6 +645,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 							hasindex,
 							InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 
 		/* Same for indexes */
@@ -661,6 +662,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 								false,
 								InvalidTransactionId,
 								InvalidMultiXactId,
+								NULL, NULL,
 								in_outer_xact);
 		}
 	}
@@ -673,6 +675,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		vac_update_relstats(onerel, -1, totalrows,
 							0, hasindex, InvalidTransactionId,
 							InvalidMultiXactId,
+							NULL, NULL,
 							in_outer_xact);
 	}
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d1dadc54e..37413dd43 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1315,6 +1315,7 @@ vac_update_relstats(Relation relation,
 					BlockNumber num_all_visible_pages,
 					bool hasindex, TransactionId frozenxid,
 					MultiXactId minmulti,
+					bool *frozenxid_updated, bool *minmulti_updated,
 					bool in_outer_xact)
 {
 	Oid			relid = RelationGetRelid(relation);
@@ -1390,22 +1391,30 @@ vac_update_relstats(Relation relation,
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
+	if (frozenxid_updated)
+		*frozenxid_updated = false;
 	if (TransactionIdIsNormal(frozenxid) &&
 		pgcform->relfrozenxid != frozenxid &&
 		(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
 		 TransactionIdPrecedes(ReadNextTransactionId(),
 							   pgcform->relfrozenxid)))
 	{
+		if (frozenxid_updated)
+			*frozenxid_updated = true;
 		pgcform->relfrozenxid = frozenxid;
 		dirty = true;
 	}
 
 	/* Similarly for relminmxid */
+	if (minmulti_updated)
+		*minmulti_updated = false;
 	if (MultiXactIdIsValid(minmulti) &&
 		pgcform->relminmxid != minmulti &&
 		(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
 		 MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
 	{
+		if (minmulti_updated)
+			*minmulti_updated = true;
 		pgcform->relminmxid = minmulti;
 		dirty = true;
 	}
-- 
2.30.2

v7-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchapplication/octet-stream; name=v7-0001-Simplify-lazy_scan_heap-s-handling-of-scanned-pag.patchDownload

From 34a497a44a96aac5a5f2f35b52af7464d32e4d5b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 17 Nov 2021 21:27:06 -0800
Subject: [PATCH v7 1/6] Simplify lazy_scan_heap's handling of scanned pages.

Redefine a scanned page as any heap page that actually gets pinned by
VACUUM's first pass over the heap.  Pages counted by scanned_pages are
now the complement of the pages that are skipped over using the
visibility map.  This new definition significantly simplifies quite a
few things.

Now heap relation truncation, visibility map bit setting, tuple counting
(e.g., for pg_class.reltuples), and tuple freezing all share a common
definition of scanned_pages.  That makes it possible to remove certain
special cases, that never made much sense.  We no longer need to track
tupcount_pages separately (see bugfix commit 1914c5ea for details),
since we now always count tuples from pages that are scanned_pages.  We
also don't need to needlessly distinguish between aggressive and
non-aggressive VACUUM operations when we cannot immediately acquire a
cleanup lock.

Since any VACUUM (not just an aggressive VACUUM) can sometimes advance
relfrozenxid, we now make non-aggressive VACUUMs work just a little
harder in order to make that desirable outcome more likely in practice.
Aggressive VACUUMs have long checked contended pages with only a shared
lock, to avoid needlessly waiting on a cleanup lock (in the common case
where the contended page has no tuples that need to be frozen anyway).
We still don't make non-aggressive VACUUMs wait for a cleanup lock, of
course -- if we did that they'd no longer be non-aggressive.  But we now
make the non-aggressive case notice that a failure to acquire a cleanup
lock on one particular heap page does not in itself make it unsafe to
advance relfrozenxid for the whole relation (which is what we usually
see in the aggressive case already).

We now also collect LP_DEAD items in the dead_items array in the case
where we cannot immediately get a cleanup lock on the buffer.  We cannot
prune without a cleanup lock, but opportunistic pruning may well have
left some LP_DEAD items behind in the past -- no reason to miss those.
Only VACUUM can mark these LP_DEAD items LP_UNUSED (no opportunistic
technique is independently capable of cleaning up line pointer bloat),
so we should not squander any opportunity to do that.  Commit 8523492d4e
taught VACUUM to set LP_DEAD line pointers to LP_UNUSED while only
holding an exclusive lock (not a cleanup lock), so we can expect to set
existing LP_DEAD items to LP_UNUSED reliably, even when we cannot
acquire our own cleanup lock at either pass over the heap (unless we opt
to skip index vacuuming, which implies that there is no second pass over
the heap).

We no longer report on "pin skipped pages" in log output.  A later patch
will add back an improved version of the same instrumentation.  We don't
want to show any information about any failures to acquire cleanup locks
unless we actually failed to do useful work as a consequence.  A page
that we could not acquire a cleanup lock on is now treated as equivalent
to any other scanned page in most cases.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wznp=c=Opj8Z7RMR3G=ec3_JfGYMN_YvmCEjoPCHzWbx0g@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c          | 812 +++++++++++-------
 .../isolation/expected/vacuum-reltuples.out   |   2 +-
 .../isolation/specs/vacuum-reltuples.spec     |   7 +-
 3 files changed, 524 insertions(+), 297 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1c2f30b68..a9c83f6dc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -143,6 +143,10 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	bool		aggressive;
+	/* Use visibility map to skip? (disabled via reloption) */
+	bool		skipwithvm;
 	/* Wraparound failsafe has been triggered? */
 	bool		failsafe_active;
 	/* Consider index vacuuming bypass optimization? */
@@ -167,6 +171,8 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
+	/* Are FreezeLimit/MultiXactCutoff still valid? */
+	bool		freeze_cutoffs_valid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -188,10 +194,8 @@ typedef struct LVRelState
 	 */
 	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
-	BlockNumber scanned_pages;	/* number of pages we examined */
-	BlockNumber pinskipped_pages;	/* # of pages skipped due to a pin */
-	BlockNumber frozenskipped_pages;	/* # of frozen pages we skipped */
-	BlockNumber tupcount_pages; /* # pages whose tuples we counted */
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -204,6 +208,7 @@ typedef struct LVRelState
 
 	/* Instrumentation counters */
 	int			num_index_scans;
+	/* Counters that follow are only for scanned_pages */
 	int64		tuples_deleted; /* # deleted from table */
 	int64		lpdead_items;	/* # deleted from indexes */
 	int64		new_dead_tuples;	/* new estimated total # of dead items in
@@ -240,19 +245,22 @@ typedef struct LVSavedErrInfo
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, VacuumParams *params,
-						   bool aggressive);
+static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
+								   BlockNumber blkno, Page page,
+								   bool sharelock, Buffer vmbuffer);
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 							BlockNumber blkno, Page page,
 							GlobalVisState *vistest,
 							LVPagePruneState *prunestate);
+static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
+							  BlockNumber blkno, Page page,
+							  bool *hastup, bool *recordfreespace);
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
 								  Buffer buffer, int index, Buffer *vmbuffer);
-static bool lazy_check_needs_freeze(Buffer buf, bool *hastup,
-									LVRelState *vacrel);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -307,16 +315,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	int			usecs;
 	double		read_rate,
 				write_rate;
-	bool		aggressive;		/* should we scan all unfrozen pages? */
-	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
+	bool		aggressive,
+				skipwithvm;
+	BlockNumber orig_rel_pages;
 	char	  **indnames = NULL;
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
 	BlockNumber new_rel_pages;
 	BlockNumber new_rel_allvisible;
 	double		new_live_tuples;
-	TransactionId new_frozen_xid;
-	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
@@ -359,8 +366,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 											   xidFullScanLimit);
 	aggressive |= MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
 											  mxactFullScanLimit);
+	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
+	{
+		/*
+		 * Force aggressive mode, and disable skipping blocks using the
+		 * visibility map (even those set all-frozen)
+		 */
 		aggressive = true;
+		skipwithvm = false;
+	}
 
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
@@ -423,6 +438,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->index_cleanup != VACOPTVALUE_UNSPECIFIED);
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
+	vacrel->aggressive = aggressive;
+	vacrel->skipwithvm = skipwithvm;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
@@ -454,35 +471,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
+	/* Track if cutoffs became invalid (possible in !aggressive case only) */
+	vacrel->freeze_cutoffs_valid = true;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params, aggressive);
+	lazy_scan_heap(vacrel, params->nworkers);
 
 	/* Done with indexes */
 	vac_close_indexes(vacrel->nindexes, vacrel->indrels, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the all unfrozen pages. If we did,
-	 * we can adjust relfrozenxid and relminmxid.
-	 *
-	 * NB: We need to check this before truncating the relation, because that
-	 * will change ->rel_pages.
-	 */
-	if ((vacrel->scanned_pages + vacrel->frozenskipped_pages)
-		< vacrel->rel_pages)
-	{
-		Assert(!aggressive);
-		scanned_all_unfrozen = false;
-	}
-	else
-		scanned_all_unfrozen = true;
-
-	/*
-	 * Optionally truncate the relation.
+	 * Optionally truncate the relation.  But remember the relation size used
+	 * by lazy_scan_prune for later first.
 	 */
+	orig_rel_pages = vacrel->rel_pages;
 	if (should_attempt_truncation(vacrel))
 	{
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_TRUNCATE,
@@ -508,28 +513,44 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 *
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * relpages to.
-	 *
-	 * Also, don't change relfrozenxid/relminmxid if we skipped any pages,
-	 * since then we don't know for certain that all tuples have a newer xmin.
 	 */
-	new_rel_pages = vacrel->rel_pages;
+	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
 	new_live_tuples = vacrel->new_live_tuples;
 
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
-	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
-	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
-
-	vac_update_relstats(rel,
-						new_rel_pages,
-						new_live_tuples,
-						new_rel_allvisible,
-						vacrel->nindexes > 0,
-						new_frozen_xid,
-						new_min_multi,
-						false);
+	/*
+	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
+	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
+	 * provided we didn't skip any all-visible (not all-frozen) pages using
+	 * the visibility map, and assuming that we didn't fail to get a cleanup
+	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
+	 * MultiXactCutoff) established for VACUUM operation.
+	 *
+	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
+	 * the rel_pages used by lazy_scan_heap, which won't match when we
+	 * happened to truncate the relation afterwards.
+	 */
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
+		!vacrel->freeze_cutoffs_valid)
+	{
+		/* Cannot advance relfrozenxid/relminmxid -- just update pg_class */
+		Assert(!aggressive);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							InvalidTransactionId, InvalidMultiXactId, false);
+	}
+	else
+	{
+		/* Can safely advance relfrozen and relminmxid, too */
+		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
+			   orig_rel_pages);
+		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
+							new_rel_allvisible, vacrel->nindexes > 0,
+							FreezeLimit, MultiXactCutoff, false);
+	}
 
 	/*
 	 * Report results to the stats collector, too.
@@ -557,7 +578,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		{
 			StringInfoData buf;
 			char	   *msgfmt;
-			BlockNumber orig_rel_pages;
 
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
@@ -609,10 +629,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped due to pins, %u skipped frozen\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u skipped frozen\n"),
 							 vacrel->removed_pages,
 							 vacrel->rel_pages,
-							 vacrel->pinskipped_pages,
 							 vacrel->frozenskipped_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable, oldest xmin: %u\n"),
@@ -620,7 +639,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 (long long) vacrel->new_rel_tuples,
 							 (long long) vacrel->new_dead_tuples,
 							 OldestXmin);
-			orig_rel_pages = vacrel->rel_pages + vacrel->removed_pages;
 			if (orig_rel_pages > 0)
 			{
 				if (vacrel->do_index_vacuuming)
@@ -737,7 +755,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
+lazy_scan_heap(LVRelState *vacrel, int nworkers)
 {
 	VacDeadItems *dead_items;
 	BlockNumber nblocks,
@@ -756,14 +774,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	GlobalVisState *vistest;
 
 	nblocks = RelationGetNumberOfBlocks(vacrel->rel);
-	next_unskippable_block = 0;
-	next_failsafe_block = 0;
-	next_fsm_block_to_vacuum = 0;
 	vacrel->rel_pages = nblocks;
 	vacrel->scanned_pages = 0;
-	vacrel->pinskipped_pages = 0;
 	vacrel->frozenskipped_pages = 0;
-	vacrel->tupcount_pages = 0;
 	vacrel->removed_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -787,14 +800,16 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * dangerously old.
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
+	next_failsafe_block = 0;
 
 	/*
 	 * Allocate the space for dead_items.  Note that this handles parallel
 	 * VACUUM initialization as part of allocating shared memory space used
 	 * for dead_items.
 	 */
-	dead_items_alloc(vacrel, params->nworkers);
+	dead_items_alloc(vacrel, nworkers);
 	dead_items = vacrel->dead_items;
+	next_fsm_block_to_vacuum = 0;
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
@@ -803,7 +818,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/*
-	 * Except when aggressive is set, we want to skip pages that are
+	 * Set things up for skipping blocks using visibility map.
+	 *
+	 * Except when vacrel->aggressive is set, we want to skip pages that are
 	 * all-visible according to the visibility map, but only when we can skip
 	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
 	 * sequentially, the OS should be doing readahead for us, so there's no
@@ -812,8 +829,8 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * page means that we can't update relfrozenxid, so we only want to do it
 	 * if we can skip a goodly number of pages.
 	 *
-	 * When aggressive is set, we can't skip pages just because they are
-	 * all-visible, but we can still skip pages that are all-frozen, since
+	 * When vacrel->aggressive is set, we can't skip pages just because they
+	 * are all-visible, but we can still skip pages that are all-frozen, since
 	 * such pages do not need freezing and do not affect the value that we can
 	 * safely set for relfrozenxid or relminmxid.
 	 *
@@ -836,17 +853,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 	 * just added to that page are necessarily newer than the GlobalXmin we
 	 * computed, so they'll have no effect on the value to which we can safely
 	 * set relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 *
-	 * We will scan the table's last page, at least to the extent of
-	 * determining whether it has tuples or not, even if it should be skipped
-	 * according to the above rules; except when we've already determined that
-	 * it's not worth trying to truncate the table.  This avoids having
-	 * lazy_truncate_heap() take access-exclusive lock on the table to attempt
-	 * a truncation that just fails immediately because there are tuples in
-	 * the last page.  This is worth avoiding mainly because such a lock must
-	 * be replayed on any hot standby, where it can be disruptive.
 	 */
-	if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
+	next_unskippable_block = 0;
+	if (vacrel->skipwithvm)
 	{
 		while (next_unskippable_block < nblocks)
 		{
@@ -855,7 +864,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			vmstatus = visibilitymap_get_status(vacrel->rel,
 												next_unskippable_block,
 												&vmbuffer);
-			if (aggressive)
+			if (vacrel->aggressive)
 			{
 				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
 					break;
@@ -882,13 +891,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		bool		all_visible_according_to_vm = false;
 		LVPagePruneState prunestate;
 
-		/*
-		 * Consider need to skip blocks.  See note above about forcing
-		 * scanning of last page.
-		 */
-#define FORCE_CHECK_PAGE() \
-		(blkno == nblocks - 1 && should_attempt_truncation(vacrel))
-
 		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
 		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
@@ -898,7 +900,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		{
 			/* Time to advance next_unskippable_block */
 			next_unskippable_block++;
-			if ((params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)
+			if (vacrel->skipwithvm)
 			{
 				while (next_unskippable_block < nblocks)
 				{
@@ -907,7 +909,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 					vmskipflags = visibilitymap_get_status(vacrel->rel,
 														   next_unskippable_block,
 														   &vmbuffer);
-					if (aggressive)
+					if (vacrel->aggressive)
 					{
 						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
 							break;
@@ -936,19 +938,27 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 			 * it's not all-visible.  But in an aggressive vacuum we know only
 			 * that it's not all-frozen, so it might still be all-visible.
 			 */
-			if (aggressive && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+			if (vacrel->aggressive &&
+				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
 				all_visible_according_to_vm = true;
 		}
 		else
 		{
 			/*
-			 * The current block is potentially skippable; if we've seen a
-			 * long enough run of skippable blocks to justify skipping it, and
-			 * we're not forced to check it, then go ahead and skip.
-			 * Otherwise, the page must be at least all-visible if not
-			 * all-frozen, so we can set all_visible_according_to_vm = true.
+			 * The current page can be skipped if we've seen a long enough run
+			 * of skippable blocks to justify skipping it -- provided it's not
+			 * the last page in the relation (according to rel_pages/nblocks).
+			 *
+			 * We always scan the table's last page to determine whether it
+			 * has tuples or not, even if it would otherwise be skipped. This
+			 * avoids having lazy_truncate_heap() take access-exclusive lock
+			 * on the table to attempt a truncation that just fails
+			 * immediately because there are tuples on the last page.
+			 *
+			 * XXX Do we need to skip even the last block when every page in
+			 * the relation is all-visible?  We don't do that currently.
 			 */
-			if (skipping_blocks && !FORCE_CHECK_PAGE())
+			if (skipping_blocks && blkno < nblocks - 1)
 			{
 				/*
 				 * Tricky, tricky.  If this is in aggressive vacuum, the page
@@ -957,18 +967,32 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 				 * careful to count it as a skipped all-frozen page in that
 				 * case, or else we'll think we can't update relfrozenxid and
 				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was all-frozen, so we have to recheck; but
-				 * in this case an approximate answer is OK.
+				 * know whether it was initially all-frozen, so we have to
+				 * recheck.
 				 */
-				if (aggressive || VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
+				if (vacrel->aggressive ||
+					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 					vacrel->frozenskipped_pages++;
 				continue;
 			}
+
+			/*
+			 * Otherwise it must be an all-visible (and possibly even
+			 * all-frozen) page that we decided to process regardless
+			 * (SKIP_PAGES_THRESHOLD must not have been crossed).
+			 */
 			all_visible_according_to_vm = true;
 		}
 
 		vacuum_delay_point();
 
+		/*
+		 * We're not skipping this page using the visibility map, and so it is
+		 * (by definition) a scanned page.  Any tuples from this page are now
+		 * guaranteed to be counted below, after some preparatory checks.
+		 */
+		vacrel->scanned_pages++;
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1023,174 +1047,78 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		}
 
 		/*
-		 * Set up visibility map page as needed.
-		 *
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
-		 * already have the correct page pinned anyway.  However, it's
-		 * possible that (a) next_unskippable_block is covered by a different
-		 * VM page than the current block or (b) we released our pin and did a
-		 * cycle of index vacuuming.
+		 * already have the correct page pinned anyway.
 		 */
 		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
 
+		/* Finished preparatory checks.  Actually scan the page. */
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vacrel->bstrategy);
+		page = BufferGetPage(buf);
 
 		/*
-		 * We need buffer cleanup lock so that we can prune HOT chains and
-		 * defragment the page.
+		 * We need a buffer cleanup lock to prune HOT chains and defragment
+		 * the page in lazy_scan_prune.  But when it's not possible to acquire
+		 * a cleanup lock right away, we may be able to settle for reduced
+		 * processing using lazy_scan_noprune.
 		 */
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
-			bool		hastup;
+			bool		hastup,
+						recordfreespace;
 
-			/*
-			 * If we're not performing an aggressive scan to guard against XID
-			 * wraparound, and we don't want to forcibly check the page, then
-			 * it's OK to skip vacuuming pages we get a lock conflict on. They
-			 * will be dealt with in some future vacuum.
-			 */
-			if (!aggressive && !FORCE_CHECK_PAGE())
-			{
-				ReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
-				continue;
-			}
-
-			/*
-			 * Read the page with share lock to see if any xids on it need to
-			 * be frozen.  If not we just skip the page, after updating our
-			 * scan statistics.  If there are some, we wait for cleanup lock.
-			 *
-			 * We could defer the lock request further by remembering the page
-			 * and coming back to it later, or we could even register
-			 * ourselves for multiple buffers and then service whichever one
-			 * is received first.  For now, this seems good enough.
-			 *
-			 * If we get here with aggressive false, then we're just forcibly
-			 * checking the page, and so we don't want to insist on getting
-			 * the lock; we only need to know if the page contains tuples, so
-			 * that we can update nonempty_pages correctly.  It's convenient
-			 * to use lazy_check_needs_freeze() for both situations, though.
-			 */
 			LockBuffer(buf, BUFFER_LOCK_SHARE);
-			if (!lazy_check_needs_freeze(buf, &hastup, vacrel))
+
+			/* Check for new or empty pages before lazy_scan_noprune call */
+			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
+									   vmbuffer))
 			{
-				UnlockReleaseBuffer(buf);
-				vacrel->scanned_pages++;
-				vacrel->pinskipped_pages++;
-				if (hastup)
-					vacrel->nonempty_pages = blkno + 1;
+				/* Processed as new/empty page (lock and pin released) */
 				continue;
 			}
-			if (!aggressive)
+
+			/* Collect LP_DEAD items in dead_items array, count tuples */
+			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
+								  &recordfreespace))
 			{
+				Size		freespace;
+
 				/*
-				 * Here, we must not advance scanned_pages; that would amount
-				 * to claiming that the page contains no freezable tuples.
+				 * Processed page successfully (without cleanup lock) -- just
+				 * need to perform rel truncation and FSM steps, much like the
+				 * lazy_scan_prune case.  Don't bother trying to match its
+				 * visibility map setting steps, though.
 				 */
-				UnlockReleaseBuffer(buf);
-				vacrel->pinskipped_pages++;
 				if (hastup)
 					vacrel->nonempty_pages = blkno + 1;
+				if (recordfreespace)
+					freespace = PageGetHeapFreeSpace(page);
+				UnlockReleaseBuffer(buf);
+				if (recordfreespace)
+					RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 				continue;
 			}
+
+			/*
+			 * lazy_scan_noprune could not do all required processing.  Wait
+			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
+			 */
+			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
-			/* drop through to normal processing */
 		}
 
-		/*
-		 * By here we definitely have enough dead_items space for whatever
-		 * LP_DEAD tids are on this page, we have the visibility map page set
-		 * up in case we need to set this page's all_visible/all_frozen bit,
-		 * and we have a cleanup lock.  Any tuples on this page are now sure
-		 * to be "counted" by this VACUUM.
-		 *
-		 * One last piece of preamble needs to take place before we can prune:
-		 * we need to consider new and empty pages.
-		 */
-		vacrel->scanned_pages++;
-		vacrel->tupcount_pages++;
-
-		page = BufferGetPage(buf);
-
-		if (PageIsNew(page))
+		/* Check for new or empty pages before lazy_scan_prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
 		{
-			/*
-			 * All-zeroes pages can be left over if either a backend extends
-			 * the relation by a single page, but crashes before the newly
-			 * initialized page has been written out, or when bulk-extending
-			 * the relation (which creates a number of empty pages at the tail
-			 * end of the relation, but enters them into the FSM).
-			 *
-			 * Note we do not enter the page into the visibilitymap. That has
-			 * the downside that we repeatedly visit this page in subsequent
-			 * vacuums, but otherwise we'll never not discover the space on a
-			 * promoted standby. The harm of repeated checking ought to
-			 * normally not be too bad - the space usually should be used at
-			 * some point, otherwise there wouldn't be any regular vacuums.
-			 *
-			 * Make sure these pages are in the FSM, to ensure they can be
-			 * reused. Do that by testing if there's any space recorded for
-			 * the page. If not, enter it. We do so after releasing the lock
-			 * on the heap page, the FSM is approximate, after all.
-			 */
-			UnlockReleaseBuffer(buf);
-
-			if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
-			{
-				Size		freespace = BLCKSZ - SizeOfPageHeaderData;
-
-				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-			}
-			continue;
-		}
-
-		if (PageIsEmpty(page))
-		{
-			Size		freespace = PageGetHeapFreeSpace(page);
-
-			/*
-			 * Empty pages are always all-visible and all-frozen (note that
-			 * the same is currently not true for new pages, see above).
-			 */
-			if (!PageIsAllVisible(page))
-			{
-				START_CRIT_SECTION();
-
-				/* mark buffer dirty before writing a WAL record */
-				MarkBufferDirty(buf);
-
-				/*
-				 * It's possible that another backend has extended the heap,
-				 * initialized the page, and then failed to WAL-log the page
-				 * due to an ERROR.  Since heap extension is not WAL-logged,
-				 * recovery might try to replay our record setting the page
-				 * all-visible and find that the page isn't initialized, which
-				 * will cause a PANIC.  To prevent that, check whether the
-				 * page has been previously WAL-logged, and if not, do that
-				 * now.
-				 */
-				if (RelationNeedsWAL(vacrel->rel) &&
-					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
-
-				PageSetAllVisible(page);
-				visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-								  vmbuffer, InvalidTransactionId,
-								  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
-				END_CRIT_SECTION();
-			}
-
-			UnlockReleaseBuffer(buf);
-			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+			/* Processed as new/empty page (lock and pin released) */
 			continue;
 		}
 
 		/*
-		 * Prune and freeze tuples.
+		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
 		 * dead_items array.  This includes LP_DEAD line pointers that we
@@ -1398,7 +1326,7 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, nblocks,
-													 vacrel->tupcount_pages,
+													 vacrel->scanned_pages,
 													 vacrel->live_tuples);
 
 	/*
@@ -1447,6 +1375,137 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		update_index_statistics(vacrel);
 }
 
+/*
+ *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
+ *
+ * Must call here to handle both new and empty pages before calling
+ * lazy_scan_prune or lazy_scan_noprune, since they're not prepared to deal
+ * with new or empty pages.
+ *
+ * It's necessary to consider new pages as a special case, since the rules for
+ * maintaining the visibility map and FSM with empty pages are a little
+ * different (though new pages can be truncated based on the usual rules).
+ *
+ * Empty pages are not really a special case -- they're just heap pages that
+ * have no allocated tuples (including even LP_UNUSED items).  You might
+ * wonder why we need to handle them here all the same.  It's only necessary
+ * because of a corner-case involving a hard crash during heap relation
+ * extension.  If we ever make relation-extension crash safe, then it should
+ * no longer be necessary to deal with empty pages here (or new pages, for
+ * that matter).
+ *
+ * Caller must hold at least a shared lock.  We might need to escalate the
+ * lock in that case, so the type of lock caller holds needs to be specified
+ * using 'sharelock' argument.
+ *
+ * Returns false in common case where caller should go on to call
+ * lazy_scan_prune (or lazy_scan_noprune).  Otherwise returns true, indicating
+ * that lazy_scan_heap is done processing the page, releasing lock on caller's
+ * behalf.
+ */
+static bool
+lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
+					   Page page, bool sharelock, Buffer vmbuffer)
+{
+	Size		freespace;
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * All-zeroes pages can be left over if either a backend extends the
+		 * relation by a single page, but crashes before the newly initialized
+		 * page has been written out, or when bulk-extending the relation
+		 * (which creates a number of empty pages at the tail end of the
+		 * relation), and then enters them into the FSM.
+		 *
+		 * Note we do not enter the page into the visibilitymap. That has the
+		 * downside that we repeatedly visit this page in subsequent vacuums,
+		 * but otherwise we'll never discover the space on a promoted standby.
+		 * The harm of repeated checking ought to normally not be too bad. The
+		 * space usually should be used at some point, otherwise there
+		 * wouldn't be any regular vacuums.
+		 *
+		 * Make sure these pages are in the FSM, to ensure they can be reused.
+		 * Do that by testing if there's any space recorded for the page. If
+		 * not, enter it. We do so after releasing the lock on the heap page,
+		 * the FSM is approximate, after all.
+		 */
+		UnlockReleaseBuffer(buf);
+
+		if (GetRecordedFreeSpace(vacrel->rel, blkno) == 0)
+		{
+			freespace = BLCKSZ - SizeOfPageHeaderData;
+
+			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		}
+
+		return true;
+	}
+
+	if (PageIsEmpty(page))
+	{
+		/*
+		 * It seems likely that caller will always be able to get a cleanup
+		 * lock on an empty page.  But don't take any chances -- escalate to
+		 * an exclusive lock (still don't need a cleanup lock, though).
+		 */
+		if (sharelock)
+		{
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (!PageIsEmpty(page))
+			{
+				/* page isn't new or empty -- keep lock and pin for now */
+				return false;
+			}
+		}
+		else
+		{
+			/* Already have a full cleanup lock (which is more than enough) */
+		}
+
+		/*
+		 * Unlike new pages, empty pages are always set all-visible and
+		 * all-frozen.
+		 */
+		if (!PageIsAllVisible(page))
+		{
+			START_CRIT_SECTION();
+
+			/* mark buffer dirty before writing a WAL record */
+			MarkBufferDirty(buf);
+
+			/*
+			 * It's possible that another backend has extended the heap,
+			 * initialized the page, and then failed to WAL-log the page due
+			 * to an ERROR.  Since heap extension is not WAL-logged, recovery
+			 * might try to replay our record setting the page all-visible and
+			 * find that the page isn't initialized, which will cause a PANIC.
+			 * To prevent that, check whether the page has been previously
+			 * WAL-logged, and if not, do that now.
+			 */
+			if (RelationNeedsWAL(vacrel->rel) &&
+				PageGetLSN(page) == InvalidXLogRecPtr)
+				log_newpage_buffer(buf, true);
+
+			PageSetAllVisible(page);
+			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
+							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
+			END_CRIT_SECTION();
+		}
+
+		freespace = PageGetHeapFreeSpace(page);
+		UnlockReleaseBuffer(buf);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		return true;
+	}
+
+	/* page isn't new or empty -- keep lock and pin */
+	return false;
+}
+
 /*
  *	lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -1491,6 +1550,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	maxoff = PageGetMaxOffsetNumber(page);
 
 retry:
@@ -1553,10 +1614,9 @@ retry:
 		 * LP_DEAD items are processed outside of the loop.
 		 *
 		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how lazy_check_needs_freeze() or
-		 * count_nondeletable_pages() do it -- they only consider pages empty
-		 * when they only have LP_UNUSED items, which is important for
-		 * correctness.
+		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
+		 * it -- it only considers pages empty/truncatable when they have no
+		 * items at all (except LP_UNUSED items).
 		 *
 		 * Our assumption is that any LP_DEAD items we encounter here will
 		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
@@ -1843,6 +1903,231 @@ retry:
 	vacrel->live_tuples += live_tuples;
 }
 
+/*
+ *	lazy_scan_noprune() -- lazy_scan_prune() without pruning or freezing
+ *
+ * Caller need only hold a pin and share lock on the buffer, unlike
+ * lazy_scan_prune, which requires a full cleanup lock.
+ *
+ * While pruning isn't performed here, we can at least collect existing
+ * LP_DEAD items into the dead_items array for removal from indexes.  It's
+ * quite possible that earlier opportunistic pruning left LP_DEAD items
+ * behind, and we shouldn't miss out on an opportunity to make them reusable
+ * (VACUUM alone is capable of cleaning up line pointer bloat like this).
+ * Note that we'll only require an exclusive lock (not a cleanup lock) later
+ * on when we set these LP_DEAD items to LP_UNUSED.
+ *
+ * Freezing isn't performed here either.  For aggressive VACUUM callers, we
+ * may return false to indicate that a full cleanup lock is required.  This is
+ * necessary because pruning requires a cleanup lock, and because VACUUM
+ * cannot freeze a page's tuples until after pruning takes place (freezing
+ * tuples effectively requires a cleanup lock, though we don't need a cleanup
+ * lock in lazy_vacuum_heap_page or in lazy_scan_new_or_empty to set a heap
+ * page all-frozen in the visibility map).  Returns true to indicate that all
+ * required processing has been performed.
+ *
+ * See lazy_scan_prune for an explanation of hastup return flag.
+ * recordfreespace flag instructs caller on whether or not it should do
+ * generic FSM processing for page.
+ */
+static bool
+lazy_scan_noprune(LVRelState *vacrel,
+				  Buffer buf,
+				  BlockNumber blkno,
+				  Page page,
+				  bool *hastup,
+				  bool *recordfreespace)
+{
+	OffsetNumber offnum,
+				maxoff;
+	int			lpdead_items,
+				num_tuples,
+				live_tuples,
+				new_dead_tuples;
+	HeapTupleHeader tupleheader;
+	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
+	*hastup = false;			/* for now */
+	*recordfreespace = false;	/* for now */
+
+	lpdead_items = 0;
+	num_tuples = 0;
+	live_tuples = 0;
+	new_dead_tuples = 0;
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid;
+		HeapTupleData tuple;
+
+		vacrel->offnum = offnum;
+		itemid = PageGetItemId(page, offnum);
+
+		if (!ItemIdIsUsed(itemid))
+			continue;
+
+		if (ItemIdIsRedirected(itemid))
+		{
+			*hastup = true;
+			continue;
+		}
+
+		if (ItemIdIsDead(itemid))
+		{
+			/*
+			 * Deliberately don't set hastup=true here.  See same point in
+			 * lazy_scan_prune for an explanation.
+			 */
+			deadoffsets[lpdead_items++] = offnum;
+			continue;
+		}
+
+		*hastup = true;			/* page prevents rel truncation */
+		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
+		if (heap_tuple_needs_freeze(tupleheader,
+									vacrel->FreezeLimit,
+									vacrel->MultiXactCutoff, buf))
+		{
+			if (vacrel->aggressive)
+			{
+				/* Going to have to get cleanup lock for lazy_scan_prune */
+				vacrel->offnum = InvalidOffsetNumber;
+				return false;
+			}
+
+			/*
+			 * Current non-aggressive VACUUM operation definitely won't be
+			 * able to advance relfrozenxid or relminmxid
+			 */
+			vacrel->freeze_cutoffs_valid = false;
+		}
+
+		num_tuples++;
+		ItemPointerSet(&(tuple.t_self), blkno, offnum);
+		tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+		tuple.t_len = ItemIdGetLength(itemid);
+		tuple.t_tableOid = RelationGetRelid(vacrel->rel);
+
+		switch (HeapTupleSatisfiesVacuum(&tuple, vacrel->OldestXmin, buf))
+		{
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+			case HEAPTUPLE_LIVE:
+
+				/*
+				 * Count both cases as live, just like lazy_scan_prune
+				 */
+				live_tuples++;
+
+				break;
+			case HEAPTUPLE_DEAD:
+
+				/*
+				 * There is some useful work for pruning to do, that won't be
+				 * done due to failure to get a cleanup lock.
+				 *
+				 * TODO Add dedicated instrumentation for this case
+				 */
+				break;
+			case HEAPTUPLE_RECENTLY_DEAD:
+
+				/*
+				 * Count in new_dead_tuples, just like lazy_scan_prune
+				 */
+				new_dead_tuples++;
+				break;
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Do not count these rows as live, just like lazy_scan_prune
+				 */
+				break;
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+	}
+
+	vacrel->offnum = InvalidOffsetNumber;
+
+	/*
+	 * Now save details of the LP_DEAD items from the page in vacrel (though
+	 * only when VACUUM uses two-pass strategy).
+	 */
+	if (vacrel->nindexes == 0)
+	{
+		/* Using one-pass strategy (since table has no indexes) */
+		if (lpdead_items > 0)
+		{
+			/*
+			 * Perfunctory handling for the corner case where a single pass
+			 * strategy VACUUM cannot get a cleanup lock, and it turns out
+			 * that there is one or more LP_DEAD items: just count the LP_DEAD
+			 * items as missed_dead_tuples instead. (This is a bit dishonest,
+			 * but it beats having to maintain specialized heap vacuuming code
+			 * forever, for vanishingly little benefit.)
+			 */
+			*hastup = true;
+			num_tuples += lpdead_items;
+			/* TODO HEAPTUPLE_DEAD style instrumentation needed here, too */
+		}
+
+		/* Caller records free space, with or without LP_DEAD items */
+		*recordfreespace = true;
+	}
+	else if (lpdead_items > 0)
+	{
+		VacDeadItems *dead_items = vacrel->dead_items;
+		ItemPointerData tmp;
+
+		/*
+		 * Page has LP_DEAD items, and so any references/TIDs that remain in
+		 * indexes will be deleted during index vacuuming (and then marked
+		 * LP_UNUSED in the heap).
+		 *
+		 * Don't record free space now -- leave it until second heap pass.
+		 */
+		vacrel->lpdead_item_pages++;
+
+		ItemPointerSetBlockNumber(&tmp, blkno);
+
+		for (int i = 0; i < lpdead_items; i++)
+		{
+			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
+			dead_items->items[dead_items->num_items++] = tmp;
+		}
+
+		Assert(dead_items->num_items <= dead_items->max_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+									 dead_items->num_items);
+
+		vacrel->lpdead_items += lpdead_items;
+	}
+	else
+	{
+		/*
+		 * Caller won't be vacuuming this page later, so tell it to record
+		 * page's freespace in the FSM now, even though we didn't prune it
+		 */
+		*recordfreespace = true;
+	}
+
+	/*
+	 * Finally, add relevant page-local counts to whole-VACUUM counts
+	 */
+	vacrel->new_dead_tuples += new_dead_tuples;
+	vacrel->num_tuples += num_tuples;
+	vacrel->live_tuples += live_tuples;
+
+	/* Caller won't need to call lazy_scan_prune with same page */
+	return true;
+}
+
 /*
  * Main entry point for index vacuuming and heap vacuuming.
  *
@@ -2286,67 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	return index;
 }
 
-/*
- *	lazy_check_needs_freeze() -- scan page to see if any tuples
- *					 need to be cleaned to avoid wraparound
- *
- * Returns true if the page needs to be vacuumed using cleanup lock.
- * Also returns a flag indicating whether page contains any tuples at all.
- */
-static bool
-lazy_check_needs_freeze(Buffer buf, bool *hastup, LVRelState *vacrel)
-{
-	Page		page = BufferGetPage(buf);
-	OffsetNumber offnum,
-				maxoff;
-	HeapTupleHeader tupleheader;
-
-	*hastup = false;
-
-	/*
-	 * New and empty pages, obviously, don't contain tuples. We could make
-	 * sure that the page is registered in the FSM, but it doesn't seem worth
-	 * waiting for a cleanup lock just for that, especially because it's
-	 * likely that the pin holder will do so.
-	 */
-	if (PageIsNew(page) || PageIsEmpty(page))
-		return false;
-
-	maxoff = PageGetMaxOffsetNumber(page);
-	for (offnum = FirstOffsetNumber;
-		 offnum <= maxoff;
-		 offnum = OffsetNumberNext(offnum))
-	{
-		ItemId		itemid;
-
-		/*
-		 * Set the offset number so that we can display it along with any
-		 * error that occurred while processing this tuple.
-		 */
-		vacrel->offnum = offnum;
-		itemid = PageGetItemId(page, offnum);
-
-		/* this should match hastup test in count_nondeletable_pages() */
-		if (ItemIdIsUsed(itemid))
-			*hastup = true;
-
-		/* dead and redirect items never need freezing */
-		if (!ItemIdIsNormal(itemid))
-			continue;
-
-		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-
-		if (heap_tuple_needs_freeze(tupleheader, vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
-			break;
-	}							/* scan along page */
-
-	/* Clear the offset information once we have processed the given page. */
-	vacrel->offnum = InvalidOffsetNumber;
-
-	return (offnum <= maxoff);
-}
-
 /*
  * Trigger the failsafe to avoid wraparound failure when vacrel table has a
  * relfrozenxid and/or relminmxid that is dangerously far in the past.
@@ -2412,7 +2636,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 	{
 		double		reltuples = vacrel->new_rel_tuples;
 		bool		estimated_count =
-		vacrel->tupcount_pages < vacrel->rel_pages;
+		vacrel->scanned_pages < vacrel->rel_pages;
 
 		for (int idx = 0; idx < vacrel->nindexes; idx++)
 		{
@@ -2429,7 +2653,7 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
 		/* Outsource everything to parallel variant */
 		parallel_vacuum_cleanup_all_indexes(vacrel->pvs, vacrel->new_rel_tuples,
 											vacrel->num_index_scans,
-											(vacrel->tupcount_pages < vacrel->rel_pages));
+											(vacrel->scanned_pages < vacrel->rel_pages));
 	}
 }
 
@@ -2536,7 +2760,9 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
  * should_attempt_truncation - should we attempt to truncate the heap?
  *
  * Don't even think about it unless we have a shot at releasing a goodly
- * number of pages.  Otherwise, the time taken isn't worth it.
+ * number of pages.  Otherwise, the time taken isn't worth it, mainly because
+ * an AccessExclusive lock must be replayed on any hot standby, where it can
+ * be particularly disruptive.
  *
  * Also don't attempt it if wraparound failsafe is in effect.  It's hard to
  * predict how long lazy_truncate_heap will take.  Don't take any chances.
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
index cdbe7f3a6..ce55376e7 100644
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ b/src/test/isolation/expected/vacuum-reltuples.out
@@ -45,7 +45,7 @@ step stats:
 
 relpages|reltuples
 --------+---------
-       1|       20
+       1|       21
 (1 row)
 
 
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
index ae2f79b8f..a2a461f2f 100644
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ b/src/test/isolation/specs/vacuum-reltuples.spec
@@ -2,9 +2,10 @@
 # to page pins. We absolutely need to avoid setting reltuples=0 in
 # such cases, since that interferes badly with planning.
 #
-# Expected result in second permutation is 20 tuples rather than 21 as
-# for the others, because vacuum should leave the previous result
-# (from before the insert) in place.
+# Expected result for all three permutation is 21 tuples, including
+# the second permutation.  VACUUM is able to count the concurrently
+# inserted tuple in its final reltuples, even when a cleanup lock
+# cannot be acquired on the affected heap page.
 
 setup {
     create table smalltbl
-- 
2.30.2

#40

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Peter Geoghegan (#39)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Jan 20, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

I do see some value in that, too. Though it's not going to be a way of
turning off the early freezing stuff, which seems unnecessary (though
I do still have work to do on getting the overhead for that down).

Attached is v7, a revision that overhauls the algorithm that decides
what to freeze. I'm now calling it block-driven freezing in the commit
message. Also included is a new patch, that makes VACUUM record zero
free space in the FSM for an all-visible page, unless the total amount
of free space happens to be greater than one half of BLCKSZ.

The fact that I am now including this new FSM patch (v7-0006-*patch)
may seem like a case of expanding the scope of something that could
well do without it. But hear me out! It's true that the new FSM patch
isn't essential. I'm including it now because it seems relevant to the
approach taken with block-driven freezing -- it may even make my
general approach easier to understand.

Without having looked at the latest patches, there was something in
the back of my mind while following the discussion upthread -- the
proposed opportunistic freezing made a lot more sense if the
earlier-proposed open/closed pages concept was already available.

Freezing whole pages
====================

It's possible that a higher cutoff (for example a cutoff of 80% of
BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
addition to the downsides from fragmentation -- it's far from a simple
trade-off. (Not that you should believe that 50% is special, it's just
a starting point for me.)

How was the space utilization with the 50% cutoff in the TPC-C test?

TPC-C raw numbers
=================

The single most important number for the patch might be the decrease
in both buffer misses and buffer hits, which I believe is caused by
the patch being able to use index-only scans much more effectively
(with modifications to BenchmarkSQL to improve the indexing strategy
[1]). This is quite clear from pg_stat_database state at the end.

Patch:

blks_hit | 174,551,067,731
tup_fetched | 124,797,772,450

Here is the same pg_stat_database info for master:

blks_hit | 283,015,966,386
tup_fetched | 237,052,965,901

That's impressive.

--
John Naylor
EDB: http://www.enterprisedb.com

#41

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: John Naylor (#40)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 4, 2022 at 2:00 PM John Naylor <john.naylor@enterprisedb.com> wrote:

Without having looked at the latest patches, there was something in
the back of my mind while following the discussion upthread -- the
proposed opportunistic freezing made a lot more sense if the
earlier-proposed open/closed pages concept was already available.

Yeah, sorry about that. The open/closed pages concept is still
something I plan on working on. My prototype (which I never posted to
the list) will be rebased, and I'll try to target Postgres 16.

Freezing whole pages
====================

It's possible that a higher cutoff (for example a cutoff of 80% of
BLCKSZ, not 50%) will actually lead to *worse* space utilization, in
addition to the downsides from fragmentation -- it's far from a simple
trade-off. (Not that you should believe that 50% is special, it's just
a starting point for me.)

How was the space utilization with the 50% cutoff in the TPC-C test?

The picture was mixed. To get the raw numbers, compare
pg-relation-sizes-after-patch-2.out and
pg-relation-sizes-after-master-2.out files from the drive link I
provided (to repeat, get them from
https://drive.google.com/drive/u/1/folders/1A1g0YGLzluaIpv-d_4o4thgmWbVx3LuR)

Highlights: the largest table (the bmsql_order_line table) had a total
size of x1.006 relative to master, meaning that we did slightly worse
there. However, the index on the same table was slightly smaller
instead, probably because reducing heap fragmentation tends to make
the index deletion stuff work a bit better than before.

Certain small tables (bmsql_district and bmsql_warehouse) were
actually significantly smaller (less than half their size on master),
probably just because the patch can reliably remove LP_DEAD items from
heap pages, even when a cleanup lock isn't available.

The bmsql_new_order table was quite a bit larger, but it's not that
large anyway (1250 MB on master at the very end, versus 1433 MB with
the patch). This is a clear trade-off, since we get much less
fragmentation in the same table (as evidenced by the VACUUM output,
where there are fewer pages with any LP_DEAD items per VACUUM with the
patch). The workload for that table is characterized by inserting new
orders together, and deleting the same orders as a group later on. So
we're bound to pay a cost in space utilization to lower the
fragmentation.

blks_hit | 174,551,067,731
tup_fetched | 124,797,772,450

Here is the same pg_stat_database info for master:

blks_hit | 283,015,966,386
tup_fetched | 237,052,965,901

That's impressive.

Thanks!

It's still possible to get a big improvement like that with something
like TPC-C because there are certain behaviors that are clearly
suboptimal -- once you look at the details of the workload, and
compare an imaginary ideal to the actual behavior of the system. In
particular, there is really only one way that the free space
management can work for the two big tables that will perform
acceptably -- the orders have to be stored in the same place to begin
with, and stay in the same place forever (at least to the extent that
that's possible).

--
Peter Geoghegan

#42

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#39)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Jan 29, 2022 at 11:43 PM Peter Geoghegan <pg@bowt.ie> wrote:

When VACUUM sees that all remaining/unpruned tuples on a page are
all-visible, it isn't just important because of cost control
considerations. It's deeper than that. It's also treated as a
tentative signal from the application itself, about the data itself.
Which is: this page looks "settled" -- it may never be updated again,
but if there is an update it likely won't change too much about the
whole page.

While I agree that there's some case to be made for leaving settled
pages well enough alone, your criterion for settled seems pretty much
accidental. Imagine a system where there are two applications running,
A and B. Application A runs all the time and all the transactions
which it performs are short. Therefore, when a certain page is not
modified by transaction A for a short period of time, the page will
become all-visible and will be considered settled. Application B runs
once a month and performs various transactions all of which are long,
perhaps on a completely separate set of tables. While application B is
running, pages take longer to settle not only for application B but
also for application A. It doesn't make sense to say that the
application is in control of the behavior when, in reality, it may be
some completely separate application that is controlling the behavior.

The application is in charge, really -- not VACUUM. This is already
the case, whether we like it or not. VACUUM needs to learn to live in
that reality, rather than fighting it. When VACUUM considers a page
settled, and the physical page still has a relatively large amount of
free space (say 45% of BLCKSZ, a borderline case in the new FSM
patch), "losing" so much free space certainly is unappealing. We set
the free space to 0 in the free space map all the same, because we're
cutting our losses at that point. While the exact threshold I've
proposed is tentative, the underlying theory seems pretty sound to me.
The BLCKSZ/2 cutoff (and the way that it extends the general rules for
whole-page freezing) is intended to catch pages that are qualitatively
different, as well as quantitatively different. It is a balancing act,
between not wasting space, and the risk of systemic problems involving
excessive amounts of non-HOT updates that must move a successor
version to another page.

I can see that this could have significant advantages under some
circumstances. But I think it could easily be far worse under other
circumstances. I mean, you can have workloads where you do some amount
of read-write work on a table and then go read only and sequential
scan it an infinite number of times. An algorithm that causes the
table to be smaller at the point where we switch to read-only
operations, even by a modest amount, wins infinitely over anything
else. But even if you have no change in the access pattern, is it a
good idea to allow the table to be, say, 5% larger if it means that
correlated data is colocated? In general, probably yes. If that means
that the table fails to fit in shared_buffers instead of fitting, no.
If that means that the table fails to fit in the OS cache instead of
fitting, definitely no.

And to me, that kind of effect is why it's hard to gain much
confidence in regards to stuff like this via laboratory testing. I
mean, I'm glad you're doing such tests. But in a laboratory test, you
tend not to have things like a sudden and complete change in the
workload, or a random other application sometimes sharing the machine,
or only being on the edge of running out of memory. I think in general
people tend to avoid such things in benchmarking scenarios, but even
if include stuff like this, it's hard to know what to include that
would be representative of real life, because just about anything
*could* happen in real life.

--
Robert Haas
EDB: http://www.enterprisedb.com

#43

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#42)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 4, 2022 at 2:45 PM Robert Haas <robertmhaas@gmail.com> wrote:

While I agree that there's some case to be made for leaving settled
pages well enough alone, your criterion for settled seems pretty much
accidental.

I fully admit that I came up with the FSM heuristic with TPC-C in
mind. But you have to start somewhere.

Fortunately, the main benefit of this patch series (avoiding the
freeze cliff during anti-wraparound VACUUMs, often avoiding
anti-wraparound VACUUMs altogether) don't depend on the experimental
FSM patch at all. I chose to post that now because it seemed to help
with my more general point about qualitatively different pages, and
freezing at the page level.

Imagine a system where there are two applications running,
A and B. Application A runs all the time and all the transactions
which it performs are short. Therefore, when a certain page is not
modified by transaction A for a short period of time, the page will
become all-visible and will be considered settled. Application B runs
once a month and performs various transactions all of which are long,
perhaps on a completely separate set of tables. While application B is
running, pages take longer to settle not only for application B but
also for application A. It doesn't make sense to say that the
application is in control of the behavior when, in reality, it may be
some completely separate application that is controlling the behavior.

Application B will already block pruning by VACUUM operations against
application A's table, and so effectively blocks recording of the
resultant free space in the FSM in your scenario. And so application A
and application B should be considered the same application already.
That's just how VACUUM works.

VACUUM isn't a passive observer of the system -- it's another
participant. It both influences and is influenced by almost everything
else in the system.

I can see that this could have significant advantages under some
circumstances. But I think it could easily be far worse under other
circumstances. I mean, you can have workloads where you do some amount
of read-write work on a table and then go read only and sequential
scan it an infinite number of times. An algorithm that causes the
table to be smaller at the point where we switch to read-only
operations, even by a modest amount, wins infinitely over anything
else. But even if you have no change in the access pattern, is it a
good idea to allow the table to be, say, 5% larger if it means that
correlated data is colocated? In general, probably yes. If that means
that the table fails to fit in shared_buffers instead of fitting, no.
If that means that the table fails to fit in the OS cache instead of
fitting, definitely no.

5% larger seems like a lot more than would be typical, based on what
I've seen. I don't think that the regression in this scenario can be
characterized as "infinitely worse", or anything like it. On a long
enough timeline, the potential upside of something like this is nearly
unlimited -- it could avoid a huge amount of write amplification. But
the potential downside seems to be small and fixed -- which is the
point (bounding the downside). The mere possibility of getting that
big benefit (avoiding the costs from heap fragmentation) is itself a
benefit, even when it turns out not to pay off in your particular
case. It can be seen as insurance.

And to me, that kind of effect is why it's hard to gain much
confidence in regards to stuff like this via laboratory testing. I
mean, I'm glad you're doing such tests. But in a laboratory test, you
tend not to have things like a sudden and complete change in the
workload, or a random other application sometimes sharing the machine,
or only being on the edge of running out of memory. I think in general
people tend to avoid such things in benchmarking scenarios, but even
if include stuff like this, it's hard to know what to include that
would be representative of real life, because just about anything
*could* happen in real life.

Then what could you have confidence in?

--
Peter Geoghegan

#44

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#43)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote:

Application B will already block pruning by VACUUM operations against
application A's table, and so effectively blocks recording of the
resultant free space in the FSM in your scenario. And so application A
and application B should be considered the same application already.
That's just how VACUUM works.

Sure ... but that also sucks. If we consider application A and
application B to be the same application, then we're basing our
decision about what to do on information that is inaccurate.

5% larger seems like a lot more than would be typical, based on what
I've seen. I don't think that the regression in this scenario can be
characterized as "infinitely worse", or anything like it. On a long
enough timeline, the potential upside of something like this is nearly
unlimited -- it could avoid a huge amount of write amplification. But
the potential downside seems to be small and fixed -- which is the
point (bounding the downside). The mere possibility of getting that
big benefit (avoiding the costs from heap fragmentation) is itself a
benefit, even when it turns out not to pay off in your particular
case. It can be seen as insurance.

I don't see it that way. There are cases where avoiding writes is
better, and cases where trying to cram everything into the fewest
possible ages is better. With the right test case you can make either
strategy look superior. What I think your test case has going for it
is that it is similar to something that a lot of people, really a ton
of people, actually do with PostgreSQL. However, it's not going to be
an accurate model of what everybody does, and therein lies some
element of danger.

Then what could you have confidence in?

Real-world experience. Which is hard to get if we don't ever commit
any patches, but a good argument for (a) having them tested by
multiple different hackers who invent test cases independently and (b)
some configurability where we can reasonably include it, so that if
anyone does experience problems they have an escape.

--
Robert Haas
EDB: http://www.enterprisedb.com

#45

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#44)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 4, 2022 at 4:18 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 4, 2022 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote:

Application B will already block pruning by VACUUM operations against
application A's table, and so effectively blocks recording of the
resultant free space in the FSM in your scenario. And so application A
and application B should be considered the same application already.
That's just how VACUUM works.

Sure ... but that also sucks. If we consider application A and
application B to be the same application, then we're basing our
decision about what to do on information that is inaccurate.

I agree that it sucks, but I don't think that it's particularly
relevant to the FSM prototype patch that I included with v7 of the
patch series. A heap page cannot be considered "closed" (either in the
specific sense from the patch, or in any informal sense) when it has
recently dead tuples.

At some point we should invent a fallback path for pruning, that
migrates recently dead tuples to some other subsidiary structure,
retaining only forwarding information in the heap page. But even that
won't change what I just said about closed pages (it'll just make it
easier to return and fix things up later on).

I don't see it that way. There are cases where avoiding writes is
better, and cases where trying to cram everything into the fewest
possible ages is better. With the right test case you can make either
strategy look superior.

The cost of reads is effectively much lower than writes with modern
SSDs, in TCO terms. Plus when a FSM strategy like the one from the
patch does badly according to a naive measure such as total table
size, that in itself doesn't mean that we do worse with reads. In
fact, it's quite the opposite.

The benchmark showed that v7 of the patch did very slightly worse on
overall space utilization, but far, far better on reads. In fact, the
benefits for reads were far in excess of any efficiency gains for
writes/with WAL. The greatest bottleneck is almost always latency on
modern hardware [1]https://dl.acm.org/doi/10.1145/1022594.1022596 -- Peter Geoghegan. It follows that keeping logically related data
grouped together is crucial. Far more important than potentially using
very slightly more space.

The story I wanted to tell with the FSM patch was about open and
closed pages being the right long term direction. More generally, we
should emphasize managing page-level costs, and deemphasize managing
tuple-level costs, which are much less meaningful.

What I think your test case has going for it
is that it is similar to something that a lot of people, really a ton
of people, actually do with PostgreSQL. However, it's not going to be
an accurate model of what everybody does, and therein lies some
element of danger.

No question -- agreed.

Then what could you have confidence in?

Real-world experience. Which is hard to get if we don't ever commit
any patches, but a good argument for (a) having them tested by
multiple different hackers who invent test cases independently and (b)
some configurability where we can reasonably include it, so that if
anyone does experience problems they have an escape.

I agree.

[1]: https://dl.acm.org/doi/10.1145/1022594.1022596 -- Peter Geoghegan
--
Peter Geoghegan

#46

Greg Stark

stark@mit.edu

almost 4 years ago

In reply to: Peter Geoghegan (#9)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan <pg@bowt.ie> wrote:

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

While I've seen all the above cases triggering anti-wraparound cases
by far the majority of the cases are not of these pathological forms.

By far the majority of anti-wraparound vacuums are triggered by tables
that are very large and so don't trigger regular vacuums for "long
periods" of time and consistently hit the anti-wraparound threshold
first.

There's nothing limiting how long "long periods" is and nothing tying
it to the rate of xid consumption. It's quite common to have some
*very* large mostly static tables in databases that have other tables
that are *very* busy.

The worst I've seen is a table that took 36 hours to vacuum in a
database that consumed about a billion transactions per day... That's
extreme but these days it's quite common to see tables that get
anti-wraparound vacuums every week or so despite having < 1% modified
tuples. And databases are only getting bigger and transaction rates
faster...

--
greg

#47

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Greg Stark (#46)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 4, 2022 at 10:21 PM Greg Stark <stark@mit.edu> wrote:

On Wed, 15 Dec 2021 at 15:30, Peter Geoghegan <pg@bowt.ie> wrote:

My emphasis here has been on making non-aggressive VACUUMs *always*
advance relfrozenxid, outside of certain obvious edge cases. And so
with all the patches applied, up to and including the opportunistic
freezing patch, every autovacuum of every table manages to advance
relfrozenxid during benchmarking -- usually to a fairly recent value.
I've focussed on making aggressive VACUUMs (especially anti-wraparound
autovacuums) a rare occurrence, for truly exceptional cases (e.g.,
user keeps canceling autovacuums, maybe due to automated script that
performs DDL). That has taken priority over other goals, for now.

While I've seen all the above cases triggering anti-wraparound cases
by far the majority of the cases are not of these pathological forms.

Right - it's practically inevitable that you'll need an
anti-wraparound VACUUM to advance relfrozenxid right now. Technically
it's possible to advance relfrozenxid in any VACUUM, but in practice
it just never happens on a large table. You only need to get unlucky
with one heap page, either by failing to get a cleanup lock, or (more
likely) by setting even one single page all-visible but not all-frozen
just once (once in any VACUUM that takes place between anti-wraparound
VACUUMs).

By far the majority of anti-wraparound vacuums are triggered by tables
that are very large and so don't trigger regular vacuums for "long
periods" of time and consistently hit the anti-wraparound threshold
first.

autovacuum_vacuum_insert_scale_factor can help with this on 13 and 14,
but only if you tune autovacuum_freeze_min_age with that goal in mind.
Which probably doesn't happen very often.

There's nothing limiting how long "long periods" is and nothing tying
it to the rate of xid consumption. It's quite common to have some
*very* large mostly static tables in databases that have other tables
that are *very* busy.

The worst I've seen is a table that took 36 hours to vacuum in a
database that consumed about a billion transactions per day... That's
extreme but these days it's quite common to see tables that get
anti-wraparound vacuums every week or so despite having < 1% modified
tuples. And databases are only getting bigger and transaction rates
faster...

Sounds very much like what I've been calling the freezing cliff. An
anti-wraparound VACUUM throws things off by suddenly dirtying many
more pages than the expected amount for a VACUUM against the table,
despite there being no change in workload characteristics. If you just
had to remove the dead tuples in such a table, then it probably
wouldn't matter if it happened earlier than expected.

--
Peter Geoghegan

#48

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#47)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 4, 2022 at 10:44 PM Peter Geoghegan <pg@bowt.ie> wrote:

Right - it's practically inevitable that you'll need an
anti-wraparound VACUUM to advance relfrozenxid right now. Technically
it's possible to advance relfrozenxid in any VACUUM, but in practice
it just never happens on a large table. You only need to get unlucky
with one heap page, either by failing to get a cleanup lock, or (more
likely) by setting even one single page all-visible but not all-frozen
just once (once in any VACUUM that takes place between anti-wraparound
VACUUMs).

Minor correction: That's a slight exaggeration, since we won't skip
groups of all-visible pages that don't exceed SKIP_PAGES_THRESHOLD
blocks (32 blocks).

--
Peter Geoghegan

#49

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Greg Stark (#46)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 4, 2022 at 10:21 PM Greg Stark <stark@mit.edu> wrote:

By far the majority of anti-wraparound vacuums are triggered by tables
that are very large and so don't trigger regular vacuums for "long
periods" of time and consistently hit the anti-wraparound threshold
first.

That's interesting, because my experience is different. Most of the
time when I get asked to look at a system, it turns out that there is
a prepared transaction or a forgotten replication slot and nobody
noticed until the system hit the wraparound threshold. Or occasionally
a long-running transaction or a failing/stuck vacuum that has the same
effect.

--
Robert Haas
EDB: http://www.enterprisedb.com

#50

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#47)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 4, 2022 at 10:45 PM Peter Geoghegan <pg@bowt.ie> wrote:

While I've seen all the above cases triggering anti-wraparound cases
by far the majority of the cases are not of these pathological forms.

Right - it's practically inevitable that you'll need an
anti-wraparound VACUUM to advance relfrozenxid right now. Technically
it's possible to advance relfrozenxid in any VACUUM, but in practice
it just never happens on a large table. You only need to get unlucky
with one heap page, either by failing to get a cleanup lock, or (more
likely) by setting even one single page all-visible but not all-frozen
just once (once in any VACUUM that takes place between anti-wraparound
VACUUMs).

But ... if I'm not mistaken, in the kind of case that Greg is
describing, relfrozenxid will be advanced exactly as often as it is
today. That's because, if VACUUM is only ever getting triggered by XID
age advancement and not by bloat, there's no opportunity for your
patch set to advance relfrozenxid any sooner than we're doing now. So
I think that people in this kind of situation will potentially be
helped or hurt by other things the patch set does, but the eager
relfrozenxid stuff won't make any difference for them.

--
Robert Haas
EDB: http://www.enterprisedb.com

#51

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#50)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Feb 7, 2022 at 10:08 AM Robert Haas <robertmhaas@gmail.com> wrote:

But ... if I'm not mistaken, in the kind of case that Greg is
describing, relfrozenxid will be advanced exactly as often as it is
today.

But what happens today in a scenario like Greg's is pathological,
despite being fairly common (common in large DBs). It doesn't seem
informative to extrapolate too much from current experience for that
reason.

That's because, if VACUUM is only ever getting triggered by XID
age advancement and not by bloat, there's no opportunity for your
patch set to advance relfrozenxid any sooner than we're doing now.

We must distinguish between:

1. "VACUUM is fundamentally never going to need to run unless it is
forced to, just to advance relfrozenxid" -- this applies to tables
like the stock and customers tables from the benchmark.

and:

2. "VACUUM must sometimes run to mark newly appended heap pages
all-visible, and maybe to also remove dead tuples, but not that often
-- and yet we current only get expensive and inconveniently timed
anti-wraparound VACUUMs, no matter what" -- this applies to all the
other big tables in the benchmark, in particular to the orders and
order lines tables, but also to simpler cases like pgbench_history.

As I've said a few times now, the patch doesn't change anything for 1.
But Greg's problem tables very much sound like they're from category
2. And what we see with the master branch for such tables is that they
always get anti-wraparound VACUUMs, past a certain size (depends on
things like exact XID rate and VACUUM settings, the insert-driven
autovacuum scheduling stuff matters). While the patch never reaches
that point in practice, during my testing -- and doesn't come close.

It is true that in theory, as the size of ones of these "category 2"
tables tends to infinity, the patch ends up behaving the same as
master anyway. But I'm pretty sure that that usually doesn't matter at
all, or matters less than you'd think. As I emphasized when presenting
the recent v7 TPC-C benchmark, neither of the two "TPC-C big problem
tables" (which are particularly interesting/tricky examples of tables
from category 2) come close to getting an anti-wraparound VACUUM
(plus, as I said in the same email, wouldn't matter if they did).

So I think that people in this kind of situation will potentially be
helped or hurt by other things the patch set does, but the eager
relfrozenxid stuff won't make any difference for them.

To be clear, I think it would if everything was in place, including
the basic relfrozenxid advancement thing, plus the new freezing stuff
(though you wouldn't need the experimental FSM thing to get this
benefit).

Here is a thought experiment that may make the general idea a bit clearer:

Imagine I reran the same benchmark as before, with the same settings,
and the expectation that everything would be the same as first time
around for the patch series. But to make things more interesting, this
time I add an adversarial element: I add an adversarial gizmo that
burns XIDs steadily, without doing any useful work. This gizmo doubles
the rate of XID consumption for the database as a whole, perhaps by
calling "SELECT txid_current()" in a loop, followed by a timed sleep
(with a delay chosen with the goal of doubling XID consumption). I
imagine that this would also burn CPU cycles, but probably not enough
to make more than a noise level impact -- so we're severely stressing
the implementation by adding this gizmo, but the stress is precisely
targeted at XID consumption and related implementation details. It's a
pretty clean experiment. What happens now?

I believe (though haven't checked for myself) that nothing important
would change. We'd still see the same VACUUM operations occur at
approximately the same times (relative to the start of the benchmark)
that we saw with the original benchmark, and each VACUUM operation
would do approximately the same amount of physical work on each
occasion. Of course, the autovacuum log output would show that the
OldestXmin for each individual VACUUM operation had larger values than
first time around for this newly initdb'd TPC-C database (purely as a
consequence of the XID burning gizmo), but it would *also* show
*concomitant* increases for our newly set relfrozenxid. The system
should therefore hardly behave differently at all compared to the
original benchmark run, despite this adversarial gizmo.

It's fair to wonder: okay, but what if it was 4x, 8x, 16x? What then?
That does get a bit more complicated, and we should get into why that
is. But for now I'll just say that I think that even that kind of
extreme would make much less difference than you might think -- since
relfrozenxid advancement has been qualitatively improved by the patch
series. It is especially likely that nothing would change if you were
willing to increase autovacuum_freeze_max_age to get a bit more
breathing room -- room to allow the autovacuums to run at their
"natural" times. You wouldn't necessarily have to go too far -- the
extra breathing room from increasing autovacuum_freeze_max_age buys
more wall clock time *between* any two successive "naturally timed
autovacuums". Again, a virtuous cycle.

Does that make sense? It's pretty subtle, admittedly, and you no doubt
have (very reasonable) concerns about the extremes, even if you accept
all that. I just want to get the general idea across here, as a
starting point for further discussion.

--
Peter Geoghegan

#52

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#51)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan <pg@bowt.ie> wrote:

That's because, if VACUUM is only ever getting triggered by XID
age advancement and not by bloat, there's no opportunity for your
patch set to advance relfrozenxid any sooner than we're doing now.

We must distinguish between:

1. "VACUUM is fundamentally never going to need to run unless it is
forced to, just to advance relfrozenxid" -- this applies to tables
like the stock and customers tables from the benchmark.

and:

2. "VACUUM must sometimes run to mark newly appended heap pages
all-visible, and maybe to also remove dead tuples, but not that often
-- and yet we current only get expensive and inconveniently timed
anti-wraparound VACUUMs, no matter what" -- this applies to all the
other big tables in the benchmark, in particular to the orders and
order lines tables, but also to simpler cases like pgbench_history.

It's not really very understandable for me when you refer to the way
table X behaves in Y benchmark, because I haven't studied that in
enough detail to know. If you say things like insert-only table, or a
continuous-random-updates table, or whatever the case is, it's a lot
easier to wrap my head around it.

Does that make sense? It's pretty subtle, admittedly, and you no doubt
have (very reasonable) concerns about the extremes, even if you accept
all that. I just want to get the general idea across here, as a
starting point for further discussion.

Not really. I think you *might* be saying tables which currently get
only wraparound vacuums will end up getting other kinds of vacuums
with your patch because things will improve enough for other tables in
the system that they will be able to get more attention than they do
currently. But I'm not sure I am understanding you correctly, and even
if I am I don't understand why that would be so, and even if it is I
think it doesn't help if essentially all the tables in the system are
suffering from the problem.

--
Robert Haas
EDB: http://www.enterprisedb.com

#53

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#52)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Feb 7, 2022 at 12:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Feb 7, 2022 at 11:43 AM Peter Geoghegan <pg@bowt.ie> wrote:

That's because, if VACUUM is only ever getting triggered by XID
age advancement and not by bloat, there's no opportunity for your
patch set to advance relfrozenxid any sooner than we're doing now.

We must distinguish between:

1. "VACUUM is fundamentally never going to need to run unless it is
forced to, just to advance relfrozenxid" -- this applies to tables
like the stock and customers tables from the benchmark.

and:

2. "VACUUM must sometimes run to mark newly appended heap pages
all-visible, and maybe to also remove dead tuples, but not that often
-- and yet we current only get expensive and inconveniently timed
anti-wraparound VACUUMs, no matter what" -- this applies to all the
other big tables in the benchmark, in particular to the orders and
order lines tables, but also to simpler cases like pgbench_history.

It's not really very understandable for me when you refer to the way
table X behaves in Y benchmark, because I haven't studied that in
enough detail to know. If you say things like insert-only table, or a
continuous-random-updates table, or whatever the case is, it's a lot
easier to wrap my head around it.

What I've called category 2 tables are the vast majority of big tables
in practice. They include pure append-only tables, but also tables
that grow and grow from inserts, but also have some updates. The point
of the TPC-C order + order lines examples was to show how broad the
category really is. And how mixtures of inserts and bloat from updates
on one single table confuse the implementation in general.

Does that make sense? It's pretty subtle, admittedly, and you no doubt
have (very reasonable) concerns about the extremes, even if you accept
all that. I just want to get the general idea across here, as a
starting point for further discussion.

Not really. I think you *might* be saying tables which currently get
only wraparound vacuums will end up getting other kinds of vacuums
with your patch because things will improve enough for other tables in
the system that they will be able to get more attention than they do
currently.

Yes, I am.

But I'm not sure I am understanding you correctly, and even
if I am I don't understand why that would be so, and even if it is I
think it doesn't help if essentially all the tables in the system are
suffering from the problem.

When I say "relfrozenxid advancement has been qualitatively improved
by the patch", what I mean is that we are much closer to a rate of
relfrozenxid advancement that is far closer to the theoretically
optimal rate for our current design, with freezing and with 32-bit
XIDs, and with the invariants for freezing.

Consider the extreme case, and generalize. In the simple append-only
table case, it is most obvious. The final relfrozenxid is very close
to OldestXmin (only tiny noise level differences appear), regardless
of XID consumption by the system in general, and even within the
append-only table in particular. Other cases are somewhat trickier,
but have roughly the same quality, to a surprising degree. Lots of
things that never really should have affected relfrozenxid to begin
with do not, for the first time.

--
Peter Geoghegan

#54

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#39)

3 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Jan 29, 2022 at 8:42 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v7, a revision that overhauls the algorithm that decides
what to freeze. I'm now calling it block-driven freezing in the commit
message. Also included is a new patch, that makes VACUUM record zero
free space in the FSM for an all-visible page, unless the total amount
of free space happens to be greater than one half of BLCKSZ.

I pushed the earlier refactoring and instrumentation patches today.

Attached is v8. No real changes -- just a rebased version.

It will be easier to benchmark and test the page-driven freezing stuff
now, since the master/baseline case will now output instrumentation
showing how relfrozenxid has been advanced (if at all) -- whether (and
to what extent) each VACUUM operation advances relfrozenxid can now be
directly compared, just by monitoring the log_autovacuum_min_duration
output for a given table over time.

--
Peter Geoghegan

Attachments:

v8-0003-Add-all-visible-FSM-heuristic.patchapplication/x-patch; name=v8-0003-Add-all-visible-FSM-heuristic.patchDownload

From 41136d2a8af434a095ce3e6dfdfbe4b48b9ec338 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 23 Jan 2022 21:10:38 -0800
Subject: [PATCH v8 3/3] Add all-visible FSM heuristic.

When recording free space in all-frozen page, record that the page has
zero free space when it has less than half BLCKSZ worth of space,
according to the traditional definition.  Otherwise record free space as
usual.

Making all-visible pages resistant to change like this can be thought of
as a form of hysteresis.  The page is given an opportunity to "settle"
and permanently stay in the same state when the tuples on the page will
never be updated or deleted.  But when they are updated or deleted, the
page can once again be used to store any tuple.  Over time, most pages
tend to settle permanently in many workloads, perhaps only on the second
or third attempt.
---
 src/backend/access/heap/vacuumlazy.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ea4b75189..95049ed25 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1231,6 +1231,13 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 				 */
 				freespace = PageGetHeapFreeSpace(page);
 
+				/*
+				 * An all-visible page should not have its free space
+				 * available from FSM unless it's more than half empty
+				 */
+				if (PageIsAllVisible(page) && freespace < BLCKSZ / 2)
+					freespace = 0;
+
 				UnlockReleaseBuffer(buf);
 				RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 				continue;
@@ -1368,6 +1375,13 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		{
 			Size		freespace = PageGetHeapFreeSpace(page);
 
+			/*
+			 * An all-visible page should not have its free space available
+			 * from FSM unless it's more than half empty
+			 */
+			if (PageIsAllVisible(page) && freespace < BLCKSZ / 2)
+				freespace = 0;
+
 			UnlockReleaseBuffer(buf);
 			RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		}
@@ -2537,6 +2551,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		page = BufferGetPage(buf);
 		freespace = PageGetHeapFreeSpace(page);
 
+		/*
+		 * An all-visible page should not have its free space available from
+		 * FSM unless it's more than half empty
+		 */
+		if (PageIsAllVisible(page) && freespace < BLCKSZ / 2)
+			freespace = 0;
+
 		UnlockReleaseBuffer(buf);
 		RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
 		vacuumed_pages++;
-- 
2.30.2

v8-0002-Make-block-level-characteristics-drive-freezing.patchapplication/x-patch; name=v8-0002-Make-block-level-characteristics-drive-freezing.patchDownload

From 4838bd1f11b748d2082caedfe4da506b8fe3f67a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 13 Dec 2021 15:00:49 -0800
Subject: [PATCH v8 2/3] Make block-level characteristics drive freezing.

Teach VACUUM to freeze all of the tuples on a page whenever it notices
that it would otherwise mark the page all-visible, without also marking
it all-frozen.  VACUUM won't freeze _any_ tuples on the page unless
_all_ tuples (that remain after pruning) are all-visible.  It may
occasionally be necessary to freeze the page due to the presence of a
particularly old XID, from before VACUUM's FreezeLimit cutoff.  But the
FreezeLimit mechanism will seldom have any impact on which pages are
frozen anymore -- it is just a backstop now.

Freezing can now informally be thought of as something that takes place
at the level of an entire page, or not at all -- differences in XIDs
among tuples on the same page are not interesting, barring extreme
cases.  Freezing a page is now practically synonymous with setting the
page to all-visible in the visibility map, at least to users.

The main upside of the new approach to freezing is that it makes the
overhead of vacuuming much more predictable over time.  We avoid the
need for large balloon payments, since the system no longer accumulates
"freezing debt" that can only be paid off by anti-wraparound vacuuming.
This seems to have been particularly troublesome with append-only
tables, especially in the common case where XIDs from pages that are
marked all-visible for the first time are still fairly young (in
particular, not as old as indicated by VACUUM's vacuum_freeze_min_age
freezing cutoff).  Before now, nothing stopped these pages from being
set to all-visible (without also being set to all-frozen) the first time
they were reached by VACUUM, which meant that they just couldn't be
frozen until the next anti-wraparound VACUUM -- at which point the XIDs
from the unfrozen tuples might be much older than vacuum_freeze_min_age.
In summary, the old vacuum_freeze_min_age-based FreezeLimit cutoff could
not _reliably_ limit freezing debt unless the GUC was set to 0.

There is a virtuous cycle enabled by the new approach to freezing:
freezing more tuples earlier during non-aggressive VACUUMs allows us to
advance relfrozenxid eagerly, which buys time.  This creates every
opportunity for the workload to naturally generate enough dead tuples
(or newly inserted tuples) to make the autovacuum launcher launch a
non-aggressive autovacuum.  The overall effect is that most individual
tables no longer require _any_ anti-wraparound vacuum operations.  This
effect also owes much to the enhancement added by commit ?????, which
loosened the coupling between freezing and advancing relfrozenxid,
allowing VACUUM to precisely determine a new relfrozenxid.

It's still possible (and sometimes even likely) that VACUUM won't be
able to freeze a tuple with a somewhat older XID due only to a cleanup
lock not being immediately available.  It's even possible that some
VACUUM operations will fail to advance relfrozenxid by very many XIDs as
a consequence.  But the impact over time should be negligible.  The next
VACUUM operation for the table will effectively get a new opportunity to
freeze (or perhaps remove) the same tuple that was originally missed.
Once that happens, relfrozenxid will completely catch up. (Actually, one
could reasonably argue that we never really "fell behind" in the first
place -- the amount of freezing needed to significantly advance
relfrozenxid won't have measurably increased at any point.  A once-off
drop in the extent to which VACUUM can advance relfrozenxid is almost
certainly harmless noise.)

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 84 ++++++++++++++++++++++++----
 1 file changed, 72 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d481a300b..ea4b75189 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,6 +169,7 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoff for pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
@@ -200,6 +201,7 @@ typedef struct LVRelState
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber newly_frozen_pages; /* # pages with tuples frozen by us */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -474,6 +476,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Set cutoffs for entire VACUUM */
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
 
@@ -654,12 +657,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u newly frozen (%.2f%% of total)\n"),
 							 vacrel->removed_pages,
 							 vacrel->rel_pages,
 							 vacrel->scanned_pages,
 							 orig_rel_pages == 0 ? 0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scanned_pages / orig_rel_pages,
+							 vacrel->newly_frozen_pages,
+							 orig_rel_pages == 0 ? 0 :
+							 100.0 * vacrel->newly_frozen_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
@@ -827,6 +833,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	vacrel->scanned_pages = 0;
 	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
+	vacrel->newly_frozen_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -1027,7 +1034,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 			/*
 			 * SKIP_PAGES_THRESHOLD (threshold for skipping) was not
 			 * crossed, or this is the last page.  Scan the page, even
-			 * though it's all-visible (and possibly even all-frozen).
+			 * though it's all-visible (and likely all-frozen, too).
 			 */
 			all_visible_according_to_vm = true;
 		}
@@ -1589,7 +1596,7 @@ lazy_scan_prune(LVRelState *vacrel,
 	ItemId		itemid;
 	HeapTupleData tuple;
 	HTSV_Result res;
-	int			tuples_deleted,
+	int			tuples_deleted = 0,
 				lpdead_items,
 				recently_dead_tuples,
 				num_tuples,
@@ -1600,6 +1607,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 	TransactionId NewRelfrozenxid;
 	MultiXactId NewRelminmxid;
+	TransactionId FreezeLimit = vacrel->FreezeLimit;
+	MultiXactId MultiXactCutoff = vacrel->MultiXactCutoff;
+	bool		freezeblk = false;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1610,7 +1620,6 @@ retry:
 	/* Initialize (or reset) page-level counters */
 	NewRelfrozenxid = vacrel->NewRelfrozenxid;
 	NewRelminmxid = vacrel->NewRelminmxid;
-	tuples_deleted = 0;
 	lpdead_items = 0;
 	recently_dead_tuples = 0;
 	num_tuples = 0;
@@ -1625,9 +1634,9 @@ retry:
 	 * lpdead_items's final value can be thought of as the number of tuples
 	 * that were deleted from indexes.
 	 */
-	tuples_deleted = heap_page_prune(rel, buf, vistest,
-									 InvalidTransactionId, 0, &nnewlpdead,
-									 &vacrel->offnum);
+	tuples_deleted += heap_page_prune(rel, buf, vistest,
+									  InvalidTransactionId, 0, &nnewlpdead,
+									  &vacrel->offnum);
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
@@ -1678,11 +1687,16 @@ retry:
 		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
 		 * treated as advisory/unreliable, so we might as well be slightly
 		 * optimistic.
+		 *
+		 * We delay setting all_visible to false due to seeing an LP_DEAD
+		 * item.  We need to test "is the page all_visible if we just consider
+		 * remaining tuples with tuple storage?" below, when considering if we
+		 * should freeze the tuples on the page.  (all_visible will be set to
+		 * false for caller once we've decided on what to freeze.)
 		 */
 		if (ItemIdIsDead(itemid))
 		{
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
 			prunestate->has_lpdead_items = true;
 			continue;
 		}
@@ -1816,8 +1830,8 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
-									  vacrel->FreezeLimit,
-									  vacrel->MultiXactCutoff,
+									  FreezeLimit,
+									  MultiXactCutoff,
 									  &frozen[nfrozen],
 									  &tuple_totally_frozen,
 									  &NewRelfrozenxid,
@@ -1837,6 +1851,50 @@ retry:
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * Freeze the whole page using OldestXmin (not FreezeLimit) as our cutoff
+	 * if the page is now eligible to be marked all_visible (barring any
+	 * LP_DEAD items) when the page is not already eligible to be marked
+	 * all_frozen.  We generally expect to freeze all of a block's tuples
+	 * together and at once, or none at all.  FreezeLimit is just a backstop
+	 * mechanism that makes sure that we don't overlook one or two older
+	 * tuples.
+	 *
+	 * For example, it's just about possible that successive VACUUM operations
+	 * will never quite manage to use the main block-level logic to freeze one
+	 * old tuple from a page where all other tuples are continually updated.
+	 * We should not be in any hurry to freeze such a tuple.  Even still, it's
+	 * better if we take care of it before an anti-wraparound VACUUM becomes
+	 * necessary -- that would mean that we'd have to wait for a cleanup lock
+	 * during the aggressive VACUUM, which has risks of its own.
+	 *
+	 * FIXME This code structure has been used for prototyping and testing the
+	 * algorithm, details of which have settled.  Code itself to be rewritten,
+	 * though.  It is backwards right now -- should be _starting_ with
+	 * OldestXmin (not FreezeLimit), since that's what happens at the
+	 * conceptual level.
+	 *
+	 * TODO Make vacuum_freeze_min_age GUC/reloption default -1, which will be
+	 * interpreted as "whatever autovacuum_freeze_max_age/2 is".  Idea is to
+	 * make FreezeLimit into a true backstop, and to do our best to avoid
+	 * waiting for a cleanup lock (always prefer to punt to the next VACUUM,
+	 * since we can advance relfrozenxid to the oldest XID on the page inside
+	 * lazy_scan_noprune).
+	 */
+	if (!freezeblk &&
+		((nfrozen > 0 && nfrozen < num_tuples) ||
+		 (prunestate->all_visible && !prunestate->all_frozen)))
+	{
+		freezeblk = true;
+		FreezeLimit = vacrel->OldestXmin;
+		MultiXactCutoff = vacrel->OldestMxact;
+		goto retry;
+	}
+
+	/* Time to define all_visible in a way that accounts for LP_DEAD items */
+	if (lpdead_items > 0)
+		prunestate->all_visible = false;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
@@ -1854,6 +1912,8 @@ retry:
 	{
 		Assert(prunestate->hastup);
 
+		vacrel->newly_frozen_pages++;
+
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
 		 * now.
@@ -1882,7 +1942,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(vacrel->rel, buf, FreezeLimit,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
-- 
2.30.2

v8-0001-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchapplication/x-patch; name=v8-0001-Loosen-coupling-between-relfrozenxid-and-tuple-fr.patchDownload

From 6c8cb32e074e7de2414b067fcf4011acb4cca121 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 22 Nov 2021 10:02:30 -0800
Subject: [PATCH v8 1/3] Loosen coupling between relfrozenxid and tuple
 freezing.

The pg_class.relfrozenxid invariant for heap relations is as follows:
relfrozenxid must be less than or equal to the oldest extant XID in the
table, and must never wraparound (it must be advanced by VACUUM before
wraparound, or in extreme cases the system must be forced to stop
allocating new XIDs).

Before now, VACUUM always set relfrozenxid to whatever value it happened
to use when determining which tuples to freeze (the VACUUM operation's
FreezeLimit cutoff).  But there was no inherent reason why the oldest
extant XID in the table should be anywhere near as old as that.
Furthermore, even if it really was almost as old as FreezeLimit, that
tells us much more about the mechanism that VACUUM used to determine
which tuples to freeze than anything else.  Depending on the details of
the table and workload, it may have been possible to safely advance
relfrozenxid by many more XIDs, at a relatively small cost in freezing
(possibly no extra cost at all) -- but VACUUM rigidly coupled freezing
with advancing relfrozenxid, missing all this.

Teach VACUUM to track the newest possible safe final relfrozenxid
dynamically (and to track a new value for relminmxid).  In the extreme
though common case where all tuples are already frozen, or became frozen
(or were removed by pruning), the final relfrozenxid value will be
exactly equal to the OldestXmin value used by the same VACUUM operation.
A later patch will overhaul the strategy that VACUUM uses for freezing
so that relfrozenxid will tend to get set to a value that's relatively
close to OldestXmin in almost all cases.

Final relfrozenxid values still follow the same rules as before.  They
must still be >= FreezeLimit in an aggressive VACUUM.  Non-aggressive
VACUUMs can set relfrozenxid to any value that's greater than the
preexisting relfrozenxid, which could be either much earlier or much
later than FreezeLimit.  Much depends on workload characteristics.  In
practice there is significant natural variation that we can take
advantage of.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam.h          |   4 +-
 src/include/access/heapam_xlog.h     |   4 +-
 src/include/commands/vacuum.h        |   1 +
 src/backend/access/heap/heapam.c     | 186 ++++++++++++++++++++-------
 src/backend/access/heap/vacuumlazy.c |  85 ++++++++----
 src/backend/commands/cluster.c       |   5 +-
 src/backend/commands/vacuum.c        |  34 +++--
 7 files changed, 238 insertions(+), 81 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 0ad87730e..d35402f9f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -168,7 +168,9 @@ extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
 extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi, Buffer buf);
+									MultiXactId cutoff_multi,
+									TransactionId *NewRelfrozenxid,
+									MultiXactId *NewRelminmxid, Buffer buf);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..ae55c90f7 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *NewRelfrozenxid,
+									  MultiXactId *NewRelminmxid);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 98230aac4..d85a817ff 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6087,12 +6087,24 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "NewRelfrozenxid" is an output value; it's used to maintain target new
+ * relfrozenxid for the relation.  It can be ignored unless "flags" contains
+ * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
+ * here.  This follows the general convention: only track XIDs that will still
+ * be in the table after the ongoing VACUUM finishes.  Note that it's up to
+ * caller to maintain this when the Xid return value is itself an Xid.
+ *
+ * Note that we cannot depend on xmin to maintain NewRelfrozenxid.  We need to
+ * push maintenance of NewRelfrozenxid down this far, since in general xmin
+ * might have been frozen by an earlier VACUUM operation, in which case our
+ * caller will not have factored-in xmin when maintaining NewRelfrozenxid.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *NewRelfrozenxid)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6104,6 +6116,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId tempNewRelfrozenxid;
 
 	*flags = 0;
 
@@ -6198,13 +6211,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-		{
 			need_replace = true;
-			break;
-		}
+		if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+			tempNewRelfrozenxid = members[i].xid;
 	}
 
 	/*
@@ -6213,6 +6226,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (!need_replace)
 	{
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 		*flags |= FRM_NOOP;
 		pfree(members);
 		return InvalidTransactionId;
@@ -6222,6 +6236,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * If the multi needs to be updated, figure out which members do we need
 	 * to keep.
 	 */
+	tempNewRelfrozenxid = *NewRelfrozenxid;
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
@@ -6303,7 +6318,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 * list.)
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, tempNewRelfrozenxid))
+					tempNewRelfrozenxid = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6313,6 +6332,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			{
 				/* running locker cannot possibly be older than the cutoff */
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6341,6 +6361,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+		/* Caller manages NewRelfrozenxid directly when we return an XID */
 	}
 	else
 	{
@@ -6350,6 +6371,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+		*NewRelfrozenxid = tempNewRelfrozenxid;
 	}
 
 	pfree(newmembers);
@@ -6368,6 +6390,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will actually go on to freeze as indicated by our *frz output, so
+ * any (xmin, xmax, xvac) XIDs that we indicate need to be frozen won't need
+ * to be counted here.  Values are valid lower bounds at the point that the
+ * ongoing VACUUM finishes.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
@@ -6392,7 +6421,9 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p,
+						  TransactionId *NewRelfrozenxid,
+						  MultiXactId *NewRelminmxid)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6436,6 +6467,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+		{
+			/* won't be frozen, but older than current NewRelfrozenxid */
+			*NewRelfrozenxid = xid;
+		}
 	}
 
 	/*
@@ -6453,10 +6489,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId temp = *NewRelfrozenxid;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi, &flags, &temp);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
@@ -6474,6 +6511,24 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
+
+			if (TransactionIdPrecedes(newxmax, *NewRelfrozenxid))
+			{
+				/* New xmax is an XID older than new NewRelfrozenxid */
+				*NewRelfrozenxid = newxmax;
+			}
+		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * Changing nothing, so might have to ratchet back NewRelminmxid,
+			 * NewRelfrozenxid, or both together
+			 */
+			if (MultiXactIdIsValid(xid) &&
+				MultiXactIdPrecedes(xid, *NewRelminmxid))
+				*NewRelminmxid = xid;
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6495,6 +6550,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 
 			changed = true;
+
+			/*
+			 * New multixact might have remaining XID older than
+			 * NewRelfrozenxid
+			 */
+			if (TransactionIdPrecedes(temp, *NewRelfrozenxid))
+				*NewRelfrozenxid = temp;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
@@ -6522,7 +6584,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			freeze_xmax = true;
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			{
+				/* won't be frozen, but older than current NewRelfrozenxid */
+				*NewRelfrozenxid = xid;
+			}
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
@@ -6569,6 +6638,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * Since we always freeze here, NewRelfrozenxid doesn't need to be
+		 * maintained.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6646,11 +6718,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId NewRelfrozenxid = FirstNormalTransactionId;
+	MultiXactId NewRelminmxid = FirstMultiXactId;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &NewRelfrozenxid, &NewRelminmxid);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7080,6 +7155,15 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
  * are older than the specified cutoff XID or MultiXactId.  If so, return true.
  *
+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will never freeze any of the XIDs from the tuple, even when we say
+ * that they should.  If caller opts to go with our recommendation to freeze,
+ * then it must account for the fact that it shouldn't trust how we've set
+ * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
+ * our recommendation because they must, and non-aggressive VACUUMs always opt
+ * to not freeze, preferring to ratchet back NewRelfrozenxid instead).
+ *
  * It doesn't matter whether the tuple is alive or dead, we are checking
  * to see if a tuple needs to be removed or frozen to avoid wraparound.
  *
@@ -7088,74 +7172,86 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  */
 bool
 heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi, Buffer buf)
+						MultiXactId cutoff_multi,
+						TransactionId *NewRelfrozenxid,
+						MultiXactId *NewRelminmxid, Buffer buf)
 {
 	TransactionId xid;
+	bool		needs_freeze = false;
 
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
+	if (TransactionIdIsNormal(xid))
+	{
+		if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+			*NewRelfrozenxid = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			needs_freeze = true;
+	}
 
 	/*
 	 * The considerations for multixacts are complicated; look at
 	 * heap_prepare_freeze_tuple for justifications.  This routine had better
 	 * be in sync with that one!
+	 *
+	 * (Actually, we maintain NewRelminmxid differently here, because we
+	 * assume that XIDs that should be frozen according to cutoff_xid won't
+	 * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)
 	 */
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		MultiXactId multi;
+		MultiXactMember *members;
+		int			nmembers;
 
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+		if (MultiXactIdIsValid(multi) &&
+			MultiXactIdPrecedes(multi, *NewRelminmxid))
+			*NewRelminmxid = multi;
+
+		if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 			return true;
 		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
+			needs_freeze = true;
+
+		/* need to check whether any member of the mxact is too old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
 		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
-
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
+			if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+				needs_freeze = true;
+			if (TransactionIdPrecedes(members[i].xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
 		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 	else
 	{
 		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *NewRelfrozenxid))
+				*NewRelfrozenxid = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
-	return false;
+	return needs_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d57055674..d481a300b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -172,8 +172,10 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+
+	/* Track new pg_class.relfrozenxid/pg_class.relminmxid values */
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -330,6 +332,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
@@ -365,8 +368,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -473,8 +476,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+
+	/* Initialize values used to advance relfrozenxid/relminmxid at the end */
+	vacrel->NewRelfrozenxid = OldestXmin;
+	vacrel->NewRelminmxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -527,16 +532,18 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
 	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
 	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * the visibility map.  A non-aggressive VACUUM might only be able to
+	 * advance relfrozenxid to an XID from before FreezeLimit (or a relminmxid
+	 * from before MultiXactCutoff) when it wasn't possible to freeze some
+	 * tuples due to our inability to acquire a cleanup lock, but the effect
+	 * is usually insignificant -- NewRelfrozenxid value still has a decent
+	 * chance of being much more recent that the existing relfrozenxid.
 	 *
 	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
 	 * the rel_pages used by lazy_scan_heap, which won't match when we
 	 * happened to truncate the relation afterwards.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
 		/* Cannot advance relfrozenxid/relminmxid */
 		Assert(!aggressive);
@@ -548,11 +555,23 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 	else
 	{
+		/*
+		 * Aggressive case is strictly required to advance relfrozenxid, at
+		 * least up to FreezeLimit (same applies with relminmxid and its
+		 * cutoff, MultiXactCutoff).  Assert that we got this right now.
+		 */
 		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
 			   orig_rel_pages);
+		Assert(!aggressive ||
+			   TransactionIdPrecedesOrEquals(FreezeLimit,
+											 vacrel->NewRelfrozenxid));
+		Assert(!aggressive ||
+			   MultiXactIdPrecedesOrEquals(MultiXactCutoff,
+										   vacrel->NewRelminmxid));
+
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
+							vacrel->NewRelfrozenxid, vacrel->NewRelminmxid,
 							&frozenxid_updated, &minmulti_updated, false);
 	}
 
@@ -657,17 +676,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenxid - vacrel->relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenxid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminmxid - vacrel->relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminmxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -1579,6 +1598,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	int			nfrozen;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid;
+	MultiXactId NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1587,6 +1608,8 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level counters */
+	NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	NewRelminmxid = vacrel->NewRelminmxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	recently_dead_tuples = 0;
@@ -1796,7 +1819,9 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenxid,
+									  &NewRelminmxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1810,13 +1835,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1969,6 +1997,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenxid = vacrel->NewRelfrozenxid;
+	MultiXactId NewRelminmxid = vacrel->NewRelminmxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2015,7 +2045,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_needs_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff, buf))
+									vacrel->MultiXactCutoff,
+									&NewRelfrozenxid, &NewRelminmxid, buf))
 		{
 			if (vacrel->aggressive)
 			{
@@ -2025,10 +2056,12 @@ lazy_scan_noprune(LVRelState *vacrel,
 			}
 
 			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
+			 * A non-aggressive VACUUM doesn't have to wait on a cleanup lock
+			 * to ensure that it advances relfrozenxid to a sufficiently
+			 * recent XID that happens to be present on this page.  It can
+			 * just accept an older New/final relfrozenxid instead.  There is
+			 * a decent chance that the problem will go away naturally.
 			 */
-			vacrel->freeze_cutoffs_valid = false;
 		}
 
 		num_tuples++;
@@ -2078,6 +2111,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * We have committed to not freezing the tuples on this page (always
+	 * happens with a non-aggressive VACUUM), so make sure that the target
+	 * relfrozenxid/relminmxid values reflect the XIDs/MXIDs we encountered
+	 */
+	vacrel->NewRelfrozenxid = NewRelfrozenxid;
+	vacrel->NewRelminmxid = NewRelminmxid;
+
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel (though
 	 * only when VACUUM uses two-pass strategy)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 02a7e94bf..a7e988298 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b6767a5ff..d71ff21b1 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -945,14 +945,26 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
- * - multiXactCutoff is the value below which all MultiXactIds are removed
- *   from Xmax.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
+ * - multiXactCutoff is the value below which all MultiXactIds are definitely
+ *   removed from Xmax during aggressive vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit, and relminmxid up to multiXactCutoff.
+ * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
+ * minimum).
+ *
+ * oldestXmin and oldestMxact can be thought of as the most recent values that
+ * can ever be passed to vac_update_relstats() as frozenxid and minmulti
+ * arguments.  These exact values can be used when no newer XIDs or MultiXacts
+ * remain in the heap relation (e.g., with an empty table).  It's typical for
+ * vacuumlazy.c caller to notice that older XIDs/Multixacts remain in the
+ * table, which will force it to use the oldest extant values when it calls
+ * vac_update_relstats().  Ideally these values won't be very far behind the
+ * "optimal" oldestXmin and oldestMxact values we provide.
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -961,6 +973,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -969,7 +982,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1065,9 +1077,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1082,8 +1096,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
-- 
2.30.2

#55

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#54)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 11, 2022 at 8:30 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v8. No real changes -- just a rebased version.

Concerns about my general approach to this project (and even the
Postgres 14 VACUUM work) were expressed by Robert and Andres over on
the "Nonrandom scanned_pages distorts pg_class.reltuples set by
VACUUM" thread. Some of what was said honestly shocked me. It now
seems unwise to pursue this project on my original timeline. I even
thought about shelving it indefinitely (which is still on the table).

I propose the following compromise: the least contentious patch alone
will be in scope for Postgres 15, while the other patches will not be.
I'm referring to the first patch from v8, which adds dynamic tracking
of the oldest extant XID in each heap table, in order to be able to
use it as our new relfrozenxid. I can't imagine that I'll have
difficulty convincing Andres of the merits of this idea, for one,
since it was his idea in the first place. It makes a lot of sense,
independent of any change to how and when we freeze.

The first patch is tricky, but at least it won't require elaborate
performance validation. It doesn't change any of the basic performance
characteristics of VACUUM. It sometimes allows us to advance
relfrozenxid to a value beyond FreezeLimit (typically only possible in
an aggressive VACUUM), which is an intrinsic good. If it isn't
effective then the overhead seems very unlikely to be noticeable. It's
pretty much a strictly additive improvement.

Are there any objections to this plan?

--
Peter Geoghegan

#56

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#55)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 18, 2022 at 3:41 PM Peter Geoghegan <pg@bowt.ie> wrote:

Concerns about my general approach to this project (and even the
Postgres 14 VACUUM work) were expressed by Robert and Andres over on
the "Nonrandom scanned_pages distorts pg_class.reltuples set by
VACUUM" thread. Some of what was said honestly shocked me. It now
seems unwise to pursue this project on my original timeline. I even
thought about shelving it indefinitely (which is still on the table).

I propose the following compromise: the least contentious patch alone
will be in scope for Postgres 15, while the other patches will not be.
I'm referring to the first patch from v8, which adds dynamic tracking
of the oldest extant XID in each heap table, in order to be able to
use it as our new relfrozenxid. I can't imagine that I'll have
difficulty convincing Andres of the merits of this idea, for one,
since it was his idea in the first place. It makes a lot of sense,
independent of any change to how and when we freeze.

The first patch is tricky, but at least it won't require elaborate
performance validation. It doesn't change any of the basic performance
characteristics of VACUUM. It sometimes allows us to advance
relfrozenxid to a value beyond FreezeLimit (typically only possible in
an aggressive VACUUM), which is an intrinsic good. If it isn't
effective then the overhead seems very unlikely to be noticeable. It's
pretty much a strictly additive improvement.

Are there any objections to this plan?

I really like the idea of reducing the scope of what is being changed
here, and I agree that eagerly advancing relfrozenxid carries much
less risk than the other changes.

I'd like to have a clearer idea of exactly what is in each of the
remaining patches before forming a final opinion.

What's tricky about 0001? Does it change any other behavior, either as
a necessary component of advancing relfrozenxid more eagerly, or
otherwise?

If there's a way you can make the precise contents of 0002 and 0003
more clear, I would like that, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

#57

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#56)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 18, 2022 at 12:54 PM Robert Haas <robertmhaas@gmail.com> wrote:

I'd like to have a clearer idea of exactly what is in each of the
remaining patches before forming a final opinion.

Great.

What's tricky about 0001? Does it change any other behavior, either as
a necessary component of advancing relfrozenxid more eagerly, or
otherwise?

It does not change any other behavior. It's totally mechanical.

0001 is tricky in the sense that there are a lot of fine details, and
if you get any one of them wrong the result might be a subtle bug. For
example, the heap_tuple_needs_freeze() code path is only used when we
cannot get a cleanup lock, which is rare -- and some of the branches
within the function are relatively rare themselves. The obvious
concern is: What if some detail of how we track the new relfrozenxid
value (and new relminmxid value) in this seldom-hit codepath is just
wrong, in whatever way we didn't think of?

On the other hand, we must already be precise in almost the same way
within heap_tuple_needs_freeze() today -- it's not all that different
(we currently need to avoid leaving any XIDs < FreezeLimit behind,
which isn't made that less complicated by the fact that it's a static
XID cutoff). Plus, we have experience with bugs like this. There was
hardening added to catch stuff like this back in 2017, following the
"freeze the dead" bug.

If there's a way you can make the precise contents of 0002 and 0003
more clear, I would like that, too.

The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible()
thing) wasn't on the table before now. 0002 is the patch that changes
the basic criteria for freezing, making it block-based rather than
based on the FreezeLimit cutoff (barring edge cases that are important
for correctness, but shouldn't noticeably affect freezing overhead).

The single biggest practical improvement from 0002 is that it
eliminates what I've called the freeze cliff, which is where many old
tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be
frozen all at once, in a balloon payment during an eventual aggressive
VACUUM. Although it's easy to see that that could be useful, it is
harder to justify (much harder) than anything else. Because we're
freezing more eagerly overall, we're also bound to do more freezing
without benefit in certain cases. Although I think that this can be
justified as the cost of doing business, that's a hard argument to
make.

In short, 0001 is mechanically tricky, but easy to understand at a
high level. Whereas 0002 is mechanically simple, but tricky to
understand at a high level (and therefore far trickier than 0001
overall).

--
Peter Geoghegan

#58

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#57)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 18, 2022 at 4:10 PM Peter Geoghegan <pg@bowt.ie> wrote:

It does not change any other behavior. It's totally mechanical.

0001 is tricky in the sense that there are a lot of fine details, and
if you get any one of them wrong the result might be a subtle bug. For
example, the heap_tuple_needs_freeze() code path is only used when we
cannot get a cleanup lock, which is rare -- and some of the branches
within the function are relatively rare themselves. The obvious
concern is: What if some detail of how we track the new relfrozenxid
value (and new relminmxid value) in this seldom-hit codepath is just
wrong, in whatever way we didn't think of?

Right. I think we have no choice but to accept such risks if we want
to make any progress here, and every patch carries them to some
degree. I hope that someone else will review this patch in more depth
than I have just now, but what I notice reading through it is that
some of the comments seem pretty opaque. For instance:

+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that

"maintains" is fuzzy. I think you should be saying something much more
explicit, and the thing you are saying should make it clear that these
arguments are input-output arguments: i.e. the caller must set them
correctly before calling this function, and they will be updated by
the function. I don't think you have to spell all of that out in every
place where this comes up in the patch, but it needs to be clear from
what you do say. For example, I would be happier with a comment that
said something like "Every call to this function will either set
HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an
argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if
it is currently newer than that. Thus, after a series of calls to this
function, *NewRelfrozenxid represents a lower bound on unfrozen xmin
values in the tuples examined. Before calling this function, caller
should initialize *NewRelfrozenxid to <something>."

+                        * Changing nothing, so might have to ratchet
back NewRelminmxid,
+                        * NewRelfrozenxid, or both together

This comment I like.

+                        * New multixact might have remaining XID older than
+                        * NewRelfrozenxid

This one's good, too.

+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will never freeze any of the XIDs from the tuple, even when we say
+ * that they should.  If caller opts to go with our recommendation to freeze,
+ * then it must account for the fact that it shouldn't trust how we've set
+ * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
+ * our recommendation because they must, and non-aggressive VACUUMs always opt
+ * to not freeze, preferring to ratchet back NewRelfrozenxid instead).

I don't understand this one.

+        * (Actually, we maintain NewRelminmxid differently here, because we
+        * assume that XIDs that should be frozen according to cutoff_xid won't
+        * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)

This one either.

I haven't really grokked exactly what is happening in
heap_tuple_needs_freeze yet, and may not have time to study it further
in the near future. Not saying it's wrong, although improving the
comments above would likely help me out.

If there's a way you can make the precise contents of 0002 and 0003
more clear, I would like that, too.

The really big one is 0002 -- even 0003 (the FSM PageIsAllVisible()
thing) wasn't on the table before now. 0002 is the patch that changes
the basic criteria for freezing, making it block-based rather than
based on the FreezeLimit cutoff (barring edge cases that are important
for correctness, but shouldn't noticeably affect freezing overhead).

The single biggest practical improvement from 0002 is that it
eliminates what I've called the freeze cliff, which is where many old
tuples (much older than FreezeLimit/vacuum_freeze_min_age) must be
frozen all at once, in a balloon payment during an eventual aggressive
VACUUM. Although it's easy to see that that could be useful, it is
harder to justify (much harder) than anything else. Because we're
freezing more eagerly overall, we're also bound to do more freezing
without benefit in certain cases. Although I think that this can be
justified as the cost of doing business, that's a hard argument to
make.

You've used the term "freezing cliff" repeatedly in earlier emails,
and this is the first time I've been able to understand what you
meant. I'm glad I do, now.

But can you describe the algorithm that 0002 uses to accomplish this
improvement? Like "if it sees that the page meets criteria X, then it
freezes all tuples on the page, else if it sees that that individual
tuples on the page meet criteria Y, then it freezes just those." And
like explain what of that is same/different vs. now.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

#59

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#57)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-02-18 13:09:45 -0800, Peter Geoghegan wrote:

0001 is tricky in the sense that there are a lot of fine details, and
if you get any one of them wrong the result might be a subtle bug. For
example, the heap_tuple_needs_freeze() code path is only used when we
cannot get a cleanup lock, which is rare -- and some of the branches
within the function are relatively rare themselves. The obvious
concern is: What if some detail of how we track the new relfrozenxid
value (and new relminmxid value) in this seldom-hit codepath is just
wrong, in whatever way we didn't think of?

I think it'd be good to add a few isolationtest cases for the
can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The
slightly harder part is verifying that VACUUM did something reasonable, but
that still should be doable?

Greetings,

Andres Freund

#60

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Robert Haas (#56)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-02-18 15:54:19 -0500, Robert Haas wrote:

Are there any objections to this plan?

I really like the idea of reducing the scope of what is being changed
here, and I agree that eagerly advancing relfrozenxid carries much
less risk than the other changes.

Sounds good to me too!

Greetings,

Andres Freund

#61

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#58)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 18, 2022 at 1:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
"maintains" is fuzzy. I think you should be saying something much more
explicit, and the thing you are saying should make it clear that these
arguments are input-output arguments: i.e. the caller must set them
correctly before calling this function, and they will be updated by
the function.

Makes sense.

I don't think you have to spell all of that out in every
place where this comes up in the patch, but it needs to be clear from
what you do say. For example, I would be happier with a comment that
said something like "Every call to this function will either set
HEAP_XMIN_FROZEN in the xl_heap_freeze_tuple struct passed as an
argument, or else reduce *NewRelfrozenxid to the xmin of the tuple if
it is currently newer than that. Thus, after a series of calls to this
function, *NewRelfrozenxid represents a lower bound on unfrozen xmin
values in the tuples examined. Before calling this function, caller
should initialize *NewRelfrozenxid to <something>."

We have to worry about XIDs from MultiXacts (and xmax values more
generally). And we have to worry about the case where we start out
with only xmin frozen (by an earlier VACUUM), and then have to freeze
xmax too. I believe that we have to generally consider xmin and xmax
independently. For example, we cannot ignore xmax, just because we
looked at xmin, since in general xmin alone might have already been
frozen.

+ * Also maintains *NewRelfrozenxid and *NewRelminmxid, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Assumption is that
+ * caller will never freeze any of the XIDs from the tuple, even when we say
+ * that they should.  If caller opts to go with our recommendation to freeze,
+ * then it must account for the fact that it shouldn't trust how we've set
+ * NewRelfrozenxid/NewRelminmxid.  (In practice aggressive VACUUMs always take
+ * our recommendation because they must, and non-aggressive VACUUMs always opt
+ * to not freeze, preferring to ratchet back NewRelfrozenxid instead).

I don't understand this one.

+        * (Actually, we maintain NewRelminmxid differently here, because we
+        * assume that XIDs that should be frozen according to cutoff_xid won't
+        * be, whereas heap_prepare_freeze_tuple makes the opposite assumption.)

This one either.

The difference between the cleanup lock path (in
lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in
lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both
of these confusing comment blocks, really. Note that cutoff_xid is the
name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze
have for FreezeLimit (maybe we should rename every occurence of
cutoff_xid in heapam.c to FreezeLimit).

At a high level, we aren't changing the fundamental definition of an
aggressive VACUUM in any of the patches -- we still need to advance
relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on
HEAD, today (we may be able to advance it *past* FreezeLimit, but
that's just a bonus). But in a non-aggressive VACUUM, where there is
still no strict requirement to advance relfrozenxid (by any amount),
the code added by 0001 can set relfrozenxid to any known safe value,
which could either be from before FreezeLimit, or after FreezeLimit --
almost anything is possible (provided we respect the relfrozenxid
invariant, and provided we see that we didn't skip any
all-visible-not-all-frozen pages).

Since we still need to "respect FreezeLimit" in an aggressive VACUUM,
the aggressive case might need to wait for a full cleanup lock the
hard way, having tried and failed to do it the easy way within
lazy_scan_noprune (lazy_scan_noprune will still return false when any
call to heap_tuple_needs_freeze for any tuple returns false) -- same
as on HEAD, today.

And so the difference at issue here is: FreezeLimit/cutoff_xid only
needs to affect the new NewRelfrozenxid value we use for relfrozenxid in
heap_prepare_freeze_tuple, which is involved in real freezing -- not
in heap_tuple_needs_freeze, whose main purpose is still to help us
avoid freezing where a cleanup lock isn't immediately available. While
the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze
is to determine its bool return value, which will only be of interest
to the aggressive case (which might have to get a cleanup lock and do
it the hard way), not the non-aggressive case (where ratcheting back
NewRelfrozenxid is generally possible, and generally leaves us with
almost as good of a value).

In other words: the calls to heap_tuple_needs_freeze made from
lazy_scan_noprune are simply concerned with the page as it actually
is, whereas the similar/corresponding calls to
heap_prepare_freeze_tuple from lazy_scan_prune are concerned with
*what the page will actually become*, after freezing finishes, and
after lazy_scan_prune is done with the page entirely (ultimately
the final NewRelfrozenxid value set in pg_class.relfrozenxid only has
to be <= the oldest extant XID *at the time the VACUUM operation is
just about to end*, not some earlier time, so "being versus becoming"
is an interesting distinction for us).

Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
here, to make all of this less confusing. I only now fully realized
how confusing all of this stuff is -- very.

I haven't really grokked exactly what is happening in
heap_tuple_needs_freeze yet, and may not have time to study it further
in the near future. Not saying it's wrong, although improving the
comments above would likely help me out.

Definitely needs more polishing.

You've used the term "freezing cliff" repeatedly in earlier emails,
and this is the first time I've been able to understand what you
meant. I'm glad I do, now.

Ugh. I thought that a snappy term like that would catch on quickly. Guess not!

But can you describe the algorithm that 0002 uses to accomplish this
improvement? Like "if it sees that the page meets criteria X, then it
freezes all tuples on the page, else if it sees that that individual
tuples on the page meet criteria Y, then it freezes just those." And
like explain what of that is same/different vs. now.

The mechanics themselves are quite simple (again, understanding the
implications is the hard part). The approach taken within 0002 is
still rough, to be honest, but wouldn't take long to clean up (there
are XXX/FIXME comments about this in 0002).

As a general rule, we try to freeze all of the remaining live tuples
on a page (following pruning) together, as a group, or none at all.
Most of the time this is triggered by our noticing that the page is
about to be set all-visible (but not all-frozen), and doing work
sufficient to mark it fully all-frozen instead. Occasionally there is
FreezeLimit to consider, which is now more of a backstop thing, used
to make sure that we never get too far behind in terms of unfrozen
XIDs. This is useful in part because it avoids any future
non-aggressive VACUUM that is fundamentally unable to advance
relfrozenxid (you can't skip all-visible pages if there are only
all-frozen pages in the VM in practice).

We're generally doing a lot more freezing with 0002, but we still
manage to avoid freezing too much in tables like pgbench_tellers or
pgbench_branches -- tables where it makes the least sense. Such tables
will be updated so frequently that VACUUM is relatively unlikely to
ever mark any page all-visible, avoiding the main criteria for
freezing implicitly. It's also unlikely that they'll ever have an XID that is so
old to trigger the fallback FreezeLimit-style criteria for freezing.

In practice, freezing tuples like this is generally not that expensive in
most tables where VACUUM freezes the majority of pages immediately
(tables that aren't like pgbench_tellers or pgbench_branches), because
they're generally big tables, where the overhead of FPIs tends
to dominate anyway (gambling that we can avoid more FPIs later on is not a
bad gamble, as gambles go). This seems to make the overhead
acceptable, on balance. Granted, you might be able to poke holes in
that argument, and reasonable people might disagree on what acceptable
should mean. There are many value judgements here, which makes it
complicated. (On the other hand we might be able to do better if there
was a particularly bad case for the 0002 work, if one came to light.)

--
Peter Geoghegan

#62

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#59)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 18, 2022 at 2:11 PM Andres Freund <andres@anarazel.de> wrote:

I think it'd be good to add a few isolationtest cases for the
can't-get-cleanup-lock paths. I think it shouldn't be hard using cursors. The
slightly harder part is verifying that VACUUM did something reasonable, but
that still should be doable?

We could even just extend existing, related tests, from vacuum-reltuples.spec.

Another testing strategy occurs to me: we could stress-test the
implementation by simulating an environment where the no-cleanup-lock
path is hit an unusually large number of times, possibly a fixed
percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
ConditionalLockBufferForCleanup() call return false randomly. Now that
we have lazy_scan_noprune for the no-cleanup-lock path (which is as
similar to the regular lazy_scan_prune path as possible), I wouldn't
expect this ConditionalLockBufferForCleanup() testing gizmo to be too
disruptive.

--
Peter Geoghegan

#63

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#62)

1 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

Another testing strategy occurs to me: we could stress-test the
implementation by simulating an environment where the no-cleanup-lock
path is hit an unusually large number of times, possibly a fixed
percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
ConditionalLockBufferForCleanup() call return false randomly. Now that
we have lazy_scan_noprune for the no-cleanup-lock path (which is as
similar to the regular lazy_scan_prune path as possible), I wouldn't
expect this ConditionalLockBufferForCleanup() testing gizmo to be too
disruptive.

I tried this out, using the attached patch. It was quite interesting,
even when run against HEAD. I think that I might have found a bug on
HEAD, though I'm not really sure.

If you modify the patch to simulate conditions under which
ConditionalLockBufferForCleanup() fails about 2% of the time, you get
much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
without it being so aggressive as to make "make check-world" fail --
which is exactly what I expected. If you are much more aggressive
about it, and make it 50% instead (which you can get just by using the
patch as written), then some tests will fail, mostly for reasons that
aren't surprising or interesting (e.g. plan changes). This is also
what I'd have guessed would happen.

However, it gets more interesting. One thing that I did not expect to
happen at all also happened (with the current 50% rate of simulated
ConditionalLockBufferForCleanup() failure from the patch): if I run
"make check" from the pg_surgery directory, then the Postgres backend
gets stuck in an infinite loop inside lazy_scan_prune, which has been
a symptom of several tricky bugs in the past year (not every time, but
usually). Specifically, the VACUUM statement launched by the SQL
command "vacuum freeze htab2;" from the file
contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
misbehavior.

This is a temp table, which is a choice made by the tests specifically
because they need to "use a temp table so that vacuum behavior doesn't
depend on global xmin". This is convenient way of avoiding spurious
regression tests failures (e.g. from autoanalyze), and relies on the
GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
94bc27b5.

It's quite possible that this is nothing more than a bug in my
adversarial gizmo patch -- since I don't think that
ConditionalLockBufferForCleanup() can ever fail with a temp buffer
(though even that's not completely clear right now). Even if the
behavior that I saw does not indicate a bug on HEAD, it still seems
informative. At the very least, it wouldn't hurt to Assert() that the
target table isn't a temp table inside lazy_scan_noprune, documenting
our assumptions around temp tables and
ConditionalLockBufferForCleanup().

I haven't actually tried to debug the issue just yet, so take all this
with a grain of salt.

--
Peter Geoghegan

Attachments:

0001-Add-adversarial-ConditionalLockBufferForCleanup-gizm.txttext/plain; charset=US-ASCII; name=0001-Add-adversarial-ConditionalLockBufferForCleanup-gizm.txtDownload

From 3f01281af3ba81b35777cb7d717f76e001fd3e10 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 19 Feb 2022 14:07:35 -0800
Subject: [PATCH] Add adversarial ConditionalLockBufferForCleanup() gizmo to
 vacuumlazy.c.

---
 src/backend/access/heap/vacuumlazy.c | 36 +++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 242511a23..31c6b360e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -50,6 +50,7 @@
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "optimizer/paths.h"
@@ -748,6 +749,39 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 }
 
+/*
+ * Adversarial gizmo, simulates excessive failure to get cleanup locks
+ */
+static inline bool
+lazy_conditionallockbufferforcleanup(Buffer buffer)
+{
+	/*
+	 * Artificially fail to get a cleanup lock 50% of the time.
+	 *
+	 * XXX: What about temp tables?  We simulate not getting a cleanup lock
+	 * there, but is that choice actually reasonable?
+	 */
+	if (pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 2))
+		return false;
+
+#if 0
+	/*
+	 * 50% is very very aggressive, while 2% - 5% is still basically
+	 * adversarial but in many ways less annoying.
+	 *
+	 * This version (which injects a failure to get a cleanup lock 2% of the
+	 * time) seems to pass the regression tests, even with my parallel make
+	 * check-world recipe.  Expected query plans don't seem to shift on
+	 * account of unexpected index bloat (nor are there any problems of a
+	 * similar nature) with this variant of the gizmo.
+	 */
+	if (pg_prng_uint32(&pg_global_prng_state) <= (PG_UINT32_MAX / 50))
+		return false;
+#endif
+
+	return ConditionalLockBufferForCleanup(buffer);
+}
+
 /*
  *	lazy_scan_heap() -- workhorse function for VACUUM
  *
@@ -1093,7 +1127,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		 * a cleanup lock right away, we may be able to settle for reduced
 		 * processing using lazy_scan_noprune.
 		 */
-		if (!ConditionalLockBufferForCleanup(buf))
+		if (!lazy_conditionallockbufferforcleanup(buf))
 		{
 			bool		hastup,
 						recordfreespace;
-- 
2.30.2

#64

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#63)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

(On phone, so crappy formatting and no source access)

On February 19, 2022 3:08:41 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

Another testing strategy occurs to me: we could stress-test the
implementation by simulating an environment where the no-cleanup-lock
path is hit an unusually large number of times, possibly a fixed
percentage of the time (like 1%, 5%), say by making vacuumlazy.c's
ConditionalLockBufferForCleanup() call return false randomly. Now that
we have lazy_scan_noprune for the no-cleanup-lock path (which is as
similar to the regular lazy_scan_prune path as possible), I wouldn't
expect this ConditionalLockBufferForCleanup() testing gizmo to be too
disruptive.

I tried this out, using the attached patch. It was quite interesting,
even when run against HEAD. I think that I might have found a bug on
HEAD, though I'm not really sure.

If you modify the patch to simulate conditions under which
ConditionalLockBufferForCleanup() fails about 2% of the time, you get
much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze,
without it being so aggressive as to make "make check-world" fail --
which is exactly what I expected. If you are much more aggressive
about it, and make it 50% instead (which you can get just by using the
patch as written), then some tests will fail, mostly for reasons that
aren't surprising or interesting (e.g. plan changes). This is also
what I'd have guessed would happen.

However, it gets more interesting. One thing that I did not expect to
happen at all also happened (with the current 50% rate of simulated
ConditionalLockBufferForCleanup() failure from the patch): if I run
"make check" from the pg_surgery directory, then the Postgres backend
gets stuck in an infinite loop inside lazy_scan_prune, which has been
a symptom of several tricky bugs in the past year (not every time, but
usually). Specifically, the VACUUM statement launched by the SQL
command "vacuum freeze htab2;" from the file
contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this
misbehavior.

This is a temp table, which is a choice made by the tests specifically
because they need to "use a temp table so that vacuum behavior doesn't
depend on global xmin". This is convenient way of avoiding spurious
regression tests failures (e.g. from autoanalyze), and relies on the
GlobalVisTempRels behavior established by Andres' 2020 bugfix commit
94bc27b5.

We don't have a blocking path for cleanup locks of temporary buffers IIRC (normally not reachable). So I wouldn't be surprised if a cleanup lock failing would cause some odd behavior.

It's quite possible that this is nothing more than a bug in my
adversarial gizmo patch -- since I don't think that
ConditionalLockBufferForCleanup() can ever fail with a temp buffer
(though even that's not completely clear right now). Even if the
behavior that I saw does not indicate a bug on HEAD, it still seems
informative. At the very least, it wouldn't hurt to Assert() that the
target table isn't a temp table inside lazy_scan_noprune, documenting
our assumptions around temp tables and
ConditionalLockBufferForCleanup().

Definitely worth looking into more.

This reminds me of a recent thing I noticed in the aio patch. Spgist can end up busy looping when buffers are locked, instead of blocking. Not actually related, of course.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#65

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#63)

1 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 3:08 PM Peter Geoghegan <pg@bowt.ie> wrote:

It's quite possible that this is nothing more than a bug in my
adversarial gizmo patch -- since I don't think that
ConditionalLockBufferForCleanup() can ever fail with a temp buffer
(though even that's not completely clear right now). Even if the
behavior that I saw does not indicate a bug on HEAD, it still seems
informative.

This very much looks like a bug in pg_surgery itself now -- attached
is a draft fix.

The temp table thing was a red herring. I found I could get exactly
the same kind of failure when htab2 was a permanent table (which was
how it originally appeared, before commit 0811f766fd made it into a
temp table due to test flappiness issues). The relevant "vacuum freeze
htab2" happens at a point after the test has already deliberately
corrupted one of its tuples using heap_force_kill(). It's not that we
aren't careful enough about the corruption at some point in
vacuumlazy.c, which was my second theory. But I quickly discarded that
idea, and came up with a third theory: the relevant heap_surgery.c
path does the relevant ItemIdSetDead() to kill items, without also
defragmenting the page to remove the tuples with storage, which is
wrong.

This meant that we depended on pruning happening (in this case during
VACUUM) and defragmenting the page in passing. But there is no reason
to not defragment the page within pg_surgery (at least no obvious
reason), since we have a cleanup lock anyway.

Theoretically you could blame this on lazy_scan_noprune instead, since
it thinks it can collect LP_DEAD items while assuming that they have
no storage, but that doesn't make much sense to me. There has never
been any way of setting a heap item to LP_DEAD without also
defragmenting the page. Since that's exactly what it means to prune a
heap page. (Actually, the same used to be true about heap vacuuming,
which worked more like heap pruning before Postgres 14, but that
doesn't seem important.)

--
Peter Geoghegan

Attachments:

0002-Fix-for-pg_surgery-s-heap_force_kill-function.txttext/plain; charset=US-ASCII; name=0002-Fix-for-pg_surgery-s-heap_force_kill-function.txtDownload

From 81f01ca623b115647ee78a1b09bbb4458fb35dab Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 19 Feb 2022 16:13:48 -0800
Subject: [PATCH 2/2] Fix for pg_surgery's heap_force_kill() function.

---
 contrib/pg_surgery/heap_surgery.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_surgery/heap_surgery.c b/contrib/pg_surgery/heap_surgery.c
index 3e641aa64..a3a193ba5 100644
--- a/contrib/pg_surgery/heap_surgery.c
+++ b/contrib/pg_surgery/heap_surgery.c
@@ -311,7 +311,8 @@ heap_force_common(FunctionCallInfo fcinfo, HeapTupleForceOption heap_force_opt)
 		 */
 		if (did_modify_page)
 		{
-			/* Mark buffer dirty before we write WAL. */
+			/* Defragment and mark buffer dirty before we write WAL. */
+			PageRepairFragmentation(page);
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
-- 
2.30.2

#66

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#65)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

This very much looks like a bug in pg_surgery itself now -- attached
is a draft fix.

Wait, that's not it either. I jumped the gun -- this isn't sufficient
(though the patch I posted might not be a bad idea anyway).

Looks like pg_surgery isn't processing HOT chains as whole units,
which it really should (at least in the context of killing items via
the heap_force_kill() function). Killing a root item in a HOT chain is
just hazardous -- disconnected/orphaned heap-only tuples are liable to
cause chaos, and should be avoided everywhere (including during
pruning, and within pg_surgery).

It's likely that the hardening I already planned on adding to pruning
[1]: /messages/by-id/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com -- Peter Geoghegan
prevent lazy_scan_prune from getting stuck like this, whatever the
cause happens to be. The actual page image I see lazy_scan_prune choke
on (i.e. exhibit the same infinite loop unpleasantness we've seen
before on) is not in a consistent state at all (its tuples consist of
tuples from a single HOT chain, and the HOT chain is totally
inconsistent on account of having an LP_DEAD line pointer root item).
pg_surgery could in principle do the right thing here by always
treating HOT chains as whole units.

Leaving behind disconnected/orphaned heap-only tuples is pretty much
pointless anyway, since they'll never be accessible by index scans.
Even after a REINDEX, since there is no root item from the heap page
to go in the index. (A dump and restore might work better, though.)

[1]: /messages/by-id/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#67

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#66)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-02-19 17:22:33 -0800, Peter Geoghegan wrote:

Looks like pg_surgery isn't processing HOT chains as whole units,
which it really should (at least in the context of killing items via
the heap_force_kill() function). Killing a root item in a HOT chain is
just hazardous -- disconnected/orphaned heap-only tuples are liable to
cause chaos, and should be avoided everywhere (including during
pruning, and within pg_surgery).

How does that cause the endless loop?

It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
me. So something needs have changed with your patch?

It's likely that the hardening I already planned on adding to pruning
[1] (as follow-up work to recent bugfix commit 18b87b201f) will
prevent lazy_scan_prune from getting stuck like this, whatever the
cause happens to be.

Yea, we should pick that up again. Not just for robustness or
performance. Also because it's just a lot easier to understand.

Leaving behind disconnected/orphaned heap-only tuples is pretty much
pointless anyway, since they'll never be accessible by index scans.
Even after a REINDEX, since there is no root item from the heap page
to go in the index. (A dump and restore might work better, though.)

Given that heap_surgery's raison d'etre is correcting corruption etc, I think
it makes sense for it to do as minimal work as possible. Iterating through a
HOT chain would be a problem if you e.g. tried to repair a page with HOT
corruption.

Greetings,

Andres Freund

#68

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#67)

1 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:

How does that cause the endless loop?

Attached is the page image itself, dumped via gdb (and gzip'd). This
was on recent HEAD (commit 8f388f6f, actually), plus
0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No
defragmenting in pg_surgery, nothing like that.

It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
me. So something needs have changed with your patch?

It doesn't always happen -- only about half the time on my machine.
Maybe it's timing sensitive?

We hit the "goto retry" on offnum 2, which is the first tuple with
storage (you can see "the ghost" of the tuple from the LP_DEAD item at
offnum 1, since the page isn't defragmented in pg_surgery). I think
that this happens because the heap-only tuple at offnum 2 is fully
DEAD to lazy_scan_prune, but hasn't been recognized as such by
heap_page_prune. There is no way that they'll ever "agree" on the
tuple being DEAD right now, because pruning still doesn't assume that
an orphaned heap-only tuple is fully DEAD.

We can either do that, or we can throw an error concerning corruption
when heap_page_prune notices orphaned tuples. Neither seems
particularly appealing. But it definitely makes no sense to allow
lazy_scan_prune to spin in a futile attempt to reach agreement with
heap_page_prune about a DEAD tuple really being DEAD.

Given that heap_surgery's raison d'etre is correcting corruption etc, I think
it makes sense for it to do as minimal work as possible. Iterating through a
HOT chain would be a problem if you e.g. tried to repair a page with HOT
corruption.

I guess that's also true. There is at least a legitimate argument to
be made for not leaving behind any orphaned heap-only tuples. The
interface is a TID, and so the user may already believe that they're
killing the heap-only, not just the root item (since ctid suggests
that the TID of a heap-only tuple is the TID of the root item, which
is kind of misleading).

Anyway, we can decide on what to do in heap_surgery later, once the
main issue is under control. My point was mostly just that orphaned
heap-only tuples are definitely not okay, in general. They are the
least worst option when corruption has already happened, maybe -- but
maybe not.

--
Peter Geoghegan

Attachments:

corrupt-hot-chain.page.gzapplication/x-gzip; name=corrupt-hot-chain.page.gzDownload

���bcorrupt-hot-chain.page��1@@F�w���B�J�q��	GQ:�C,����K�x�)F���KUyuGpZ�^�5�������Z����-������������^���	[b�j

#69

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#68)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-02-19 18:16:54 -0800, Peter Geoghegan wrote:

On Sat, Feb 19, 2022 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:

How does that cause the endless loop?

Attached is the page image itself, dumped via gdb (and gzip'd). This
was on recent HEAD (commit 8f388f6f, actually), plus
0001-Add-adversarial-ConditionalLockBuff[...]. No other changes. No
defragmenting in pg_surgery, nothing like that.

It doesn't do so on HEAD + 0001-Add-adversarial-ConditionalLockBuff[...] for
me. So something needs have changed with your patch?

It doesn't always happen -- only about half the time on my machine.
Maybe it's timing sensitive?

Ah, I'd only run the tests three times or so, without it happening. Trying a
few more times repro'd it.

It's kind of surprising that this needs this
0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question
of hint bits changing due to lazy_scan_noprune(), which then makes
HeapTupleHeaderIsHotUpdated() have a different return value, preventing the
"If the tuple is DEAD and doesn't chain to anything else"
path from being taken.

We hit the "goto retry" on offnum 2, which is the first tuple with
storage (you can see "the ghost" of the tuple from the LP_DEAD item at
offnum 1, since the page isn't defragmented in pg_surgery). I think
that this happens because the heap-only tuple at offnum 2 is fully
DEAD to lazy_scan_prune, but hasn't been recognized as such by
heap_page_prune. There is no way that they'll ever "agree" on the
tuple being DEAD right now, because pruning still doesn't assume that
an orphaned heap-only tuple is fully DEAD.

We can either do that, or we can throw an error concerning corruption
when heap_page_prune notices orphaned tuples. Neither seems
particularly appealing. But it definitely makes no sense to allow
lazy_scan_prune to spin in a futile attempt to reach agreement with
heap_page_prune about a DEAD tuple really being DEAD.

Yea, this sucks. I think we should go for the rewrite of the
heap_prune_chain() logic. The current approach is just never going to be
robust.

Greetings,

Andres Freund

#70

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#69)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:

We can either do that, or we can throw an error concerning corruption
when heap_page_prune notices orphaned tuples. Neither seems
particularly appealing. But it definitely makes no sense to allow
lazy_scan_prune to spin in a futile attempt to reach agreement with
heap_page_prune about a DEAD tuple really being DEAD.

Yea, this sucks. I think we should go for the rewrite of the
heap_prune_chain() logic. The current approach is just never going to be
robust.

No, it just isn't robust enough. But it's not that hard to fix. My
patch really wasn't invasive.

I confirmed that HeapTupleSatisfiesVacuum() and
heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum
2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no
reason to think that there is a new bug involved). The problem here is
indeed just that heap_prune_chain() can't "get to" the tuple, given
its current design.

For anybody else that doesn't follow what we're talking about:

The "doesn't chain to anything else" code at the start of
heap_prune_chain() won't get to the heap-only tuple at offnum 2, since
the tuple is itself HeapTupleHeaderIsHotUpdated() -- the expectation
is that it'll be processed later on, once we locate the HOT chain's
root item. Since, of course, the "root item" was already LP_DEAD
before we even reached heap_page_prune() (on account of the pg_surgery
corruption), there is no possible way that that can happen later on.
And so we cannot find the same heap-only tuple and mark it LP_UNUSED
(which is how we always deal with HEAPTUPLE_DEAD heap-only tuples)
during pruning.

--
Peter Geoghegan

#71

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#69)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:

It's kind of surprising that this needs this
0001-Add-adversarial-ConditionalLockBuff to break. I suspect it's a question
of hint bits changing due to lazy_scan_noprune(), which then makes
HeapTupleHeaderIsHotUpdated() have a different return value, preventing the
"If the tuple is DEAD and doesn't chain to anything else"
path from being taken.

That makes sense as an explanation. Goes to show just how fragile the
"DEAD and doesn't chain to anything else" logic at the top of
heap_prune_chain really is.

--
Peter Geoghegan

#72

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#70)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-02-19 19:07:39 -0800, Peter Geoghegan wrote:

On Sat, Feb 19, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote:

We can either do that, or we can throw an error concerning corruption
when heap_page_prune notices orphaned tuples. Neither seems
particularly appealing. But it definitely makes no sense to allow
lazy_scan_prune to spin in a futile attempt to reach agreement with
heap_page_prune about a DEAD tuple really being DEAD.

Yea, this sucks. I think we should go for the rewrite of the
heap_prune_chain() logic. The current approach is just never going to be
robust.

No, it just isn't robust enough. But it's not that hard to fix. My
patch really wasn't invasive.

I think we're in agreement there. We might think at some point about
backpatching too, but I'd rather have it stew in HEAD for a bit first.

I confirmed that HeapTupleSatisfiesVacuum() and
heap_prune_satisfies_vacuum() agree that the heap-only tuple at offnum
2 is HEAPTUPLE_DEAD -- they are in agreement, as expected (so no
reason to think that there is a new bug involved). The problem here is
indeed just that heap_prune_chain() can't "get to" the tuple, given
its current design.

Right.

The reason that the "adversarial" patch makes a different is solely that it
changes the heap_surgery test to actually kill an item, which it doesn't
intend:

create temp table htab2(a int);
insert into htab2 values (100);
update htab2 set a = 200;
vacuum htab2;

-- redirected TIDs should be skipped
select heap_force_kill('htab2'::regclass, ARRAY['(0, 1)']::tid[]);

If the vacuum can get the cleanup lock due to the adversarial patch, the
heap_force_kill() doesn't do anything, because the first item is a
redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead
targets the root item. Triggering the endless loop.

Hm. I think this might be a mild regression in 14. In < 14 we'd just skip the
tuple in lazy_scan_heap(), but now we have an uninterruptible endless
loop.

We'd do completely bogus stuff later in < 14 though, I think we'd just leave
it in place despite being older than relfrozenxid, which obviously has its own
set of issues.

Greetings,

Andres Freund

#73

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#68)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:

Given that heap_surgery's raison d'etre is correcting corruption etc, I think
it makes sense for it to do as minimal work as possible. Iterating through a
HOT chain would be a problem if you e.g. tried to repair a page with HOT
corruption.

I guess that's also true. There is at least a legitimate argument to
be made for not leaving behind any orphaned heap-only tuples. The
interface is a TID, and so the user may already believe that they're
killing the heap-only, not just the root item (since ctid suggests
that the TID of a heap-only tuple is the TID of the root item, which
is kind of misleading).

Actually, I would say that heap_surgery's raison d'etre is making
weird errors related to corruption of this or that TID go away, so
that the user can cut their losses. That's how it's advertised.

Let's assume that we don't want to make VACUUM/pruning just treat
orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
status -- let's say that we want to err in the direction of doing
nothing at all with the page. Now we have to have a weird error in
VACUUM instead (not great, but better than just spinning between
lazy_scan_prune and heap_prune_page). And we've just created natural
demand for heap_surgery to deal with the problem by deleting whole HOT
chains (not just root items).

If we allow VACUUM to treat orphaned heap-only tuples as DEAD right
away, then we might as well do the same thing in heap_surgery, since
there is little chance that the user will get to the heap-only tuples
before VACUUM does (not something to rely on, at any rate).

Either way, I think we probably end up needing to teach heap_surgery
to kill entire HOT chains as a group, given a TID.

--
Peter Geoghegan

#74

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#72)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 7:28 PM Andres Freund <andres@anarazel.de> wrote:

If the vacuum can get the cleanup lock due to the adversarial patch, the
heap_force_kill() doesn't do anything, because the first item is a
redirect. However if it *can't* get a cleanup lock, heap_force_kill() instead
targets the root item. Triggering the endless loop.

But it shouldn't matter if the root item is an LP_REDIRECT or a normal
(not heap-only) tuple with storage. Either way it's the root of a HOT
chain.

The fact that pg_surgery treats LP_REDIRECT items differently from the
other kind of root items is just arbitrary. It seems to have more to
do with freezing tuples than killing tuples.

--
Peter Geoghegan

#75

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#73)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-02-19 19:31:21 -0800, Peter Geoghegan wrote:

On Sat, Feb 19, 2022 at 6:16 PM Peter Geoghegan <pg@bowt.ie> wrote:

Given that heap_surgery's raison d'etre is correcting corruption etc, I think
it makes sense for it to do as minimal work as possible. Iterating through a
HOT chain would be a problem if you e.g. tried to repair a page with HOT
corruption.

I guess that's also true. There is at least a legitimate argument to
be made for not leaving behind any orphaned heap-only tuples. The
interface is a TID, and so the user may already believe that they're
killing the heap-only, not just the root item (since ctid suggests
that the TID of a heap-only tuple is the TID of the root item, which
is kind of misleading).

Actually, I would say that heap_surgery's raison d'etre is making
weird errors related to corruption of this or that TID go away, so
that the user can cut their losses. That's how it's advertised.

I'm not that sure those are that different... Imagine some corruption leading
to two hot chains ending in the same tid, which our fancy new secure pruning
algorithm might detect.

Either way, I'm a bit surprised about the logic to not allow killing redirect
items? What if you have a redirect pointing to an unused item?

Let's assume that we don't want to make VACUUM/pruning just treat
orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
status

I don't think that'd ever be a good idea. Those tuples are visible to a
seqscan after all.

-- let's say that we want to err in the direction of doing
nothing at all with the page. Now we have to have a weird error in
VACUUM instead (not great, but better than just spinning between
lazy_scan_prune and heap_prune_page).

Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
problem here is a DEAD orphaned HOT tuples, and those we should be able to
delete with the new page pruning logic, right?

I think it might be worth getting rid of the need for the retry approach by
reusing the same HTSV status array between heap_prune_page and
lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
lazy_scan_prune() would be some form of corruption. And it'd be a pretty
decent performance boost, HTSV ain't cheap.

Greetings,

Andres Freund

#76

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#75)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 7:47 PM Andres Freund <andres@anarazel.de> wrote:

I'm not that sure those are that different... Imagine some corruption leading
to two hot chains ending in the same tid, which our fancy new secure pruning
algorithm might detect.

I suppose that's possible, but it doesn't seem all that likely to ever
happen, what with the xmin -> xmax cross-tuple matching stuff.

Either way, I'm a bit surprised about the logic to not allow killing redirect
items? What if you have a redirect pointing to an unused item?

Again, I simply think it boils down to having to treat HOT chains as a
whole unit when killing TIDs.

Let's assume that we don't want to make VACUUM/pruning just treat
orphaned heap-only tuples as DEAD, regardless of their true HTSV-wise
status

I don't think that'd ever be a good idea. Those tuples are visible to a
seqscan after all.

I agree (I don't hate it completely, but it seems mostly bad). This is
what leads me to the conclusion that pg_surgery has to be able to do
this instead. Surely it's not okay to have something that makes VACUUM
always end in error, that cannot even be fixed by pg_surgery.

-- let's say that we want to err in the direction of doing
nothing at all with the page. Now we have to have a weird error in
VACUUM instead (not great, but better than just spinning between
lazy_scan_prune and heap_prune_page).

Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
problem here is a DEAD orphaned HOT tuples, and those we should be able to
delete with the new page pruning logic, right?

Right. But what good does that really do? The problematic page had a
third tuple (at offnum 3) that was LIVE. If we could have done
something about the problematic tuple at offnum 2 (which is where we
got stuck), then we'd still be left with a very unpleasant choice
about what happens to the third tuple.

I think it might be worth getting rid of the need for the retry approach by
reusing the same HTSV status array between heap_prune_page and
lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
lazy_scan_prune() would be some form of corruption. And it'd be a pretty
decent performance boost, HTSV ain't cheap.

I guess it doesn't actually matter if we leave an aborted DEAD tuple
behind, that we could have pruned away, but didn't. The important
thing is to be consistent at the level of the page.

--
Peter Geoghegan

#77

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#76)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On February 19, 2022 7:56:53 PM PST, Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Feb 19, 2022 at 7:47 PM Andres Freund <andres@anarazel.de> wrote:

Non DEAD orphaned versions shouldn't cause a problem in lazy_scan_prune(). The
problem here is a DEAD orphaned HOT tuples, and those we should be able to
delete with the new page pruning logic, right?

Right. But what good does that really do? The problematic page had a
third tuple (at offnum 3) that was LIVE. If we could have done
something about the problematic tuple at offnum 2 (which is where we
got stuck), then we'd still be left with a very unpleasant choice
about what happens to the third tuple.

Why does anything need to happen to it from vacuum's POV? It'll not be a problem for freezing etc. Until it's deleted vacuum doesn't need to care.

Probably worth a WARNING, and amcheck definitely needs to detect it, but otherwise I think it's fine to just continue.

I think it might be worth getting rid of the need for the retry approach by
reusing the same HTSV status array between heap_prune_page and
lazy_scan_prune. Then the only legitimate reason for seeing a DEAD item in
lazy_scan_prune() would be some form of corruption. And it'd be a pretty
decent performance boost, HTSV ain't cheap.

I guess it doesn't actually matter if we leave an aborted DEAD tuple
behind, that we could have pruned away, but didn't. The important
thing is to be consistent at the level of the page.

That's not ok, because it opens up dangers of being interpreted differently after wraparound etc.

But I don't see any cases where it would happen with the new pruning logic in your patch and sharing the HTSV status array?

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#78

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#77)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 8:21 PM Andres Freund <andres@anarazel.de> wrote:

Why does anything need to happen to it from vacuum's POV? It'll not be a problem for freezing etc. Until it's deleted vacuum doesn't need to care.

Probably worth a WARNING, and amcheck definitely needs to detect it, but otherwise I think it's fine to just continue.

Maybe that's true, but it's just really weird to imagine not having an
LP_REDIRECT that points to the LIVE item here, without throwing an
error. Seems kind of iffy, to say the least.

I guess it doesn't actually matter if we leave an aborted DEAD tuple
behind, that we could have pruned away, but didn't. The important
thing is to be consistent at the level of the page.

That's not ok, because it opens up dangers of being interpreted differently after wraparound etc.

But I don't see any cases where it would happen with the new pruning logic in your patch and sharing the HTSV status array?

Right. Fundamentally, there isn't any reason why it should matter that
VACUUM reached the heap page just before (rather than concurrent with
or just after) some xact that inserted or updated on the page aborts.
Just as long as we have a consistent idea about what's going on at the
level of the whole page (or maybe the level of each HOT chain, but the
whole page level seems simpler to me).

--
Peter Geoghegan

#79

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Andres Freund (#67)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sat, Feb 19, 2022 at 8:54 PM Andres Freund <andres@anarazel.de> wrote:

Leaving behind disconnected/orphaned heap-only tuples is pretty much
pointless anyway, since they'll never be accessible by index scans.
Even after a REINDEX, since there is no root item from the heap page
to go in the index. (A dump and restore might work better, though.)

Given that heap_surgery's raison d'etre is correcting corruption etc, I think
it makes sense for it to do as minimal work as possible. Iterating through a
HOT chain would be a problem if you e.g. tried to repair a page with HOT
corruption.

Yeah, I agree. I don't have time to respond to all of these emails
thoroughly right now, but I think it's really important that
pg_surgery do the exact surgery the user requested, and not any other
work. I don't think that page defragmentation should EVER be REQUIRED
as a condition of other work. If other code is relying on that, I'd
say it's busted. I'm a little more uncertain about the case where we
kill the root tuple of a HOT chain, because I can see that this might
leave the page a state where sequential scans see the tuple at the end
of the chain and index scans don't. I'm not sure whether that should
be the responsibility of pg_surgery itself to avoid, or whether that's
your problem as a user of it -- although I lean mildly toward the
latter view, at the moment. But in any case surely the pruning code
can't just decide to go into an infinite loop if that happens. Code
that manipulates the states of data pages needs to be as robust
against arbitrary on-disk states as we can reasonably make it, because
pages get garbled on disk all the time.

--
Robert Haas
EDB: http://www.enterprisedb.com

#80

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#61)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 18, 2022 at 7:12 PM Peter Geoghegan <pg@bowt.ie> wrote:

We have to worry about XIDs from MultiXacts (and xmax values more
generally). And we have to worry about the case where we start out
with only xmin frozen (by an earlier VACUUM), and then have to freeze
xmax too. I believe that we have to generally consider xmin and xmax
independently. For example, we cannot ignore xmax, just because we
looked at xmin, since in general xmin alone might have already been
frozen.

Right, so we at least need to add a similar comment to what I proposed
for MXIDs, and maybe other changes are needed, too.

The difference between the cleanup lock path (in
lazy_scan_prune/heap_prepare_freeze_tuple) and the share lock path (in
lazy_scan_noprune/heap_tuple_needs_freeze) is what is at issue in both
of these confusing comment blocks, really. Note that cutoff_xid is the
name that both heap_prepare_freeze_tuple and heap_tuple_needs_freeze
have for FreezeLimit (maybe we should rename every occurence of
cutoff_xid in heapam.c to FreezeLimit).

At a high level, we aren't changing the fundamental definition of an
aggressive VACUUM in any of the patches -- we still need to advance
relfrozenxid up to FreezeLimit in an aggressive VACUUM, just like on
HEAD, today (we may be able to advance it *past* FreezeLimit, but
that's just a bonus). But in a non-aggressive VACUUM, where there is
still no strict requirement to advance relfrozenxid (by any amount),
the code added by 0001 can set relfrozenxid to any known safe value,
which could either be from before FreezeLimit, or after FreezeLimit --
almost anything is possible (provided we respect the relfrozenxid
invariant, and provided we see that we didn't skip any
all-visible-not-all-frozen pages).

Since we still need to "respect FreezeLimit" in an aggressive VACUUM,
the aggressive case might need to wait for a full cleanup lock the
hard way, having tried and failed to do it the easy way within
lazy_scan_noprune (lazy_scan_noprune will still return false when any
call to heap_tuple_needs_freeze for any tuple returns false) -- same
as on HEAD, today.

And so the difference at issue here is: FreezeLimit/cutoff_xid only
needs to affect the new NewRelfrozenxid value we use for relfrozenxid in
heap_prepare_freeze_tuple, which is involved in real freezing -- not
in heap_tuple_needs_freeze, whose main purpose is still to help us
avoid freezing where a cleanup lock isn't immediately available. While
the purpose of FreezeLimit/cutoff_xid within heap_tuple_needs_freeze
is to determine its bool return value, which will only be of interest
to the aggressive case (which might have to get a cleanup lock and do
it the hard way), not the non-aggressive case (where ratcheting back
NewRelfrozenxid is generally possible, and generally leaves us with
almost as good of a value).

In other words: the calls to heap_tuple_needs_freeze made from
lazy_scan_noprune are simply concerned with the page as it actually
is, whereas the similar/corresponding calls to
heap_prepare_freeze_tuple from lazy_scan_prune are concerned with
*what the page will actually become*, after freezing finishes, and
after lazy_scan_prune is done with the page entirely (ultimately
the final NewRelfrozenxid value set in pg_class.relfrozenxid only has
to be <= the oldest extant XID *at the time the VACUUM operation is
just about to end*, not some earlier time, so "being versus becoming"
is an interesting distinction for us).

Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
here, to make all of this less confusing. I only now fully realized
how confusing all of this stuff is -- very.

Right. I think I understand all of this, or at least most of it -- but
not from the comment. The question is how the comment can be more
clear. My general suggestion is that function header comments should
have more to do with the behavior of the function than how it fits
into the bigger picture. If it's clear to the reader what conditions
must hold before calling the function and which must hold on return,
it helps a lot. IMHO, it's the job of the comments in the calling
function to clarify why we then choose to call that function at the
place and in the way that we do.

As a general rule, we try to freeze all of the remaining live tuples
on a page (following pruning) together, as a group, or none at all.
Most of the time this is triggered by our noticing that the page is
about to be set all-visible (but not all-frozen), and doing work
sufficient to mark it fully all-frozen instead. Occasionally there is
FreezeLimit to consider, which is now more of a backstop thing, used
to make sure that we never get too far behind in terms of unfrozen
XIDs. This is useful in part because it avoids any future
non-aggressive VACUUM that is fundamentally unable to advance
relfrozenxid (you can't skip all-visible pages if there are only
all-frozen pages in the VM in practice).

We're generally doing a lot more freezing with 0002, but we still
manage to avoid freezing too much in tables like pgbench_tellers or
pgbench_branches -- tables where it makes the least sense. Such tables
will be updated so frequently that VACUUM is relatively unlikely to
ever mark any page all-visible, avoiding the main criteria for
freezing implicitly. It's also unlikely that they'll ever have an XID that is so
old to trigger the fallback FreezeLimit-style criteria for freezing.

In practice, freezing tuples like this is generally not that expensive in
most tables where VACUUM freezes the majority of pages immediately
(tables that aren't like pgbench_tellers or pgbench_branches), because
they're generally big tables, where the overhead of FPIs tends
to dominate anyway (gambling that we can avoid more FPIs later on is not a
bad gamble, as gambles go). This seems to make the overhead
acceptable, on balance. Granted, you might be able to poke holes in
that argument, and reasonable people might disagree on what acceptable
should mean. There are many value judgements here, which makes it
complicated. (On the other hand we might be able to do better if there
was a particularly bad case for the 0002 work, if one came to light.)

I think that the idea has potential, but I don't think that I
understand yet what the *exact* algorithm is. Maybe I need to read the
code, when I have some time for that. I can't form an intelligent
opinion at this stage about whether this is likely to be a net
positive.

--
Robert Haas
EDB: http://www.enterprisedb.com

#81

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#80)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sun, Feb 20, 2022 at 7:30 AM Robert Haas <robertmhaas@gmail.com> wrote:

Right, so we at least need to add a similar comment to what I proposed
for MXIDs, and maybe other changes are needed, too.

Agreed.

Maybe the way that FreezeLimit/cutoff_xid is overloaded can be fixed
here, to make all of this less confusing. I only now fully realized
how confusing all of this stuff is -- very.

Right. I think I understand all of this, or at least most of it -- but
not from the comment. The question is how the comment can be more
clear. My general suggestion is that function header comments should
have more to do with the behavior of the function than how it fits
into the bigger picture. If it's clear to the reader what conditions
must hold before calling the function and which must hold on return,
it helps a lot. IMHO, it's the job of the comments in the calling
function to clarify why we then choose to call that function at the
place and in the way that we do.

You've given me a lot of high quality feedback on all of this, which
I'll work through soon. It's hard to get the balance right here, but
it's made much easier by this kind of feedback.

I think that the idea has potential, but I don't think that I
understand yet what the *exact* algorithm is.

The algorithm seems to exploit a natural tendency that Andres once
described in a blog post about his snapshot scalability work [1]https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon. To a
surprising extent, we can usefully bucket all tuples/pages into two
simple categories:

1. Very, very old ("infinitely old" for all practical purposes).

2. Very very new.

There doesn't seem to be much need for a third "in-between" category
in practice. This seems to be at least approximately true all of the
time.

Perhaps Andres wouldn't agree with this very general statement -- he
actually said something more specific. I for one believe that the
point he made generalizes surprisingly well, though. I have my own
theories about why this appears to be true. (Executive summary: power
laws are weird, and it seems as if the sparsity-of-effects principle
makes it easy to bucket things at the highest level, in a way that
generalizes well across disparate workloads.)

Maybe I need to read the
code, when I have some time for that. I can't form an intelligent
opinion at this stage about whether this is likely to be a net
positive.

The code in the v8-0002 patch is a bit sloppy right now. I didn't
quite get around to cleaning it up -- I was focussed on performance
validation of the algorithm itself. So bear that in mind if you do
look at v8-0002 (might want to wait for v9-0002 before looking).

I believe that the only essential thing about the algorithm itself is
that it freezes all the tuples on a page when it anticipates setting
the page all-visible, or (barring edge cases) freezes none at all.
(Note that setting the page all-visible/all-frozen may be happen just
after lazy_scan_prune returns, or in the second pass over the heap,
after LP_DEAD items are set to LP_UNUSED -- lazy_scan_prune doesn't
care which way it will happen.)

There are one or two other design choices that we need to make, like
what exact tuples we freeze in the edge case where FreezeLimit/XID age
forces us to freeze in lazy_scan_prune. These other design choices
don't seem relevant to the issue of central importance, which is
whether or not we come out ahead overall with this new algorithm.
FreezeLimit will seldom affect our choice to freeze or not freeze now,
and so AFAICT the exact way that FreezeLimit affects which precise
freezing-eligible tuples we freeze doesn't complicate performance
validation.

Remember when I got excited about how my big TPC-C benchmark run
showed a predictable, tick/tock style pattern across VACUUM operations
against the order and order lines table [2]/messages/by-id/CAH2-Wz=iLnf+0CsaB37efXCGMRJO1DyJw5HMzm7tp1AxG1NR2g@mail.gmail.com -- scroll down to "TPC-C", which has the relevant autovacuum log output for the orders table, covering a 24 hour period? It seemed very
significant to me that the OldestXmin of VACUUM operation n
consistently went on to become the new relfrozenxid for the same table
in VACUUM operation n + 1. It wasn't exactly the same XID, but very
close to it (within the range of noise). This pattern was clearly
present, even though VACUUM operation n + 1 might happen as long as 4
or 5 hours after VACUUM operation n (this was a big table).

This pattern was encouraging to me because it showed (at least for the
workload and tables in question) that the amount of unnecessary extra
freezing can't have been too bad -- the fact that we can always
advance relfrozenxid in the same way is evidence of that. Note that
the vacuum_freeze_min_age setting can't have affected our choice of
what to freeze (given what we see in the logs), and yet there is a
clear pattern where the pages (it's really pages, not tuples) that the
new algorithm doesn't freeze in VACUUM operation n will reliably get
frozen in VACUUM operation n + 1 instead.

And so this pattern seems to lend support to the general idea of
letting the workload itself be the primary driver of what pages we
freeze (not FreezeLimit, and not anything based on XIDs). That's
really the underlying principle behind the new algorithm -- freezing
is driven by workload characteristics (or page/block characteristics,
if you prefer). ISTM that vacuum_freeze_min_age is almost impossible
to tune -- XID age is just too squishy a concept for that to ever
work.

[1]: https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon
[2]: /messages/by-id/CAH2-Wz=iLnf+0CsaB37efXCGMRJO1DyJw5HMzm7tp1AxG1NR2g@mail.gmail.com -- scroll down to "TPC-C", which has the relevant autovacuum log output for the orders table, covering a 24 hour period
-- scroll down to "TPC-C", which has the relevant autovacuum log
output for the orders table, covering a 24 hour period

--
Peter Geoghegan

#82

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#81)

4 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sun, Feb 20, 2022 at 12:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

You've given me a lot of high quality feedback on all of this, which
I'll work through soon. It's hard to get the balance right here, but
it's made much easier by this kind of feedback.

Attached is v9. Lots of changes. Highlights:

* Much improved 0001 ("loosen coupling" dynamic relfrozenxid tracking
patch). Some of the improvements are due to recent feedback from
Robert.

* Much improved 0002 ("Make page-level characteristics drive freezing"
patch). Whole new approach to the implementation, though the same
algorithm as before.

* No more FSM patch -- that was totally separate work, that I
shouldn't have attached to this project.

* There are 2 new patches (these are now 0003 and 0004), both of which
are concerned with allowing non-aggressive VACUUM to consistently
advance relfrozenxid. I think that 0003 makes sense on general
principle, but I'm much less sure about 0004. These aren't too
important.

While working on the new approach to freezing taken by v9-0002, I had
some insight about the issues that Robert raised around 0001, too. I
wasn't expecting that to happen.

0002 makes page-level freezing a first class thing.
heap_prepare_freeze_tuple now has some (limited) knowledge of how this
works. heap_prepare_freeze_tuple's cutoff_xid argument is now always
the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We
still have to pass FreezeLimit to heap_prepare_freeze_tuple, which
helps us to respect FreezeLimit as a backstop, and so now it's passed
via the new backstop_cutoff_xid argument instead. Whenever we opt to
"freeze a page", the new page-level algorithm *always* uses the most
recent possible XID and MXID values (OldestXmin and oldestMxact) to
decide what XIDs/XMIDs need to be replaced. That might sound like it'd
be too much, but it only applies to those pages that we actually
decide to freeze (since page-level characteristics drive everything
now). FreezeLimit is only one way of triggering that now (and one of
the least interesting and rarest).

0002 also adds an alternative set of relfrozenxid/relminmxid tracker
variables, to make the "don't freeze the page" path within
lazy_scan_prune simpler (if you don't want to freeze the page, then
use the set of tracker variables that go with that choice, which
heap_prepare_freeze_tuple knows about and helps with). With page-level
freezing, lazy_scan_prune wants to make a decision about the page as a
whole, at the last minute, after all heap_prepare_freeze_tuple calls
have already been made. So I think that heap_prepare_freeze_tuple
needs to know about that aspect of lazy_scan_prune's behavior.

When we *don't* want to freeze the page, we more or less need
everything related to freezing inside lazy_scan_prune to behave like
lazy_scan_noprune, which never freezes the page (that's mostly the
point of lazy_scan_noprune). And that's almost what we actually do --
heap_prepare_freeze_tuple now outsources maintenance of this
alternative set of "don't freeze the page" relfrozenxid/relminmxid
tracker variables to its sibling function, heap_tuple_needs_freeze.
That is the same function that lazy_scan_noprune itself actually
calls.

Now back to Robert's feedback on 0001, which had very complicated
comments in the last version. This approach seems to make the "being
versus becoming" or "going to freeze versus not going to freeze"
distinctions much clearer. This is less true if you assume that 0002
won't be committed but 0001 will be. Even if that happens with
Postgres 15, I have to imagine that adding something like 0002 must be
the real goal, long term. Without 0002, the value from 0001 is far
more limited. You need both together to get the virtuous cycle I've
described.

The approach with always using OldestXmin as cutoff_xid and
oldestMxact as our cutoff_multi makes a lot of sense to me, in part
because I think that it might well cut down on the tendency of VACUUM
to allocate new MultiXacts in order to be able to freeze old ones.
AFAICT the only reason that heap_prepare_freeze_tuple does that is
because it has no flexibility on FreezeLimit and MultiXactCutoff.
These are derived from vacuum_freeze_min_age and
vacuum_multixact_freeze_min_age, respectively, and so they're two
independent though fairly meaningless cutoffs. On the other hand,
OldestXmin and OldestMxact are not independent in the same way. We get
both of them at the same time and the same place, in
vacuum_set_xid_limits. OldestMxact really is very close to OldestXmin
-- only the units differ.

It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
freezing old ones) in large part so it can NOT freeze XIDs that it
would have been useful (and much cheaper) to remove anyway. On HEAD,
FreezeMultiXactId() doesn't get passed down the VACUUM operation's
OldestXmin at all (it actually just gets FreezeLimit passed as its
cutoff_xid argument). It cannot possibly recognize any of this for
itself.

Does that theory about MultiXacts sound plausible? I'm not claiming
that the patch makes it impossible that FreezeMultiXactId() will have
to allocate a new MultiXact to freeze during VACUUM -- the
freeze-the-dead isolation tests already show that that's not true. I
just think that page-level freezing based on page characteristics with
oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff)
cutoffs might make it a lot less likely in practice. oldestXmin and
oldestMxact map to the same wall clock time, more or less -- that
seems like it might be an important distinction, independent of
everything else.

Thanks
--
Peter Geoghegan

Attachments:

v9-0002-Make-page-level-characteristics-drive-freezing.patchapplication/x-patch; name=v9-0002-Make-page-level-characteristics-drive-freezing.patchDownload

From d10f42a1c091b4dc52670fca80a63fee4e73e20c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 13 Dec 2021 15:00:49 -0800
Subject: [PATCH v9 2/4] Make page-level characteristics drive freezing.

Teach VACUUM to freeze all of the tuples on a page whenever it notices
that it would otherwise mark the page all-visible, without also marking
it all-frozen.  VACUUM typically won't freeze _any_ tuples on the page
unless _all_ tuples (that remain after pruning) are all-visible.  This
makes the overhead of vacuuming much more predictable over time.  We
avoid the need for large balloon payments during aggressive VACUUMs
(typically anti-wraparound autovacuums).  Freezing is proactive, so
we're much less likely to get into "freezing debt".

The new approach to freezing also enables relfrozenxid advancement in
non-aggressive VACUUMs, which might be enough to avoid aggressive
VACUUMs altogether (with many individual tables/workloads).  While the
non-aggressive case continues to skip all-visible (but not all-frozen)
pages (thereby making relfrozenxid advancement impossible), that in
itself will no longer hinder relfrozenxid advancement (outside of
pg_upgrade scenarios).  We now consistently avoid leaving behind
all-visible (not all-frozen) pages.  This (as well as work from commit
44fa84881f) makes relfrozenxid advancement in non-aggressive VACUUMs
commonplace.

There is also a clear disadvantage to the new approach to freezing: more
eager freezing will impose overhead on cases that don't receive any
benefit.  This is considered an acceptable trade-off.  The new algorithm
tends to avoid freezing early on pages where it makes the least sense,
since frequently modified pages are unlikely to be all-visible.

The system accumulates freezing debt in proportion to the number of
physical heap pages with unfrozen tuples, more or less.  Anything based
on XID age is likely to be a poor proxy for the eventual cost of
freezing (during the inevitable anti-wraparound autovacuum).  At a high
level, freezing is now treated as one of the costs of storing tuples in
physical heap pages -- not a cost of transactions that allocate XIDs.
Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still
influence what we freeze, and when, they effectively become backstops.
It may still be necessary to "freeze a page" due to the presence of a
particularly old XID, from before VACUUM's FreezeLimit cutoff, though
that will be rare in practice -- FreezeLimit is just a backstop now.  It
can only _trigger_ page-level freezing now.  All XIDs < OldestXmin and
all MXIDs < OldestMxact will now be frozen on any page that VACUUM
decides to freeze, regardless of the details behind its decision.

The autovacuum logging instrumentation (and VACUUM VERBOSE) now display
the number of pages that were "newly frozen".  This new metric will give
users a general sense of how much freezing VACUUM performed.  It tends
to be fairly predictable (as a percentage of rel_pages) for a given
table and workload.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam_xlog.h     |  7 ++-
 src/backend/access/heap/heapam.c     | 89 ++++++++++++++++++++++++----
 src/backend/access/heap/vacuumlazy.c | 88 ++++++++++++++++++++-------
 src/backend/commands/vacuum.c        |  8 +++
 4 files changed, 158 insertions(+), 34 deletions(-)

diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 2d8a7f627..a58226e54 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -409,10 +409,15 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId relminmxid,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
+									  TransactionId backstop_cutoff_xid,
+									  MultiXactId backstop_cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
 									  bool *totally_frozen,
+									  bool *force_freeze,
 									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  MultiXactId *relminmxid_out,
+									  TransactionId *relfrozenxid_nofreeze_out,
+									  MultiXactId *relminmxid_nofreeze_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 134bc408a..05253e8dd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6439,14 +6439,38 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * are older than the specified cutoff XID and cutoff MultiXactId.  If so,
  * setup enough state (in the *frz output argument) to later execute and
  * WAL-log what we would need to do, and return true.  Return false if nothing
- * is to be changed.  In addition, set *totally_frozen_p to true if the tuple
+ * can be changed.  In addition, set *totally_frozen_p to true if the tuple
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Although this interface is primarily tuple-based, vacuumlazy.c caller
+ * cooperates with us to decide on whether or not to freeze whole pages,
+ * together as a single group.  We prepare for freezing at the level of each
+ * tuple, but the final decision is made for the page as a whole.  All pages
+ * that are frozen within a given VACUUM operation are frozen according to
+ * cutoff_xid and cutoff_multi.  Caller _must_ freeze the whole page when
+ * we've set *force_freeze to true!
+ *
+ * cutoff_xid must be caller's oldest xmin to ensure that any XID older than
+ * it could neither be running nor seen as running by any open transaction.
+ * This ensures that the replacement will not change anyone's idea of the
+ * tuple state.  Similarly, cutoff_multi must be the smallest MultiXactId used
+ * by any open transaction (at the time that the oldest xmin was acquired).
+ *
+ * backstop_cutoff_xid must be <= cutoff_xid, and backstop_cutoff_multi must
+ * be <= cutoff_multi.  When any XID/XMID from before these backstop cutoffs
+ * is encountered, we set *force_freeze to true, making caller freeze the page
+ * (freezing-eligible XIDs/XMIDs will be frozen, at least).  "Backstop
+ * freezing" ensures that VACUUM won't allow XIDs/XMIDs to ever get too old.
+ * This shouldn't be necessary very often.  VACUUM should prefer to freeze
+ * when it's cheap (not when it's urgent).
+ *
  * Maintains *relfrozenxid_out and *relminmxid_out, which are the current
- * target relfrozenxid and relminmxid for the relation.  Caller should make
- * temp copies of global tracking variables before starting to process a page,
- * so that we can only scribble on copies.
+ * target relfrozenxid and relminmxid for the relation.  There are also "no
+ * freeze" variants (*relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out)
+ * that are used by caller when it decides to not freeze the page.  Caller
+ * should make temp copies of global tracking variables before starting to
+ * process a page, so that we can only scribble on copies.
  *
  * Caller is responsible for setting the offset field, if appropriate.
  *
@@ -6454,13 +6478,6 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be less than or equal to the smallest
- * MultiXactId used by any transaction currently open.
- *
  * If the tuple is in a shared buffer, caller must hold an exclusive lock on
  * that buffer.
  *
@@ -6472,12 +6489,18 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
+						  TransactionId backstop_cutoff_xid,
+						  MultiXactId backstop_cutoff_multi,
 						  xl_heap_freeze_tuple *frz,
 						  bool *totally_frozen_p,
+						  bool *force_freeze,
 						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  MultiXactId *relminmxid_out,
+						  TransactionId *relfrozenxid_nofreeze_out,
+						  MultiXactId *relminmxid_nofreeze_out)
 {
 	bool		changed = false;
+	bool		xmin_already_frozen = false;
 	bool		xmax_already_frozen = false;
 	bool		xmin_frozen;
 	bool		freeze_xmax;
@@ -6498,7 +6521,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
+	{
+		xmin_already_frozen = true;
 		xmin_frozen = true;
+	}
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6564,6 +6590,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
 
+			/*
+			 * Have caller freeze the page, since setting this MultiXactId to
+			 * a simple XID has some value.  Long-lived MultiXacts should be
+			 * avoided.
+			 */
+			*force_freeze = true;
+
 			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
 			{
 				/* New xmax is an XID older than new relfrozenxid_out */
@@ -6609,6 +6642,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			 */
 			if (TransactionIdPrecedes(temp, *relfrozenxid_out))
 				*relfrozenxid_out = temp;
+
+			/*
+			 * We allocated a MultiXact for this, so force freezing to avoid
+			 * wasting it
+			 */
+			*force_freeze = true;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
@@ -6713,11 +6752,28 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
 			changed = true;
+
+			/* Seems like a good idea to freeze early when this case is hit */
+			*force_freeze = true;
 		}
 	}
 
 	*totally_frozen_p = (xmin_frozen &&
 						 (freeze_xmax || xmax_already_frozen));
+
+	/*
+	 * Maintain alternative versions of relfrozenxid_out/relminmxid_out that
+	 * leave caller with the option of *not* freezing the page.  If caller has
+	 * already lost that option (e.g. when the page has an old XID that
+	 * requires backstop freezing), then we don't waste time on this.
+	 */
+	if (!*force_freeze && (!xmin_already_frozen || !xmax_already_frozen))
+		*force_freeze = heap_tuple_needs_freeze(tuple,
+												backstop_cutoff_xid,
+												backstop_cutoff_multi,
+												relfrozenxid_nofreeze_out,
+												relminmxid_nofreeze_out);
+
 	return changed;
 }
 
@@ -6769,15 +6825,22 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 {
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
+	bool		force_freeze = true;
 	bool		tuple_totally_frozen;
 	TransactionId relfrozenxid_out = cutoff_xid;
 	MultiXactId relminmxid_out = cutoff_multi;
+	TransactionId relfrozenxid_nofreeze_out = cutoff_xid;
+	MultiXactId relminmxid_nofreeze_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
+										  cutoff_xid, cutoff_multi,
 										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+										  &force_freeze,
+										  &relfrozenxid_out, &relminmxid_out,
+										  &relfrozenxid_nofreeze_out,
+										  &relminmxid_nofreeze_out);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6ebb9c520..f14b64dfc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -167,9 +167,10 @@ typedef struct LVRelState
 	MultiXactId relminmxid;
 	double		old_live_tuples;	/* previous value of pg_class.reltuples */
 
-	/* VACUUM operation's cutoff for pruning */
+	/* Cutoffs for freezing eligibility */
 	TransactionId OldestXmin;
-	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
+	MultiXactId OldestMxact;
+	/* Backstop cutoffs that force freezing of older XIDs/MXIDs */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
@@ -199,6 +200,7 @@ typedef struct LVRelState
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber newly_frozen_pages; /* # pages with tuples frozen by us */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -470,8 +472,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->relminmxid = rel->rd_rel->relminmxid;
 	vacrel->old_live_tuples = rel->rd_rel->reltuples;
 
-	/* Set cutoffs for entire VACUUM */
+	/* Initialize freezing cutoffs */
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
 	/* Initialize state used to track oldest extant XID/XMID */
@@ -643,12 +646,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total), %u newly frozen (%.2f%% of total)\n"),
 							 vacrel->removed_pages,
 							 vacrel->rel_pages,
 							 vacrel->scanned_pages,
 							 orig_rel_pages == 0 ? 0 :
-							 100.0 * vacrel->scanned_pages / orig_rel_pages);
+							 100.0 * vacrel->scanned_pages / orig_rel_pages,
+							 vacrel->newly_frozen_pages,
+							 orig_rel_pages == 0 ? 0 :
+							 100.0 * vacrel->newly_frozen_pages / orig_rel_pages);
 			appendStringInfo(&buf,
 							 _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"),
 							 (long long) vacrel->tuples_deleted,
@@ -818,6 +824,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	vacrel->scanned_pages = 0;
 	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
+	vacrel->newly_frozen_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -873,7 +880,10 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	 * When vacrel->aggressive is set, we can't skip pages just because they
 	 * are all-visible, but we can still skip pages that are all-frozen, since
 	 * such pages do not need freezing and do not affect the value that we can
-	 * safely set for relfrozenxid or relminmxid.
+	 * safely set for relfrozenxid or relminmxid.  Pages that are set to
+	 * all-visible but not also set to all-frozen are generally only expected
+	 * in pg_upgrade scenarios (these days lazy_scan_prune freezes all of the
+	 * tuples on a page when the page as a whole will be marked all-visible).
 	 *
 	 * Before entering the main loop, establish the invariant that
 	 * next_unskippable_block is the next block number >= blkno that we can't
@@ -1017,7 +1027,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 			/*
 			 * SKIP_PAGES_THRESHOLD (threshold for skipping) was not
 			 * crossed, or this is the last page.  Scan the page, even
-			 * though it's all-visible (and possibly even all-frozen).
+			 * though it's all-visible (and likely all-frozen, too).
 			 */
 			all_visible_according_to_vm = true;
 		}
@@ -1585,10 +1595,13 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
+	bool		force_freeze = false;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	TransactionId NewRelfrozenXid,
+				NoFreezeNewRelfrozenXid;
+	MultiXactId NewRelminMxid,
+				NoFreezeNewRelminMxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1597,8 +1610,8 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	NewRelfrozenXid = NoFreezeNewRelfrozenXid = vacrel->NewRelfrozenXid;
+	NewRelminMxid = NoFreezeNewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1669,8 +1682,15 @@ retry:
 		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * We delay setting all_visible to false in the event of seeing an
+			 * LP_DEAD item.  We need to test "is the page all_visible if we
+			 * just consider remaining tuples with tuple storage?" below, when
+			 * considering if we want to freeze the page.  We set all_visible
+			 * to false for our caller last, when doing final processing of
+			 * any LP_DEAD items collected here.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
 			prunestate->has_lpdead_items = true;
 			continue;
 		}
@@ -1803,12 +1823,17 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
 									  &tuple_totally_frozen,
+									  &force_freeze,
 									  &NewRelfrozenXid,
-									  &NewRelminMxid))
+									  &NewRelminMxid,
+									  &NoFreezeNewRelfrozenXid,
+									  &NoFreezeNewRelminMxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1829,9 +1854,31 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page (based on heap_prepare_freeze_tuple's instructions)
+	 * when it is about to become all-visible.  Also freeze in cases where
+	 * heap_prepare_freeze_tuple requires it.  This usually happens due to the
+	 * presence of an old XID from before FreezeLimit.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (prunestate->all_visible || force_freeze)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs/XMIDs that are just about to be frozen
+		 * anyway.
+		 */
+		vacrel->NewRelfrozenXid = NewRelfrozenXid;
+		vacrel->NewRelminMxid = NewRelminMxid;
+	}
+	else
+	{
+		/* This is comparable to lazy_scan_noprune's handling */
+		vacrel->NewRelfrozenXid = NoFreezeNewRelfrozenXid;
+		vacrel->NewRelminMxid = NoFreezeNewRelminMxid;
+
+		/* Forget heap_prepare_freeze_tuple's guidance on freezing */
+		nfrozen = 0;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1839,7 +1886,7 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
-		Assert(prunestate->hastup);
+		vacrel->newly_frozen_pages++;
 
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
@@ -1869,7 +1916,7 @@ retry:
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(vacrel->rel, buf, NewRelfrozenXid,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
@@ -1892,7 +1939,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1900,7 +1947,6 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
 		Assert(prunestate->all_frozen == all_frozen);
 
 		/*
@@ -1922,9 +1968,11 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
 		Assert(prunestate->has_lpdead_items);
 
+		/* Caller expects LP_DEAD items to unset all_visible */
+		prunestate->all_visible = false;
+
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0ae3b4506..514658ba0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -957,6 +957,14 @@ get_all_vacuum_rels(int options)
  * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
  * minimum).
  *
+ * While non-aggressive VACUUMs are never required to advance relfrozenxid and
+ * relminmxid, they often do so in practice.  They freeze wherever possible,
+ * based on the same criteria that aggressive VACUUMs use.  FreezeLimit and
+ * multiXactCutoff are still applied as backstop cutoffs, that force freezing
+ * of older XIDs/XMIDs that did not get frozen based on the standard criteria.
+ * (Actually, the backstop cutoffs won't force freezing in rare cases where a
+ * cleanup lock cannot be acquired on a page during a non-aggressive VACUUM.)
+ *
  * oldestXmin and oldestMxact are the most recent values that can ever be
  * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
  * vacuumlazy.c caller later on.  These values should be passed when it turns
-- 
2.30.2

v9-0004-Avoid-setting-a-page-all-visible-but-not-all-froz.patchapplication/x-patch; name=v9-0004-Avoid-setting-a-page-all-visible-but-not-all-froz.patchDownload

From 15dec1e572ac4da0540251253c3c219eadf46a83 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Feb 2022 17:21:45 -0800
Subject: [PATCH v9 4/4] Avoid setting a page all-visible but not all-frozen.

This is pretty much an addendum to the work in the "Make page-level
characteristics drive freezing" commit.  It has been broken out like
this because I'm not even sure if it's necessary.  It seems like we
might want to be paranoid about losing out on the chance to advance
relfrozenxid in non-aggressive VACUUMs, though.

The only test that will trigger this case is the "freeze-the-dead"
isolation test.  It's incredibly narrow.  On the other hand, why take a
chance?  All it takes is one heap page that's all-visible (and not also
all-frozen) nestled between some all-frozen heap pages to lose out on
relfrozenxid advancement.  The SKIP_PAGES_THRESHOLD stuff won't save us
then [1].

[1] For context see commit bf136cf6e3 -- SKIP_PAGES_THRESHOLD is
specifically concerned with relfrozenxid advancement in non-aggressive
VACUUMs, and always has been.  This isn't directly documented right now.
---
 src/backend/access/heap/vacuumlazy.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b2d3b039d..5eede8c55 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1981,6 +1981,26 @@ retry:
 	}
 #endif
 
+	/*
+	 * Since OldestXmin and OldestMxact are not absolutely precise, there is a
+	 * tiny chance that we will consider the page all-visible while not also
+	 * considering it all-frozen (having frozen the page with the expectation
+	 * that that would render it all-frozen).  This can happen when there is a
+	 * MultiXact containing XIDs from before and after OldestXmin, for
+	 * example.  This risks making relfrozenxid advancement by future
+	 * non-aggressive VACUUMs impossible, which is a heavy price to pay just
+	 * to be able to avoid accessing one single isolated heap page.
+	 *
+	 * We could just live with this, but it seems prudent to avoid the problem
+	 * instead.  And so we deliberately throw away the opportunity to set such
+	 * a page all-visible instead of allowing this case.
+	 *
+	 * XXX What about the lazy_vacuum_heap_page/heap_page_is_all_visible path,
+	 * which could still set the page just all-visible when that happens?
+	 */
+	if (prunestate->all_visible && !prunestate->all_frozen)
+		prunestate->all_visible = false;
+
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel
 	 */
-- 
2.30.2

v9-0003-Remove-aggressive-VACUUM-skipping-special-case.patchapplication/x-patch; name=v9-0003-Remove-aggressive-VACUUM-skipping-special-case.patchDownload

From d2190abf366f148bae5307442e8a6245c6922e78 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 21 Feb 2022 12:46:44 -0800
Subject: [PATCH v9 3/4] Remove aggressive VACUUM skipping special case.

Since it's simply never okay to miss out on advancing relfrozenxid
during an aggressive VACUUM (that's the whole point), the aggressive
case treated any page from a next_unskippable_block-wise skippable block
range as an all-frozen page (not a merely all-visible page) during
skipping.  Such a page might not be all-visible/all-frozen at the point
that it actually gets skipped, but it could nevertheless be safely
skipped, and then counted in frozenskipped_pages (the page must have
been all-frozen back when we determined the extent of the range of
blocks to skip, since aggressive VACUUMs _must_ scan all-visible pages).
This is necessary to ensure that aggressive VACUUMs are always capable
of advancing relfrozenxid.

The non-aggressive case behaved slightly differently: it rechecked the
visibility map for each page at the point of skipping, and only counted
pages in frozenskipped_pages when they were still all-frozen at that
time.  But it skipped the page either way (since we already committed to
skipping the page at the point of the recheck).  This was correct, but
sometimes resulted in non-aggressive VACUUMs needlessly wasting an
opportunity to advance relfrozenxid (when a page was modified in just
the wrong way, at just the wrong time).  It also resulted in a needless
recheck of the visibility map for each and every page skipped during
non-aggressive VACUUMs.

Avoid these problems by conditioning the "skippable page was definitely
all-frozen when range of skippable pages was first determined" behavior
on what the visibility map _actually said_ about the range as a whole
back when we first determined the extent of the range (don't deduce what
must have happened at that time on the basis of aggressive-ness).  This
allows us to reliably count skipped pages in frozenskipped_pages when
they were initially all-frozen.  In particular, when a page's visibility
map bit is unset after the point where a skippable range of pages is
initially determined, but before the point where the page is actually
skipped, non-aggressive VACUUMs now count it in frozenskipped_pages,
just like aggressive VACUUMs always have [1].  It's not critical for the
non-aggressive case to get this right, but there is no reason not to.

[1] Actually, it might not work that way when there happens to be a mix
of all-visible and all-frozen pages in a range of skippable pages.
There is no chance of VACUUM advancing relfrozenxid in this scenario
either way, though, so it doesn't matter.
---
 src/backend/access/heap/vacuumlazy.c | 59 +++++++++++++++++++---------
 1 file changed, 40 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f14b64dfc..b2d3b039d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -542,7 +542,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
-		/* Cannot advance relfrozenxid/relminmxid */
+		/*
+		 * Skipped some all-visible pages, so definitely cannot advance
+		 * relfrozenxid.  This is generally only expected in pg_upgrade
+		 * scenarios, since VACUUM now avoids setting a page to all-visible
+		 * but not all-frozen.  However, it's also possible (though quite
+		 * unlikely) that we ended up here because somebody else cleared some
+		 * page's all-frozen flag (without clearing its all-visible flag).
+		 */
 		Assert(!aggressive);
 		frozenxid_updated = minmulti_updated = false;
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
@@ -810,7 +817,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 				next_failsafe_block,
 				next_fsm_block_to_vacuum;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		skipping_blocks,
+				skipping_allfrozen_blocks;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -905,27 +913,31 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	 * computed, so they'll have no effect on the value to which we can safely
 	 * set relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
 	 */
+	skipping_allfrozen_blocks = true;	/* iff skipping_blocks */
 	next_unskippable_block = 0;
 	if (vacrel->skipwithvm)
 	{
 		while (next_unskippable_block < nblocks)
 		{
-			uint8		vmstatus;
+			uint8		vmskipflags;
 
-			vmstatus = visibilitymap_get_status(vacrel->rel,
-												next_unskippable_block,
-												&vmbuffer);
+			vmskipflags = visibilitymap_get_status(vacrel->rel,
+												   next_unskippable_block,
+												   &vmbuffer);
 			if (vacrel->aggressive)
 			{
-				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
+				if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
 					break;
 			}
 			else
 			{
-				if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
+				if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
 					break;
 			}
 			vacuum_delay_point();
+
+			if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
+				skipping_allfrozen_blocks = false;
 			next_unskippable_block++;
 		}
 	}
@@ -949,6 +961,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 
 		if (blkno == next_unskippable_block)
 		{
+			skipping_allfrozen_blocks = true;	/* iff skipping_blocks */
+
 			/* Time to advance next_unskippable_block */
 			next_unskippable_block++;
 			if (vacrel->skipwithvm)
@@ -971,6 +985,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 							break;
 					}
 					vacuum_delay_point();
+
+					if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
+						skipping_allfrozen_blocks = false;
 					next_unskippable_block++;
 				}
 			}
@@ -997,8 +1014,11 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		{
 			/*
 			 * The current page can be skipped if we've seen a long enough run
-			 * of skippable blocks to justify skipping it -- provided it's not
-			 * the last page in the relation (according to rel_pages/nblocks).
+			 * of skippable blocks to justify skipping it.  An aggressive
+			 * VACUUM can only skip a range of blocks that were determined to
+			 * be all-frozen (not just all-visible) as a group back when the
+			 * next_unskippable_block-wise extent of the range was determined.
+			 * Assert that we got this right in passing.
 			 *
 			 * We always scan the table's last page to determine whether it
 			 * has tuples or not, even if it would otherwise be skipped. This
@@ -1006,19 +1026,20 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 			 * on the table to attempt a truncation that just fails
 			 * immediately because there are tuples on the last page.
 			 */
+			Assert(!vacrel->aggressive || !skipping_blocks ||
+				   skipping_allfrozen_blocks);
 			if (skipping_blocks && blkno < nblocks - 1)
 			{
 				/*
-				 * Tricky, tricky.  If this is in aggressive vacuum, the page
-				 * must have been all-frozen at the time we checked whether it
-				 * was skippable, but it might not be any more.  We must be
-				 * careful to count it as a skipped all-frozen page in that
-				 * case, or else we'll think we can't update relfrozenxid and
-				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was initially all-frozen, so we have to
-				 * recheck.
+				 * When skipping a range of blocks with one or more blocks
+				 * that are not all-frozen (expected during a non-aggressive
+				 * VACUUM following pg_upgrade), we need to recheck if this
+				 * block is all-frozen to maintain frozenskipped_pages.  The
+				 * block might not even be all-visible by now, but it's always
+				 * okay to skip (see note above about visibilitymap_get_status
+				 * return value being out-of-date).
 				 */
-				if (vacrel->aggressive ||
+				if (skipping_allfrozen_blocks ||
 					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 					vacrel->frozenskipped_pages++;
 				continue;
-- 
2.30.2

v9-0001-Loosen-coupling-between-relfrozenxid-and-freezing.patchapplication/x-patch; name=v9-0001-Loosen-coupling-between-relfrozenxid-and-freezing.patchDownload

From 483bc8df203f9df058fcb53e7972e3912e223b30 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 22 Nov 2021 10:02:30 -0800
Subject: [PATCH v9 1/4] Loosen coupling between relfrozenxid and freezing.

When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive: the relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.  There is no fixed
relationship between the amount of physical work performed by VACUUM to
make it safe to advance relfrozenxid (freezing and pruning), and the
actual number of XIDs that relfrozenxid can be advanced by (at least in
principle) as a result.  VACUUM might have to freeze all of the tuples
from a hundred million heap pages just to enable relfrozenxid to be
advanced by no more than one or two XIDs.  On the other hand, VACUUM
might end up doing little or no work, and yet still be capable of
advancing relfrozenxid by hundreds of millions of XIDs as a result.

VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
extant XID (and oldest extant MultiXactId) from the table, including
XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
This optimization doesn't require any changes to the definition of
relfrozenxid, nor does it require changes to the core design of
freezing.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM (FreezeLimit is still used as an XID-age based backstop there).
In non-aggressive VACUUMs (where there is still no strict guarantee that
relfrozenxid will be advanced at all), we now advance relfrozenxid by as
much as we possibly can.  This exploits workload conditions that make it
easy to advance relfrozenxid by many more XIDs (for the same amount of
freezing/pruning work).

The non-aggressive case can now set relfrozenxid to any legal XID value,
which could in principle be any XID that is > the existing relfrozenxid,
and <= the VACUUM operation's OldestXmin/"removal cutoff" XID value.
FreezeLimit is still used by VACUUM to determine which tuples to freeze,
at least for now.  Practical experience from the field may show that
non-aggressive VACUUMs seldom need to set relfrozenxid to an XID from
before FreezeLimit, but having the option still seems very valuable.

A later commit will teach VACUUM to determine which tuples to freeze
based on page-level characteristics.  Without this improved approach to
freezing in place, most individual tables still have very little chance
of relfrozenxid advancement during non-aggressive VACUUMs (an aggressive
anti-wraparound autovacuum will still eventually be required with most
tables).  All it takes is an earlier VACUUM that sets just a few pages
all-visible (but not all-frozen); later non-aggressive VACUUMs will end
up skipping those pages, as a matter of policy, making relfrozenxid
advancement impossible.  This can be avoided by avoiding setting pages
all-visible (but not all-frozen) in the first place.

Once VACUUM becomes capable of consistently advancing relfrozenxid, even
during non-aggressive VACUUMs, relfrozenxid values (and especially
relminmxid values) will tend to track what's really happening in each
table much more accurately.  This is expected to make anti-wraparound
autovacuums far rarer in practice.  The problem of "anti-wraparound
stampedes" (where multiple anti-wraparound autovacuums are launched at
exactly the same time) is also naturally avoided by advancing
relfrozenxid early and often (since it results in "natural diversity"
among relfrozenxid values, due to table-level workload characteristics).

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam.h          |   7 +-
 src/include/access/heapam_xlog.h     |   4 +-
 src/include/commands/vacuum.h        |   1 +
 src/backend/access/heap/heapam.c     | 194 ++++++++++++++++++++-------
 src/backend/access/heap/vacuumlazy.c | 128 +++++++++++++-----
 src/backend/commands/cluster.c       |   5 +-
 src/backend/commands/vacuum.c        |  42 +++---
 7 files changed, 280 insertions(+), 101 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b46ab7d73..10584a4ce 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,11 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi);
+extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple,
+									TransactionId backstop_cutoff_xid,
+									MultiXactId backstop_cutoff_multi,
+									TransactionId *relfrozenxid_nofreeze_out,
+									MultiXactId *relminmxid_nofreeze_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..2d8a7f627 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *relfrozenxid_out,
+									  MultiXactId *relminmxid_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 59d43e2ba..134bc408a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6140,12 +6140,24 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "relfrozenxid_out" is an output value; it's used to maintain target new
+ * relfrozenxid for the relation.  It can be ignored unless "flags" contains
+ * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
+ * here.  This follows the general convention: only track XIDs that will still
+ * be in the table after the ongoing VACUUM finishes.  Note that it's up to
+ * caller to maintain this when the Xid return value is itself an Xid.
+ *
+ * Note that we cannot depend on xmin to maintain relfrozenxid_out.  We need
+ * to push maintenance of relfrozenxid_out down this far, since in general
+ * xmin might have been frozen by an earlier VACUUM operation, in which case
+ * our caller will not have factored-in xmin into relfrozenxid_out's value.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *relfrozenxid_out)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6157,6 +6169,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId temprelfrozenxid_out;
 
 	*flags = 0;
 
@@ -6251,13 +6264,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	temprelfrozenxid_out = *relfrozenxid_out;
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-		{
 			need_replace = true;
-			break;
-		}
+		if (TransactionIdPrecedes(members[i].xid, temprelfrozenxid_out))
+			temprelfrozenxid_out = members[i].xid;
 	}
 
 	/*
@@ -6266,6 +6279,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (!need_replace)
 	{
+		*relfrozenxid_out = temprelfrozenxid_out;
 		*flags |= FRM_NOOP;
 		pfree(members);
 		return InvalidTransactionId;
@@ -6275,6 +6289,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * If the multi needs to be updated, figure out which members do we need
 	 * to keep.
 	 */
+	temprelfrozenxid_out = *relfrozenxid_out;
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
@@ -6356,7 +6371,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 * list.)
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, temprelfrozenxid_out))
+					temprelfrozenxid_out = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6366,6 +6385,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			{
 				/* running locker cannot possibly be older than the cutoff */
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid, *relfrozenxid_out));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6394,6 +6414,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+		/* Caller manages relfrozenxid_out directly when we return an XID */
 	}
 	else
 	{
@@ -6403,6 +6424,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+		*relfrozenxid_out = temprelfrozenxid_out;
 	}
 
 	pfree(newmembers);
@@ -6421,6 +6443,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Maintains *relfrozenxid_out and *relminmxid_out, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Caller should make
+ * temp copies of global tracking variables before starting to process a page,
+ * so that we can only scribble on copies.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
@@ -6445,7 +6472,10 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen_p)
+						  xl_heap_freeze_tuple *frz,
+						  bool *totally_frozen_p,
+						  TransactionId *relfrozenxid_out,
+						  MultiXactId *relminmxid_out)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6489,6 +6519,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+		{
+			/* won't be frozen, but older than current relfrozenxid_out */
+			*relfrozenxid_out = xid;
+		}
 	}
 
 	/*
@@ -6506,10 +6541,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId temp = *relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi, &flags, &temp);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
@@ -6527,6 +6563,24 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
+
+			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
+			{
+				/* New xmax is an XID older than new relfrozenxid_out */
+				*relfrozenxid_out = newxmax;
+			}
+		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * Changing nothing, so might have to ratchet back relminmxid_out,
+			 * relfrozenxid_out, or both together
+			 */
+			if (MultiXactIdIsValid(xid) &&
+				MultiXactIdPrecedes(xid, *relminmxid_out))
+				*relminmxid_out = xid;
+			if (TransactionIdPrecedes(temp, *relfrozenxid_out))
+				*relfrozenxid_out = temp;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6548,6 +6602,13 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 
 			changed = true;
+
+			/*
+			 * New multixact might have remaining XID older than
+			 * relfrozenxid_out
+			 */
+			if (TransactionIdPrecedes(temp, *relfrozenxid_out))
+				*relfrozenxid_out = temp;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
@@ -6575,7 +6636,14 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			freeze_xmax = true;
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+			{
+				/* won't be frozen, but older than current relfrozenxid_out */
+				*relfrozenxid_out = xid;
+			}
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
@@ -6622,6 +6690,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * Since we always freeze here, relfrozenxid_out doesn't need to be
+		 * maintained.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6699,11 +6770,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId relfrozenxid_out = cutoff_xid;
+	MultiXactId relminmxid_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &relfrozenxid_out, &relminmxid_out);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7133,6 +7207,22 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
  * are older than the specified cutoff XID or MultiXactId.  If so, return true.
  *
+ * See heap_prepare_freeze_tuple for information about the basic rules for the
+ * cutoffs used here.
+ *
+ * Maintains *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out, which
+ * are the current target relfrozenxid and relminmxid for the relation.  We
+ * assume that caller will never want to freeze its tuple, even when the tuple
+ * "needs freezing" according to our return value.  Caller should make temp
+ * copies of global tracking variables before starting to process a page, so
+ * that we can only scribble on copies.  That way caller can just discard the
+ * temp copies if it isn't okay with that assumption.
+ *
+ * Only aggressive VACUUM callers are expected to really care when a tuple
+ * "needs freezing" according to us.  It follows that non-aggressive VACUUMs
+ * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all
+ * cases.
+ *
  * It doesn't matter whether the tuple is alive or dead, we are checking
  * to see if a tuple needs to be removed or frozen to avoid wraparound.
  *
@@ -7140,15 +7230,23 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * on a standby.
  */
 bool
-heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi)
+heap_tuple_needs_freeze(HeapTupleHeader tuple,
+						TransactionId backstop_cutoff_xid,
+						MultiXactId backstop_cutoff_multi,
+						TransactionId *relfrozenxid_nofreeze_out,
+						MultiXactId *relminmxid_nofreeze_out)
 {
 	TransactionId xid;
+	bool		needs_freeze = false;
 
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
+	if (TransactionIdIsNormal(xid))
+	{
+		if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+			*relfrozenxid_nofreeze_out = xid;
+		if (TransactionIdPrecedes(xid, backstop_cutoff_xid))
+			needs_freeze = true;
+	}
 
 	/*
 	 * The considerations for multixacts are complicated; look at
@@ -7158,57 +7256,59 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		MultiXactId multi;
+		MultiXactMember *members;
+		int			nmembers;
 
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+		if (MultiXactIdIsValid(multi) &&
+			MultiXactIdPrecedes(multi, *relminmxid_nofreeze_out))
+			*relminmxid_nofreeze_out = multi;
+
+		if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 			return true;
-		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
+		else if (MultiXactIdPrecedes(multi, backstop_cutoff_multi))
+			needs_freeze = true;
+
+		/* need to check whether any member of the mxact is too old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
 		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
-
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
+			if (TransactionIdPrecedes(members[i].xid, backstop_cutoff_xid))
+				needs_freeze = true;
+			if (TransactionIdPrecedes(members[i].xid,
+									  *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
 		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 	else
 	{
 		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			if (TransactionIdPrecedes(xid, backstop_cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			if (TransactionIdPrecedes(xid, backstop_cutoff_xid))
+				needs_freeze = true;
+		}
 	}
 
-	return false;
+	return needs_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 40101e0cb..6ebb9c520 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,7 +144,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
@@ -172,8 +172,9 @@ typedef struct LVRelState
 	/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -329,6 +330,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
@@ -355,17 +357,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.
 	 *
 	 * If this is an aggressive VACUUM, then we're strictly required to freeze
-	 * any and all XIDs from before FreezeLimit, so that we will be able to
-	 * safely advance relfrozenxid up to FreezeLimit below (we must be able to
-	 * advance relminmxid up to MultiXactCutoff, too).
+	 * any and all XIDs from before FreezeLimit in order to be able to advance
+	 * relfrozenxid to a value >= FreezeLimit below.  There is an analogous
+	 * requirement around MultiXact freezing, relminmxid, and MultiXactCutoff.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -472,8 +474,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->OldestXmin = OldestXmin;
 	vacrel->FreezeLimit = FreezeLimit;
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+	/* Initialize state used to track oldest extant XID/XMID */
+	vacrel->NewRelfrozenXid = OldestXmin;
+	vacrel->NewRelminMxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -526,16 +529,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
 	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
 	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * the visibility map.  A non-aggressive VACUUM might advance relfrozenxid
+	 * to an XID that is either older or newer than FreezeLimit (same applies
+	 * to relminmxid and MultiXactCutoff).
 	 *
 	 * NB: We must use orig_rel_pages, not vacrel->rel_pages, since we want
 	 * the rel_pages used by lazy_scan_heap, which won't match when we
 	 * happened to truncate the relation afterwards.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
 		/* Cannot advance relfrozenxid/relminmxid */
 		Assert(!aggressive);
@@ -549,9 +551,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	{
 		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
 			   orig_rel_pages);
+		Assert(!aggressive ||
+			   TransactionIdPrecedesOrEquals(FreezeLimit,
+											 vacrel->NewRelfrozenXid));
+		Assert(!aggressive ||
+			   MultiXactIdPrecedesOrEquals(MultiXactCutoff,
+										   vacrel->NewRelminMxid));
+
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
+							vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
 							&frozenxid_updated, &minmulti_updated, false);
 	}
 
@@ -656,17 +665,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminMxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -1576,6 +1587,8 @@ lazy_scan_prune(LVRelState *vacrel,
 	int			nfrozen;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -1583,7 +1596,9 @@ lazy_scan_prune(LVRelState *vacrel,
 
 retry:
 
-	/* Initialize (or reset) page-level counters */
+	/* Initialize (or reset) page-level state */
+	NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	NewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1791,7 +1806,9 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenXid,
+									  &NewRelminMxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1805,13 +1822,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1962,6 +1982,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	TransactionId NoFreezeNewRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NoFreezeNewRelminMxid = vacrel->NewRelminMxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2007,20 +2029,56 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_needs_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff))
+									vacrel->MultiXactCutoff,
+									&NoFreezeNewRelfrozenXid,
+									&NoFreezeNewRelminMxid))
 		{
 			if (vacrel->aggressive)
 			{
-				/* Going to have to get cleanup lock for lazy_scan_prune */
+				/*
+				 * heap_tuple_needs_freeze determined that it isn't going to
+				 * be possible for the ongoing aggressive VACUUM operation to
+				 * advance relfrozenxid to a value >= FreezeLimit without
+				 * freezing one or more tuples with older XIDs from this page.
+				 * (Or perhaps the issue was that MultiXactCutoff could not be
+				 * respected.  Might have even been both cutoffs, together.)
+				 *
+				 * Tell caller that it must acquire a full cleanup lock.  It's
+				 * possible that caller will have to wait a while for one, but
+				 * that can't be helped -- full processing by lazy_scan_prune
+				 * is required to freeze the older XIDs (and/or freeze older
+				 * MultiXactIds).
+				 *
+				 * lazy_scan_prune expects a clean slate.  Forget everything
+				 * that lazy_scan_noprune learned about the page, including
+				 * NewRelfrozenXid and NewRelminMxid tracking information.
+				 */
 				vacrel->offnum = InvalidOffsetNumber;
 				return false;
 			}
-
-			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
-			 */
-			vacrel->freeze_cutoffs_valid = false;
+			else
+			{
+				/*
+				 * This is a non-aggressive VACUUM, which is under no strict
+				 * obligation to advance relfrozenxid at all (much less to
+				 * advance it to a value >= FreezeLimit).  Non-aggressive
+				 * VACUUM advances relfrozenxid/relminmxid on a best-effort
+				 * basis.  It never waits for a cleanup lock.
+				 *
+				 * NewRelfrozenXid (and/or NewRelminMxid) will still have been
+				 * ratcheted back as needed.  heap_tuple_needs_freeze assumes
+				 * that its caller _might_ prefer to carry on without freezing
+				 * anything on the page in the event of a tuple containing an
+				 * XID/MXID that "needs freezing".
+				 *
+				 * The fact that we won't be able to advance relfrozenxid up
+				 * to FreezeLimit on this occasion is no reason to completely
+				 * give up on advancing relfrozenxid.  There is likely to be
+				 * some benefit from advancing relfrozenxid by any amount,
+				 * even if the final value is significantly older than our
+				 * FreezeLimit.
+				 */
+			}
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
@@ -2069,6 +2127,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * We have committed to not freezing the tuples on this page (always
+	 * happens with a non-aggressive VACUUM), so make sure that the target
+	 * relfrozenxid/relminmxid values reflect the XIDs/MXIDs we encountered
+	 */
+	vacrel->NewRelfrozenXid = NoFreezeNewRelfrozenXid;
+	vacrel->NewRelminMxid = NoFreezeNewRelminMxid;
+
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel (though
 	 * only when VACUUM uses two-pass strategy)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 02a7e94bf..a7e988298 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 50a4a612e..0ae3b4506 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -945,14 +945,22 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
- * - multiXactCutoff is the value below which all MultiXactIds are removed
- *   from Xmax.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
+ * - multiXactCutoff is the value below which all MultiXactIds are definitely
+ *   removed from Xmax during aggressive vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit, and relminmxid up to multiXactCutoff.
+ * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
+ * minimum).
+ *
+ * oldestXmin and oldestMxact are the most recent values that can ever be
+ * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
+ * vacuumlazy.c caller later on.  These values should be passed when it turns
+ * out that VACUUM will leave no unfrozen XIDs/XMIDs behind in the table.
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -961,6 +969,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -969,7 +978,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1065,9 +1073,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1082,8 +1092,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
@@ -1390,14 +1400,10 @@ vac_update_relstats(Relation relation,
 	 * Update relfrozenxid, unless caller passed InvalidTransactionId
 	 * indicating it has no new data.
 	 *
-	 * Ordinarily, we don't let relfrozenxid go backwards: if things are
-	 * working correctly, the only way the new frozenxid could be older would
-	 * be if a previous VACUUM was done with a tighter freeze_min_age, in
-	 * which case we don't want to forget the work it already did.  However,
-	 * if the stored relfrozenxid is "in the future", then it must be corrupt
-	 * and it seems best to overwrite it with the cutoff we used this time.
-	 * This should match vac_update_datfrozenxid() concerning what we consider
-	 * to be "in the future".
+	 * Ordinarily, we don't let relfrozenxid go backwards.  However, if the
+	 * stored relfrozenxid is "in the future", then it must be corrupt, so
+	 * just overwrite it.  This should match vac_update_datfrozenxid()
+	 * concerning what we consider to be "in the future".
 	 */
 	if (frozenxid_updated)
 		*frozenxid_updated = false;
-- 
2.30.2

#83

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#82)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-02-24 20:53:08 -0800, Peter Geoghegan wrote:

0002 makes page-level freezing a first class thing.
heap_prepare_freeze_tuple now has some (limited) knowledge of how this
works. heap_prepare_freeze_tuple's cutoff_xid argument is now always
the VACUUM caller's OldestXmin (not its FreezeLimit, as before). We
still have to pass FreezeLimit to heap_prepare_freeze_tuple, which
helps us to respect FreezeLimit as a backstop, and so now it's passed
via the new backstop_cutoff_xid argument instead.

I am not a fan of the backstop terminology. It's still the reason we need to
do freezing for correctness reasons. It'd make more sense to me to turn it
around and call the "non-backstop" freezing opportunistic freezing or such.

Whenever we opt to
"freeze a page", the new page-level algorithm *always* uses the most
recent possible XID and MXID values (OldestXmin and oldestMxact) to
decide what XIDs/XMIDs need to be replaced. That might sound like it'd
be too much, but it only applies to those pages that we actually
decide to freeze (since page-level characteristics drive everything
now). FreezeLimit is only one way of triggering that now (and one of
the least interesting and rarest).

That largely makes sense to me and doesn't seem weird.

I'm a tad concerned about replacing mxids that have some members that are
older than OldestXmin but not older than FreezeLimit. It's not too hard to
imagine that accelerating mxid consumption considerably. But we can probably,
if not already done, special case that.

It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
freezing old ones) in large part so it can NOT freeze XIDs that it
would have been useful (and much cheaper) to remove anyway.

Well, we may have to allocate a new mxid because some members are older than
FreezeLimit but others are still running. When do we not remove xids that
would have been cheaper to remove once we decide to actually do work?

On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's
OldestXmin at all (it actually just gets FreezeLimit passed as its
cutoff_xid argument). It cannot possibly recognize any of this for itself.

It does recognize something like OldestXmin in a more precise and expensive
way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().

Does that theory about MultiXacts sound plausible? I'm not claiming
that the patch makes it impossible that FreezeMultiXactId() will have
to allocate a new MultiXact to freeze during VACUUM -- the
freeze-the-dead isolation tests already show that that's not true. I
just think that page-level freezing based on page characteristics with
oldestXmin and oldestMxact (not FreezeLimit and MultiXactCutoff)
cutoffs might make it a lot less likely in practice.

Hm. I guess I'll have to look at the code for it. It doesn't immediately
"feel" quite right.

oldestXmin and oldestMxact map to the same wall clock time, more or less --
that seems like it might be an important distinction, independent of
everything else.

Hm. Multis can be kept alive by fairly "young" member xids. So it may not be
removably (without creating a newer multi) until much later than its creation
time. So I don't think that's really true.

From 483bc8df203f9df058fcb53e7972e3912e223b30 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 22 Nov 2021 10:02:30 -0800
Subject: [PATCH v9 1/4] Loosen coupling between relfrozenxid and freezing.

When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive: the relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit. There is no fixed
relationship between the amount of physical work performed by VACUUM to
make it safe to advance relfrozenxid (freezing and pruning), and the
actual number of XIDs that relfrozenxid can be advanced by (at least in
principle) as a result. VACUUM might have to freeze all of the tuples
from a hundred million heap pages just to enable relfrozenxid to be
advanced by no more than one or two XIDs. On the other hand, VACUUM
might end up doing little or no work, and yet still be capable of
advancing relfrozenxid by hundreds of millions of XIDs as a result.

VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
extant XID (and oldest extant MultiXactId) from the table, including
XIDs from the table's remaining/unfrozen MultiXacts. This requires that
VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
This optimization doesn't require any changes to the definition of
relfrozenxid, nor does it require changes to the core design of
freezing.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM (FreezeLimit is still used as an XID-age based backstop there).
In non-aggressive VACUUMs (where there is still no strict guarantee that
relfrozenxid will be advanced at all), we now advance relfrozenxid by as
much as we possibly can. This exploits workload conditions that make it
easy to advance relfrozenxid by many more XIDs (for the same amount of
freezing/pruning work).

Don't we now always advance relfrozenxid as much as we can, particularly also
during aggressive vacuums?

* FRM_RETURN_IS_MULTI
*		The return value is a new MultiXactId to set as new Xmax.
*		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "relfrozenxid_out" is an output value; it's used to maintain target new
+ * relfrozenxid for the relation.  It can be ignored unless "flags" contains
+ * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
+ * here.  This follows the general convention: only track XIDs that will still
+ * be in the table after the ongoing VACUUM finishes.  Note that it's up to
+ * caller to maintain this when the Xid return value is itself an Xid.
+ *
+ * Note that we cannot depend on xmin to maintain relfrozenxid_out.

What does it mean for xmin to maintain something?

+ * See heap_prepare_freeze_tuple for information about the basic rules for the
+ * cutoffs used here.
+ *
+ * Maintains *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out, which
+ * are the current target relfrozenxid and relminmxid for the relation.  We
+ * assume that caller will never want to freeze its tuple, even when the tuple
+ * "needs freezing" according to our return value.

I don't understand the "will never want to" bit?

Caller should make temp
+ * copies of global tracking variables before starting to process a page, so
+ * that we can only scribble on copies.  That way caller can just discard the
+ * temp copies if it isn't okay with that assumption.
+ *
+ * Only aggressive VACUUM callers are expected to really care when a tuple
+ * "needs freezing" according to us.  It follows that non-aggressive VACUUMs
+ * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all
+ * cases.

Could it make sense to track can_freeze and need_freeze separately?

@@ -7158,57 +7256,59 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
{
MultiXactId multi;
+		MultiXactMember *members;
+		int			nmembers;
multi = HeapTupleHeaderGetRawXmax(tuple);
- if (!MultiXactIdIsValid(multi))
- {
- /* no xmax set, ignore */
- ;
- }

-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+		if (MultiXactIdIsValid(multi) &&
+			MultiXactIdPrecedes(multi, *relminmxid_nofreeze_out))
+			*relminmxid_nofreeze_out = multi;

I may be misreading the diff, but aren't we know continuing to use multi down
below even if !MultiXactIdIsValid()?

+		if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
return true;
-		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
+		else if (MultiXactIdPrecedes(multi, backstop_cutoff_multi))
+			needs_freeze = true;
+
+		/* need to check whether any member of the mxact is too old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));

Doesn't this mean we unpack the members even if the multi is old enough to
need freezing? Just to then do it again during freezing? Accessing multis
isn't cheap...

+			if (TransactionIdPrecedes(members[i].xid, backstop_cutoff_xid))
+				needs_freeze = true;
+			if (TransactionIdPrecedes(members[i].xid,
+									  *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
}
+		if (nmembers > 0)
+			pfree(members);
}
else
{
xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			if (TransactionIdPrecedes(xid, backstop_cutoff_xid))
+				needs_freeze = true;
+		}
}

if (tuple->t_infomask & HEAP_MOVED)
{
xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			if (TransactionIdPrecedes(xid, backstop_cutoff_xid))
+				needs_freeze = true;
+		}
}

This stanza is repeated a bunch. Perhaps put it in a small static inline
helper?

/* VACUUM operation's cutoff for freezing XIDs and MultiXactIds */
TransactionId FreezeLimit;
MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;

Struct member names starting with an upper case look profoundly ugly to
me... But this isn't the first one, so I guess... :(

From d10f42a1c091b4dc52670fca80a63fee4e73e20c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 13 Dec 2021 15:00:49 -0800
Subject: [PATCH v9 2/4] Make page-level characteristics drive freezing.

Teach VACUUM to freeze all of the tuples on a page whenever it notices
that it would otherwise mark the page all-visible, without also marking
it all-frozen. VACUUM typically won't freeze _any_ tuples on the page
unless _all_ tuples (that remain after pruning) are all-visible. This
makes the overhead of vacuuming much more predictable over time. We
avoid the need for large balloon payments during aggressive VACUUMs
(typically anti-wraparound autovacuums). Freezing is proactive, so
we're much less likely to get into "freezing debt".

I still suspect this will cause a very substantial increase in WAL traffic in
realistic workloads. It's common to have workloads where tuples are inserted
once, and deleted once/ partition dropped. Freezing all the tuples is a lot
more expensive than just marking the page all visible. It's not uncommon to be
bound by WAL traffic rather than buffer dirtying rate (since the latter may be
ameliorated by s_b and local storage, whereas WAL needs to be
streamed/archived).

This is particularly true because log_heap_visible() doesn't need an FPW if
checkpoints aren't enabled. A small record vs an FPI is a *huge* difference.

I think we'll have to make this less aggressive or tunable. Random ideas for
heuristics:

- Is it likely that freezing would not require an FPI or conversely that
log_heap_visible() will also need an fpi? If the page already was recently
modified / checksums are enabled the WAL overhead of the freezing doesn't
play much of a role.

- #dead items / #force-frozen items on the page - if we already need to do
more than just setting all-visible, we can probably afford the WAL traffic.

- relfrozenxid vs max_freeze_age / FreezeLimit. The closer they get, the more
aggressive we should freeze all-visible pages. Might even make sense to
start vacuuming an increasing percentage of all-visible pages during
non-aggressive vacuums, the closer we get to FreezeLimit.

- Keep stats about the age of dead and frozen over time. If all tuples are
removed within a reasonable fraction of freeze_max_age, there's no point in
freezing them.

The new approach to freezing also enables relfrozenxid advancement in
non-aggressive VACUUMs, which might be enough to avoid aggressive
VACUUMs altogether (with many individual tables/workloads). While the
non-aggressive case continues to skip all-visible (but not all-frozen)
pages (thereby making relfrozenxid advancement impossible), that in
itself will no longer hinder relfrozenxid advancement (outside of
pg_upgrade scenarios).

I don't know how to parse "thereby making relfrozenxid advancement impossible
... will no longer hinder relfrozenxid advancement"?

We now consistently avoid leaving behind all-visible (not all-frozen) pages.
This (as well as work from commit 44fa84881f) makes relfrozenxid advancement
in non-aggressive VACUUMs commonplace.

s/consistently/try to/?

The system accumulates freezing debt in proportion to the number of
physical heap pages with unfrozen tuples, more or less. Anything based
on XID age is likely to be a poor proxy for the eventual cost of
freezing (during the inevitable anti-wraparound autovacuum). At a high
level, freezing is now treated as one of the costs of storing tuples in
physical heap pages -- not a cost of transactions that allocate XIDs.
Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still
influence what we freeze, and when, they effectively become backstops.
It may still be necessary to "freeze a page" due to the presence of a
particularly old XID, from before VACUUM's FreezeLimit cutoff, though
that will be rare in practice -- FreezeLimit is just a backstop now.

I don't really like the "rare in practice" bit. It'll be rare in some
workloads but others will likely be much less affected.

+ * Although this interface is primarily tuple-based, vacuumlazy.c caller
+ * cooperates with us to decide on whether or not to freeze whole pages,
+ * together as a single group.  We prepare for freezing at the level of each
+ * tuple, but the final decision is made for the page as a whole.  All pages
+ * that are frozen within a given VACUUM operation are frozen according to
+ * cutoff_xid and cutoff_multi.  Caller _must_ freeze the whole page when
+ * we've set *force_freeze to true!
+ *
+ * cutoff_xid must be caller's oldest xmin to ensure that any XID older than
+ * it could neither be running nor seen as running by any open transaction.
+ * This ensures that the replacement will not change anyone's idea of the
+ * tuple state.  Similarly, cutoff_multi must be the smallest MultiXactId used
+ * by any open transaction (at the time that the oldest xmin was acquired).

I think this means my concern above about increasing mxid creation rate
substantially may be warranted.

+ * backstop_cutoff_xid must be <= cutoff_xid, and backstop_cutoff_multi must
+ * be <= cutoff_multi.  When any XID/XMID from before these backstop cutoffs
+ * is encountered, we set *force_freeze to true, making caller freeze the page
+ * (freezing-eligible XIDs/XMIDs will be frozen, at least).  "Backstop
+ * freezing" ensures that VACUUM won't allow XIDs/XMIDs to ever get too old.
+ * This shouldn't be necessary very often.  VACUUM should prefer to freeze
+ * when it's cheap (not when it's urgent).

Hm. Does this mean that we might call heap_prepare_freeze_tuple and then
decide not to freeze? Doesn't that mean we might create new multis over and
over, because we don't end up pulling the trigger on freezing the page?

+
+			/*
+			 * We allocated a MultiXact for this, so force freezing to avoid
+			 * wasting it
+			 */
+			*force_freeze = true;

Ah, I guess not. But it'd be nicer if I didn't have to scroll down to the body
of the function to figure it out...

From d2190abf366f148bae5307442e8a6245c6922e78 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 21 Feb 2022 12:46:44 -0800
Subject: [PATCH v9 3/4] Remove aggressive VACUUM skipping special case.

Since it's simply never okay to miss out on advancing relfrozenxid
during an aggressive VACUUM (that's the whole point), the aggressive
case treated any page from a next_unskippable_block-wise skippable block
range as an all-frozen page (not a merely all-visible page) during
skipping. Such a page might not be all-visible/all-frozen at the point
that it actually gets skipped, but it could nevertheless be safely
skipped, and then counted in frozenskipped_pages (the page must have
been all-frozen back when we determined the extent of the range of
blocks to skip, since aggressive VACUUMs _must_ scan all-visible pages).
This is necessary to ensure that aggressive VACUUMs are always capable
of advancing relfrozenxid.

The non-aggressive case behaved slightly differently: it rechecked the
visibility map for each page at the point of skipping, and only counted
pages in frozenskipped_pages when they were still all-frozen at that
time. But it skipped the page either way (since we already committed to
skipping the page at the point of the recheck). This was correct, but
sometimes resulted in non-aggressive VACUUMs needlessly wasting an
opportunity to advance relfrozenxid (when a page was modified in just
the wrong way, at just the wrong time). It also resulted in a needless
recheck of the visibility map for each and every page skipped during
non-aggressive VACUUMs.

Avoid these problems by conditioning the "skippable page was definitely
all-frozen when range of skippable pages was first determined" behavior
on what the visibility map _actually said_ about the range as a whole
back when we first determined the extent of the range (don't deduce what
must have happened at that time on the basis of aggressive-ness). This
allows us to reliably count skipped pages in frozenskipped_pages when
they were initially all-frozen. In particular, when a page's visibility
map bit is unset after the point where a skippable range of pages is
initially determined, but before the point where the page is actually
skipped, non-aggressive VACUUMs now count it in frozenskipped_pages,
just like aggressive VACUUMs always have [1]. It's not critical for the
non-aggressive case to get this right, but there is no reason not to.

[1] Actually, it might not work that way when there happens to be a mix
of all-visible and all-frozen pages in a range of skippable pages.
There is no chance of VACUUM advancing relfrozenxid in this scenario
either way, though, so it doesn't matter.

I think this commit message needs a good amount of polishing - it's very
convoluted. It's late and I didn't sleep well, but I've tried to read it
several times without really getting a sense of what this precisely does.

From 15dec1e572ac4da0540251253c3c219eadf46a83 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 24 Feb 2022 17:21:45 -0800
Subject: [PATCH v9 4/4] Avoid setting a page all-visible but not all-frozen.

To me the commit message body doesn't actually describe what this is doing...

This is pretty much an addendum to the work in the "Make page-level
characteristics drive freezing" commit. It has been broken out like
this because I'm not even sure if it's necessary. It seems like we
might want to be paranoid about losing out on the chance to advance
relfrozenxid in non-aggressive VACUUMs, though.

The only test that will trigger this case is the "freeze-the-dead"
isolation test. It's incredibly narrow. On the other hand, why take a
chance? All it takes is one heap page that's all-visible (and not also
all-frozen) nestled between some all-frozen heap pages to lose out on
relfrozenxid advancement. The SKIP_PAGES_THRESHOLD stuff won't save us
then [1].

FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up
causing a lot of time doing IO that we never need, completely trashing all CPU
caches, while not actually causing decent readaead IO from what I've seen.

Greetings,

Andres Freund

#84

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#83)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Feb 24, 2022 at 11:14 PM Andres Freund <andres@anarazel.de> wrote:

I am not a fan of the backstop terminology. It's still the reason we need to
do freezing for correctness reasons.

Thanks for the review!

I'm not wedded to that particular terminology, but I think that we
need something like it. Open to suggestions.

How about limit-based? Something like that?

It'd make more sense to me to turn it
around and call the "non-backstop" freezing opportunistic freezing or such.

The problem with that scheme is that it leads to a world where
"standard freezing" is incredibly rare (it often literally never
happens), whereas "opportunistic freezing" is incredibly common. That
doesn't make much sense to me.

We tend to think of 50 million XIDs (the vacuum_freeze_min_age
default) as being not that many. But I think that it can be a huge
number, too. Even then, it's unpredictable -- I suspect that it can
change without very much changing in the application, from the point
of view of users. That's a big part of the problem I'm trying to
address -- freezing outside of aggressive VACUUMs is way too rare (it
might barely happen at all). FreezeLimit/vacuum_freeze_min_age was
designed at a time when there was no visibility map at all, when it
made somewhat more sense as the thing that drives freezing.

Incidentally, this is part of the problem with anti-wraparound vacuums
and freezing debt -- the fact that some quite busy databases take
weeks or months to go through 50 million XIDs (or 200 million)
increases the pain of the eventual aggressive VACUUM. It's not
completely unbounded -- autovacuum_freeze_max_age is not 100% useless
here. But the extent to which that stuff bounds the debt can vary
enormously, for not-very-good reasons.

Whenever we opt to
"freeze a page", the new page-level algorithm *always* uses the most
recent possible XID and MXID values (OldestXmin and oldestMxact) to
decide what XIDs/XMIDs need to be replaced. That might sound like it'd
be too much, but it only applies to those pages that we actually
decide to freeze (since page-level characteristics drive everything
now). FreezeLimit is only one way of triggering that now (and one of
the least interesting and rarest).

That largely makes sense to me and doesn't seem weird.

I'm very pleased that the main intuition behind 0002 makes sense to
you. That's a start, at least.

I'm a tad concerned about replacing mxids that have some members that are
older than OldestXmin but not older than FreezeLimit. It's not too hard to
imagine that accelerating mxid consumption considerably. But we can probably,
if not already done, special case that.

Let's assume for a moment that this is a real problem. I'm not sure if
it is or not myself (it's complicated), but let's say that it is. The
problem may be more than offset by the positive impact on relminxmid
advancement. I have placed a large emphasis on enabling
relfrozenxid/relminxmid advancement in every non-aggressive VACUUM,
for a number of reasons -- this is one of the reasons. Finding a way
for every VACUUM operation to be "vacrel->scanned_pages +
vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some*
amount of relfrozenxid/relminxmid advancement possible in every
VACUUM) has a great deal of value.

As I said recently on the "do only critical work during single-user
vacuum?" thread, why should the largest tables in databases that
consume too many MXIDs do so evenly, across all their tables? There
are usually one or two large tables, and many more smaller tables. I
think it's much more likely that the largest tables consume
approximately zero MultiXactIds in these databases -- actual
MultiXactId consumption is probably concentrated in just one or two
smaller tables (even when we burn through MultiXacts very quickly).
But we don't recognize these kinds of distinctions at all right now.

Under these conditions, we will have many more opportunities to
advance relminmxid for most of the tables (including the larger
tables) all the way up to current-oldestMxact with the patch series.
Without needing to freeze *any* MultiXacts early (just freezing some
XIDs early) to get that benefit. The patch series is not just about
spreading the burden of freezing, so that non-aggressive VACUUMs
freeze more -- it's also making relfrozenxid and relminmxid more
recent and therefore *reliable* indicators of which tables any
wraparound problems *really* are.

Does that make sense to you? This kind of "virtuous cycle" seems
really important to me. It's a subtle point, so I have to ask.

It seems that heap_prepare_freeze_tuple allocates new MXIDs (when
freezing old ones) in large part so it can NOT freeze XIDs that it
would have been useful (and much cheaper) to remove anyway.

Well, we may have to allocate a new mxid because some members are older than
FreezeLimit but others are still running. When do we not remove xids that
would have been cheaper to remove once we decide to actually do work?

My point was that today, on HEAD, there is nothing fundamentally
special about FreezeLimit (aka cutoff_xid) as far as
heap_prepare_freeze_tuple is concerned -- and yet that's the only
cutoff it knows about, really. Why can't we do better, by "exploiting
the difference" between FreezeLimit and OldestXmin?

On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's
OldestXmin at all (it actually just gets FreezeLimit passed as its
cutoff_xid argument). It cannot possibly recognize any of this for itself.

It does recognize something like OldestXmin in a more precise and expensive
way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().

It doesn't look that way to me.

While it's true that FreezeMultiXactId() will call
MultiXactIdIsRunning(), that's only a cross-check. This cross-check is
made at a point where we've already determined that the MultiXact in
question is < cutoff_multi. In other words, it catches cases where a
"MultiXactId < cutoff_multi" Multi contains an XID *that's still
running* -- a correctness issue. Nothing to do with being smart about
avoiding allocating new MultiXacts during freezing, or exploiting the
fact that "FreezeLimit < OldestXmin" (which is almost always true,
very true).

This correctness issue is the same issue discussed in "NB: cutoff_xid
*must* be <= the current global xmin..." comments that appear at the
top of heap_prepare_freeze_tuple. That's all.

Hm. I guess I'll have to look at the code for it. It doesn't immediately
"feel" quite right.

I kinda think it might be. Please let me know if you see a problem
with what I've said.

oldestXmin and oldestMxact map to the same wall clock time, more or less --
that seems like it might be an important distinction, independent of
everything else.

Hm. Multis can be kept alive by fairly "young" member xids. So it may not be
removably (without creating a newer multi) until much later than its creation
time. So I don't think that's really true.

Maybe what I said above it true, even though (at the same time) I have
*also* created new problems with "young" member xids. I really don't
know right now, though.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM (FreezeLimit is still used as an XID-age based backstop there).
In non-aggressive VACUUMs (where there is still no strict guarantee that
relfrozenxid will be advanced at all), we now advance relfrozenxid by as
much as we possibly can. This exploits workload conditions that make it
easy to advance relfrozenxid by many more XIDs (for the same amount of
freezing/pruning work).

Don't we now always advance relfrozenxid as much as we can, particularly also
during aggressive vacuums?

I just meant "we hope for the best and accept what we can get". Will fix.

* FRM_RETURN_IS_MULTI
*           The return value is a new MultiXactId to set as new Xmax.
*           (caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "relfrozenxid_out" is an output value; it's used to maintain target new
+ * relfrozenxid for the relation.  It can be ignored unless "flags" contains
+ * either FRM_NOOP or FRM_RETURN_IS_MULTI, because we only handle multiXacts
+ * here.  This follows the general convention: only track XIDs that will still
+ * be in the table after the ongoing VACUUM finishes.  Note that it's up to
+ * caller to maintain this when the Xid return value is itself an Xid.
+ *
+ * Note that we cannot depend on xmin to maintain relfrozenxid_out.

What does it mean for xmin to maintain something?

Will fix.

+ * See heap_prepare_freeze_tuple for information about the basic rules for the
+ * cutoffs used here.
+ *
+ * Maintains *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out, which
+ * are the current target relfrozenxid and relminmxid for the relation.  We
+ * assume that caller will never want to freeze its tuple, even when the tuple
+ * "needs freezing" according to our return value.

I don't understand the "will never want to" bit?

I meant "even when it's a non-aggressive VACUUM, which will never want
to wait for a cleanup lock the hard way, and will therefore always
settle for these relfrozenxid_nofreeze_out and
*relminmxid_nofreeze_out values". Note the convention here, which is
relfrozenxid_nofreeze_out is not the same thing as relfrozenxid_out --
the former variable name is used for values in cases where we *don't*
freeze, the latter for values in the cases where we do.

Will try to clear that up.

Caller should make temp
+ * copies of global tracking variables before starting to process a page, so
+ * that we can only scribble on copies.  That way caller can just discard the
+ * temp copies if it isn't okay with that assumption.
+ *
+ * Only aggressive VACUUM callers are expected to really care when a tuple
+ * "needs freezing" according to us.  It follows that non-aggressive VACUUMs
+ * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all
+ * cases.

Could it make sense to track can_freeze and need_freeze separately?

You mean to change the signature of heap_tuple_needs_freeze, so it
doesn't return a bool anymore? It just has two bool pointers as
arguments, can_freeze and need_freeze?

I suppose that could make sense. Don't feel strongly either way.

I may be misreading the diff, but aren't we know continuing to use multi down
below even if !MultiXactIdIsValid()?

Will investigate.

Doesn't this mean we unpack the members even if the multi is old enough to
need freezing? Just to then do it again during freezing? Accessing multis
isn't cheap...

Will investigate.

This stanza is repeated a bunch. Perhaps put it in a small static inline
helper?

Will fix.

Struct member names starting with an upper case look profoundly ugly to
me... But this isn't the first one, so I guess... :(

I am in 100% agreement, actually. But you know how it goes...

I still suspect this will cause a very substantial increase in WAL traffic in
realistic workloads. It's common to have workloads where tuples are inserted
once, and deleted once/ partition dropped.

I agree with the principle that this kind of use case should be
accommodated in some way.

I think we'll have to make this less aggressive or tunable. Random ideas for
heuristics:

The problem that all of these heuristics have is that they will tend
to make it impossible for future non-aggressive VACUUMs to be able to
advance relfrozenxid. All that it takes is one single all-visible page
to make that impossible. As I said upthread, I think that being able
to advance relfrozenxid (and especially relminmxid) by *some* amount
in every VACUUM has non-obvious value.

Maybe you can address that by changing the behavior of non-aggressive
VACUUMs, so that they are directly sensitive to this. Maybe they don't
skip any all-visible pages when there aren't too many, that kind of
thing. That needs to be in scope IMV.

I don't know how to parse "thereby making relfrozenxid advancement impossible
... will no longer hinder relfrozenxid advancement"?

Will fix.

We now consistently avoid leaving behind all-visible (not all-frozen) pages.
This (as well as work from commit 44fa84881f) makes relfrozenxid advancement
in non-aggressive VACUUMs commonplace.

s/consistently/try to/?

Will fix.

The system accumulates freezing debt in proportion to the number of
physical heap pages with unfrozen tuples, more or less. Anything based
on XID age is likely to be a poor proxy for the eventual cost of
freezing (during the inevitable anti-wraparound autovacuum). At a high
level, freezing is now treated as one of the costs of storing tuples in
physical heap pages -- not a cost of transactions that allocate XIDs.
Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still
influence what we freeze, and when, they effectively become backstops.
It may still be necessary to "freeze a page" due to the presence of a
particularly old XID, from before VACUUM's FreezeLimit cutoff, though
that will be rare in practice -- FreezeLimit is just a backstop now.

I don't really like the "rare in practice" bit. It'll be rare in some
workloads but others will likely be much less affected.

Maybe. The first time one XID crosses FreezeLimit now will be enough
to trigger freezing the page. So it's still very different to today.

I'll change this, though. It's not important.

I think this means my concern above about increasing mxid creation rate
substantially may be warranted.

Can you think of an adversarial workload, to get a sense of the extent
of the problem?

+ * backstop_cutoff_xid must be <= cutoff_xid, and backstop_cutoff_multi must
+ * be <= cutoff_multi.  When any XID/XMID from before these backstop cutoffs
+ * is encountered, we set *force_freeze to true, making caller freeze the page
+ * (freezing-eligible XIDs/XMIDs will be frozen, at least).  "Backstop
+ * freezing" ensures that VACUUM won't allow XIDs/XMIDs to ever get too old.
+ * This shouldn't be necessary very often.  VACUUM should prefer to freeze
+ * when it's cheap (not when it's urgent).

Hm. Does this mean that we might call heap_prepare_freeze_tuple and then
decide not to freeze?

Yes. And so heap_prepare_freeze_tuple is now a little more like its
sibling function, heap_tuple_needs_freeze.

Doesn't that mean we might create new multis over and
over, because we don't end up pulling the trigger on freezing the page?

Ah, I guess not. But it'd be nicer if I didn't have to scroll down to the body
of the function to figure it out...

Will fix.

I think this commit message needs a good amount of polishing - it's very
convoluted. It's late and I didn't sleep well, but I've tried to read it
several times without really getting a sense of what this precisely does.

It received much less polishing than the others.

Think of 0003 like this:

The logic for skipping a range of blocks using the visibility map
works by deciding the next_unskippable_block-wise range of skippable
blocks up front. Later, we actually execute the skipping of this range
of blocks (assuming it exceeds SKIP_PAGES_THRESHOLD). These are two
separate steps.

Right now, we do this:

if (skipping_blocks && blkno < nblocks - 1)
{
/*
* Tricky, tricky. If this is in aggressive vacuum, the page
* must have been all-frozen at the time we checked whether it
* was skippable, but it might not be any more. We must be
* careful to count it as a skipped all-frozen page in that
* case, or else we'll think we can't update relfrozenxid and
* relminmxid. If it's not an aggressive vacuum, we don't
* know whether it was initially all-frozen, so we have to
* recheck.
*/
if (vacrel->aggressive ||
VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
vacrel->frozenskipped_pages++;
continue;
}

The fact that this is conditioned in part on "vacrel->aggressive"
concerns me here. Why should we have a special case for this, where we
condition something on aggressive-ness that isn't actually strictly
related to that? Why not just remember that the range that we're
skipping was all-frozen up-front?

That way non-aggressive VACUUMs are not unnecessarily at a
disadvantage, when it comes to being able to advance relfrozenxid.
What if we end up not incrementing vacrel->frozenskipped_pages when we
easily could have, just because this is a non-aggressive VACUUM? I
think that it's worth avoiding stuff like that whenever possible.
Maybe this particular example isn't the most important one. For
example it probably isn't as bad as the one was fixed by the
lazy_scan_noprune work. But why even take a chance? Seems easier to
remove the special case -- which is what this really is.

FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up
causing a lot of time doing IO that we never need, completely trashing all CPU
caches, while not actually causing decent readaead IO from what I've seen.

I am also suspicious of SKIP_PAGES_THRESHOLD. But if we want to get
rid of it, we'll need to be sensitive to how that affects relfrozenxid
advancement in non-aggressive VACUUMs IMV.

Thanks again for the review!

--
Peter Geoghegan

#85

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#84)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-02-25 14:00:12 -0800, Peter Geoghegan wrote:

On Thu, Feb 24, 2022 at 11:14 PM Andres Freund <andres@anarazel.de> wrote:

I am not a fan of the backstop terminology. It's still the reason we need to
do freezing for correctness reasons.

Thanks for the review!

I'm not wedded to that particular terminology, but I think that we
need something like it. Open to suggestions.

How about limit-based? Something like that?

freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or
s/limit/below/? I kind of like below because that answers < vs <= which I find
hard to remember around freezing.

I'm a tad concerned about replacing mxids that have some members that are
older than OldestXmin but not older than FreezeLimit. It's not too hard to
imagine that accelerating mxid consumption considerably. But we can probably,
if not already done, special case that.

Let's assume for a moment that this is a real problem. I'm not sure if
it is or not myself (it's complicated), but let's say that it is. The
problem may be more than offset by the positive impact on relminxmid
advancement. I have placed a large emphasis on enabling
relfrozenxid/relminxmid advancement in every non-aggressive VACUUM,
for a number of reasons -- this is one of the reasons. Finding a way
for every VACUUM operation to be "vacrel->scanned_pages +
vacrel->frozenskipped_pages == orig_rel_pages" (i.e. making *some*
amount of relfrozenxid/relminxmid advancement possible in every
VACUUM) has a great deal of value.

That may be true, but I think working more incrementally is better in this
are. I'd rather have a smaller improvement for a release, collect some data,
get another improvement in the next, than see a bunch of reports of larger
wind and large regressions.

As I said recently on the "do only critical work during single-user
vacuum?" thread, why should the largest tables in databases that
consume too many MXIDs do so evenly, across all their tables? There
are usually one or two large tables, and many more smaller tables. I
think it's much more likely that the largest tables consume
approximately zero MultiXactIds in these databases -- actual
MultiXactId consumption is probably concentrated in just one or two
smaller tables (even when we burn through MultiXacts very quickly).
But we don't recognize these kinds of distinctions at all right now.

Recognizing those distinctions seems independent of freezing multixacts with
live members. I am happy with freezing them more aggressively if they don't
have live members. It's freezing mxids with live members that has me
concerned. The limits you're proposing are quite aggressive and can advance
quickly.

I've seen large tables with plenty multixacts. Typically concentrated over a
value range (often changing over time).

Under these conditions, we will have many more opportunities to
advance relminmxid for most of the tables (including the larger
tables) all the way up to current-oldestMxact with the patch series.
Without needing to freeze *any* MultiXacts early (just freezing some
XIDs early) to get that benefit. The patch series is not just about
spreading the burden of freezing, so that non-aggressive VACUUMs
freeze more -- it's also making relfrozenxid and relminmxid more
recent and therefore *reliable* indicators of which tables any
wraparound problems *really* are.

My concern was explicitly about the case where we have to create new
multixacts...

Does that make sense to you?

Yes.

On HEAD, FreezeMultiXactId() doesn't get passed down the VACUUM operation's
OldestXmin at all (it actually just gets FreezeLimit passed as its
cutoff_xid argument). It cannot possibly recognize any of this for itself.

It does recognize something like OldestXmin in a more precise and expensive
way - MultiXactIdIsRunning() and TransactionIdIsCurrentTransactionId().

It doesn't look that way to me.

While it's true that FreezeMultiXactId() will call MultiXactIdIsRunning(),
that's only a cross-check.

This cross-check is made at a point where we've already determined that the
MultiXact in question is < cutoff_multi. In other words, it catches cases
where a "MultiXactId < cutoff_multi" Multi contains an XID *that's still
running* -- a correctness issue. Nothing to do with being smart about
avoiding allocating new MultiXacts during freezing, or exploiting the fact
that "FreezeLimit < OldestXmin" (which is almost always true, very true).

If there's <= 1 live members in a mxact, we replace it with with a plain xid
iff the xid also would get frozen. With the current freezing logic I don't see
what passing down OldestXmin would change. Or how it differs to a meaningful
degree from heap_prepare_freeze_tuple()'s logic. I don't see how it'd avoid a
single new mxact from being allocated.

Caller should make temp
+ * copies of global tracking variables before starting to process a page, so
+ * that we can only scribble on copies.  That way caller can just discard the
+ * temp copies if it isn't okay with that assumption.
+ *
+ * Only aggressive VACUUM callers are expected to really care when a tuple
+ * "needs freezing" according to us.  It follows that non-aggressive VACUUMs
+ * can use *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out in all
+ * cases.
Could it make sense to track can_freeze and need_freeze separately?
You mean to change the signature of heap_tuple_needs_freeze, so it
doesn't return a bool anymore? It just has two bool pointers as
arguments, can_freeze and need_freeze?

Something like that. Or return true if there's anything to do, and then rely
on can_freeze and need_freeze for finer details. But it doesn't matter that much.

I still suspect this will cause a very substantial increase in WAL traffic in
realistic workloads. It's common to have workloads where tuples are inserted
once, and deleted once/ partition dropped.

I agree with the principle that this kind of use case should be
accommodated in some way.

I think we'll have to make this less aggressive or tunable. Random ideas for
heuristics:

The problem that all of these heuristics have is that they will tend
to make it impossible for future non-aggressive VACUUMs to be able to
advance relfrozenxid. All that it takes is one single all-visible page
to make that impossible. As I said upthread, I think that being able
to advance relfrozenxid (and especially relminmxid) by *some* amount
in every VACUUM has non-obvious value.

I think that's a laudable goal. But I don't think we should go there unless we
are quite confident we've mitigated the potential downsides.

Observed horizons for "never vacuumed before" tables and for aggressive
vacuums alone would be a huge win.

Maybe you can address that by changing the behavior of non-aggressive
VACUUMs, so that they are directly sensitive to this. Maybe they don't
skip any all-visible pages when there aren't too many, that kind of
thing. That needs to be in scope IMV.

Yea. I still like my idea to have vacuum process a some all-visible pages
every time and to increase that percentage based on how old the relfrozenxid
is.

We could slowly "refill" the number of all-visible pages VACUUM is allowed to
process whenever dirtying a page for other reasons.

I think this means my concern above about increasing mxid creation rate
substantially may be warranted.

Can you think of an adversarial workload, to get a sense of the extent
of the problem?

I'll try to come up with something.

FWIW, I'd really like to get rid of SKIP_PAGES_THRESHOLD. It often ends up
causing a lot of time doing IO that we never need, completely trashing all CPU
caches, while not actually causing decent readaead IO from what I've seen.

I am also suspicious of SKIP_PAGES_THRESHOLD. But if we want to get
rid of it, we'll need to be sensitive to how that affects relfrozenxid
advancement in non-aggressive VACUUMs IMV.

It might make sense to separate the purposes of SKIP_PAGES_THRESHOLD. The
relfrozenxid advancement doesn't benefit from visiting all-frozen pages, just
because there are only 30 of them in a row.

Thanks again for the review!

NP, I think we need a lot of improvements in this area.

I wish somebody would tackle merging heap_page_prune() with
vacuuming. Primarily so we only do a single WAL record. But also because the
separation has caused a *lot* of complexity. I've already more projects than
I should, otherwise I'd start on it...

Greetings,

Andres Freund

#86

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#84)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 25, 2022 at 2:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

Hm. I guess I'll have to look at the code for it. It doesn't immediately
"feel" quite right.

I kinda think it might be. Please let me know if you see a problem
with what I've said.

Oh, wait. I have a better idea of what you meant now. The loop towards
the end of FreezeMultiXactId() will indeed "Determine whether to keep
this member or ignore it." when we need a new MultiXactId. The loop is
exact in the sense that it will only include those XIDs that are truly
needed -- those that are still running.

But why should we ever get to the FreezeMultiXactId() loop with the
stuff from 0002 in place? The whole purpose of the loop is to handle
cases where we have to remove *some* (not all) XIDs from before
cutoff_xid that appear in a MultiXact, which requires careful checking
of each XID (this is only possible when the MultiXactId is <
cutoff_multi to begin with, which is OldestMxact in the patch, which
is presumably very recent).

It's not impossible that we'll get some number of "skewed MultiXacts"
with the patch -- cases that really do necessitate allocating a new
MultiXact, just to "freeze some XIDs from a MultiXact". That is, there
will sometimes be some number of XIDs that are < OldestXmin, but
nevertheless appear in some MultiXactIds >= OldestMxact. This seems
likely to be rare with the patch, though, since VACUUM calculates its
OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi
really are in the patch) at the same point in time. Which was the
point I made in my email yesterday.

How many of these "skewed MultiXacts" can we really expect? Seems like
there might be very few in practice. But I'm really not sure about
that.

--
Peter Geoghegan

#87

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#86)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-02-25 15:28:17 -0800, Peter Geoghegan wrote:

But why should we ever get to the FreezeMultiXactId() loop with the
stuff from 0002 in place? The whole purpose of the loop is to handle
cases where we have to remove *some* (not all) XIDs from before
cutoff_xid that appear in a MultiXact, which requires careful checking
of each XID (this is only possible when the MultiXactId is <
cutoff_multi to begin with, which is OldestMxact in the patch, which
is presumably very recent).

It's not impossible that we'll get some number of "skewed MultiXacts"
with the patch -- cases that really do necessitate allocating a new
MultiXact, just to "freeze some XIDs from a MultiXact". That is, there
will sometimes be some number of XIDs that are < OldestXmin, but
nevertheless appear in some MultiXactIds >= OldestMxact. This seems
likely to be rare with the patch, though, since VACUUM calculates its
OldestXmin and OldestMxact (which are what cutoff_xid and cutoff_multi
really are in the patch) at the same point in time. Which was the
point I made in my email yesterday.

I don't see why it matters that OldestXmin and OldestMxact are computed at the
same time? It's a question of the workload, not vacuum algorithm.

OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all
members are older than OldestXmin (not quite true, but that's the bound), and
they have always more than one member.

How many of these "skewed MultiXacts" can we really expect?

I don't think they're skewed in any way. It's a fundamental aspect of
multixacts.

Greetings,

Andres Freund

#88

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#87)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 25, 2022 at 3:48 PM Andres Freund <andres@anarazel.de> wrote:

I don't see why it matters that OldestXmin and OldestMxact are computed at the
same time? It's a question of the workload, not vacuum algorithm.

I think it's both.

OldestMxact inherently lags OldestXmin. OldestMxact can only advance after all
members are older than OldestXmin (not quite true, but that's the bound), and
they have always more than one member.

How many of these "skewed MultiXacts" can we really expect?

I don't think they're skewed in any way. It's a fundamental aspect of
multixacts.

Having this happen to some degree is fundamental to MultiXacts, sure.
But also seems like the approach of using FreezeLimit and
MultiXactCutoff in the way that we do right now seems like it might
make the problem a lot worse. Because they're completely meaningless
cutoffs. They are magic numbers that have no relationship whatsoever
to each other.

There are problems with assuming that OldestXmin and OldestMxact
"align" -- no question. But at least it's approximately true -- which
is a start. They are at least not arbitrarily, unpredictably
different, like FreezeLimit and MultiXactCutoff are, and always will
be. I think that that's a meaningful and useful distinction.

I am okay with making the most pessimistic possible assumptions about
how any changes to how we freeze might cause FreezeMultiXactId() to
allocate more MultiXacts than before. And I accept that the patch
series shouldn't "get credit" for "offsetting" any problem like that
by making relminmxid advancement occur much more frequently (even
though that does seem very valuable). All I'm really saying is this:
in general, there are probably quite a few opportunities for
FreezeMultiXactId() to avoid allocating new XMIDs (just to freeze
XIDs) by having the full context. And maybe by making the dialog
between lazy_scan_prune and heap_prepare_freeze_tuple a bit more
nuanced.

--
Peter Geoghegan

#89

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#85)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 25, 2022 at 3:26 PM Andres Freund <andres@anarazel.de> wrote:

freeze_required_limit, freeze_desired_limit? Or s/limit/cutoff/? Or
s/limit/below/? I kind of like below because that answers < vs <= which I find
hard to remember around freezing.

I like freeze_required_limit the most.

That may be true, but I think working more incrementally is better in this
are. I'd rather have a smaller improvement for a release, collect some data,
get another improvement in the next, than see a bunch of reports of larger
wind and large regressions.

I agree.

There is an important practical way in which it makes sense to treat
0001 as separate to 0002. It is true that 0001 is independently quite
useful. In practical terms, I'd be quite happy to just get 0001 into
Postgres 15, without 0002. I think that that's what you meant here, in
concrete terms, and we can agree on that now.

However, it is *also* true that there is an important practical sense
in which they *are* related. I don't want to ignore that either -- it
does matter. Most of the value to be had here comes from the synergy
between 0001 and 0002 -- or what I've been calling a "virtuous cycle",
the thing that makes it possible to advance relfrozenxid/relminmxid in
almost every VACUUM. Having both 0001 and 0002 together (or something
along the same lines) is way more valuable than having just one.

Perhaps we can even agree on this second point. I am encouraged by the
fact that you at least recognize the general validity of the key ideas
from 0002. If I am going to commit 0001 (and not 0002) ahead of
feature freeze for 15, I better be pretty sure that I have at least
roughly the right idea with 0002, too -- since that's the direction
that 0001 is going in. It almost seems dishonest to pretend that I
wasn't thinking of 0002 when I wrote 0001.

I'm glad that you seem to agree that this business of accumulating
freezing debt without any natural limit is just not okay. That is
really fundamental to me. I mean, vacuum_freeze_min_age kind of
doesn't work as designed. This is a huge problem for us.

Under these conditions, we will have many more opportunities to
advance relminmxid for most of the tables (including the larger
tables) all the way up to current-oldestMxact with the patch series.
Without needing to freeze *any* MultiXacts early (just freezing some
XIDs early) to get that benefit. The patch series is not just about
spreading the burden of freezing, so that non-aggressive VACUUMs
freeze more -- it's also making relfrozenxid and relminmxid more
recent and therefore *reliable* indicators of which tables any
wraparound problems *really* are.

My concern was explicitly about the case where we have to create new
multixacts...

It was a mistake on my part to counter your point about that with this
other point about eager relminmxid advancement. As I said in the last
email, while that is very valuable, it's not something that needs to
be brought into this.

Does that make sense to you?

Yes.

Okay, great. The fact that you recognize the value in that comes as a relief.

You mean to change the signature of heap_tuple_needs_freeze, so it
doesn't return a bool anymore? It just has two bool pointers as
arguments, can_freeze and need_freeze?

Something like that. Or return true if there's anything to do, and then rely
on can_freeze and need_freeze for finer details. But it doesn't matter that much.

Got it.

The problem that all of these heuristics have is that they will tend
to make it impossible for future non-aggressive VACUUMs to be able to
advance relfrozenxid. All that it takes is one single all-visible page
to make that impossible. As I said upthread, I think that being able
to advance relfrozenxid (and especially relminmxid) by *some* amount
in every VACUUM has non-obvious value.

I think that's a laudable goal. But I don't think we should go there unless we
are quite confident we've mitigated the potential downsides.

True. But that works both ways. We also shouldn't err in the direction
of adding these kinds of heuristics (which have real downsides) until
the idea of mostly swallowing the cost of freezing whole pages (while
making it possible to disable) has lost, fairly. Overall, it looks
like the cost is acceptable in most cases.

I think that users will find it very reassuring to regularly and
reliably see confirmation that wraparound is being kept at bay, by
every VACUUM operation, with details that they can relate to their
workload. That has real value IMV -- even when it's theoretically
unnecessary for us to be so eager with advancing relfrozenxid.

I really don't like the idea of falling behind on freezing
systematically. You always run the "risk" of freezing being wasted.
But that way of looking at it can be penny wise, pound foolish --
maybe we should just accept that trying to predict what will happen in
the future (whether or not freezing will be worth it) is mostly not
helpful. Our users mostly complain about performance stability these
days. Big shocks are really something we ought to avoid. That does
have a cost. Why wouldn't it?

Maybe you can address that by changing the behavior of non-aggressive
VACUUMs, so that they are directly sensitive to this. Maybe they don't
skip any all-visible pages when there aren't too many, that kind of
thing. That needs to be in scope IMV.

Yea. I still like my idea to have vacuum process a some all-visible pages
every time and to increase that percentage based on how old the relfrozenxid
is.

You can quite easily construct cases where the patch does much better
than that, though -- very believable cases. Any table like
pgbench_history. And so I lean towards quantifying the cost of
page-level freezing carefully, making sure there is nothing
pathological, and then just accepting it (with a GUC to disable). The
reality is that freezing is really a cost of storing data in Postgres,
and will be for the foreseeable future.

Can you think of an adversarial workload, to get a sense of the extent
of the problem?

I'll try to come up with something.

That would be very helpful. Thanks!

It might make sense to separate the purposes of SKIP_PAGES_THRESHOLD. The
relfrozenxid advancement doesn't benefit from visiting all-frozen pages, just
because there are only 30 of them in a row.

Right. I imagine that SKIP_PAGES_THRESHOLD actually does help with
this, but if we actually tried we'd find a much better way.

I wish somebody would tackle merging heap_page_prune() with
vacuuming. Primarily so we only do a single WAL record. But also because the
separation has caused a *lot* of complexity. I've already more projects than
I should, otherwise I'd start on it...

That has value, but it doesn't feel as urgent.

--
Peter Geoghegan

#90

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#81)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sun, Feb 20, 2022 at 3:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that the idea has potential, but I don't think that I
understand yet what the *exact* algorithm is.

The algorithm seems to exploit a natural tendency that Andres once
described in a blog post about his snapshot scalability work [1]. To a
surprising extent, we can usefully bucket all tuples/pages into two
simple categories:

1. Very, very old ("infinitely old" for all practical purposes).

2. Very very new.

There doesn't seem to be much need for a third "in-between" category
in practice. This seems to be at least approximately true all of the
time.

Perhaps Andres wouldn't agree with this very general statement -- he
actually said something more specific. I for one believe that the
point he made generalizes surprisingly well, though. I have my own
theories about why this appears to be true. (Executive summary: power
laws are weird, and it seems as if the sparsity-of-effects principle
makes it easy to bucket things at the highest level, in a way that
generalizes well across disparate workloads.)

I think that this is not really a description of an algorithm -- and I
think that it is far from clear that the third "in-between" category
does not need to exist.

Remember when I got excited about how my big TPC-C benchmark run
showed a predictable, tick/tock style pattern across VACUUM operations
against the order and order lines table [2]? It seemed very
significant to me that the OldestXmin of VACUUM operation n
consistently went on to become the new relfrozenxid for the same table
in VACUUM operation n + 1. It wasn't exactly the same XID, but very
close to it (within the range of noise). This pattern was clearly
present, even though VACUUM operation n + 1 might happen as long as 4
or 5 hours after VACUUM operation n (this was a big table).

I think findings like this are very unconvincing. TPC-C (or any
benchmark really) is so simple as to be a terrible proxy for what
vacuuming is going to look like on real-world systems. Like, it's nice
that it works, and it shows that something's working, but it doesn't
demonstrate that the patch is making the right trade-offs overall.

--
Robert Haas
EDB: http://www.enterprisedb.com

#91

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#90)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Mar 1, 2022 at 1:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

I think that this is not really a description of an algorithm -- and I
think that it is far from clear that the third "in-between" category
does not need to exist.

But I already described the algorithm. It is very simple
mechanistically -- though that in itself means very little. As I have
said multiple times now, the hard part is assessing what the
implications are. And the even harder part is making a judgement about
whether or not those implications are what we generally want.

I think findings like this are very unconvincing.

TPC-C may be unrealistic in certain ways, but it is nevertheless
vastly more realistic than pgbench. pgbench is really more of a stress
test than a benchmark.

The main reasons why TPC-C is interesting here are *very* simple, and
would likely be equally true with TPC-E (just for example) -- even
though TPC-E is a very different benchmark kind of OLTP workload
overall. TPC-C (like TPC-E) features a diversity of transaction types,
some of which are more complicated than others -- which is strictly
more realistic than having only one highly synthetic OLTP transaction
type. Each transaction type doesn't necessarily modify the same tables
in the same way. This leads to natural diversity among tables and
among transactions, including:

* The typical or average number of distinct XIDs per heap page varies
significantly among each table. There are way fewer distinct XIDs per
"order line" table heap page than there are per "order" table heap
page, for the obvious reason.

* Roughly speaking, there are various different ways that free space
management ought to work in a system like Postgres. For example it is
necessary to make a "fragmentations vs space utilization" trade-off
with the new orders table.

* There are joins in some of the transactions!

Maybe TPC-C is a crude approximation of reality, but it nevertheless
exercises relevant parts of the system to a significant degree. What
else would you expect me to use, for a project like this? To a
significant degree the relfrozenxid tracking stuff is interesting
because tables tend to have natural differences like the ones I have
highlighted on this thread. How could that not be the case? Why
wouldn't we want to take advantage of that?

There might be some danger in over-optimizing for this particular
benchmark, but right now that is so far from being the main problem
that the idea seems strange to me. pgbench doesn't need the FSM, at
all. In fact pgbench doesn't even really need VACUUM (except for
antiwraparound), once heap fillfactor is lowered to 95 or so. pgbench
simply isn't relevant, *at all*, except perhaps as a way of measuring
regressions in certain synthetic cases that don't benefit.

TPC-C (or any
benchmark really) is so simple as to be a terrible proxy for what
vacuuming is going to look like on real-world systems.

Doesn't that amount to "no amount of any kind of testing or
benchmarking will convince me of anything, ever"?

There is more than one type of real-world system. I think that TPC-C
is representative of some real world systems in some regards. But even
that's not the important point for me. I find TPC-C generally
interesting for one reason: I can clearly see that Postgres does
things in a way that just doesn't make much sense, which isn't
particularly fundamental to how VACUUM works.

My only long term goal is to teach Postgres to *avoid* various
pathological cases exhibited by TPC-C (e.g., the B-Tree "split after
new tuple" mechanism from commit f21668f328 *avoids* a pathological
case from TPC-C). We don't necessarily have to agree on how important
each individual case is "in the real world" (which is impossible to
know anyway). We only have to agree that what we see is a pathological
case (because some reasonable expectation is dramatically violated),
and then work out a fix.

I don't want to teach Postgres to be clever -- I want to teach it to
avoid being stupid in cases where it exhibits behavior that really
cannot be described any other way. You seem to talk about some of this
work as if it was just as likely to have a detrimental effect
elsewhere, for some equally plausible workload, which will have a
downside that is roughly as bad as the advertised upside. I consider
that very unlikely, though. Sure, regressions are quite possible, and
a real concern -- but regressions *like that* are unlikely. Avoiding
doing what is clearly the wrong thing just seems to work out that way,
in general.

--
Peter Geoghegan

#92

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#89)

3 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Feb 25, 2022 at 5:52 PM Peter Geoghegan <pg@bowt.ie> wrote:

There is an important practical way in which it makes sense to treat
0001 as separate to 0002. It is true that 0001 is independently quite
useful. In practical terms, I'd be quite happy to just get 0001 into
Postgres 15, without 0002. I think that that's what you meant here, in
concrete terms, and we can agree on that now.

Attached is v10. While this does still include the freezing patch,
it's not in scope for Postgres 15. As I've said, I still think that it
makes sense to maintain the patch series with the freezing stuff,
since it's structurally related. So, to be clear, the first two
patches from the patch series are in scope for Postgres 15. But not
the third.

Highlights:

* Changes to terminology and commit messages along the lines suggested
by Andres.

* Bug fixes to heap_tuple_needs_freeze()'s MultiXact handling. My
testing strategy here still needs work.

* Expanded refactoring by v10-0002 patch.

The v10-0002 patch (which appeared for the first time in v9) was
originally all about fixing a case where non-aggressive VACUUMs were
at a gratuitous disadvantage (relative to aggressive VACUUMs) around
advancing relfrozenxid -- very much like the lazy_scan_noprune work
from commit 44fa8488. And that is still its main purpose. But the
refactoring now seems related to Andres' idea of making non-aggressive
VACUUMs decides to scan a few extra all-visible pages in order to be
able to advance relfrozenxid.

The code that sets up skipping the visibility map is made a lot
clearer by v10-0002. That patch moves a significant amount of code
from lazy_scan_heap() into a new helper routine (so it continues the
trend started by the Postgres 14 work that added lazy_scan_prune()).
Now skipping a range of visibility map pages is fundamentally based on
setting up the range up front, and then using the same saved details
about the range thereafter -- we don't have anymore ad-hoc
VM_ALL_VISIBLE()/VM_ALL_FROZEN() calls for pages from a range that we
already decided to skip (so no calls to those routines from
lazy_scan_heap(), at least not until after we finish processing in
lazy_scan_prune()).

This is more or less what we were doing all along for one special
case: aggressive VACUUMs. We had to make sure to either increment
frozenskipped_pages or increment scanned_pages for every page from
rel_pages -- this issue is described by lazy_scan_heap() comments on
HEAD that begin with "Tricky, tricky." (these date back to the freeze
map work from 2016). Anyway, there is no reason to not go further with
that: we should make whole ranges the basic unit that we deal with
when skipping. It's a lot simpler to think in terms of entire ranges
(not individual pages) that are determined to be all-visible or
all-frozen up-front, without needing to recheck anything (regardless
of whether it's an aggressive VACUUM).

We don't need to track frozenskipped_pages this way. And it's much
more obvious that it's safe for more complicated cases, in particular
for aggressive VACUUMs.

This kind of approach seems necessary to make non-aggressive VACUUMs
do a little more work opportunistically, when they realize that they
can advance relfrozenxid relatively easily that way (which I believe
Andres favors as part of overhauling freezing). That becomes a lot
more natural when you have a clear and unambiguous separation between
deciding what range of blocks to skip, and then actually skipping. I
can imagine the new helper function added by v10-0002 (which I've
called lazy_scan_skip_range()) eventually being taught to do these
kinds of tricks.

In general I think that all of the details of what to skip need to be
decided up front. The loop in lazy_scan_heap() should execute skipping
based on the instructions it receives from the new helper function, in
the simplest way possible. The helper function can become more
intelligent about the costs and benefits of skipping in the future,
without that impacting lazy_scan_heap().

--
Peter Geoghegan

Attachments:

v10-0003-Make-page-level-characteristics-drive-freezing.patchapplication/x-patch; name=v10-0003-Make-page-level-characteristics-drive-freezing.patchDownload

From 43ab00609392ed7ad31be491834bdac348e13653 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v10 3/3] Make page-level characteristics drive freezing.

Teach VACUUM to freeze all of the tuples on a page whenever it notices
that it would otherwise mark the page all-visible, without also marking
it all-frozen.  VACUUM typically won't freeze _any_ tuples on the page
unless _all_ tuples (that remain after pruning) are all-visible.  This
makes the overhead of vacuuming much more predictable over time.  We
avoid the need for large balloon payments during aggressive VACUUMs
(typically anti-wraparound autovacuums).  Freezing is proactive, so
we're much less likely to get into "freezing debt".

The new approach to freezing also enables relfrozenxid advancement in
non-aggressive VACUUMs, which might be enough to avoid aggressive
VACUUMs altogether (with many individual tables/workloads).  While the
non-aggressive case continues to skip all-visible (but not all-frozen)
pages, that will no longer hinder relfrozenxid advancement (outside of
pg_upgrade scenarios).  We now try to avoid leaving behind all-visible
(not all-frozen) pages.  This (as well as work from commit 44fa84881f)
makes relfrozenxid advancement in non-aggressive VACUUMs commonplace.

There is also a clear disadvantage to the new approach to freezing: more
eager freezing will impose overhead on cases that don't receive any
benefit.  This is considered an acceptable trade-off.  The new algorithm
tends to avoid freezing early on pages where it makes the least sense,
since frequently modified pages are unlikely to be all-visible.

The system accumulates freezing debt in proportion to the number of
physical heap pages with unfrozen tuples, more or less.  Anything based
on XID age is likely to be a poor proxy for the eventual cost of
freezing (during the inevitable anti-wraparound autovacuum).  At a high
level, freezing is now treated as one of the costs of storing tuples in
physical heap pages -- not a cost of transactions that allocate XIDs.
Although vacuum_freeze_min_age and vacuum_multixact_freeze_min_age still
influence what we freeze, and when, they seldom have much influence in
many important cases.

It may still be necessary to "freeze a page" due to the presence of a
particularly old XID, from before VACUUM's FreezeLimit cutoff.
FreezeLimit can only trigger page-level freezing, though -- it cannot
change how freezing is actually executed.  All XIDs < OldestXmin and all
MXIDs < OldestMxact will now be frozen on any page that VACUUM decides
to freeze, regardless of the details behind its decision.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam_xlog.h     |   7 +-
 src/backend/access/heap/heapam.c     |  92 +++++++++++++++++----
 src/backend/access/heap/vacuumlazy.c | 116 ++++++++++++++++++---------
 src/backend/commands/vacuum.c        |   8 ++
 doc/src/sgml/maintenance.sgml        |   9 +--
 5 files changed, 172 insertions(+), 60 deletions(-)

diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 2d8a7f627..2c25e72b2 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -409,10 +409,15 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId relminmxid,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
+									  TransactionId limit_xid,
+									  MultiXactId limit_multi,
 									  xl_heap_freeze_tuple *frz,
 									  bool *totally_frozen,
+									  bool *force_freeze,
 									  TransactionId *relfrozenxid_out,
-									  MultiXactId *relminmxid_out);
+									  MultiXactId *relminmxid_out,
+									  TransactionId *relfrozenxid_nofreeze_out,
+									  MultiXactId *relminmxid_nofreeze_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2e859e427..3454201f3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6446,14 +6446,38 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * are older than the specified cutoff XID and cutoff MultiXactId.  If so,
  * setup enough state (in the *frz output argument) to later execute and
  * WAL-log what we would need to do, and return true.  Return false if nothing
- * is to be changed.  In addition, set *totally_frozen to true if the tuple
+ * can be changed.  In addition, set *totally_frozen to true if the tuple
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Although this interface is primarily tuple-based, vacuumlazy.c caller
+ * cooperates with us to decide on whether or not to freeze whole pages,
+ * together as a single group.  We prepare for freezing at the level of each
+ * tuple, but the final decision is made for the page as a whole.  All pages
+ * that are frozen within a given VACUUM operation are frozen according to
+ * cutoff_xid and cutoff_multi.  Caller _must_ freeze the whole page when
+ * we've set *force_freeze to true!
+ *
+ * cutoff_xid must be caller's oldest xmin to ensure that any XID older than
+ * it could neither be running nor seen as running by any open transaction.
+ * This ensures that the replacement will not change anyone's idea of the
+ * tuple state.  Similarly, cutoff_multi must be the smallest MultiXactId used
+ * by any open transaction (at the time that the oldest xmin was acquired).
+ *
+ * limit_xid must be <= cutoff_xid, and limit_multi must be <= cutoff_multi.
+ * When any XID/XMID from before these secondary cutoffs are encountered, we
+ * set *force_freeze to true, making caller freeze the page (freezing-eligible
+ * XIDs/XMIDs will be frozen, at least).  Forcing freezing like this ensures
+ * that VACUUM won't allow XIDs/XMIDs to ever get too old.  This shouldn't be
+ * necessary very often.  VACUUM should prefer to freeze when it's cheap (not
+ * when it's urgent).
+ *
  * Maintains *relfrozenxid_out and *relminmxid_out, which are the current
- * target relfrozenxid and relminmxid for the relation.  Caller should make
- * temp copies of global tracking variables before starting to process a page,
- * so that we can only scribble on copies.
+ * target relfrozenxid and relminmxid for the relation.  There are also "no
+ * freeze" variants (*relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out)
+ * that are used by caller when it decides to not freeze the page.  Caller
+ * should make temp copies of global tracking variables before starting to
+ * process a page, so that we can only scribble on copies.
  *
  * Caller is responsible for setting the offset field, if appropriate.
  *
@@ -6461,13 +6485,6 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
- * XID older than it could neither be running nor seen as running by any
- * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be less than or equal to the smallest
- * MultiXactId used by any transaction currently open.
- *
  * If the tuple is in a shared buffer, caller must hold an exclusive lock on
  * that buffer.
  *
@@ -6479,11 +6496,16 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
+						  TransactionId limit_xid, MultiXactId limit_multi,
+						  xl_heap_freeze_tuple *frz,
+						  bool *totally_frozen, bool *force_freeze,
 						  TransactionId *relfrozenxid_out,
-						  MultiXactId *relminmxid_out)
+						  MultiXactId *relminmxid_out,
+						  TransactionId *relfrozenxid_nofreeze_out,
+						  MultiXactId *relminmxid_nofreeze_out)
 {
 	bool		changed = false;
+	bool		xmin_already_frozen = false;
 	bool		xmax_already_frozen = false;
 	bool		xmin_frozen;
 	bool		freeze_xmax;
@@ -6504,7 +6526,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
+	{
+		xmin_already_frozen = true;
 		xmin_frozen = true;
+	}
 	else
 	{
 		if (TransactionIdPrecedes(xid, relfrozenxid))
@@ -6534,7 +6559,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * resolve a MultiXactId to its member Xids, in case some of them are
 	 * below the given cutoff for Xids.  In that case, those values might need
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
-	 * it out --- if there's a live updater Xid, it needs to be kept.
+	 * it out --- if there's a live updater Xid, it needs to be kept.  If we
+	 * need to allocate a new MultiXact for that purposes, we will force
+	 * caller to freeze the page.
 	 *
 	 * Make sure to keep heap_tuple_needs_freeze in sync with this.
 	 */
@@ -6580,6 +6607,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(TransactionIdIsValid(newxmax));
 			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
 				*relfrozenxid_out = newxmax;
+
+			/*
+			 * We have an opportunity to get rid of this MultiXact now, so
+			 * force freezing to avoid wasting it
+			 */
+			*force_freeze = true;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6616,6 +6649,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(TransactionIdPrecedesOrEquals(xmax_oldest_xid_out,
 												 *relfrozenxid_out));
 			*relfrozenxid_out = xmax_oldest_xid_out;
+
+			/*
+			 * We allocated a MultiXact for this, so force freezing to avoid
+			 * wasting it
+			 */
+			*force_freeze = true;
 		}
 		else if (flags & FRM_NOOP)
 		{
@@ -6734,11 +6773,27 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
 			changed = true;
+
+			/* Seems like a good idea to freeze early when this case is hit */
+			*force_freeze = true;
 		}
 	}
 
 	*totally_frozen = (xmin_frozen &&
 					   (freeze_xmax || xmax_already_frozen));
+
+	/*
+	 * Maintain alternative versions of relfrozenxid_out/relminmxid_out that
+	 * leave caller with the option of *not* freezing the page.  If caller has
+	 * already lost that option (e.g. when the page has an old XID that we
+	 * must force caller to freeze), then we don't waste time on this.
+	 */
+	if (!*force_freeze && (!xmin_already_frozen || !xmax_already_frozen))
+		*force_freeze = heap_tuple_needs_freeze(tuple,
+												limit_xid, limit_multi,
+												relfrozenxid_nofreeze_out,
+												relminmxid_nofreeze_out);
+
 	return changed;
 }
 
@@ -6790,15 +6845,22 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 {
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
+	bool		force_freeze = true;
 	bool		tuple_totally_frozen;
 	TransactionId relfrozenxid_out = cutoff_xid;
 	MultiXactId relminmxid_out = cutoff_multi;
+	TransactionId relfrozenxid_nofreeze_out = cutoff_xid;
+	MultiXactId relminmxid_nofreeze_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
+										  cutoff_xid, cutoff_multi,
 										  &frz, &tuple_totally_frozen,
-										  &relfrozenxid_out, &relminmxid_out);
+										  &force_freeze,
+										  &relfrozenxid_out, &relminmxid_out,
+										  &relfrozenxid_nofreeze_out,
+										  &relminmxid_nofreeze_out);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3bc75d401..7e2d03ba6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -169,8 +169,9 @@ typedef struct LVRelState
 
 	/* VACUUM operation's cutoffs for freezing and pruning */
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	GlobalVisState *vistest;
-	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
+	/* Limits on the age of the oldest unfrozen XID and MXID */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
@@ -199,6 +200,7 @@ typedef struct LVRelState
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
+	BlockNumber newly_frozen_pages; /* # pages frozen by lazy_scan_prune */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
@@ -477,6 +479,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
 	vacrel->removed_pages = 0;
+	vacrel->newly_frozen_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
 	vacrel->nonempty_pages = 0;
@@ -514,10 +517,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
 	vacrel->OldestXmin = OldestXmin;
+	vacrel->OldestMxact = OldestMxact;
 	vacrel->vistest = GlobalVisTestFor(rel);
-	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
+	/* FreezeLimit limits unfrozen XID age (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
+	/* MultiXactCutoff limits unfrozen MXID age (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
 	/* Initialize state used to track oldest extant XID/XMID */
 	vacrel->NewRelfrozenXid = OldestXmin;
@@ -583,7 +587,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	if (vacrel->skippedallvis)
 	{
-		/* Cannot advance relfrozenxid/relminmxid */
+		/*
+		 * Skipped some all-visible pages, so definitely cannot advance
+		 * relfrozenxid.  This is generally only expected in pg_upgrade
+		 * scenarios, since VACUUM now avoids setting a page to all-visible
+		 * but not all-frozen.  However, it's also possible (though quite
+		 * unlikely) that we ended up here because somebody else cleared some
+		 * page's all-frozen flag (without clearing its all-visible flag).
+		 */
 		Assert(!aggressive);
 		frozenxid_updated = minmulti_updated = false;
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
@@ -685,9 +696,10 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
-			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
+			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u frozen, %u scanned (%.2f%% of total)\n"),
 							 vacrel->removed_pages,
 							 vacrel->rel_pages,
+							 vacrel->newly_frozen_pages,
 							 vacrel->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
 							 100.0 * vacrel->scanned_pages / orig_rel_pages);
@@ -1613,8 +1625,11 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
-	TransactionId NewRelfrozenXid;
-	MultiXactId NewRelminMxid;
+	bool		force_freeze = false;
+	TransactionId NewRelfrozenXid,
+				NoFreezeNewRelfrozenXid;
+	MultiXactId NewRelminMxid,
+				NoFreezeNewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1625,8 +1640,8 @@ lazy_scan_prune(LVRelState *vacrel,
 retry:
 
 	/* Initialize (or reset) page-level state */
-	NewRelfrozenXid = vacrel->NewRelfrozenXid;
-	NewRelminMxid = vacrel->NewRelminMxid;
+	NewRelfrozenXid = NoFreezeNewRelfrozenXid = vacrel->NewRelfrozenXid;
+	NewRelminMxid = NoFreezeNewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1679,27 +1694,23 @@ retry:
 			continue;
 		}
 
-		/*
-		 * LP_DEAD items are processed outside of the loop.
-		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how count_nondeletable_pages() does
-		 * it -- it only considers pages empty/truncatable when they have no
-		 * items at all (except LP_UNUSED items).
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
-		 */
 		if (ItemIdIsDead(itemid))
 		{
+			/*
+			 * Delay unsetting all_visible until after we have decided on
+			 * whether this page should be frozen.  We need to test "is this
+			 * page all_visible, assuming any LP_DEAD items are set LP_UNUSED
+			 * in final heap pass?" to reach a decision.  all_visible will be
+			 * unset before we return, as required by lazy_scan_heap caller.
+			 *
+			 * Deliberately don't set hastup for LP_DEAD items.  We make the
+			 * soft assumption that any LP_DEAD items encountered here will
+			 * become LP_UNUSED later on, before count_nondeletable_pages is
+			 * reached.  Whether the page 'hastup' is inherently race-prone.
+			 * It must be treated as unreliable by caller anyway, so we might
+			 * as well be slightly optimistic about it.
+			 */
 			deadoffsets[lpdead_items++] = offnum;
-			prunestate->all_visible = false;
-			prunestate->has_lpdead_items = true;
 			continue;
 		}
 
@@ -1831,11 +1842,15 @@ retry:
 		if (heap_prepare_freeze_tuple(tuple.t_data,
 									  vacrel->relfrozenxid,
 									  vacrel->relminmxid,
+									  vacrel->OldestXmin,
+									  vacrel->OldestMxact,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen,
-									  &NewRelfrozenXid, &NewRelminMxid))
+									  &tuple_totally_frozen, &force_freeze,
+									  &NewRelfrozenXid, &NewRelminMxid,
+									  &NoFreezeNewRelfrozenXid,
+									  &NoFreezeNewRelminMxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1856,9 +1871,32 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
+	 *
+	 * Freeze the page when it is about to become all-visible (either just
+	 * after we return control to lazy_scan_heap, or later on, during the
+	 * final heap pass).  Also freeze when heap_prepare_freeze_tuple forces us
+	 * to freeze (this is mandatory).  Freezing is typically forced because
+	 * there is at least one XID/XMID from before FreezeLimit/MultiXactCutoff.
 	 */
-	vacrel->NewRelfrozenXid = NewRelfrozenXid;
-	vacrel->NewRelminMxid = NewRelminMxid;
+	if (prunestate->all_visible || force_freeze)
+	{
+		/*
+		 * We're freezing the page.  Our final NewRelfrozenXid doesn't need to
+		 * be affected by the XIDs/XMIDs that are just about to be frozen
+		 * anyway.
+		 */
+		vacrel->NewRelfrozenXid = NewRelfrozenXid;
+		vacrel->NewRelminMxid = NewRelminMxid;
+	}
+	else
+	{
+		/* This is comparable to lazy_scan_noprune's handling */
+		vacrel->NewRelfrozenXid = NoFreezeNewRelfrozenXid;
+		vacrel->NewRelminMxid = NoFreezeNewRelminMxid;
+
+		/* Forget heap_prepare_freeze_tuple's guidance on freezing */
+		nfrozen = 0;
+	}
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1866,7 +1904,7 @@ retry:
 	 */
 	if (nfrozen > 0)
 	{
-		Assert(prunestate->hastup);
+		vacrel->newly_frozen_pages++;
 
 		/*
 		 * At least one tuple with storage needs to be frozen -- execute that
@@ -1892,11 +1930,11 @@ retry:
 		}
 
 		/* Now WAL-log freezing if necessary */
-		if (RelationNeedsWAL(vacrel->rel))
+		if (RelationNeedsWAL(rel))
 		{
 			XLogRecPtr	recptr;
 
-			recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
+			recptr = log_heap_freeze(rel, buf, NewRelfrozenXid,
 									 frozen, nfrozen);
 			PageSetLSN(page, recptr);
 		}
@@ -1919,7 +1957,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible)
+	if (prunestate->all_visible && lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1927,7 +1965,6 @@ retry:
 		if (!heap_page_is_all_visible(vacrel, buf, &cutoff, &all_frozen))
 			Assert(false);
 
-		Assert(lpdead_items == 0);
 		Assert(prunestate->all_frozen == all_frozen);
 
 		/*
@@ -1949,9 +1986,6 @@ retry:
 		VacDeadItems *dead_items = vacrel->dead_items;
 		ItemPointerData tmp;
 
-		Assert(!prunestate->all_visible);
-		Assert(prunestate->has_lpdead_items);
-
 		vacrel->lpdead_item_pages++;
 
 		ItemPointerSetBlockNumber(&tmp, blkno);
@@ -1965,6 +1999,10 @@ retry:
 		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
 									 dead_items->num_items);
+
+		/* lazy_scan_heap caller expects LP_DEAD item to unset all_visible */
+		prunestate->has_lpdead_items = true;
+		prunestate->all_visible = false;
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 0ae3b4506..f1ea50454 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -957,6 +957,14 @@ get_all_vacuum_rels(int options)
  * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
  * minimum).
  *
+ * While non-aggressive VACUUMs are never required to advance relfrozenxid and
+ * relminmxid, they often do so in practice.  They freeze wherever possible,
+ * based on the same criteria that aggressive VACUUMs use.  FreezeLimit and
+ * multiXactCutoff still force freezing of older XIDs/XMIDs that did not get
+ * frozen based on the standard criteria, though.  (Actually, these cutoffs
+ * won't force non-aggressive VACUUMs to freeze pages that cannot be cleanup
+ * locked without waiting.)
+ *
  * oldestXmin and oldestMxact are the most recent values that can ever be
  * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
  * vacuumlazy.c caller later on.  These values should be passed when it turns
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 6a02d0fa8..4d585a265 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -565,11 +565,10 @@
     the <structfield>relfrozenxid</structfield> column of a table's
     <structname>pg_class</structname> row contains the oldest
     remaining XID at the end of the most recent <command>VACUUM</command>
-    that successfully advanced <structfield>relfrozenxid</structfield>
-    (typically the most recent aggressive VACUUM).  All rows inserted
-    by transactions with XIDs older than this cutoff XID are
-    guaranteed to have been frozen.  Similarly,
-    the <structfield>datfrozenxid</structfield> column of a database's
+    that successfully advanced <structfield>relfrozenxid</structfield>.
+    All rows inserted by transactions with XIDs older than this cutoff
+    XID are guaranteed to have been frozen.  Similarly, the
+    <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
     appearing in that database &mdash; it is just the minimum of the
     per-table <structfield>relfrozenxid</structfield> values within the database.
-- 
2.30.2

v10-0001-Loosen-coupling-between-relfrozenxid-and-freezin.patchapplication/x-patch; name=v10-0001-Loosen-coupling-between-relfrozenxid-and-freezin.patchDownload

From 19edc49f9a0f7efa5b8518285dafac620b7b8e72 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v10 1/3] Loosen coupling between relfrozenxid and freezing.

When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive: the relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.  There is no fixed
relationship between the amount of physical work performed by VACUUM to
make it safe to advance relfrozenxid (freezing and pruning), and the
actual number of XIDs that relfrozenxid can be advanced by (at least in
principle) as a result.  VACUUM might have to freeze all of the tuples
from a hundred million heap pages just to enable relfrozenxid to be
advanced by no more than one or two XIDs.  On the other hand, VACUUM
might end up doing little or no work, and yet still be capable of
advancing relfrozenxid by hundreds of millions of XIDs as a result.

VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
extant XID (and oldest extant MultiXactId) from the table, including
XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
This optimization doesn't require any changes to the definition of
relfrozenxid, nor does it require changes to the core design of
freezing.

Later work targeting PostgreSQL 16 will teach VACUUM to determine what
to freeze based on page-level characteristics (not XID/XMID based
cutoffs).  But setting relfrozenxid/relminmxid to the exact oldest
extant XID/MXID is independently useful work.  For example, it is
helpful with larger databases that consume many MultiXacts.  If we
assume that the largest tables don't ever need to allocate any
MultiXacts, then aggressive VACUUMs targeting those tables will now
advance relminmxid right up to OldestMxact.  pg_class.relminmxid becomes
a much more precise indicator of what's really going on in each table,
making autovacuums to prevent wraparound (MultiXactId wraparound) occur
less frequently.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM -- FreezeLimit still acts as a lower bound on the final value
that aggressive VACUUM can set relfrozenxid to.  Since standard VACUUMs
still make no guarantees about advancing relfrozenxid, they might as
well set relfrozenxid to a value from well before FreezeLimit when the
opportunity presents itself.  In general standard VACUUMs may now set
relfrozenxid to any value > the original relfrozenxid and <= OldestXmin.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam.h          |   7 +-
 src/include/access/heapam_xlog.h     |   4 +-
 src/include/commands/vacuum.h        |   1 +
 src/backend/access/heap/heapam.c     | 247 +++++++++++++++++++++------
 src/backend/access/heap/vacuumlazy.c | 119 +++++++++----
 src/backend/commands/cluster.c       |   5 +-
 src/backend/commands/vacuum.c        |  42 +++--
 doc/src/sgml/maintenance.sgml        |  30 +++-
 8 files changed, 344 insertions(+), 111 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b46ab7d73..6ef3c02bb 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,11 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi);
+extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple,
+									TransactionId limit_xid,
+									MultiXactId limit_multi,
+									TransactionId *relfrozenxid_nofreeze_out,
+									MultiXactId *relminmxid_nofreeze_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..2d8a7f627 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *relfrozenxid_out,
+									  MultiXactId *relminmxid_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3746336a0..2e859e427 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6128,7 +6128,12 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * NB -- this might have the side-effect of creating a new MultiXactId!
  *
  * "flags" is an output value; it's used to tell caller what to do on return.
- * Possible flags are:
+ *
+ * "xmax_oldest_xid_out" is an output value; we must handle the details of
+ * tracking the oldest extant XID within Multixacts.  This is part of how
+ * caller tracks relfrozenxid_out (the oldest extant XID) on behalf of VACUUM.
+ *
+ * Possible values that we can set in "flags":
  * FRM_NOOP
  *		don't do anything -- keep existing Xmax
  * FRM_INVALIDATE_XMAX
@@ -6140,12 +6145,21 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * Final *xmax_oldest_xid_out value should be ignored completely unless
+ * "flags" contains either FRM_NOOP or FRM_RETURN_IS_MULTI.  Final value is
+ * drawn from oldest extant XID that will remain in some MultiXact (old or
+ * new) after xmax is frozen (XIDs that won't remain after freezing are
+ * ignored, per convention).
+ *
+ * Note in particular that caller must deal with FRM_RETURN_IS_XID case
+ * itself, by considering returned Xid (not using *xmax_oldest_xid_out).
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *xmax_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6157,6 +6171,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId temp_xid_out;
 
 	*flags = 0;
 
@@ -6251,13 +6266,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	temp_xid_out = *xmax_oldest_xid_out;	/* initialize temp_xid_out */
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-		{
 			need_replace = true;
-			break;
-		}
+		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+			temp_xid_out = members[i].xid;
 	}
 
 	/*
@@ -6266,6 +6281,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (!need_replace)
 	{
+		*xmax_oldest_xid_out = temp_xid_out;
 		*flags |= FRM_NOOP;
 		pfree(members);
 		return InvalidTransactionId;
@@ -6275,6 +6291,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * If the multi needs to be updated, figure out which members do we need
 	 * to keep.
 	 */
+	temp_xid_out = *xmax_oldest_xid_out;	/* reset temp_xid_out */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
@@ -6356,7 +6373,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 * list.)
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+					temp_xid_out = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6366,6 +6387,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			{
 				/* running locker cannot possibly be older than the cutoff */
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid, *xmax_oldest_xid_out));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6403,6 +6425,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+
+		/*
+		 * Return oldest remaining XID in new multixact if it's older than
+		 * caller's original xmax_oldest_xid_out (otherwise it's just the
+		 * original xmax_oldest_xid_out value from caller)
+		 */
+		*xmax_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
@@ -6421,6 +6450,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Maintains *relfrozenxid_out and *relminmxid_out, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Caller should make
+ * temp copies of global tracking variables before starting to process a page,
+ * so that we can only scribble on copies.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
@@ -6445,7 +6479,9 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
+						  TransactionId *relfrozenxid_out,
+						  MultiXactId *relminmxid_out)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6489,6 +6525,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+			*relfrozenxid_out = xid;
 	}
 
 	/*
@@ -6506,16 +6544,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId xmax_oldest_xid_out = *relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi,
+									&flags, &xmax_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
 		if (flags & FRM_RETURN_IS_XID)
 		{
 			/*
+			 * xmax will become an updater XID (an XID from the original
+			 * MultiXact's XIDs that needs to be carried forward).
+			 *
 			 * NB -- some of these transformations are only valid because we
 			 * know the return Xid is a tuple updater (i.e. not merely a
 			 * locker.) Also note that the only reason we don't explicitly
@@ -6527,6 +6570,16 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
+			Assert(freeze_xmax);
+
+			/*
+			 * Only consider newxmax Xid to track relfrozenxid_out here, since
+			 * any other XIDs from the old MultiXact won't be left behind once
+			 * xmax is actually frozen.
+			 */
+			Assert(TransactionIdIsValid(newxmax));
+			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
+				*relfrozenxid_out = newxmax;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6534,6 +6587,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			uint16		newbits2;
 
 			/*
+			 * xmax was an old MultiXactId which we have to replace with a new
+			 * Multixact, that carries forward a subset of the XIDs from the
+			 * original (those that we'll still need).
+			 *
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
 			 * here; that routine initializes the masks to all zeroes, which
 			 * would lose other bits we need.  Doing it this way ensures all
@@ -6548,6 +6605,37 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 
 			changed = true;
+			Assert(!freeze_xmax);
+
+			/*
+			 * FreezeMultiXactId sets xmax_oldest_xid_out to any XID that it
+			 * notices is older than initial relfrozenxid_out, unless the XID
+			 * won't remain after freezing
+			 */
+			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(TransactionIdPrecedesOrEquals(xmax_oldest_xid_out,
+												 *relfrozenxid_out));
+			*relfrozenxid_out = xmax_oldest_xid_out;
+		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 *
+			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
+			 * both together.  FreezeMultiXactId sets xmax_oldest_xid_out to
+			 * any XID that it notices is older than initial relfrozenxid_out,
+			 * unless the XID won't remain after freezing (or in this case
+			 * after _not_ freezing).
+			 */
+			Assert(MultiXactIdIsValid(xid));
+			Assert(!changed && !freeze_xmax);
+
+			if (MultiXactIdPrecedes(xid, *relminmxid_out))
+				*relminmxid_out = xid;
+			Assert(TransactionIdPrecedesOrEquals(xmax_oldest_xid_out,
+												 *relfrozenxid_out));
+			*relfrozenxid_out = xmax_oldest_xid_out;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
@@ -6575,7 +6663,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			freeze_xmax = true;
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
@@ -6699,11 +6791,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId relfrozenxid_out = cutoff_xid;
+	MultiXactId relminmxid_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &relfrozenxid_out, &relminmxid_out);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7133,24 +7228,57 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
  * are older than the specified cutoff XID or MultiXactId.  If so, return true.
  *
+ * See heap_prepare_freeze_tuple for information about the basic rules for the
+ * cutoffs used here.
+ *
  * It doesn't matter whether the tuple is alive or dead, we are checking
  * to see if a tuple needs to be removed or frozen to avoid wraparound.
  *
+ * The *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out arguments are
+ * input/output arguments that work just like heap_prepare_freeze_tuple's
+ * *relfrozenxid_out and *relminmxid_out input/output arguments.  However,
+ * there is one important difference: we track the oldest extant XID and XMID
+ * while making a working assumption that no freezing will actually take
+ * place.  On the other hand, heap_prepare_freeze_tuple assumes that freezing
+ * will take place (based on the specific instructions it also sets up for
+ * caller's tuple).
+ *
+ * Note, in particular, that we even assume that freezing won't go ahead for a
+ * tuple that we indicate "needs freezing" (by returning true).  Not all
+ * callers will be okay with that.  Caller should make temp copies of global
+ * tracking variables before starting to process a page, so that we only ever
+ * scribble on copies.  That way caller can just discard the temp copies if it
+ * really needs to freeze (using heap_prepare_freeze_tuple interface).  In
+ * practice aggressive VACUUM callers always do this and non-aggressive VACUUM
+ * callers always just accept an older final relfrozenxid value.
+ *
  * NB: Cannot rely on hint bits here, they might not be set after a crash or
  * on a standby.
  */
 bool
-heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi)
+heap_tuple_needs_freeze(HeapTupleHeader tuple,
+						TransactionId limit_xid, MultiXactId limit_multi,
+						TransactionId *relfrozenxid_nofreeze_out,
+						MultiXactId *relminmxid_nofreeze_out)
 {
 	TransactionId xid;
-
-	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
+	bool		needs_freeze = false;
 
 	/*
+	 * First deal with xmin.
+	 */
+	xid = HeapTupleHeaderGetXmin(tuple);
+	if (TransactionIdIsNormal(xid))
+	{
+		if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+			*relfrozenxid_nofreeze_out = xid;
+		if (TransactionIdPrecedes(xid, limit_xid))
+			needs_freeze = true;
+	}
+
+	/*
+	 * Now deal with xmax.
+	 *
 	 * The considerations for multixacts are complicated; look at
 	 * heap_prepare_freeze_tuple for justifications.  This routine had better
 	 * be in sync with that one!
@@ -7158,57 +7286,80 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		MultiXactId multi;
+		MultiXactMember *members;
+		int			nmembers;
 
 		multi = HeapTupleHeaderGetRawXmax(tuple);
 		if (!MultiXactIdIsValid(multi))
 		{
-			/* no xmax set, ignore */
-			;
+			/* no xmax set -- but xmin might still need freezing */
+			return needs_freeze;
 		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
-			return true;
-		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
+
+		/*
+		 * Might have to ratchet back relminmxid_nofreeze_out, which we assume
+		 * won't be frozen by caller (even when we return true)
+		 */
+		if (MultiXactIdPrecedes(multi, *relminmxid_nofreeze_out))
+			*relminmxid_nofreeze_out = multi;
+
+		if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
-
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
+			/*
+			 * pg_upgrade'd MultiXact doesn't need to have its XID members
+			 * affect caller's relfrozenxid_nofreeze_out (just freeze it)
+			 */
+			return true;
 		}
+		else if (MultiXactIdPrecedes(multi, limit_multi))
+			needs_freeze = true;
+
+		/*
+		 * Need to check whether any member of the mxact is too old to
+		 * determine if MultiXact needs to be frozen now.  We even access the
+		 * members when we know that the MultiXactId isn't eligible for
+		 * freezing now -- we must still maintain relfrozenxid_nofreeze_out.
+		 */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
+		{
+			xid = members[i].xid;
+
+			if (TransactionIdPrecedes(xid, limit_xid))
+				needs_freeze = true;
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 	else
 	{
 		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			if (TransactionIdPrecedes(xid, limit_xid))
+				needs_freeze = true;
+		}
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			if (TransactionIdPrecedes(xid, limit_xid))
+				needs_freeze = true;
+		}
 	}
 
-	return false;
+	return needs_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 87ab7775a..9f5178e0a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,7 +144,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
@@ -173,8 +173,9 @@ typedef struct LVRelState
 	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -328,6 +329,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
@@ -354,17 +356,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.
 	 *
 	 * If this is an aggressive VACUUM, then we're strictly required to freeze
-	 * any and all XIDs from before FreezeLimit, so that we will be able to
-	 * safely advance relfrozenxid up to FreezeLimit below (we must be able to
-	 * advance relminmxid up to MultiXactCutoff, too).
+	 * any and all XIDs from before FreezeLimit in order to be able to advance
+	 * relfrozenxid to a value >= FreezeLimit below.  There is an analogous
+	 * requirement around MultiXact freezing, relminmxid, and MultiXactCutoff.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -511,10 +513,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing */
+	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+	/* Initialize state used to track oldest extant XID/XMID */
+	vacrel->NewRelfrozenXid = OldestXmin;
+	vacrel->NewRelminMxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -568,12 +571,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
 	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
 	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * the visibility map.  A non-aggressive VACUUM might advance relfrozenxid
+	 * to an XID that is either older or newer than FreezeLimit (same applies
+	 * to relminmxid and MultiXactCutoff).
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
 		/* Cannot advance relfrozenxid/relminmxid */
 		Assert(!aggressive);
@@ -587,9 +589,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	{
 		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
 			   orig_rel_pages);
+		Assert(!aggressive ||
+			   TransactionIdPrecedesOrEquals(FreezeLimit,
+											 vacrel->NewRelfrozenXid));
+		Assert(!aggressive ||
+			   MultiXactIdPrecedesOrEquals(MultiXactCutoff,
+										   vacrel->NewRelminMxid));
+
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
+							vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
 							&frozenxid_updated, &minmulti_updated, false);
 	}
 
@@ -694,17 +703,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminMxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -896,8 +907,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	 * find them.  But even when aggressive *is* set, it's still OK if we miss
 	 * a page whose all-frozen marking has just been cleared.  Any new XIDs
 	 * just added to that page are necessarily >= vacrel->OldestXmin, and so
-	 * they'll have no effect on the value to which we can safely set
-	 * relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
+	 * they cannot invalidate NewRelfrozenXid tracking.  A similar argument
+	 * applies for NewRelminMxid tracking and OldestMxact.
 	 */
 	next_unskippable_block = 0;
 	if (vacrel->skipwithvm)
@@ -1584,6 +1595,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1593,7 +1606,9 @@ lazy_scan_prune(LVRelState *vacrel,
 
 retry:
 
-	/* Initialize (or reset) page-level counters */
+	/* Initialize (or reset) page-level state */
+	NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	NewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1801,7 +1816,8 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1815,13 +1831,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1972,6 +1991,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	TransactionId NoFreezeNewRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NoFreezeNewRelminMxid = vacrel->NewRelminMxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2017,20 +2038,40 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_needs_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff))
+									vacrel->MultiXactCutoff,
+									&NoFreezeNewRelfrozenXid,
+									&NoFreezeNewRelminMxid))
 		{
 			if (vacrel->aggressive)
 			{
-				/* Going to have to get cleanup lock for lazy_scan_prune */
+				/*
+				 * heap_tuple_needs_freeze determined that it isn't going to
+				 * be possible for the ongoing aggressive VACUUM operation to
+				 * advance relfrozenxid to a value >= FreezeLimit without
+				 * freezing one or more tuples with older XIDs from this page.
+				 * (Or perhaps the issue was that MultiXactCutoff could not be
+				 * respected.  Might have even been both cutoffs, together.)
+				 *
+				 * Tell caller that it must acquire a full cleanup lock.  It's
+				 * possible that caller will have to wait a while for one, but
+				 * that can't be helped -- full processing by lazy_scan_prune
+				 * is required to freeze the older XIDs (and/or freeze older
+				 * MultiXactIds).
+				 */
 				vacrel->offnum = InvalidOffsetNumber;
 				return false;
 			}
-
-			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
-			 */
-			vacrel->freeze_cutoffs_valid = false;
+			else
+			{
+				/*
+				 * This is a non-aggressive VACUUM, which is under no strict
+				 * obligation to advance relfrozenxid at all (much less to
+				 * advance it to a value >= FreezeLimit).  Non-aggressive
+				 * VACUUM advances relfrozenxid/relminmxid on a best-effort
+				 * basis.  Accept an older final relfrozenxid/relminmxid value
+				 * rather than waiting for a cleanup lock.
+				 */
+			}
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
@@ -2079,6 +2120,16 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * By here we know for sure that caller can tolerate having reduced
+	 * processing for this particular page.  Before we return to report
+	 * success, update vacrel with details of how we processed the page.
+	 * (lazy_scan_prune expects a clean slate, so we have to delay these steps
+	 * until here.)
+	 */
+	vacrel->NewRelfrozenXid = NoFreezeNewRelfrozenXid;
+	vacrel->NewRelminMxid = NoFreezeNewRelminMxid;
+
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel (though
 	 * only when VACUUM uses two-pass strategy)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 02a7e94bf..a7e988298 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 50a4a612e..0ae3b4506 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -945,14 +945,22 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
- * - multiXactCutoff is the value below which all MultiXactIds are removed
- *   from Xmax.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
+ * - multiXactCutoff is the value below which all MultiXactIds are definitely
+ *   removed from Xmax during aggressive vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit, and relminmxid up to multiXactCutoff.
+ * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
+ * minimum).
+ *
+ * oldestXmin and oldestMxact are the most recent values that can ever be
+ * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
+ * vacuumlazy.c caller later on.  These values should be passed when it turns
+ * out that VACUUM will leave no unfrozen XIDs/XMIDs behind in the table.
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -961,6 +969,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -969,7 +978,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1065,9 +1073,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1082,8 +1092,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
@@ -1390,14 +1400,10 @@ vac_update_relstats(Relation relation,
 	 * Update relfrozenxid, unless caller passed InvalidTransactionId
 	 * indicating it has no new data.
 	 *
-	 * Ordinarily, we don't let relfrozenxid go backwards: if things are
-	 * working correctly, the only way the new frozenxid could be older would
-	 * be if a previous VACUUM was done with a tighter freeze_min_age, in
-	 * which case we don't want to forget the work it already did.  However,
-	 * if the stored relfrozenxid is "in the future", then it must be corrupt
-	 * and it seems best to overwrite it with the cutoff we used this time.
-	 * This should match vac_update_datfrozenxid() concerning what we consider
-	 * to be "in the future".
+	 * Ordinarily, we don't let relfrozenxid go backwards.  However, if the
+	 * stored relfrozenxid is "in the future", then it must be corrupt, so
+	 * just overwrite it.  This should match vac_update_datfrozenxid()
+	 * concerning what we consider to be "in the future".
 	 */
 	if (frozenxid_updated)
 		*frozenxid_updated = false;
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 36f975b1e..6a02d0fa8 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -563,9 +563,11 @@
     statistics in the system tables <structname>pg_class</structname> and
     <structname>pg_database</structname>.  In particular,
     the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the freeze cutoff XID that was used
-    by the last aggressive <command>VACUUM</command> for that table.  All rows
-    inserted by transactions with XIDs older than this cutoff XID are
+    <structname>pg_class</structname> row contains the oldest
+    remaining XID at the end of the most recent <command>VACUUM</command>
+    that successfully advanced <structfield>relfrozenxid</structfield>
+    (typically the most recent aggressive VACUUM).  All rows inserted
+    by transactions with XIDs older than this cutoff XID are
     guaranteed to have been frozen.  Similarly,
     the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
@@ -588,6 +590,17 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     cutoff XID to the current transaction's XID.
    </para>
 
+   <tip>
+    <para>
+     <literal>VACUUM VERBOSE</literal> outputs information about
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> when either field was
+     advanced.  The same details appear in the server log when <xref
+      linkend="guc-log-autovacuum-min-duration"/> reports on vacuuming
+     by autovacuum.
+    </para>
+   </tip>
+
    <para>
     <command>VACUUM</command> normally only scans pages that have been modified
     since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
@@ -602,7 +615,11 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     set <literal>age(relfrozenxid)</literal> to a value just a little more than the
     <varname>vacuum_freeze_min_age</varname> setting
     that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  If no <structfield>relfrozenxid</structfield>-advancing
+    <command>VACUUM</command> started).  <command>VACUUM</command>
+    will set <structfield>relfrozenxid</structfield> to the oldest XID
+    that remains in the table, so it's possible that the final value
+    will be much more recent than strictly required.
+    If no <structfield>relfrozenxid</structfield>-advancing
     <command>VACUUM</command> is issued on the table until
     <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
     be forced for the table.
@@ -689,8 +706,9 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command> scans, regardless of
-     what causes them, enable advancing the value for that table.
+     Aggressive <command>VACUUM</command> scans, regardless of what
+     causes them, are <emphasis>guaranteed</emphasis> to be able to
+     advance the table's <structfield>relminmxid</structfield>.
      Eventually, as all tables in all databases are scanned and their
      oldest multixact values are advanced, on-disk storage for older
      multixacts can be removed.
-- 
2.30.2

v10-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchapplication/x-patch; name=v10-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchDownload

From 134bd550bd7cb8c182fe3a28789705be5bf8785a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v10 2/3] Generalize how VACUUM skips all-frozen pages.

Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to
aggressive VACUUMs) around advancing relfrozenxid before now.  The
underlying issue was that lazy_scan_heap conditioned its skipping
behavior on whether or not the current VACUUM was aggressive.  VACUUM
could fail to increment its frozenskipped_pages counter as a result, and
so could miss out on advancing relfrozenxid for no good reason.  The
approach taken during aggressive VACUUMs avoided the problem, but that
only worked in the aggressive case.

Fix the issue by generalizing how we skip all-frozen pages: remember
whether a range of skippable pages consists only of all-frozen pages as
we're initially establishing the range of skippable pages.  If we decide
to skip the range of pages, and if the range as a whole is not an
all-frozen range, remember that fact for later (this makes it unsafe to
advance relfrozenxid).  We no longer need to recheck any pages using the
visibility map.  We no longer directly track frozenskipped_pages at all.
And we no longer need ad-hoc VM_ALL_VISIBLE()/VM_ALL_FROZEN() calls for
pages from a range of blocks that we already decided were safe to skip.

The issue is subtle.  Before now, the non-aggressive case always had to
recheck the visibility map at the point of actually skipping each page.
This created a window for some other session to concurrently unset the
same heap page's bit in the visibility map.  If the bit was unset at
exactly the wrong time, then the non-aggressive case would
conservatively conclude that the page was _never_ all-frozen on recheck.
And so frozenskipped_pages would not be incremented for the page.
lazy_scan_heap had already "committed" to skipping the page at that
point, though, which was enough to make it unsafe to advance
relfrozenxid/relminmxid later on.

It's possible that this issue hardly ever came up in practice.  It's
hard to be sure either way.  We only had to be unlucky once to lose out
on advancing relfrozenxid -- a single affected heap page was enough to
throw VACUUM off.  That seems like something to avoid on general
principle.  This is similar to an issue addressed by commit 44fa8488,
which taught vacuumlazy.c to not give up on non-aggressive relfrozenxid
advancement just because a cleanup lock wasn't immediately available on
some heap page.

Also refactor the mechanism that disables skipping using the visibility
map during VACUUM(DISABLE_PAGE_SKIPPING).  Our old approach made VACUUM
behave as if there were no pages with VM bits set.  Our new approach has
VACUUM set up a range of pages in the usual way, without actually going
through with skipping the range in the end.  This has the advantage of
making VACUUM(DISABLE_PAGE_SKIPPING) apply standard cross checks that
report on visibility map corruption via WARNINGs.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 298 ++++++++++++++-------------
 1 file changed, 158 insertions(+), 140 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9f5178e0a..3bc75d401 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -176,6 +176,8 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
+	/* Have we skipped any all-visible (not all-frozen) pages? */
+	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -196,7 +198,6 @@ typedef struct LVRelState
 	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
@@ -247,6 +248,10 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static BlockNumber lazy_scan_skip_range(LVRelState *vacrel, Buffer *vmbuffer,
+										BlockNumber next_unskippable_block,
+										bool *all_visible_next_unskippable,
+										bool *all_frozen_skippable_range);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -471,7 +476,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
-	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
@@ -518,6 +522,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/XMID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
+	/* Cannot advance relfrozenxid when we skipped all-visible pages */
+	vacrel->skippedallvis = false;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -575,7 +581,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * to an XID that is either older or newer than FreezeLimit (same applies
 	 * to relminmxid and MultiXactCutoff).
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
+	if (vacrel->skippedallvis)
 	{
 		/* Cannot advance relfrozenxid/relminmxid */
 		Assert(!aggressive);
@@ -587,8 +593,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 	else
 	{
-		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
-			   orig_rel_pages);
 		Assert(!aggressive ||
 			   TransactionIdPrecedesOrEquals(FreezeLimit,
 											 vacrel->NewRelfrozenXid));
@@ -842,7 +846,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 				next_failsafe_block,
 				next_fsm_block_to_vacuum;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		skipping_range,
+				all_visible_next_unskippable,
+				all_frozen_skippable_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -874,167 +880,85 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/*
-	 * Set things up for skipping blocks using visibility map.
-	 *
-	 * Except when vacrel->aggressive is set, we want to skip pages that are
-	 * all-visible according to the visibility map, but only when we can skip
-	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
-	 * sequentially, the OS should be doing readahead for us, so there's no
-	 * gain in skipping a page now and then; that's likely to disable
-	 * readahead and so be counterproductive. Also, skipping even a single
-	 * page means that we can't update relfrozenxid, so we only want to do it
-	 * if we can skip a goodly number of pages.
-	 *
-	 * When vacrel->aggressive is set, we can't skip pages just because they
-	 * are all-visible, but we can still skip pages that are all-frozen, since
-	 * such pages do not need freezing and do not affect the value that we can
-	 * safely set for relfrozenxid or relminmxid.
+	 * Set up an initial range of blocks to skip via the visibility map.
 	 *
 	 * Before entering the main loop, establish the invariant that
 	 * next_unskippable_block is the next block number >= blkno that we can't
-	 * skip based on the visibility map, either all-visible for a regular scan
-	 * or all-frozen for an aggressive scan.  We set it to rel_pages when
-	 * there's no such block.  We also set up the skipping_blocks flag
-	 * correctly at this stage.
-	 *
-	 * Note: The value returned by visibilitymap_get_status could be slightly
-	 * out-of-date, since we make this test before reading the corresponding
-	 * heap page or locking the buffer.  This is OK.  If we mistakenly think
-	 * that the page is all-visible or all-frozen when in fact the flag's just
-	 * been cleared, we might fail to vacuum the page.  It's easy to see that
-	 * skipping a page when aggressive is not set is not a very big deal; we
-	 * might leave some dead tuples lying around, but the next vacuum will
-	 * find them.  But even when aggressive *is* set, it's still OK if we miss
-	 * a page whose all-frozen marking has just been cleared.  Any new XIDs
-	 * just added to that page are necessarily >= vacrel->OldestXmin, and so
-	 * they cannot invalidate NewRelfrozenXid tracking.  A similar argument
-	 * applies for NewRelminMxid tracking and OldestMxact.
+	 * skip based on the visibility map.
 	 */
-	next_unskippable_block = 0;
-	if (vacrel->skipwithvm)
-	{
-		while (next_unskippable_block < rel_pages)
-		{
-			uint8		vmstatus;
+	next_unskippable_block = lazy_scan_skip_range(vacrel, &vmbuffer, 0,
+												  &all_visible_next_unskippable,
+												  &all_frozen_skippable_range);
 
-			vmstatus = visibilitymap_get_status(vacrel->rel,
-												next_unskippable_block,
-												&vmbuffer);
-			if (vacrel->aggressive)
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
-					break;
-			}
-			else
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
-					break;
-			}
-			vacuum_delay_point();
-			next_unskippable_block++;
-		}
-	}
-
-	if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
-		skipping_blocks = true;
-	else
-		skipping_blocks = false;
+	/*
+	 * Decide whether or not we'll actually skip the first skippable range.
+	 *
+	 * We want to skip pages that are all-visible according to the visibility
+	 * map (or all-frozen in the aggressive case), but only when we can skip
+	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
+	 * sequentially, the OS should be doing readahead for us, so there's no
+	 * gain in skipping a page now and then; that's likely to disable
+	 * readahead and so be counterproductive.
+	 */
+	skipping_range = (vacrel->skipwithvm &&
+					  next_unskippable_block >= SKIP_PAGES_THRESHOLD);
 
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm = false;
+		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
-
-		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
-								 blkno, InvalidOffsetNumber);
-
 		if (blkno == next_unskippable_block)
 		{
-			/* Time to advance next_unskippable_block */
-			next_unskippable_block++;
-			if (vacrel->skipwithvm)
-			{
-				while (next_unskippable_block < rel_pages)
-				{
-					uint8		vmskipflags;
-
-					vmskipflags = visibilitymap_get_status(vacrel->rel,
-														   next_unskippable_block,
-														   &vmbuffer);
-					if (vacrel->aggressive)
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
-							break;
-					}
-					else
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
-							break;
-					}
-					vacuum_delay_point();
-					next_unskippable_block++;
-				}
-			}
+			/*
+			 * We can't skip this block.  It might still be all-visible,
+			 * though.  This can happen when an aggressive VACUUM cannot skip
+			 * an all-visible block.
+			 */
+			all_visible_according_to_vm = all_visible_next_unskippable;
 
 			/*
-			 * We know we can't skip the current block.  But set up
-			 * skipping_blocks to do the right thing at the following blocks.
+			 * Determine a range of blocks to skip after we scan and process
+			 * this block.  We pass blkno + 1 as next_unskippable_block.  The
+			 * final next_unskippable_block won't change when there are no
+			 * blocks to skip (skippable blocks are those after blkno, but
+			 * before final next_unskippable_block).
 			 */
-			if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)
-				skipping_blocks = true;
-			else
-				skipping_blocks = false;
+			next_unskippable_block =
+				lazy_scan_skip_range(vacrel, &vmbuffer, blkno + 1,
+									 &all_visible_next_unskippable,
+									 &all_frozen_skippable_range);
 
-			/*
-			 * Normally, the fact that we can't skip this block must mean that
-			 * it's not all-visible.  But in an aggressive vacuum we know only
-			 * that it's not all-frozen, so it might still be all-visible.
-			 */
-			if (vacrel->aggressive &&
-				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+			/* Decide whether or not we'll actually skip the new range */
+			skipping_range =
+				(vacrel->skipwithvm &&
+				 next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD);
 		}
 		else
 		{
-			/*
-			 * The current page can be skipped if we've seen a long enough run
-			 * of skippable blocks to justify skipping it -- provided it's not
-			 * the last page in the relation (according to rel_pages).
-			 *
-			 * We always scan the table's last page to determine whether it
-			 * has tuples or not, even if it would otherwise be skipped. This
-			 * avoids having lazy_truncate_heap() take access-exclusive lock
-			 * on the table to attempt a truncation that just fails
-			 * immediately because there are tuples on the last page.
-			 */
-			if (skipping_blocks && blkno < rel_pages - 1)
+			/* Every block in the range must be safe to skip */
+			all_visible_according_to_vm = true;
+
+			Assert(blkno < next_unskippable_block);
+			Assert(blkno < rel_pages - 1);	/* see lazy_scan_skip_range */
+			Assert(!vacrel->aggressive || all_frozen_skippable_range);
+
+			if (skipping_range)
 			{
 				/*
-				 * Tricky, tricky.  If this is in aggressive vacuum, the page
-				 * must have been all-frozen at the time we checked whether it
-				 * was skippable, but it might not be any more.  We must be
-				 * careful to count it as a skipped all-frozen page in that
-				 * case, or else we'll think we can't update relfrozenxid and
-				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was initially all-frozen, so we have to
-				 * recheck.
+				 * If this range of blocks is not all-frozen, then we cannot
+				 * advance relfrozenxid later.  This is another reason for
+				 * SKIP_PAGES_THRESHOLD; it helps us to avoid losing out on
+				 * advancing relfrozenxid where it makes the least sense.
 				 */
-				if (vacrel->aggressive ||
-					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-					vacrel->frozenskipped_pages++;
+				if (!all_frozen_skippable_range)
+					vacrel->skippedallvis = true;
 				continue;
 			}
 
-			/*
-			 * SKIP_PAGES_THRESHOLD (threshold for skipping) was not
-			 * crossed, or this is the last page.  Scan the page, even
-			 * though it's all-visible (and possibly even all-frozen).
-			 */
-			all_visible_according_to_vm = true;
+			/* We decided to not skip this range, so scan its page */
 		}
 
 		vacuum_delay_point();
@@ -1046,6 +970,11 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		 */
 		vacrel->scanned_pages++;
 
+		/* Report as block scanned, update error traceback information */
+		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+								 blkno, InvalidOffsetNumber);
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1425,6 +1354,95 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	Assert(!IsInParallelMode());
 }
 
+/*
+ * Set up a range of skippable blocks using visibility map.
+ *
+ * lazy_scan_heap() caller calls here every time it needs to set up a new
+ * range of blocks to skip via the visibility map.  Caller passes the block
+ * immediately after its last next_unskippable_block to set up a new range.
+ * We return a new next_unskippable_block for this range.  This is often a
+ * degenerate 0-page range (we return caller's next_unskippable_block when
+ * that happens).
+ *
+ * Sets *all_visible_next_unskippable describes whether the returned block can
+ * be assumed all-visible.  Also sets *all_frozen_skippable_range to indicate
+ * whether the range is known to contain any all-visible pages.
+ *
+ * When vacrel->aggressive is set, caller can't skip pages just because they
+ * are all-visible, but can still skip pages that are all-frozen, since such
+ * pages do not need freezing and do not affect the value that we can safely
+ * set for relfrozenxid or relminmxid.  *all_frozen_skippable_range is never
+ * set 'true' for aggressive callers for this reason.
+ *
+ * Note: If caller thinks that one of the pages from the range is all-visible
+ * or all-frozen when in fact the flag's just been cleared, caller might fail
+ * to vacuum the page.  It's easy to see that skipping a page in a VACUUM that
+ * ultimately cannot advance relfrozenxid or relminmxid is not a very big
+ * deal; we might leave some dead tuples lying around, but the next vacuum
+ * will find them.  But even in VACUUMs that *are* capable of advancing
+ * relfrozenxid, it's still OK if we miss a page whose all-frozen marking gets
+ * concurrently cleared.  Any new XIDs from such a page must be >= OldestXmin,
+ * and so cannot invalidate NewRelfrozenXid tracking.  A similar argument
+ * applies for NewRelminMxid tracking and OldestMxact.
+ */
+static BlockNumber
+lazy_scan_skip_range(LVRelState *vacrel, Buffer *vmbuffer,
+					 BlockNumber next_unskippable_block,
+					 bool *all_visible_next_unskippable,
+					 bool *all_frozen_skippable_range)
+{
+	BlockNumber rel_pages = vacrel->rel_pages;
+
+	*all_visible_next_unskippable = true;
+	*all_frozen_skippable_range = true;
+
+	while (next_unskippable_block < rel_pages)
+	{
+		uint8		vmstatus;
+
+		vmstatus = visibilitymap_get_status(vacrel->rel,
+											next_unskippable_block,
+											vmbuffer);
+		if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0);
+			*all_visible_next_unskippable = false;
+			break;
+		}
+
+		/*
+		 * We always scan the table's last page later to determine whether it
+		 * has tuples or not, even if it would otherwise be skipped.  This
+		 * avoids having lazy_truncate_heap() take access-exclusive lock on
+		 * the table to attempt a truncation that just fails immediately
+		 * because there are tuples on the last page.
+		 */
+		if (next_unskippable_block == rel_pages - 1)
+		{
+			/* Last block case need only set all_visible_next_unskippable */
+			Assert(*all_visible_next_unskippable);
+			break;
+		}
+
+		if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * This block may be skipped too.  It's not all-frozen, though, so
+			 * entire skippable range will be deemed not-all-frozen.
+			 */
+			*all_frozen_skippable_range = false;
+		}
+
+		vacuum_delay_point();
+		next_unskippable_block++;
+	}
+
+	return next_unskippable_block;
+}
+
 /*
  *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
  *
-- 
2.30.2

#93

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#92)

3 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sun, Mar 13, 2022 at 9:05 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v10. While this does still include the freezing patch,
it's not in scope for Postgres 15. As I've said, I still think that it
makes sense to maintain the patch series with the freezing stuff,
since it's structurally related.

Attached is v11. Changes:

* No longer includes the patch that adds page-level freezing. It was
making it harder to assess code coverage for the patches that I'm
targeting Postgres 15 with. And so including it with each new revision
no longer seems useful. I'll pick it up for Postgres 16.

* Extensive isolation tests added to v11-0001-*, exercising a lot of
hard-to-hit code paths that are reached when VACUUM is unable to
immediately acquire a cleanup lock on some heap page. In particular,
we now have test coverage for the code in heapam.c that handles
tracking the oldest extant XID and MXID in the presence of MultiXacts
(on a no-cleanup-lock heap page).

* v11-0002-* (which is the patch that avoids missing out on advancing
relfrozenxid in non-aggressive VACUUMs due to a race condition on
HEAD) now moves even more of the logic for deciding how VACUUM will
skip using the visibility map into its own helper routine. Now
lazy_scan_heap just does what the state returned by the helper routine
tells it about the current skippable range -- it doesn't make any
decisions itself anymore. This is far simpler than what we do
currently, on HEAD.

There are no behavioral changes here, but this approach could be
pushed further to improve performance. We could easily determine
*every* page that we're going to scan (not skip) up-front in even the
largest tables, very early, before we've even scanned one page. This
could enable things like I/O prefetching, or capping the size of the
dead_items array based on our final scanned_pages (not on rel_pages).

* A new patch (v11-0003-*) alters the behavior of VACUUM's
DISABLE_PAGE_SKIPPING option. DISABLE_PAGE_SKIPPING no longer forces
aggressive VACUUM -- now it only forces the use of the visibility map,
since that behavior is totally independent of aggressiveness.

I don't feel too strongly about the DISABLE_PAGE_SKIPPING change. It
just seems logical to decouple no-vm-skipping from aggressiveness --
it might actually be helpful in testing the work from the patch series
in the future. Any page counted in scanned_pages has essentially been
processed by VACUUM with this work in place -- that was the idea
behind the lazy_scan_noprune stuff from commit 44fa8488. Bear in mind
that the relfrozenxid tracking stuff from v11-0001-* makes it almost
certain that a DISABLE_PAGE_SKIPPING-without-aggressiveness VACUUM
will still manage to advance relfrozenxid -- usually by the same
amount as an equivalent aggressive VACUUM would anyway. (Failing to
acquire a cleanup lock on some heap page might result in the final
older relfrozenxid being appreciably older, but probably not, and we'd
still almost certainly manage to advance relfrozenxid by *some* small
amount.)

Of course, anybody that wants both an aggressive VACUUM and a VACUUM
that never skips even all-frozen pages in the visibility map will
still be able to get that behavior quite easily. For example,
VACUUM(DISABLE_PAGE_SKIPPING, FREEZE) will do that. Several of our
existing tests must already use both of these options together,
because the tests require an effective vacuum_freeze_min_age of 0 (and
vacuum_multixact_freeze_min_age of 0) -- DISABLE_PAGE_SKIPPING alone
won't do that on HEAD, which seems to confuse the issue (see commit
b700f96c for an example of that).

In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently*
force lazy_scan_noprune to refuse to process a page on HEAD (it all
depends on FreezeLimit/vacuum_freeze_min_age), it is logical for
DISABLE_PAGE_SKIPPING to totally get out of the business of caring
about that -- better to limit it to caring only about the visibility
map (by no longer making it force aggressiveness).

--
Peter Geoghegan

Attachments:

v11-0003-Don-t-force-aggressive-mode-for-DISABLE_PAGE_SKI.patchapplication/octet-stream; name=v11-0003-Don-t-force-aggressive-mode-for-DISABLE_PAGE_SKI.patchDownload

From db98e5f02714bea3ce5422a68403b0a48dd280a7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 17 Mar 2022 21:39:01 -0700
Subject: [PATCH v11 3/3] Don't force aggressive mode for
 DISABLE_PAGE_SKIPPING.

It seems more natural to just make this option about the visibility map,
not whether or not individual pages are processed using lazy_scan_prune
or lazy_scan_noprune.  The latter arguably doesn't really skip at all.

TODO Review implications for use of DISABLE_PAGE_SKIPPING in tests
changed by commits c2dc1a79 and fe246d1c11.  This probably won't revive
the issues addressed in those commits.

VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) is used for some of these tests
already, presumably because it was necessary to specify FREEZE to force
vacuum_freeze_min_age=0 to get stable results (in the individual cases
that use FREEZE too).
---
 src/include/commands/vacuum.h        |  2 +-
 src/backend/access/heap/vacuumlazy.c | 16 ++--------------
 doc/src/sgml/ref/vacuum.sgml         | 15 +++++----------
 3 files changed, 8 insertions(+), 25 deletions(-)

diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index ead88edda..e0908012d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -187,7 +187,7 @@ typedef struct VacAttrStats
 #define VACOPT_FULL 0x10		/* FULL (non-concurrent) vacuum */
 #define VACOPT_SKIP_LOCKED 0x20 /* skip if cannot get lock */
 #define VACOPT_PROCESS_TOAST 0x40	/* process the TOAST table, if any */
-#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages */
+#define VACOPT_DISABLE_PAGE_SKIPPING 0x80	/* don't skip any pages via VM */
 
 /*
  * Values used by index_cleanup and truncate params.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 49653ae99..498c2f6ee 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -320,8 +320,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	int			usecs;
 	double		read_rate,
 				write_rate;
-	bool		aggressive,
-				skipwithvm;
+	bool		aggressive;
 	bool		frozenxid_updated,
 				minmulti_updated;
 	BlockNumber orig_rel_pages;
@@ -372,17 +371,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 									   &OldestXmin, &OldestMxact,
 									   &FreezeLimit, &MultiXactCutoff);
 
-	skipwithvm = true;
-	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
-	{
-		/*
-		 * Force aggressive mode, and disable skipping blocks using the
-		 * visibility map (even those set all-frozen)
-		 */
-		aggressive = true;
-		skipwithvm = false;
-	}
-
 	/*
 	 * Setup error traceback support for ereport() first.  The idea is to set
 	 * up an error context callback to display additional information on any
@@ -445,7 +433,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(params->truncate != VACOPTVALUE_UNSPECIFIED &&
 		   params->truncate != VACOPTVALUE_AUTO);
 	vacrel->aggressive = aggressive;
-	vacrel->skipwithvm = skipwithvm;
+	vacrel->skipwithvm = (params->options & VACOPT_DISABLE_PAGE_SKIPPING) == 0;
 	vacrel->failsafe_active = false;
 	vacrel->consider_bypass_optimization = true;
 	vacrel->do_index_vacuuming = true;
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 3df32b58e..aab2d6c53 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -155,16 +155,11 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ <replaceable class="paramet
     <listitem>
      <para>
       Normally, <command>VACUUM</command> will skip pages based on the <link
-      linkend="vacuum-for-visibility-map">visibility map</link>.  Pages where
-      all tuples are known to be frozen can always be skipped, and those
-      where all tuples are known to be visible to all transactions may be
-      skipped except when performing an aggressive vacuum.  Furthermore,
-      except when performing an aggressive vacuum, some pages may be skipped
-      in order to avoid waiting for other sessions to finish using them.
-      This option disables all page-skipping behavior, and is intended to
-      be used only when the contents of the visibility map are
-      suspect, which should happen only if there is a hardware or software
-      issue causing database corruption.
+      linkend="vacuum-for-visibility-map">visibility map</link>.
+      This option disables that behavior, and is intended to be used
+      only when the contents of the visibility map are suspect, which
+      should happen only if there is a hardware or software issue
+      causing database corruption.
      </para>
     </listitem>
    </varlistentry>
-- 
2.30.2

v11-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchapplication/octet-stream; name=v11-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchDownload

From c08001f3db8c279c37ca289499ccad70f524658b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v11 2/3] Generalize how VACUUM skips all-frozen pages.

Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to
aggressive VACUUMs) around advancing relfrozenxid before now.  The
underlying issue was that lazy_scan_heap conditioned its skipping
behavior on whether or not the current VACUUM was aggressive.  VACUUM
could fail to increment its frozenskipped_pages counter as a result, and
so could miss out on advancing relfrozenxid (in the non-aggressive case)
for no good reason.

The issue only comes up when concurrent activity might unset a page's
visibility map bit at exactly the wrong time.  The non-aggressive case
rechecked the visibility map at the point of skipping each page before
now.  This created a window for some other session to concurrently unset
the same heap page's bit in the visibility map.  If the bit was unset at
the wrong time, it would cause VACUUM to conservatively conclude that
the page was _never_ all-frozen on recheck.  frozenskipped_pages would
not be incremented for the page as a result.  lazy_scan_heap had already
committed to skipping the page/range at that point, though -- which made
it unsafe to advance relfrozenxid/relminmxid later on.

Consistently avoid the issue by generalizing how we skip frozen pages
during aggressive VACUUMs: take the same approach when skipping any
skippable page range during aggressive and non-aggressive VACUUMs alike.
The new approach makes ranges (not individual pages) the fundamental
unit of skipping using the visibility map.  frozenskipped_pages is
replaced with a boolean flag that represents whether some skippable
range with one or more all-visible pages was actually skipped (making
relfrozenxid unsafe to update).  The VM_ALL_VISIBLE()/VM_ALL_FROZEN()
rechecks at the top of lazy_scan_heap are no longer required, since we
now record the same information in the book keeping state that tracks
the range as a whole.

There is now a clean and unambiguous separation between deciding which
contiguous pages are safe to skip (and so should be treated as a range
of skippable pages), deciding if it's worth skipping a given range, and
actually executing skipping.  This separation seems like it might be
useful in the future.  For example, it would now be straightforward to
teach VACUUM to assemble skippable ranges up front, via a batch process.
Many unprocessed skippable ranges could be stored in a palloc'd array;
there are no dependencies to complicate things for lazy_scan_heap later.

It's possible that the issue this commit fixes hardly ever came up in
practice.  But we only had to be unlucky once to lose out on advancing
relfrozenxid -- a single affected heap page was enough to throw VACUUM
off.  That seems like something to avoid on general principle.  This is
similar to an issue fixed by commit 44fa8488, which taught vacuumlazy.c
to not give up on non-aggressive relfrozenxid advancement just because a
cleanup lock wasn't immediately available on some heap page.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 311 +++++++++++++--------------
 1 file changed, 146 insertions(+), 165 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ae280d4f9..49653ae99 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -176,6 +176,7 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -196,7 +197,6 @@ typedef struct LVRelState
 	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
@@ -247,6 +247,10 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
+								  BlockNumber next_block,
+								  bool *next_unskippable_allvis,
+								  bool *skipping_current_range);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -471,7 +475,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
-	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
@@ -518,6 +521,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/XMID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
+	vacrel->skippedallvis = false;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -575,7 +579,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * to an XID that is either older or newer than FreezeLimit (same applies
 	 * to relminmxid and MultiXactCutoff).
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
+	if (vacrel->skippedallvis)
 	{
 		/* Skipped an all-visible page, so cannot advance relfrozenxid */
 		Assert(!aggressive);
@@ -587,8 +591,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 	else
 	{
-		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
-			   orig_rel_pages);
 		Assert(!aggressive ||
 			   TransactionIdPrecedesOrEquals(FreezeLimit,
 											 vacrel->NewRelfrozenXid));
@@ -841,7 +843,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 				next_failsafe_block,
 				next_fsm_block_to_vacuum;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		next_unskippable_allvis,
+				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -872,179 +875,52 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/*
-	 * Set things up for skipping blocks using visibility map.
-	 *
-	 * Except when vacrel->aggressive is set, we want to skip pages that are
-	 * all-visible according to the visibility map, but only when we can skip
-	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
-	 * sequentially, the OS should be doing readahead for us, so there's no
-	 * gain in skipping a page now and then; that's likely to disable
-	 * readahead and so be counterproductive. Also, skipping even a single
-	 * page means that we can't update relfrozenxid, so we only want to do it
-	 * if we can skip a goodly number of pages.
-	 *
-	 * When vacrel->aggressive is set, we can't skip pages just because they
-	 * are all-visible, but we can still skip pages that are all-frozen, since
-	 * such pages do not need freezing and do not affect the value that we can
-	 * safely set for relfrozenxid or relminmxid.
-	 *
-	 * Before entering the main loop, establish the invariant that
-	 * next_unskippable_block is the next block number >= blkno that we can't
-	 * skip based on the visibility map, either all-visible for a regular scan
-	 * or all-frozen for an aggressive scan.  We set it to rel_pages when
-	 * there's no such block.  We also set up the skipping_blocks flag
-	 * correctly at this stage.
-	 *
-	 * Note: The value returned by visibilitymap_get_status could be slightly
-	 * out-of-date, since we make this test before reading the corresponding
-	 * heap page or locking the buffer.  This is OK.  If we mistakenly think
-	 * that the page is all-visible or all-frozen when in fact the flag's just
-	 * been cleared, we might fail to vacuum the page.  It's easy to see that
-	 * skipping a page when aggressive is not set is not a very big deal; we
-	 * might leave some dead tuples lying around, but the next vacuum will
-	 * find them.  But even when aggressive *is* set, it's still OK if we miss
-	 * a page whose all-frozen marking has just been cleared.  Any new XIDs
-	 * just added to that page are necessarily >= vacrel->OldestXmin, and so
-	 * they cannot invalidate NewRelfrozenXid tracking.  A similar argument
-	 * applies for NewRelminMxid tracking and OldestMxact.
-	 */
-	next_unskippable_block = 0;
-	if (vacrel->skipwithvm)
-	{
-		while (next_unskippable_block < rel_pages)
-		{
-			uint8		vmstatus;
-
-			vmstatus = visibilitymap_get_status(vacrel->rel,
-												next_unskippable_block,
-												&vmbuffer);
-			if (vacrel->aggressive)
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
-					break;
-			}
-			else
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
-					break;
-			}
-			vacuum_delay_point();
-			next_unskippable_block++;
-		}
-	}
-
-	if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
-		skipping_blocks = true;
-	else
-		skipping_blocks = false;
-
+	/* Set up an initial range of skippable blocks using the visibility map */
+	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
+											&next_unskippable_allvis,
+											&skipping_current_range);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm = false;
+		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
-
-		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
-								 blkno, InvalidOffsetNumber);
-
 		if (blkno == next_unskippable_block)
 		{
-			/* Time to advance next_unskippable_block */
-			next_unskippable_block++;
-			if (vacrel->skipwithvm)
-			{
-				while (next_unskippable_block < rel_pages)
-				{
-					uint8		vmskipflags;
-
-					vmskipflags = visibilitymap_get_status(vacrel->rel,
-														   next_unskippable_block,
-														   &vmbuffer);
-					if (vacrel->aggressive)
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
-							break;
-					}
-					else
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
-							break;
-					}
-					vacuum_delay_point();
-					next_unskippable_block++;
-				}
-			}
-
 			/*
-			 * We know we can't skip the current block.  But set up
-			 * skipping_blocks to do the right thing at the following blocks.
+			 * Can't skip this page safely.  Must scan the page.  But
+			 * determine the next skippable range after the page first.
 			 */
-			if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)
-				skipping_blocks = true;
-			else
-				skipping_blocks = false;
+			all_visible_according_to_vm = next_unskippable_allvis;
+			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
+													blkno + 1,
+													&next_unskippable_allvis,
+													&skipping_current_range);
 
-			/*
-			 * Normally, the fact that we can't skip this block must mean that
-			 * it's not all-visible.  But in an aggressive vacuum we know only
-			 * that it's not all-frozen, so it might still be all-visible.
-			 */
-			if (vacrel->aggressive &&
-				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+			Assert(next_unskippable_block >= blkno + 1);
 		}
 		else
 		{
-			/*
-			 * The current page can be skipped if we've seen a long enough run
-			 * of skippable blocks to justify skipping it -- provided it's not
-			 * the last page in the relation (according to rel_pages).
-			 *
-			 * We always scan the table's last page to determine whether it
-			 * has tuples or not, even if it would otherwise be skipped. This
-			 * avoids having lazy_truncate_heap() take access-exclusive lock
-			 * on the table to attempt a truncation that just fails
-			 * immediately because there are tuples on the last page.
-			 */
-			if (skipping_blocks && blkno < rel_pages - 1)
-			{
-				/*
-				 * Tricky, tricky.  If this is in aggressive vacuum, the page
-				 * must have been all-frozen at the time we checked whether it
-				 * was skippable, but it might not be any more.  We must be
-				 * careful to count it as a skipped all-frozen page in that
-				 * case, or else we'll think we can't update relfrozenxid and
-				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was initially all-frozen, so we have to
-				 * recheck.
-				 */
-				if (vacrel->aggressive ||
-					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-					vacrel->frozenskipped_pages++;
-				continue;
-			}
+			/* Last page always scanned (may need to set nonempty_pages) */
+			Assert(blkno < rel_pages - 1);
 
-			/*
-			 * SKIP_PAGES_THRESHOLD (threshold for skipping) was not
-			 * crossed, or this is the last page.  Scan the page, even
-			 * though it's all-visible (and possibly even all-frozen).
-			 */
+			if (skipping_current_range)
+				continue;
+
+			/* Current range is too small to skip -- just scan the page */
 			all_visible_according_to_vm = true;
 		}
 
-		vacuum_delay_point();
-
-		/*
-		 * We're not skipping this page using the visibility map, and so it is
-		 * (by definition) a scanned page.  Any tuples from this page are now
-		 * guaranteed to be counted below, after some preparatory checks.
-		 */
 		vacrel->scanned_pages++;
 
+		/* Report as block scanned, update error traceback information */
+		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+								 blkno, InvalidOffsetNumber);
+
+		vacuum_delay_point();
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1244,8 +1120,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on what the VM said about
-		 * the page before pruning started, and using prunestate
+		 * Handle setting visibility map bit based on information from the VM
+		 * (as of last lazy_scan_skip() call), and from prunestate
 		 */
 		if (!all_visible_according_to_vm && prunestate.all_visible)
 		{
@@ -1277,9 +1153,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
 		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * got cleared after lazy_scan_skip() was called, so we must recheck
+		 * with buffer lock before concluding that the VM is corrupt.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
@@ -1318,7 +1193,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * If the all-visible page is all-frozen but not marked as such yet,
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
+		 * all_visible is true, so we must check both prunestate fields.
 		 */
 		else if (all_visible_according_to_vm && prunestate.all_visible &&
 				 prunestate.all_frozen &&
@@ -1424,6 +1299,112 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	Assert(!IsInParallelMode());
 }
 
+/*
+ *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *
+ * lazy_scan_heap() calls here every time it needs to set up a new range of
+ * blocks to skip via the visibility map.  Caller passes the next block in
+ * line.  We return a next_unskippable_block for this range.  When there are
+ * no skippable blocks we just return caller's next_block.  The all-visible
+ * status of the returned block is set in *next_unskippable_allvis for caller,
+ * too.  Block usually won't be all-visible (since it's unskippable), but it
+ * can be during aggressive VACUUMs (as well as in certain edge cases).
+ *
+ * Sets *skipping_current_range to indicate if caller should skip this range.
+ * Costs and benefits drive our decision.  Very small ranges won't be skipped.
+ *
+ * Note: our opinion of which blocks can be skipped can go stale immediately.
+ * It's okay if caller "misses" a page whose all-visible or all-frozen marking
+ * was concurrently cleared, though.  All that matters is that caller scan all
+ * pages whose tuples might contain XIDs < OldestXmin, or XMIDs < OldestMxact.
+ * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
+ * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
+ * choice to skip such a range is actually made, making everything safe.)
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
+			   bool *next_unskippable_allvis, bool *skipping_current_range)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_unskippable_block = next_block,
+				nskippable_blocks = 0;
+	bool		allvisinrange = false;
+
+	*next_unskippable_allvis = true;
+	while (next_unskippable_block < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
+													   next_unskippable_block,
+													   vmbuffer);
+
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			*next_unskippable_allvis = false;
+			break;
+		}
+
+		/*
+		 * Caller must scan the last page to determine whether it has tuples
+		 * (caller must have the opportunity to set vacrel->nonempty_pages).
+		 * This rule avoids having lazy_truncate_heap() take access-exclusive
+		 * lock on rel to attempt a truncation that fails anyway, just because
+		 * there are tuples on the last page (it is likely that there will be
+		 * tuples on other nearby pages as well, but those can be skipped).
+		 *
+		 * Implement this by always treating the last block as unsafe to skip.
+		 */
+		if (next_unskippable_block == rel_pages - 1)
+			break;
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.  They may still skip all-frozen pages, which can't
+		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case.  But
+			 * remember that the final range contains such a block for later.
+			 */
+			allvisinrange = true;
+		}
+
+		vacuum_delay_point();
+		next_unskippable_block++;
+		nskippable_blocks++;
+	}
+
+	/*
+	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
+	 * pages.  Since we're reading sequentially, the OS should be doing
+	 * readahead for us, so there's no gain in skipping a page now and then.
+	 * Skipping such a range might even discourage sequential detection.
+	 *
+	 * This test also enables more frequent relfrozenxid advancement during
+	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
+	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 */
+	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
+		*skipping_current_range = false;
+	else
+	{
+		*skipping_current_range = true;
+		if (allvisinrange)
+			vacrel->skippedallvis = true;
+	}
+
+	return next_unskippable_block;
+}
+
 /*
  *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
  *
-- 
2.30.2

v11-0001-Loosen-coupling-between-relfrozenxid-and-freezin.patchapplication/octet-stream; name=v11-0001-Loosen-coupling-between-relfrozenxid-and-freezin.patchDownload

From b3b90046078fbddc8e4b2287a5d04b2cb5142cc6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v11 1/3] Loosen coupling between relfrozenxid and freezing.

When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive: the relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.  There is no fixed
relationship between the amount of physical work performed by VACUUM to
make it safe to advance relfrozenxid (freezing and pruning), and the
actual number of XIDs that relfrozenxid can be advanced by (at least in
principle) as a result.  VACUUM might have to freeze all of the tuples
from a hundred million heap pages just to enable relfrozenxid to be
advanced by no more than one or two XIDs.  On the other hand, VACUUM
might end up doing little or no work, and yet still be capable of
advancing relfrozenxid by hundreds of millions of XIDs as a result.

VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
extant XID (and oldest extant MultiXactId) from the table, including
XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
This optimization doesn't require any changes to the definition of
relfrozenxid, nor does it require changes to the core design of
freezing.

Later work targeting PostgreSQL 16 will teach VACUUM to determine what
to freeze based on page-level characteristics (not XID/XMID based
cutoffs).  But setting relfrozenxid/relminmxid to the exact oldest
extant XID/MXID is independently useful work.  For example, it is
helpful with larger databases that consume many MultiXacts.  If we
assume that the largest tables don't ever need to allocate any
MultiXacts, then aggressive VACUUMs targeting those tables will now
advance relminmxid right up to OldestMxact.  pg_class.relminmxid becomes
a much more precise indicator of what's really going on in each table,
making autovacuums to prevent wraparound (MultiXactId wraparound) occur
less frequently.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM -- FreezeLimit still acts as a lower bound on the final value
that aggressive VACUUM can set relfrozenxid to.  Since standard VACUUMs
still make no guarantees about advancing relfrozenxid, they might as
well set relfrozenxid to a value from well before FreezeLimit when the
opportunity presents itself.  In general standard VACUUMs may now set
relfrozenxid to any value > the original relfrozenxid and <= OldestXmin.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam.h                   |   7 +-
 src/include/access/heapam_xlog.h              |   4 +-
 src/include/commands/vacuum.h                 |   1 +
 src/backend/access/heap/heapam.c              | 244 ++++++++++++++----
 src/backend/access/heap/vacuumlazy.c          | 120 ++++++---
 src/backend/commands/cluster.c                |   5 +-
 src/backend/commands/vacuum.c                 |  42 +--
 doc/src/sgml/maintenance.sgml                 |  30 ++-
 .../expected/vacuum-no-cleanup-lock.out       | 188 ++++++++++++++
 .../isolation/expected/vacuum-reltuples.out   |  67 -----
 src/test/isolation/isolation_schedule         |   2 +-
 .../specs/vacuum-no-cleanup-lock.spec         | 145 +++++++++++
 .../isolation/specs/vacuum-reltuples.spec     |  49 ----
 13 files changed, 675 insertions(+), 229 deletions(-)
 create mode 100644 src/test/isolation/expected/vacuum-no-cleanup-lock.out
 delete mode 100644 src/test/isolation/expected/vacuum-reltuples.out
 create mode 100644 src/test/isolation/specs/vacuum-no-cleanup-lock.spec
 delete mode 100644 src/test/isolation/specs/vacuum-reltuples.spec

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b46ab7d73..6ef3c02bb 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,11 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi);
+extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple,
+									TransactionId limit_xid,
+									MultiXactId limit_multi,
+									TransactionId *relfrozenxid_nofreeze_out,
+									MultiXactId *relminmxid_nofreeze_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..2d8a7f627 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *relfrozenxid_out,
+									  MultiXactId *relminmxid_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3746336a0..5a3c18413 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6128,7 +6128,12 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * NB -- this might have the side-effect of creating a new MultiXactId!
  *
  * "flags" is an output value; it's used to tell caller what to do on return.
- * Possible flags are:
+ *
+ * "xmax_oldest_xid_out" is an output value; we must handle the details of
+ * tracking the oldest extant XID within Multixacts.  This is part of how
+ * caller tracks relfrozenxid_out (the oldest extant XID) on behalf of VACUUM.
+ *
+ * Possible values that we can set in "flags":
  * FRM_NOOP
  *		don't do anything -- keep existing Xmax
  * FRM_INVALIDATE_XMAX
@@ -6140,12 +6145,21 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * Final *xmax_oldest_xid_out value should be ignored completely unless
+ * "flags" contains either FRM_NOOP or FRM_RETURN_IS_MULTI.  Final value is
+ * drawn from oldest extant XID that will remain in some MultiXact (old or
+ * new) after xmax is frozen (XIDs that won't remain after freezing are
+ * ignored, per the general convention).
+ *
+ * Note in particular that caller must deal with FRM_RETURN_IS_XID case
+ * itself, by considering returned Xid (not using *xmax_oldest_xid_out).
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *xmax_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6157,6 +6171,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId temp_xid_out;
 
 	*flags = 0;
 
@@ -6251,13 +6266,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	temp_xid_out = *xmax_oldest_xid_out;	/* initialize temp_xid_out */
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-		{
 			need_replace = true;
-			break;
-		}
+		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+			temp_xid_out = members[i].xid;
 	}
 
 	/*
@@ -6266,6 +6281,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 */
 	if (!need_replace)
 	{
+		*xmax_oldest_xid_out = temp_xid_out;
 		*flags |= FRM_NOOP;
 		pfree(members);
 		return InvalidTransactionId;
@@ -6275,6 +6291,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	 * If the multi needs to be updated, figure out which members do we need
 	 * to keep.
 	 */
+	temp_xid_out = *xmax_oldest_xid_out;	/* reset temp_xid_out */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
@@ -6356,7 +6373,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 * list.)
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+					temp_xid_out = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6366,6 +6387,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			{
 				/* running locker cannot possibly be older than the cutoff */
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid, *xmax_oldest_xid_out));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6403,6 +6425,13 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+
+		/*
+		 * Return oldest remaining XID in new multixact if it's older than
+		 * caller's original xmax_oldest_xid_out (otherwise it's just the
+		 * original xmax_oldest_xid_out value from caller)
+		 */
+		*xmax_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
@@ -6421,6 +6450,11 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * Maintains *relfrozenxid_out and *relminmxid_out, which are the current
+ * target relfrozenxid and relminmxid for the relation.  Caller should make
+ * temp copies of global tracking variables before starting to process a page,
+ * so that we can only scribble on copies.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
@@ -6445,7 +6479,9 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
+						  TransactionId *relfrozenxid_out,
+						  MultiXactId *relminmxid_out)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6489,6 +6525,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+			*relfrozenxid_out = xid;
 	}
 
 	/*
@@ -6506,16 +6544,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId xmax_oldest_xid_out = *relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi,
+									&flags, &xmax_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
 		if (flags & FRM_RETURN_IS_XID)
 		{
 			/*
+			 * xmax will become an updater XID (an XID from the original
+			 * MultiXact's XIDs that needs to be carried forward).
+			 *
 			 * NB -- some of these transformations are only valid because we
 			 * know the return Xid is a tuple updater (i.e. not merely a
 			 * locker.) Also note that the only reason we don't explicitly
@@ -6527,6 +6570,16 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			if (flags & FRM_MARK_COMMITTED)
 				frz->t_infomask |= HEAP_XMAX_COMMITTED;
 			changed = true;
+			Assert(freeze_xmax);
+
+			/*
+			 * Only consider newxmax Xid to track relfrozenxid_out here, since
+			 * any other XIDs from the old MultiXact won't be left behind once
+			 * xmax is actually frozen.
+			 */
+			Assert(TransactionIdIsValid(newxmax));
+			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
+				*relfrozenxid_out = newxmax;
 		}
 		else if (flags & FRM_RETURN_IS_MULTI)
 		{
@@ -6534,6 +6587,10 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			uint16		newbits2;
 
 			/*
+			 * xmax was an old MultiXactId which we have to replace with a new
+			 * Multixact, that carries forward a subset of the XIDs from the
+			 * original (those that we'll still need).
+			 *
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
 			 * here; that routine initializes the masks to all zeroes, which
 			 * would lose other bits we need.  Doing it this way ensures all
@@ -6548,6 +6605,37 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->xmax = newxmax;
 
 			changed = true;
+			Assert(!freeze_xmax);
+
+			/*
+			 * FreezeMultiXactId sets xmax_oldest_xid_out to any XID that it
+			 * notices is older than initial relfrozenxid_out, unless the XID
+			 * won't remain after freezing
+			 */
+			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(TransactionIdPrecedesOrEquals(xmax_oldest_xid_out,
+												 *relfrozenxid_out));
+			*relfrozenxid_out = xmax_oldest_xid_out;
+		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 *
+			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
+			 * both together.  FreezeMultiXactId sets xmax_oldest_xid_out to
+			 * any XID that it notices is older than initial relfrozenxid_out,
+			 * unless the XID won't remain after freezing (or in this case
+			 * after _not_ freezing).
+			 */
+			Assert(MultiXactIdIsValid(xid));
+			Assert(!freeze_xmax);
+
+			if (MultiXactIdPrecedes(xid, *relminmxid_out))
+				*relminmxid_out = xid;
+			Assert(TransactionIdPrecedesOrEquals(xmax_oldest_xid_out,
+												 *relfrozenxid_out));
+			*relfrozenxid_out = xmax_oldest_xid_out;
 		}
 	}
 	else if (TransactionIdIsNormal(xid))
@@ -6575,7 +6663,11 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			freeze_xmax = true;
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
@@ -6699,11 +6791,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId relfrozenxid_out = cutoff_xid;
+	MultiXactId relminmxid_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &relfrozenxid_out, &relminmxid_out);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7133,24 +7228,54 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
  * are older than the specified cutoff XID or MultiXactId.  If so, return true.
  *
+ * See heap_prepare_freeze_tuple for information about the basic rules for the
+ * cutoffs used here.
+ *
  * It doesn't matter whether the tuple is alive or dead, we are checking
  * to see if a tuple needs to be removed or frozen to avoid wraparound.
  *
+ * The *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out arguments are
+ * input/output arguments that work just like heap_prepare_freeze_tuple's
+ * *relfrozenxid_out and *relminmxid_out input/output arguments.  However,
+ * there is one important difference: we track the oldest extant XID and XMID
+ * while making a working assumption that no freezing will actually take
+ * place.  On the other hand, heap_prepare_freeze_tuple assumes that freezing
+ * will take place (based on the specific instructions it also sets up for
+ * caller's tuple).
+ *
+ * Note, in particular, that we even assume that freezing won't go ahead for a
+ * tuple that we indicate "needs freezing" (by returning true).  Not all
+ * callers will be okay with that.  Caller should make temp copies of global
+ * tracking variables, and pass us those.  That way caller can back out at the
+ * last moment when it must freeze the tuple using heap_prepare_freeze_tuple.
+ *
  * NB: Cannot rely on hint bits here, they might not be set after a crash or
  * on a standby.
  */
 bool
-heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi)
+heap_tuple_needs_freeze(HeapTupleHeader tuple,
+						TransactionId limit_xid, MultiXactId limit_multi,
+						TransactionId *relfrozenxid_nofreeze_out,
+						MultiXactId *relminmxid_nofreeze_out)
 {
 	TransactionId xid;
-
-	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
+	bool		needs_freeze = false;
 
 	/*
+	 * First deal with xmin.
+	 */
+	xid = HeapTupleHeaderGetXmin(tuple);
+	if (TransactionIdIsNormal(xid))
+	{
+		if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+			*relfrozenxid_nofreeze_out = xid;
+		if (TransactionIdPrecedes(xid, limit_xid))
+			needs_freeze = true;
+	}
+
+	/*
+	 * Now deal with xmax.
+	 *
 	 * The considerations for multixacts are complicated; look at
 	 * heap_prepare_freeze_tuple for justifications.  This routine had better
 	 * be in sync with that one!
@@ -7158,57 +7283,80 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
 		MultiXactId multi;
+		MultiXactMember *members;
+		int			nmembers;
 
 		multi = HeapTupleHeaderGetRawXmax(tuple);
 		if (!MultiXactIdIsValid(multi))
 		{
-			/* no xmax set, ignore */
-			;
+			/* no xmax set -- but xmin might still need freezing */
+			return needs_freeze;
 		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
-			return true;
-		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
+
+		/*
+		 * Might have to ratchet back relminmxid_nofreeze_out, which we assume
+		 * won't be frozen by caller (even when we return true)
+		 */
+		if (MultiXactIdPrecedes(multi, *relminmxid_nofreeze_out))
+			*relminmxid_nofreeze_out = multi;
+
+		if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
 		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
-
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
+			/*
+			 * pg_upgrade'd MultiXact doesn't need to have its XID members
+			 * affect caller's relfrozenxid_nofreeze_out (just freeze it)
+			 */
+			return true;
 		}
+		else if (MultiXactIdPrecedes(multi, limit_multi))
+			needs_freeze = true;
+
+		/*
+		 * Need to check whether any member of the mxact is too old to
+		 * determine if MultiXact needs to be frozen now.  We even access the
+		 * members when we know that the MultiXactId isn't eligible for
+		 * freezing now -- we must still maintain relfrozenxid_nofreeze_out.
+		 */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
+		{
+			xid = members[i].xid;
+
+			if (TransactionIdPrecedes(xid, limit_xid))
+				needs_freeze = true;
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 	else
 	{
 		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			if (TransactionIdPrecedes(xid, limit_xid))
+				needs_freeze = true;
+		}
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			if (TransactionIdPrecedes(xid, limit_xid))
+				needs_freeze = true;
+		}
 	}
 
-	return false;
+	return needs_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 87ab7775a..ae280d4f9 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,7 +144,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
@@ -173,8 +173,9 @@ typedef struct LVRelState
 	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -328,6 +329,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
@@ -354,17 +356,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * used to determine which XIDs/MultiXactIds will be frozen.
 	 *
 	 * If this is an aggressive VACUUM, then we're strictly required to freeze
-	 * any and all XIDs from before FreezeLimit, so that we will be able to
-	 * safely advance relfrozenxid up to FreezeLimit below (we must be able to
-	 * advance relminmxid up to MultiXactCutoff, too).
+	 * any and all XIDs from before FreezeLimit in order to be able to advance
+	 * relfrozenxid to a value >= FreezeLimit below.  There is an analogous
+	 * requirement around MultiXact freezing, relminmxid, and MultiXactCutoff.
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -511,10 +513,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing */
+	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+	/* Initialize state used to track oldest extant XID/XMID */
+	vacrel->NewRelfrozenXid = OldestXmin;
+	vacrel->NewRelminMxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -568,14 +571,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
 	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
 	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * the visibility map.  A non-aggressive VACUUM might advance relfrozenxid
+	 * to an XID that is either older or newer than FreezeLimit (same applies
+	 * to relminmxid and MultiXactCutoff).
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
-		/* Cannot advance relfrozenxid/relminmxid */
+		/* Skipped an all-visible page, so cannot advance relfrozenxid */
 		Assert(!aggressive);
 		frozenxid_updated = minmulti_updated = false;
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
@@ -587,9 +589,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	{
 		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
 			   orig_rel_pages);
+		Assert(!aggressive ||
+			   TransactionIdPrecedesOrEquals(FreezeLimit,
+											 vacrel->NewRelfrozenXid));
+		Assert(!aggressive ||
+			   MultiXactIdPrecedesOrEquals(MultiXactCutoff,
+										   vacrel->NewRelminMxid));
 		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
 							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
+							vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
 							&frozenxid_updated, &minmulti_updated, false);
 	}
 
@@ -694,17 +702,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminMxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -896,8 +906,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	 * find them.  But even when aggressive *is* set, it's still OK if we miss
 	 * a page whose all-frozen marking has just been cleared.  Any new XIDs
 	 * just added to that page are necessarily >= vacrel->OldestXmin, and so
-	 * they'll have no effect on the value to which we can safely set
-	 * relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
+	 * they cannot invalidate NewRelfrozenXid tracking.  A similar argument
+	 * applies for NewRelminMxid tracking and OldestMxact.
 	 */
 	next_unskippable_block = 0;
 	if (vacrel->skipwithvm)
@@ -1584,6 +1594,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1593,7 +1605,9 @@ lazy_scan_prune(LVRelState *vacrel,
 
 retry:
 
-	/* Initialize (or reset) page-level counters */
+	/* Initialize (or reset) page-level state */
+	NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	NewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1801,7 +1815,8 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1815,13 +1830,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1972,6 +1990,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	TransactionId NoFreezeNewRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NoFreezeNewRelminMxid = vacrel->NewRelminMxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2017,20 +2037,40 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_needs_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff))
+									vacrel->MultiXactCutoff,
+									&NoFreezeNewRelfrozenXid,
+									&NoFreezeNewRelminMxid))
 		{
 			if (vacrel->aggressive)
 			{
-				/* Going to have to get cleanup lock for lazy_scan_prune */
+				/*
+				 * heap_tuple_needs_freeze determined that it isn't going to
+				 * be possible for the ongoing aggressive VACUUM operation to
+				 * advance relfrozenxid to a value >= FreezeLimit without
+				 * freezing one or more tuples with older XIDs from this page.
+				 * (Or perhaps the issue was that MultiXactCutoff could not be
+				 * respected.  Might have even been both cutoffs, together.)
+				 *
+				 * Tell caller that it must acquire a full cleanup lock.  It's
+				 * possible that caller will have to wait a while for one, but
+				 * that can't be helped -- full processing by lazy_scan_prune
+				 * is required to freeze the older XIDs (and/or freeze older
+				 * MultiXactIds).
+				 */
 				vacrel->offnum = InvalidOffsetNumber;
 				return false;
 			}
-
-			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
-			 */
-			vacrel->freeze_cutoffs_valid = false;
+			else
+			{
+				/*
+				 * This is a non-aggressive VACUUM, which is under no strict
+				 * obligation to advance relfrozenxid at all (much less to
+				 * advance it to a value >= FreezeLimit).  Non-aggressive
+				 * VACUUM advances relfrozenxid/relminmxid on a best-effort
+				 * basis.  Accept an older final relfrozenxid/relminmxid value
+				 * rather than waiting for a cleanup lock.
+				 */
+			}
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
@@ -2079,6 +2119,16 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 	vacrel->offnum = InvalidOffsetNumber;
 
+	/*
+	 * By here we know for sure that caller can tolerate having reduced
+	 * processing for this particular page.  Before we return to report
+	 * success, update vacrel with details of how we processed the page.
+	 * (lazy_scan_prune expects a clean slate, so we have to delay these steps
+	 * until here.)
+	 */
+	vacrel->NewRelfrozenXid = NoFreezeNewRelfrozenXid;
+	vacrel->NewRelminMxid = NoFreezeNewRelminMxid;
+
 	/*
 	 * Now save details of the LP_DEAD items from the page in vacrel (though
 	 * only when VACUUM uses two-pass strategy)
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 02a7e94bf..a7e988298 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 50a4a612e..0ae3b4506 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -945,14 +945,22 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
- * - multiXactCutoff is the value below which all MultiXactIds are removed
- *   from Xmax.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
+ * - multiXactCutoff is the value below which all MultiXactIds are definitely
+ *   removed from Xmax during aggressive vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit, and relminmxid up to multiXactCutoff.
+ * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
+ * minimum).
+ *
+ * oldestXmin and oldestMxact are the most recent values that can ever be
+ * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
+ * vacuumlazy.c caller later on.  These values should be passed when it turns
+ * out that VACUUM will leave no unfrozen XIDs/XMIDs behind in the table.
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -961,6 +969,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -969,7 +978,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1065,9 +1073,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1082,8 +1092,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
@@ -1390,14 +1400,10 @@ vac_update_relstats(Relation relation,
 	 * Update relfrozenxid, unless caller passed InvalidTransactionId
 	 * indicating it has no new data.
 	 *
-	 * Ordinarily, we don't let relfrozenxid go backwards: if things are
-	 * working correctly, the only way the new frozenxid could be older would
-	 * be if a previous VACUUM was done with a tighter freeze_min_age, in
-	 * which case we don't want to forget the work it already did.  However,
-	 * if the stored relfrozenxid is "in the future", then it must be corrupt
-	 * and it seems best to overwrite it with the cutoff we used this time.
-	 * This should match vac_update_datfrozenxid() concerning what we consider
-	 * to be "in the future".
+	 * Ordinarily, we don't let relfrozenxid go backwards.  However, if the
+	 * stored relfrozenxid is "in the future", then it must be corrupt, so
+	 * just overwrite it.  This should match vac_update_datfrozenxid()
+	 * concerning what we consider to be "in the future".
 	 */
 	if (frozenxid_updated)
 		*frozenxid_updated = false;
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 36f975b1e..6a02d0fa8 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -563,9 +563,11 @@
     statistics in the system tables <structname>pg_class</structname> and
     <structname>pg_database</structname>.  In particular,
     the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the freeze cutoff XID that was used
-    by the last aggressive <command>VACUUM</command> for that table.  All rows
-    inserted by transactions with XIDs older than this cutoff XID are
+    <structname>pg_class</structname> row contains the oldest
+    remaining XID at the end of the most recent <command>VACUUM</command>
+    that successfully advanced <structfield>relfrozenxid</structfield>
+    (typically the most recent aggressive VACUUM).  All rows inserted
+    by transactions with XIDs older than this cutoff XID are
     guaranteed to have been frozen.  Similarly,
     the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
@@ -588,6 +590,17 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     cutoff XID to the current transaction's XID.
    </para>
 
+   <tip>
+    <para>
+     <literal>VACUUM VERBOSE</literal> outputs information about
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> when either field was
+     advanced.  The same details appear in the server log when <xref
+      linkend="guc-log-autovacuum-min-duration"/> reports on vacuuming
+     by autovacuum.
+    </para>
+   </tip>
+
    <para>
     <command>VACUUM</command> normally only scans pages that have been modified
     since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
@@ -602,7 +615,11 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     set <literal>age(relfrozenxid)</literal> to a value just a little more than the
     <varname>vacuum_freeze_min_age</varname> setting
     that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  If no <structfield>relfrozenxid</structfield>-advancing
+    <command>VACUUM</command> started).  <command>VACUUM</command>
+    will set <structfield>relfrozenxid</structfield> to the oldest XID
+    that remains in the table, so it's possible that the final value
+    will be much more recent than strictly required.
+    If no <structfield>relfrozenxid</structfield>-advancing
     <command>VACUUM</command> is issued on the table until
     <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
     be forced for the table.
@@ -689,8 +706,9 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command> scans, regardless of
-     what causes them, enable advancing the value for that table.
+     Aggressive <command>VACUUM</command> scans, regardless of what
+     causes them, are <emphasis>guaranteed</emphasis> to be able to
+     advance the table's <structfield>relminmxid</structfield>.
      Eventually, as all tables in all databases are scanned and their
      oldest multixact values are advanced, on-disk storage for older
      multixacts can be removed.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
new file mode 100644
index 000000000..9b77bb5b4
--- /dev/null
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -0,0 +1,188 @@
+Parsed test spec with 4 sessions
+
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+step dml_begin: BEGIN;
+step dml_other_begin: BEGIN;
+step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step dml_other_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_commit: COMMIT;
+step dml_other_commit: COMMIT;
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_commit: 
+  COMMIT;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
deleted file mode 100644
index ce55376e7..000000000
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ /dev/null
@@ -1,67 +0,0 @@
-Parsed test spec with 2 sessions
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify open fetch1 vac close stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step open: 
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-
-step fetch1: 
-    fetch next from c1;
-
-dummy
------
-    1
-(1 row)
-
-step vac: 
-    vacuum smalltbl;
-
-step close: 
-    commit;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 0dae483e8..06436cf46 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -80,7 +80,7 @@ test: alter-table-4
 test: create-trigger
 test: sequence-ddl
 test: async-notify
-test: vacuum-reltuples
+test: vacuum-no-cleanup-lock
 test: timeouts
 test: vacuum-concurrent-drop
 test: vacuum-conflict
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
new file mode 100644
index 000000000..991738247
--- /dev/null
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -0,0 +1,145 @@
+# Test for vacuum's reduced processing of heap pages (used for any heap page
+# where a cleanup lock isn't immediately available)
+#
+# Debugging tip: Change VACUUM to VACUUM VERBOSE to get feedback on what's
+# really going on
+setup
+{
+  CREATE TABLE smalltbl AS SELECT i AS id FROM generate_series(1,20) i;
+  ALTER TABLE smalltbl SET (autovacuum_enabled = off);
+}
+setup
+{
+  VACUUM ANALYZE smalltbl;
+}
+
+teardown
+{
+  DROP TABLE smalltbl;
+}
+
+# This session holds a pin on smalltbl's only heap page:
+session pinholder
+step pinholder_cursor
+{
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+}
+step pinholder_commit
+{
+  COMMIT;
+}
+
+# This session inserts and deletes tuples, potentially affecting reltuples:
+session dml
+step dml_insert
+{
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+}
+step dml_delete
+{
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+}
+step dml_begin            { BEGIN; }
+step dml_key_share        { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_commit           { COMMIT; }
+
+# Needed for Multixact test:
+session dml_other
+step dml_other_begin      { BEGIN; }
+step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_other_commit     { COMMIT; }
+
+# This session runs non-aggressive VACUUM, but with maximally aggressive
+# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+session vacuumer
+setup
+{
+  SET vacuum_freeze_min_age = 0;
+  SET vacuum_multixact_freeze_min_age = 0;
+}
+step vacuumer_nonaggressive_vacuum
+{
+  VACUUM smalltbl;
+}
+step vacuumer_pg_class_stats
+{
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+}
+
+# Test VACUUM's reltuples counting mechanism.
+#
+# Final pg_class.reltuples should never be affected by VACUUM's inability to
+# get a cleanup lock on any page, except to the extent that any cleanup lock
+# contention changes the number of tuples that remain ("missed dead" tuples
+# are counted in reltuples, much like "recently dead" tuples).
+
+# Easy case:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+
+# Harder case -- count 21 tuples at the end (like last time), but with cleanup
+# lock contention this time:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    pinholder_cursor
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but vary the order, and delete an inserted row:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    pinholder_cursor
+    dml_insert
+    dml_delete
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "recently dead" tuple won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but initial insert and delete before cursor:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    dml_delete
+    pinholder_cursor
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
+    # concurrent activity held back VACUUM's OldestXmin) won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Test VACUUM's mechanism for skipping MultiXact freezing.
+#
+# This provides test coverage for code paths that are only hit when we need to
+# freeze, but inability to acquire a cleanup lock on a heap page makes
+# freezing some XIDs/XMIDs < FreezeLimit/MultiXactCutoff impossible (without
+# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+permutation
+    dml_begin
+    dml_other_begin
+    dml_key_share
+    dml_other_key_share
+    # Will get cleanup lock, can't advance relminmxid yet:
+    # (though will usually advance relfrozenxid by ~2 XIDs)
+    vacuumer_nonaggressive_vacuum
+    pinholder_cursor
+    dml_commit
+    dml_other_commit
+    # Can't cleanup lock, so still can't advance relminmxid here:
+    # (relfrozenxid held back by XIDs in MultiXact too)
+    vacuumer_nonaggressive_vacuum
+    pinholder_commit
+    # Pin was dropped, so will advance relminmxid, at long last:
+    # (ditto for relfrozenxid advancement)
+    vacuumer_nonaggressive_vacuum
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
deleted file mode 100644
index a2a461f2f..000000000
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ /dev/null
@@ -1,49 +0,0 @@
-# Test for vacuum's handling of reltuples when pages are skipped due
-# to page pins. We absolutely need to avoid setting reltuples=0 in
-# such cases, since that interferes badly with planning.
-#
-# Expected result for all three permutation is 21 tuples, including
-# the second permutation.  VACUUM is able to count the concurrently
-# inserted tuple in its final reltuples, even when a cleanup lock
-# cannot be acquired on the affected heap page.
-
-setup {
-    create table smalltbl
-        as select i as id from generate_series(1,20) i;
-    alter table smalltbl set (autovacuum_enabled = off);
-}
-setup {
-    vacuum analyze smalltbl;
-}
-
-teardown {
-    drop table smalltbl;
-}
-
-session worker
-step open {
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-}
-step fetch1 {
-    fetch next from c1;
-}
-step close {
-    commit;
-}
-step stats {
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-}
-
-session vacuumer
-step vac {
-    vacuum smalltbl;
-}
-step modify {
-    insert into smalltbl select max(id)+1 from smalltbl;
-}
-
-permutation modify vac stats
-permutation modify open fetch1 vac close stats
-permutation modify vac stats
-- 
2.30.2

#94

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#93)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 23, 2022 at 3:59 PM Peter Geoghegan <pg@bowt.ie> wrote:

In other words, since DISABLE_PAGE_SKIPPING doesn't *consistently*
force lazy_scan_noprune to refuse to process a page on HEAD (it all
depends on FreezeLimit/vacuum_freeze_min_age), it is logical for
DISABLE_PAGE_SKIPPING to totally get out of the business of caring
about that -- better to limit it to caring only about the visibility
map (by no longer making it force aggressiveness).

It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely
disable skipping pages, we have a problem.

The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named
DISABLE_PAGE_SKIPPING.

--
Robert Haas
EDB: http://www.enterprisedb.com

#95

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#94)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 23, 2022 at 1:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely
disable skipping pages, we have a problem.

It depends on how you define skipping. DISABLE_PAGE_SKIPPING was
created at a time when a broader definition of skipping made a lot
more sense.

The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named
DISABLE_PAGE_SKIPPING.

VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show
that 100% of all of the pages from rel_pages are scanned. A page that
is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of
its tuples frozen. But every other aspect of processing the page
happens in just the same way as it would in the cleanup
lock/lazy_scan_prune path.

We'll even still VACUUM the page if it happens to have some existing
LP_DEAD items left behind by opportunistic pruning. We don't need a
cleanup in either lazy_scan_noprune (a share lock is all we need), nor
do we even need one in lazy_vacuum_heap_page (a regular exclusive lock
is all we need).

--
Peter Geoghegan

#96

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#95)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 23, 2022 at 4:49 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 23, 2022 at 1:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

It seems to me that if DISABLE_PAGE_SKIPPING doesn't completely
disable skipping pages, we have a problem.

It depends on how you define skipping. DISABLE_PAGE_SKIPPING was
created at a time when a broader definition of skipping made a lot
more sense.

The option isn't named CARE_ABOUT_VISIBILITY_MAP. It's named
DISABLE_PAGE_SKIPPING.

VACUUM(DISABLE_PAGE_SKIPPING, VERBOSE) will still consistently show
that 100% of all of the pages from rel_pages are scanned. A page that
is "skipped" by lazy_scan_noprune isn't pruned, and won't have any of
its tuples frozen. But every other aspect of processing the page
happens in just the same way as it would in the cleanup
lock/lazy_scan_prune path.

I see what you mean about it depending on how you define "skipping".
But I think that DISABLE_PAGE_SKIPPING is intended as a sort of
emergency safeguard when you really, really don't want to leave
anything out. And therefore I favor defining it to mean that we don't
skip any work at all.

--
Robert Haas
EDB: http://www.enterprisedb.com

#97

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#96)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 23, 2022 at 1:53 PM Robert Haas <robertmhaas@gmail.com> wrote:

I see what you mean about it depending on how you define "skipping".
But I think that DISABLE_PAGE_SKIPPING is intended as a sort of
emergency safeguard when you really, really don't want to leave
anything out.

I agree.

And therefore I favor defining it to mean that we don't
skip any work at all.

But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot
acquire a cleanup lock on a page, unless it happens to have XIDs from
before FreezeLimit (which is probably 50 million XIDs behind
OldestXmin, the vacuum_freeze_min_age default). I don't see much
difference.

Anyway, this isn't important. I'll just drop the third patch.

--
Peter Geoghegan

#98

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#97)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Mar 24, 2022 at 9:59 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 23, 2022 at 1:53 PM Robert Haas <robertmhaas@gmail.com> wrote:

And therefore I favor defining it to mean that we don't
skip any work at all.

But even today DISABLE_PAGE_SKIPPING won't do pruning when we cannot
acquire a cleanup lock on a page, unless it happens to have XIDs from
before FreezeLimit (which is probably 50 million XIDs behind
OldestXmin, the vacuum_freeze_min_age default). I don't see much
difference.

Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable
all page skipping, so 3414099c turned out to be not enough.

#99

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Thomas Munro (#98)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 23, 2022 at 2:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Yeah, I found it confusing that DISABLE_PAGE_SKIPPING doesn't disable
all page skipping, so 3414099c turned out to be not enough.

The proposed change to DISABLE_PAGE_SKIPPING is partly driven by that,
and partly driven by a similar concern about aggressive VACUUM.

It seems worth emphasizing the idea that an aggressive VACUUM is now
just the same as any other VACUUM except for one detail: we're
guaranteed to advance relfrozenxid to a value >= FreezeLimit at the
end. The non-aggressive case has the choice to do things that make
that impossible. But there are only two places where this can happen now:

1. Non-aggressive VACUUMs might decide to skip some all-visible pages in
the new lazy_scan_skip() helper routine for skipping with the VM (see
v11-0002-*).

2. A non-aggressive VACUUM can *always* decide to ratchet back its
target relfrozenxid in lazy_scan_noprune, to avoid waiting for a
cleanup lock -- a final value from before FreezeLimit is usually still
pretty good.

The first scenario is the only one where it becomes impossible for
non-aggressive VACUUM to be able to advance relfrozenxid (with
v11-0001-* in place) by any amount. Even that's a choice, made by
weighing costs against benefits.

There is no behavioral change in v11-0002-* (we're still using the
old SKIP_PAGES_THRESHOLD strategy), but the lazy_scan_skip()
helper routine could fairly easily be taught a lot more about the
downside of skipping all-visible pages (namely how that makes it
impossible to advance relfrozenxid).

Maybe it's worth skipping all-visible pages (there are lots of them
and age(relfrozenxid) is still low), and maybe it isn't worth it. We
should get to decide, without implementation details making
relfrozenxid advancement unsafe.

It would be great if you could take a look v11-0002-*, Robert. Does it
make sense to you?

Thanks
--
Peter Geoghegan

#100

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#99)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 23, 2022 at 6:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

It would be great if you could take a look v11-0002-*, Robert. Does it
make sense to you?

You're probably not going to love hearing this, but I think you're
still explaining things here in ways that are too baroque and hard to
follow. I do think it's probably better. But, for example, in the
commit message for 0001, I think you could change the subject line to
"Allow non-aggressive vacuums to advance relfrozenxid" and it would be
clearer. And then I think you could eliminate about half of the first
paragraph, starting with "There is no fixed relationship", and all of
the third paragraph (which starts with "Later work..."), and I think
removing all that material would make it strictly more clear than it
is currently. I don't think it's the place of a commit message to
speculate too much on future directions or to wax eloquent on
theoretical points. If that belongs anywhere, it's in a mailing list
discussion.

It seems to me that 0002 mixes code movement with functional changes.
I'm completely on board with moving the code that decides how much to
skip into a function. That seems like a great idea, and probably
overdue. But it is not easy for me to see what has changed
functionally between the old and new code organization, and I bet it
would be possible to split this into two patches, one of which creates
a function, and the other of which fixes the problem, and I think that
would be a useful service to future readers of the code. I have a hard
time believing that if someone in the future bisects a problem back to
this commit, they're going to have an easy time finding the behavior
change in here. In fact I can't see it myself. I think the actual
functional change is to fix what is described in the second paragraph
of the commit message, but I haven't been able to figure out where the
logic is actually changing to address that. Note that I would be happy
with the behavior change happening either before or after the code
reorganization.

I also think that the commit message for 0002 is probably longer and
more complex than is really helpful, and that the subject line is too
vague, but since I don't yet understand exactly what's happening here,
I cannot comment on how I think it should be revised at this point,
except to say that the second paragraph of that commit message looks
like the most useful part.

I would also like to mention a few things that I do like about 0002.
One is that it seems to collapse two different pieces of logic for
page skipping into one. That seems good. As mentioned, it's especially
good because that logic is abstracted into a function. Also, it looks
like it is making a pretty localized change to one (1) aspect of what
VACUUM does -- and I definitely prefer patches that change only one
thing at a time.

Hope that's helpful.

--
Robert Haas
EDB: http://www.enterprisedb.com

#101

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#100)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Mar 24, 2022 at 10:21 AM Robert Haas <robertmhaas@gmail.com> wrote:

You're probably not going to love hearing this, but I think you're
still explaining things here in ways that are too baroque and hard to
follow. I do think it's probably better.

There are a lot of dimensions to this work. It's hard to know which to
emphasize here.

But, for example, in the
commit message for 0001, I think you could change the subject line to
"Allow non-aggressive vacuums to advance relfrozenxid" and it would be
clearer.

But non-aggressive VACUUMs have always been able to do that.

How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"

And then I think you could eliminate about half of the first
paragraph, starting with "There is no fixed relationship", and all of
the third paragraph (which starts with "Later work..."), and I think
removing all that material would make it strictly more clear than it
is currently. I don't think it's the place of a commit message to
speculate too much on future directions or to wax eloquent on
theoretical points. If that belongs anywhere, it's in a mailing list
discussion.

Okay, I'll do that.

It seems to me that 0002 mixes code movement with functional changes.

Believe it or not, I avoided functional changes in 0002 -- at least in
one important sense. That's why you had difficulty spotting any. This
must sound peculiar, since the commit message very clearly says that
the commit avoids a problem seen only in the non-aggressive case. It's
really quite subtle.

You wrote this comment and code block (which I propose to remove in
0002), so clearly you already understand the race condition that I'm
concerned with here:

- if (skipping_blocks && blkno < rel_pages - 1)
- {
- /*
- * Tricky, tricky. If this is in aggressive vacuum, the page
- * must have been all-frozen at the time we checked whether it
- * was skippable, but it might not be any more. We must be
- * careful to count it as a skipped all-frozen page in that
- * case, or else we'll think we can't update relfrozenxid and
- * relminmxid. If it's not an aggressive vacuum, we don't
- * know whether it was initially all-frozen, so we have to
- * recheck.
- */
- if (vacrel->aggressive ||
- VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
- vacrel->frozenskipped_pages++;
- continue;
- }

What you're saying here boils down to this: it doesn't matter what the
visibility map would say right this microsecond (in the aggressive
case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
said that this page was all-frozen *in the recent past*. That's good
enough; we will never fail to scan a page that might have an XID <
OldestXmin (ditto for XMIDs) this way, which is all that really
matters.

This is absolutely mandatory in the aggressive case, because otherwise
relfrozenxid advancement might be seen as unsafe. My observation is:
Why should we accept the same race in the non-aggressive case? Why not
do essentially the same thing in every VACUUM?

In 0002 we now track if each range that we actually chose to skip had
any all-visible (not all-frozen) pages -- if that happens then
relfrozenxid advancement becomes unsafe. The existing code uses
"vacrel->aggressive" as a proxy for the same condition -- the existing
code reasons based on what the visibility map must have said about the
page in the recent past. Which makes sense, but only works in the
aggressive case. The approach taken in 0002 also makes the code
simpler, which is what enabled putting the VM skipping code into its
own helper function, but that was just a bonus.

And so you could almost say that there is now behavioral change at
all. We're skipping pages in the same way, based on the same
information (from the visibility map) as before. We're just being a
bit more careful than before about how that information is tracked, to
avoid this race. A race that we always avoided in the aggressive case
is now consistently avoided.

I'm completely on board with moving the code that decides how much to
skip into a function. That seems like a great idea, and probably
overdue. But it is not easy for me to see what has changed
functionally between the old and new code organization, and I bet it
would be possible to split this into two patches, one of which creates
a function, and the other of which fixes the problem, and I think that
would be a useful service to future readers of the code.

It seems kinda tricky to split up 0002 like that. It's possible, but
I'm not sure if it's possible to split it in a way that highlights the
issue that I just described. Because we already avoided the race in
the aggressive case.

I also think that the commit message for 0002 is probably longer and
more complex than is really helpful, and that the subject line is too
vague, but since I don't yet understand exactly what's happening here,
I cannot comment on how I think it should be revised at this point,
except to say that the second paragraph of that commit message looks
like the most useful part.

I'll work on that.

I would also like to mention a few things that I do like about 0002.
One is that it seems to collapse two different pieces of logic for
page skipping into one. That seems good. As mentioned, it's especially
good because that logic is abstracted into a function. Also, it looks
like it is making a pretty localized change to one (1) aspect of what
VACUUM does -- and I definitely prefer patches that change only one
thing at a time.

Totally embracing the idea that we don't necessarily need very recent
information from the visibility map (it just has to be after
OldestXmin was established) has a lot of advantages, architecturally.
It could in principle be hours out of date in the longest VACUUM
operations -- that should be fine. This is exactly the same principle
that makes it okay to stick with our original rel_pages, even when the
table has grown during a VACUUM operation (I documented this in commit
73f6ec3d3c recently).

We could build on the approach taken by 0002 to create a totally
comprehensive picture of the ranges we're skipping up-front, before we
actually scan any pages, even with very large tables. We could in
principle cache a very large number of skippable ranges up-front,
without ever going back to the visibility map again later (unless we
need to set a bit). It really doesn't matter if somebody else unsets a
page's VM bit concurrently, at all.

I see a lot of advantage to knowing our final scanned_pages almost
immediately. Things like prefetching, capping the size of the
dead_items array more intelligently (use final scanned_pages instead
of rel_pages in dead_items_max_items()), improvements to progress
reporting...not to mention more intelligent choices about whether we
should try to advance relfrozenxid a bit earlier during non-aggressive
VACUUMs.

Hope that's helpful.

Very helpful -- thanks!

--
Peter Geoghegan

#102

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#101)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Mar 24, 2022 at 3:28 PM Peter Geoghegan <pg@bowt.ie> wrote:

But non-aggressive VACUUMs have always been able to do that.

How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"

Sure, that sounds nice.

Believe it or not, I avoided functional changes in 0002 -- at least in
one important sense. That's why you had difficulty spotting any. This
must sound peculiar, since the commit message very clearly says that
the commit avoids a problem seen only in the non-aggressive case. It's
really quite subtle.

Well, I think the goal in revising the code is to be as un-subtle as
possible. Commits that people can't easily understand breed future
bugs.

What you're saying here boils down to this: it doesn't matter what the
visibility map would say right this microsecond (in the aggressive
case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
said that this page was all-frozen *in the recent past*. That's good
enough; we will never fail to scan a page that might have an XID <
OldestXmin (ditto for XMIDs) this way, which is all that really
matters.

Makes sense. So maybe the commit message should try to emphasize this
point e.g. "If a page is all-frozen at the time we check whether it
can be skipped, don't allow it to affect the relfrozenxmin and
relminmxid which we set for the relation. This was previously true for
aggressive vacuums, but not for non-aggressive vacuums, which was
inconsistent. (The reason this is a safe thing to do is that any new
XIDs or MXIDs that appear on the page after we initially observe it to
be frozen must be newer than any relfrozenxid or relminmxid the
current vacuum could possibly consider storing into pg_class.)"

This is absolutely mandatory in the aggressive case, because otherwise
relfrozenxid advancement might be seen as unsafe. My observation is:
Why should we accept the same race in the non-aggressive case? Why not
do essentially the same thing in every VACUUM?

Sure, that seems like a good idea. I think I basically agree with the
goals of the patch. My concern is just about making the changes
understandable to future readers. This area is notoriously subtle, and
people are going to introduce more bugs even if the comments and code
organization are fantastic.

And so you could almost say that there is now behavioral change at
all.

I vigorously object to this part, though. We should always err on the
side of saying that commits *do* have behavioral changes. We should go
out of our way to call out in the commit message any possible way that
someone might notice the difference between the post-commit situation
and the pre-commit situation. It is fine, even good, to also be clear
about how we're maintaining continuity and why we don't think it's a
problem, but the only commits that should be described as not having
any behavioral change are ones that do mechanical code movement, or
are just changing comments, or something like that.

It seems kinda tricky to split up 0002 like that. It's possible, but
I'm not sure if it's possible to split it in a way that highlights the
issue that I just described. Because we already avoided the race in
the aggressive case.

I do see that there are some difficulties there. I'm not sure what to
do about that. I think a sufficiently clear commit message could
possibly be enough, rather than trying to split the patch. But I also
think splitting the patch should be considered, if that can reasonably
be done.

--
Robert Haas
EDB: http://www.enterprisedb.com

#103

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#102)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Mar 24, 2022 at 1:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

How about: "Set relfrozenxid to oldest extant XID seen by VACUUM"

Sure, that sounds nice.

Cool.

What you're saying here boils down to this: it doesn't matter what the
visibility map would say right this microsecond (in the aggressive
case) were we to call VM_ALL_FROZEN(): we know for sure that the VM
said that this page was all-frozen *in the recent past*. That's good
enough; we will never fail to scan a page that might have an XID <
OldestXmin (ditto for XMIDs) this way, which is all that really
matters.

Makes sense. So maybe the commit message should try to emphasize this
point e.g. "If a page is all-frozen at the time we check whether it
can be skipped, don't allow it to affect the relfrozenxmin and
relminmxid which we set for the relation. This was previously true for
aggressive vacuums, but not for non-aggressive vacuums, which was
inconsistent. (The reason this is a safe thing to do is that any new
XIDs or MXIDs that appear on the page after we initially observe it to
be frozen must be newer than any relfrozenxid or relminmxid the
current vacuum could possibly consider storing into pg_class.)"

Okay, I'll add something more like that.

Almost every aspect of relfrozenxid advancement by VACUUM seems
simpler when thought about in these terms IMV. Every VACUUM now scans
all pages that might have XIDs < OldestXmin, and so every VACUUM can
advance relfrozenxid to the oldest extant XID (barring non-aggressive
VACUUMs that *choose* to skip some all-visible pages).

There are a lot more important details, of course. My "Every
VACUUM..." statement works well as an axiom because all of those other
details don't create any awkward exceptions.

This is absolutely mandatory in the aggressive case, because otherwise
relfrozenxid advancement might be seen as unsafe. My observation is:
Why should we accept the same race in the non-aggressive case? Why not
do essentially the same thing in every VACUUM?

Sure, that seems like a good idea. I think I basically agree with the
goals of the patch.

Great.

My concern is just about making the changes
understandable to future readers. This area is notoriously subtle, and
people are going to introduce more bugs even if the comments and code
organization are fantastic.

Makes sense.

And so you could almost say that there is now behavioral change at
all.

I vigorously object to this part, though. We should always err on the
side of saying that commits *do* have behavioral changes.

I think that you've taken my words too literally here. I would never
conceal the intent of a piece of work like that. I thought that it
would clarify matters to point out that I could in theory "get away
with it if I wanted to" in this instance. This was only a means of
conveying a subtle point about the behavioral changes from 0002 --
since you couldn't initially see them yourself (even with my commit
message).

Kind of like Tom Lane's 2011 talk on the query planner. The one where
he lied to the audience several times.

It seems kinda tricky to split up 0002 like that. It's possible, but
I'm not sure if it's possible to split it in a way that highlights the
issue that I just described. Because we already avoided the race in
the aggressive case.

I do see that there are some difficulties there. I'm not sure what to
do about that. I think a sufficiently clear commit message could
possibly be enough, rather than trying to split the patch. But I also
think splitting the patch should be considered, if that can reasonably
be done.

I'll see if I can come up with something. It's hard to be sure about
that kind of thing when you're this close to the code.

--
Peter Geoghegan

#104

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#103)

3 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Mar 24, 2022 at 2:40 PM Peter Geoghegan <pg@bowt.ie> wrote:

This is absolutely mandatory in the aggressive case, because otherwise
relfrozenxid advancement might be seen as unsafe. My observation is:
Why should we accept the same race in the non-aggressive case? Why not
do essentially the same thing in every VACUUM?

Sure, that seems like a good idea. I think I basically agree with the
goals of the patch.

Great.

Attached is v12. My current goal is to commit all 3 patches before
feature freeze. Note that this does not include the more complicated
patch including with previous revisions of the patch series (the
page-level freezing work that appeared in versions before v11).

Changes that appear in this new revision, v12:

* Reworking of the commit messages based on feedback from Robert.

* General cleanup of the changes to heapam.c from 0001 (the changes to
heap_prepare_freeze_tuple and related functions). New and existing
code now fits together a bit better. I also added a couple of new
documenting assertions, to make the flow a bit easier to understand.

* Added new assertions that document
OldestXmin/FreezeLimit/relfrozenxid invariants, right at the point we
update pg_class within vacuumlazy.c.

These assertions would have a decent chance of failing if there were
any bugs in the code.

* Removed patch that made DISABLE_PAGE_SKIPPING not force aggressive
VACUUM, limiting the underlying mechanism to forcing scanning of all
pages in lazy_scan_heap (v11 was the first and last revision that
included this patch).

* Adds a new small patch 0003. This just moves the last piece of
resource allocation that still took place at the top of
lazy_scan_heap() back into its caller, heap_vacuum_rel().

The work in 0003 probably should have happened as part of the patch
that became commit 73f6ec3d -- same idea. It's totally mechanical
stuff. With 0002 and 0003, there is hardly any lazy_scan_heap code
before the main loop that iterates through blocks in rel_pages (and
the code that's still there is obviously related to the loop in a
direct and obvious way). This seems like a big overall improvement in
maintainability.

Didn't see a way to split up 0002, per Robert's suggestion 3 days ago.
As I said at the time, it's possible to split it up, but not in a way
that highlights the underlying issue (since the issue 0002 fixes was
always limited to non-aggressive VACUUMs). The commit message may have
to suffice.

--
Peter Geoghegan

Attachments:

v12-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patchapplication/octet-stream; name=v12-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patchDownload

From 8bac0453c9414f2b888cb916559d1909cd07be64 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v12 1/3] Set relfrozenxid to oldest extant XID seen by VACUUM.

When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive: the relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.

VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
extant XID (and oldest extant MultiXactId) from the table, including
XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
This optimization doesn't require any changes to the definition of
relfrozenxid, nor does it require changes to the core design of
freezing.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM -- FreezeLimit still acts as a lower bound on the final value
that aggressive VACUUM can set relfrozenxid to.  Since standard VACUUMs
still make no guarantees about advancing relfrozenxid, they might as
well set relfrozenxid to a value from well before FreezeLimit when the
opportunity presents itself.  In general standard VACUUMs may now set
relfrozenxid to any value > the original relfrozenxid and <= OldestXmin.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam.h                   |   4 +-
 src/include/access/heapam_xlog.h              |   4 +-
 src/include/commands/vacuum.h                 |   1 +
 src/backend/access/heap/heapam.c              | 306 ++++++++++++++----
 src/backend/access/heap/vacuumlazy.c          | 175 ++++++----
 src/backend/commands/cluster.c                |   5 +-
 src/backend/commands/vacuum.c                 |  42 +--
 doc/src/sgml/maintenance.sgml                 |  30 +-
 .../expected/vacuum-no-cleanup-lock.out       | 189 +++++++++++
 .../isolation/expected/vacuum-reltuples.out   |  67 ----
 src/test/isolation/isolation_schedule         |   2 +-
 .../specs/vacuum-no-cleanup-lock.spec         | 150 +++++++++
 .../isolation/specs/vacuum-reltuples.spec     |  49 ---
 13 files changed, 744 insertions(+), 280 deletions(-)
 create mode 100644 src/test/isolation/expected/vacuum-no-cleanup-lock.out
 delete mode 100644 src/test/isolation/expected/vacuum-reltuples.out
 create mode 100644 src/test/isolation/specs/vacuum-no-cleanup-lock.spec
 delete mode 100644 src/test/isolation/specs/vacuum-reltuples.spec

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b46ab7d73..df5b31700 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -168,7 +168,9 @@ extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
 extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi);
+									MultiXactId cutoff_multi,
+									TransactionId *relfrozenxid_nofreeze_out,
+									MultiXactId *relminmxid_nofreeze_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..2d8a7f627 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *relfrozenxid_out,
+									  MultiXactId *relminmxid_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3746336a0..55670f507 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6128,7 +6128,12 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * NB -- this might have the side-effect of creating a new MultiXactId!
  *
  * "flags" is an output value; it's used to tell caller what to do on return.
- * Possible flags are:
+ *
+ * "xmax_oldest_xid_out" is an output value; we must handle the details of
+ * tracking the oldest extant member Xid within any Multixact that will
+ * remain.  This is one component used by caller to track relfrozenxid_out.
+ *
+ * Possible values that we can set in "flags":
  * FRM_NOOP
  *		don't do anything -- keep existing Xmax
  * FRM_INVALIDATE_XMAX
@@ -6140,12 +6145,18 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * Final "xmax_oldest_xid_out" value should be ignored completely unless
+ * "flags" contains either FRM_NOOP or FRM_RETURN_IS_MULTI.  Final value is
+ * drawn from oldest extant Xid that will remain in some MultiXact (old or
+ * new) after xmax is processed.  Xids that won't remain after processing will
+ * never affect final "xmax_oldest_xid_out" set here, per general convention.
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *xmax_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6157,6 +6168,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId temp_xid_out;
 
 	*flags = 0;
 
@@ -6228,6 +6240,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 		}
 
+		/*
+		 * Don't push back xmax_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
+		 * when no Xids will remain
+		 */
 		return xid;
 	}
 
@@ -6251,6 +6267,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	temp_xid_out = *xmax_oldest_xid_out;	/* init for FRM_NOOP */
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
@@ -6258,28 +6275,38 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			need_replace = true;
 			break;
 		}
+		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+			temp_xid_out = members[i].xid;
 	}
 
 	/*
 	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as is.
+	 * keep the existing MultiXactId as-is, avoiding a more expensive second
+	 * pass over the multi
 	 */
 	if (!need_replace)
 	{
+		/*
+		 * When xmax_oldest_xid_out gets pushed back here it's likely that the
+		 * update Xid was the oldest member, but we don't rely on that
+		 */
 		*flags |= FRM_NOOP;
+		*xmax_oldest_xid_out = temp_xid_out;
 		pfree(members);
-		return InvalidTransactionId;
+		return multi;
 	}
 
 	/*
-	 * If the multi needs to be updated, figure out which members do we need
-	 * to keep.
+	 * Do a more thorough second pass over the multi to figure out which
+	 * member XIDs actually need to be kept.  Checking the precise status of
+	 * individual members might even show that we don't need to keep anything.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
 	update_xid = InvalidTransactionId;
 	update_committed = false;
+	temp_xid_out = *xmax_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
 
 	for (i = 0; i < nmembers; i++)
 	{
@@ -6335,7 +6362,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 
 			/*
-			 * Since the tuple wasn't marked HEAPTUPLE_DEAD by vacuum, the
+			 * Since the tuple wasn't totally removed when vacuum pruned, the
 			 * update Xid cannot possibly be older than the xid cutoff. The
 			 * presence of such a tuple would cause corruption, so be paranoid
 			 * and check.
@@ -6348,15 +6375,20 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 										 update_xid, cutoff_xid)));
 
 			/*
-			 * If we determined that it's an Xid corresponding to an update
-			 * that must be retained, additionally add it to the list of
-			 * members of the new Multi, in case we end up using that.  (We
-			 * might still decide to use only an update Xid and not a multi,
-			 * but it's easier to maintain the list as we walk the old members
-			 * list.)
+			 * We determined that this is an Xid corresponding to an update
+			 * that must be retained -- add it to new members list for later.
+			 *
+			 * Also consider pushing back temp_xid_out, which is needed when
+			 * we later conclude that a new multi is required (i.e. when we go
+			 * on to set FRM_RETURN_IS_MULTI for our caller because we also
+			 * need to retain a locker that's still running).
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+					temp_xid_out = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6374,11 +6406,17 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	pfree(members);
 
+	/*
+	 * Determine what to do with caller's multi based on information gathered
+	 * during our second pass
+	 */
 	if (nnewmembers == 0)
 	{
 		/* nothing worth keeping!? Tell caller to remove the whole thing */
 		*flags |= FRM_INVALIDATE_XMAX;
 		xid = InvalidTransactionId;
+
+		/* Don't push back xmax_oldest_xid_out -- no Xids will remain */
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
 	{
@@ -6394,6 +6432,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+
+		/* Don't push back xmax_oldest_xid_out using FRM_RETURN_IS_XID Xid */
 	}
 	else
 	{
@@ -6403,6 +6443,12 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+
+		/*
+		 * The oldest Xid we're transferring from the old multixact over to
+		 * the new one might push back xmax_oldest_xid_out
+		 */
+		*xmax_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
@@ -6421,21 +6467,30 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * The *relfrozenxid_out and *relminmxid_out arguments are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
+ * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
+ * NB: This function has side effects: it might allocate a new MultiXactId.
+ * It will be set as tuple's new xmax when our *frz output is processed within
+ * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
+ * then caller had better have an exclusive lock on it already.
+ *
+ * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
  * XID older than it could neither be running nor seen as running by any
  * open transaction.  This ensures that the replacement will not change
  * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be less than or equal to the smallest
- * MultiXactId used by any transaction currently open.
- *
- * If the tuple is in a shared buffer, caller must hold an exclusive lock on
- * that buffer.
+ * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
  *
  * NB: It is not enough to set hint bits to indicate something is
  * committed/invalid -- they might not be set on a standby, or after crash
@@ -6445,7 +6500,9 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
+						  TransactionId *relfrozenxid_out,
+						  MultiXactId *relminmxid_out)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6464,7 +6521,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * already a permanent value), while in the block below it is set true to
 	 * mean "xmin won't need freezing after what we do to it here" (false
 	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.
+	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
+	 * handling, since either way the tuple's xmin will be a permanent value
+	 * once we're done with it.
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
@@ -6489,6 +6548,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else
+		{
+			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 
 	/*
@@ -6506,15 +6571,29 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId xmax_oldest_xid_out = *relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi,
+									&flags, &xmax_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
 		if (flags & FRM_RETURN_IS_XID)
 		{
+			/*
+			 * xmax will become an updater Xid (original MultiXact's updater
+			 * member Xid will be carried forward as a simple Xid in Xmax).
+			 * Might have to ratchet back relfrozenxid_out here, though never
+			 * relminmxid_out.
+			 */
+			Assert(!freeze_xmax);
+			Assert(TransactionIdIsValid(newxmax));
+			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
+				*relfrozenxid_out = newxmax;
+			/* Note: xmax_oldest_xid_out isn't valid here */
+
 			/*
 			 * NB -- some of these transformations are only valid because we
 			 * know the return Xid is a tuple updater (i.e. not merely a
@@ -6533,6 +6612,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			uint16		newbits;
 			uint16		newbits2;
 
+			/*
+			 * xmax is an old MultiXactId which we have to replace with a new
+			 * Multixact, that carries forward some of the original's Xids.
+			 * Might have to ratchet back relfrozenxid_out here, though never
+			 * relminmxid_out.
+			 */
+			Assert(!freeze_xmax);
+			Assert(MultiXactIdIsValid(newxmax));
+			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(TransactionIdPrecedesOrEquals(xmax_oldest_xid_out,
+												 *relfrozenxid_out));
+			*relfrozenxid_out = xmax_oldest_xid_out;
+
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
 			 * here; that routine initializes the masks to all zeroes, which
@@ -6549,6 +6641,30 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 			changed = true;
 		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
+			 * both together.
+			 */
+			Assert(!freeze_xmax);
+			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
+			Assert(TransactionIdPrecedesOrEquals(xmax_oldest_xid_out,
+												 *relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, *relminmxid_out))
+				*relminmxid_out = xid;
+			*relfrozenxid_out = xmax_oldest_xid_out;
+		}
+		else
+		{
+			/*
+			 * Neither keeping an Xid or a MultiXactId for xmax (freezing it).
+			 * Won't have to ratchet back relminmxid_out or relfrozenxid_out.
+			 */
+			Assert(freeze_xmax);
+			Assert(!TransactionIdIsValid(newxmax));
+		}
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6573,15 +6689,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
 			freeze_xmax = true;
+			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
 	{
 		freeze_xmax = false;
 		xmax_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmax */
 	}
 	else
 		ereport(ERROR,
@@ -6622,6 +6744,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * No need for relfrozenxid_out handling, since we always freeze xvac.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6699,11 +6823,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId relfrozenxid_out = cutoff_xid;
+	MultiXactId relminmxid_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &relfrozenxid_out, &relminmxid_out);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7136,79 +7263,122 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
  * It doesn't matter whether the tuple is alive or dead, we are checking
  * to see if a tuple needs to be removed or frozen to avoid wraparound.
  *
+ * The *relfrozenxid_nofreeze_out and *relminmxid_nofreeze_out arguments are
+ * input/output arguments that are similar to heap_prepare_freeze_tuple's
+ * *relfrozenxid_out and *relminmxid_out input/output arguments.  There is one
+ * big difference: we track the oldest extant XID and XMID while making a
+ * working assumption that freezing won't go ahead.  heap_prepare_freeze_tuple
+ * assumes that freezing will go ahead (based on the specific instructions it
+ * provides for its caller's tuple).
+ *
+ * Note, in particular, that we even assume that freezing won't go ahead for a
+ * tuple that we indicate "needs freezing" (by returning true).  Not all
+ * callers will be okay with that.  Caller should make temp copies of global
+ * tracking variables, and pass us those.  That way caller can back out at the
+ * last moment when it must freeze the tuple using heap_prepare_freeze_tuple.
+ *
  * NB: Cannot rely on hint bits here, they might not be set after a crash or
  * on a standby.
  */
 bool
 heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi)
+						MultiXactId cutoff_multi,
+						TransactionId *relfrozenxid_nofreeze_out,
+						MultiXactId *relminmxid_nofreeze_out)
 {
+	bool		needs_freeze = false;
 	TransactionId xid;
+	MultiXactId multi;
 
+	/* First deal with xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
+	if (TransactionIdIsNormal(xid))
+	{
+		if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+			*relfrozenxid_nofreeze_out = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			needs_freeze = true;
+	}
 
 	/*
+	 * Now deal with xmax.
+	 *
 	 * The considerations for multixacts are complicated; look at
 	 * heap_prepare_freeze_tuple for justifications.  This routine had better
 	 * be in sync with that one!
 	 */
+	xid = InvalidTransactionId;
+	multi = InvalidMultiXactId;
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
-	{
-		MultiXactId multi;
-
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
-			return true;
-		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
-		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
+	else
+		xid = HeapTupleHeaderGetRawXmax(tuple);
 
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
-		}
+	if (TransactionIdIsNormal(xid))
+	{
+		/* xmax is a non-permanent XID */
+		if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+			*relfrozenxid_nofreeze_out = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			needs_freeze = true;
+	}
+	else if (!MultiXactIdIsValid(multi))
+	{
+		/* xmax is a permanent XID or invalid MultiXactId/XID */
+	}
+	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+	{
+		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
+		if (MultiXactIdPrecedes(multi, *relminmxid_nofreeze_out))
+			*relminmxid_nofreeze_out = multi;
+		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
+		needs_freeze = true;
 	}
 	else
 	{
-		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		/* xmax is a MultiXactId that may have an updater XID */
+		MultiXactMember *members;
+		int			nmembers;
+
+		if (MultiXactIdPrecedes(multi, *relminmxid_nofreeze_out))
+			*relminmxid_nofreeze_out = multi;
+		if (MultiXactIdPrecedes(multi, cutoff_multi))
+			needs_freeze = true;
+
+		/*
+		 * relfrozenxid_nofreeze_out might need to be pushed back by the
+		 * oldest member XID from the mxact.  Need to check its members now.
+		 * (Might also affect whether we advise caller to freeze tuple.)
+		 */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
+		{
+			xid = members[i].xid;
+			Assert(TransactionIdIsNormal(xid));
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				needs_freeze = true;
+		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_nofreeze_out))
+				*relfrozenxid_nofreeze_out = xid;
+			/* heap_prepare_freeze_tuple always freezes xvac */
+			needs_freeze = true;
+		}
 	}
 
-	return false;
+	return needs_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 87ab7775a..723408744 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,7 +144,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
@@ -173,8 +173,9 @@ typedef struct LVRelState
 	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -319,15 +320,15 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				skipwithvm;
 	bool		frozenxid_updated,
 				minmulti_updated;
-	BlockNumber orig_rel_pages;
+	BlockNumber orig_rel_pages,
+				new_rel_pages,
+				new_rel_allvisible;
 	char	  **indnames = NULL;
-	BlockNumber new_rel_pages;
-	BlockNumber new_rel_allvisible;
-	double		new_live_tuples;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 	TransactionId OldestXmin;
+	MultiXactId OldestMxact;
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
 
@@ -351,20 +352,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
 	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.
-	 *
-	 * If this is an aggressive VACUUM, then we're strictly required to freeze
-	 * any and all XIDs from before FreezeLimit, so that we will be able to
-	 * safely advance relfrozenxid up to FreezeLimit below (we must be able to
-	 * advance relminmxid up to MultiXactCutoff, too).
+	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
+	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
+	 * XIDs < FreezeLimit (or unfrozen MXIDs < MultiXactCutoff).
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -511,10 +509,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing */
+	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+	/* Initialize state used to track oldest extant XID/XMID */
+	vacrel->NewRelfrozenXid = OldestXmin;
+	vacrel->NewRelminMxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -548,51 +547,57 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * In principle new_live_tuples could be -1 indicating that we (still)
-	 * don't know the tuple count.  In practice that probably can't happen,
-	 * since we'd surely have scanned some pages if the table is new and
-	 * nonempty.
-	 *
 	 * For safety, clamp relallvisible to be not more than what we're setting
 	 * relpages to.
 	 */
 	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
-	new_live_tuples = vacrel->new_live_tuples;
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
 
 	/*
-	 * Now actually update rel's pg_class entry.
-	 *
-	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
-	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
-	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * Aggressive VACUUMs must advance relfrozenxid to a value >= FreezeLimit,
+	 * and advance relminmxid to a value >= MultiXactCutoff.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
+	Assert(!aggressive || vacrel->NewRelfrozenXid == OldestXmin ||
+		   TransactionIdPrecedesOrEquals(FreezeLimit,
+										 vacrel->NewRelfrozenXid));
+	Assert(!aggressive || vacrel->NewRelminMxid == OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(MultiXactCutoff,
+									   vacrel->NewRelminMxid));
+
+	/*
+	 * Non-aggressive VACUUMs might advance relfrozenxid to an XID that is
+	 * either older or newer than FreezeLimit (same applies to relminmxid and
+	 * MultiXactCutoff).  But the state that tracks the oldest remaining XID
+	 * and MXID cannot be trusted when any all-visible pages were skipped.
+	 */
+	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
+		   TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
+										 vacrel->NewRelfrozenXid));
+	Assert(vacrel->NewRelminMxid == OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(vacrel->relminmxid,
+									   vacrel->NewRelminMxid));
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
 	{
-		/* Cannot advance relfrozenxid/relminmxid */
+		/* Keep existing relfrozenxid and relminmxid (can't trust trackers) */
 		Assert(!aggressive);
-		frozenxid_updated = minmulti_updated = false;
-		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
-							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId,
-							NULL, NULL, false);
-	}
-	else
-	{
-		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
-			   orig_rel_pages);
-		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
-							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
-							&frozenxid_updated, &minmulti_updated, false);
+		vacrel->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->NewRelminMxid = InvalidMultiXactId;
 	}
 
+	/*
+	 * Now actually update rel's pg_class entry
+	 *
+	 * In principle new_live_tuples could be -1 indicating that we (still)
+	 * don't know the tuple count.  In practice that can't happen, since we
+	 * scan every page that isn't skipped using the visibility map.
+	 */
+	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
+						new_rel_allvisible, vacrel->nindexes > 0,
+						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						&frozenxid_updated, &minmulti_updated, false);
+
 	/*
 	 * Report results to the stats collector, too.
 	 *
@@ -605,7 +610,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
-						 Max(new_live_tuples, 0),
+						 Max(vacrel->new_live_tuples, 0),
 						 vacrel->recently_dead_tuples +
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
@@ -694,17 +699,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminMxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -1584,6 +1591,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1593,7 +1602,9 @@ lazy_scan_prune(LVRelState *vacrel,
 
 retry:
 
-	/* Initialize (or reset) page-level counters */
+	/* Initialize (or reset) page-level state */
+	NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	NewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1801,7 +1812,8 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1815,13 +1827,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1972,6 +1987,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
+	TransactionId NoFreezeNewRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NoFreezeNewRelminMxid = vacrel->NewRelminMxid;
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
 
@@ -2017,20 +2034,39 @@ lazy_scan_noprune(LVRelState *vacrel,
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
 		if (heap_tuple_needs_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff))
+									vacrel->MultiXactCutoff,
+									&NoFreezeNewRelfrozenXid,
+									&NoFreezeNewRelminMxid))
 		{
+			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
+
 			if (vacrel->aggressive)
 			{
-				/* Going to have to get cleanup lock for lazy_scan_prune */
+				/*
+				 * Aggressive VACUUMs must always be able to advance rel's
+				 * relfrozenxid to a value >= FreezeLimit (and to advance
+				 * rel's relminmxid to a value >= MultiXactCutoff).  The
+				 * ongoing aggressive VACUUM cannot satisfy these requirements
+				 * without freezing an XID (or XMID) from this tuple.
+				 *
+				 * The only safe option is to have caller perform processing
+				 * of this page using lazy_scan_prune.  Caller might have to
+				 * wait a while for a cleanup lock, but it can't be helped.
+				 */
 				vacrel->offnum = InvalidOffsetNumber;
 				return false;
 			}
-
-			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
-			 */
-			vacrel->freeze_cutoffs_valid = false;
+			else
+			{
+				/*
+				 * Standard VACUUMs are not obligated to advance relfrozenxid
+				 * or relminmxid by any amount, so we can be much laxer here.
+				 *
+				 * Currently we always just accept an older final relfrozenxid
+				 * and/or relminmxid value.  We never make caller wait or work
+				 * a little harder, even when it likely makes sense to do so.
+				 */
+			}
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
@@ -2080,9 +2116,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel (though
-	 * only when VACUUM uses two-pass strategy)
+	 * By here we know for sure that caller can tolerate reduced processing
+	 * for this particular page.  Save all of the details in vacrel now.
+	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
+	vacrel->NewRelfrozenXid = NoFreezeNewRelfrozenXid;
+	vacrel->NewRelminMxid = NoFreezeNewRelminMxid;
+
+	/* Save details of the LP_DEAD items from the page */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 02a7e94bf..a7e988298 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 50a4a612e..0ae3b4506 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -945,14 +945,22 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
- * - multiXactCutoff is the value below which all MultiXactIds are removed
- *   from Xmax.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
+ * - multiXactCutoff is the value below which all MultiXactIds are definitely
+ *   removed from Xmax during aggressive vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit, and relminmxid up to multiXactCutoff.
+ * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
+ * minimum).
+ *
+ * oldestXmin and oldestMxact are the most recent values that can ever be
+ * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
+ * vacuumlazy.c caller later on.  These values should be passed when it turns
+ * out that VACUUM will leave no unfrozen XIDs/XMIDs behind in the table.
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -961,6 +969,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -969,7 +978,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1065,9 +1073,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1082,8 +1092,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
@@ -1390,14 +1400,10 @@ vac_update_relstats(Relation relation,
 	 * Update relfrozenxid, unless caller passed InvalidTransactionId
 	 * indicating it has no new data.
 	 *
-	 * Ordinarily, we don't let relfrozenxid go backwards: if things are
-	 * working correctly, the only way the new frozenxid could be older would
-	 * be if a previous VACUUM was done with a tighter freeze_min_age, in
-	 * which case we don't want to forget the work it already did.  However,
-	 * if the stored relfrozenxid is "in the future", then it must be corrupt
-	 * and it seems best to overwrite it with the cutoff we used this time.
-	 * This should match vac_update_datfrozenxid() concerning what we consider
-	 * to be "in the future".
+	 * Ordinarily, we don't let relfrozenxid go backwards.  However, if the
+	 * stored relfrozenxid is "in the future", then it must be corrupt, so
+	 * just overwrite it.  This should match vac_update_datfrozenxid()
+	 * concerning what we consider to be "in the future".
 	 */
 	if (frozenxid_updated)
 		*frozenxid_updated = false;
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 36f975b1e..6a02d0fa8 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -563,9 +563,11 @@
     statistics in the system tables <structname>pg_class</structname> and
     <structname>pg_database</structname>.  In particular,
     the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the freeze cutoff XID that was used
-    by the last aggressive <command>VACUUM</command> for that table.  All rows
-    inserted by transactions with XIDs older than this cutoff XID are
+    <structname>pg_class</structname> row contains the oldest
+    remaining XID at the end of the most recent <command>VACUUM</command>
+    that successfully advanced <structfield>relfrozenxid</structfield>
+    (typically the most recent aggressive VACUUM).  All rows inserted
+    by transactions with XIDs older than this cutoff XID are
     guaranteed to have been frozen.  Similarly,
     the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
@@ -588,6 +590,17 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     cutoff XID to the current transaction's XID.
    </para>
 
+   <tip>
+    <para>
+     <literal>VACUUM VERBOSE</literal> outputs information about
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> when either field was
+     advanced.  The same details appear in the server log when <xref
+      linkend="guc-log-autovacuum-min-duration"/> reports on vacuuming
+     by autovacuum.
+    </para>
+   </tip>
+
    <para>
     <command>VACUUM</command> normally only scans pages that have been modified
     since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
@@ -602,7 +615,11 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     set <literal>age(relfrozenxid)</literal> to a value just a little more than the
     <varname>vacuum_freeze_min_age</varname> setting
     that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  If no <structfield>relfrozenxid</structfield>-advancing
+    <command>VACUUM</command> started).  <command>VACUUM</command>
+    will set <structfield>relfrozenxid</structfield> to the oldest XID
+    that remains in the table, so it's possible that the final value
+    will be much more recent than strictly required.
+    If no <structfield>relfrozenxid</structfield>-advancing
     <command>VACUUM</command> is issued on the table until
     <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
     be forced for the table.
@@ -689,8 +706,9 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command> scans, regardless of
-     what causes them, enable advancing the value for that table.
+     Aggressive <command>VACUUM</command> scans, regardless of what
+     causes them, are <emphasis>guaranteed</emphasis> to be able to
+     advance the table's <structfield>relminmxid</structfield>.
      Eventually, as all tables in all databases are scanned and their
      oldest multixact values are advanced, on-disk storage for older
      multixacts can be removed.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
new file mode 100644
index 000000000..f7bc93e8f
--- /dev/null
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -0,0 +1,189 @@
+Parsed test spec with 4 sessions
+
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+step dml_begin: BEGIN;
+step dml_other_begin: BEGIN;
+step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step dml_other_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
+step dml_commit: COMMIT;
+step dml_other_commit: COMMIT;
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_commit: 
+  COMMIT;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
deleted file mode 100644
index ce55376e7..000000000
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ /dev/null
@@ -1,67 +0,0 @@
-Parsed test spec with 2 sessions
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify open fetch1 vac close stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step open: 
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-
-step fetch1: 
-    fetch next from c1;
-
-dummy
------
-    1
-(1 row)
-
-step vac: 
-    vacuum smalltbl;
-
-step close: 
-    commit;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 8e8709815..35e0d1ee4 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -80,7 +80,7 @@ test: alter-table-4
 test: create-trigger
 test: sequence-ddl
 test: async-notify
-test: vacuum-reltuples
+test: vacuum-no-cleanup-lock
 test: timeouts
 test: vacuum-concurrent-drop
 test: vacuum-conflict
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
new file mode 100644
index 000000000..a88be66de
--- /dev/null
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -0,0 +1,150 @@
+# Test for vacuum's reduced processing of heap pages (used for any heap page
+# where a cleanup lock isn't immediately available)
+#
+# Debugging tip: Change VACUUM to VACUUM VERBOSE to get feedback on what's
+# really going on
+
+# Use name type here to avoid TOAST table:
+setup
+{
+  CREATE TABLE smalltbl AS SELECT i AS id, 't'::name AS t FROM generate_series(1,20) i;
+  ALTER TABLE smalltbl SET (autovacuum_enabled = off);
+  ALTER TABLE smalltbl ADD PRIMARY KEY (id);
+}
+setup
+{
+  VACUUM ANALYZE smalltbl;
+}
+
+teardown
+{
+  DROP TABLE smalltbl;
+}
+
+# This session holds a pin on smalltbl's only heap page:
+session pinholder
+step pinholder_cursor
+{
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+}
+step pinholder_commit
+{
+  COMMIT;
+}
+
+# This session inserts and deletes tuples, potentially affecting reltuples:
+session dml
+step dml_insert
+{
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+}
+step dml_delete
+{
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+}
+step dml_begin            { BEGIN; }
+step dml_key_share        { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_commit           { COMMIT; }
+
+# Needed for Multixact test:
+session dml_other
+step dml_other_begin      { BEGIN; }
+step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
+step dml_other_commit     { COMMIT; }
+
+# This session runs non-aggressive VACUUM, but with maximally aggressive
+# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+session vacuumer
+setup
+{
+  SET vacuum_freeze_min_age = 0;
+  SET vacuum_multixact_freeze_min_age = 0;
+}
+step vacuumer_nonaggressive_vacuum
+{
+  VACUUM smalltbl;
+}
+step vacuumer_pg_class_stats
+{
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+}
+
+# Test VACUUM's reltuples counting mechanism.
+#
+# Final pg_class.reltuples should never be affected by VACUUM's inability to
+# get a cleanup lock on any page, except to the extent that any cleanup lock
+# contention changes the number of tuples that remain ("missed dead" tuples
+# are counted in reltuples, much like "recently dead" tuples).
+
+# Easy case:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+
+# Harder case -- count 21 tuples at the end (like last time), but with cleanup
+# lock contention this time:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    pinholder_cursor
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but vary the order, and delete an inserted row:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    pinholder_cursor
+    dml_insert
+    dml_delete
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "recently dead" tuple won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but initial insert and delete before cursor:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    dml_delete
+    pinholder_cursor
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
+    # concurrent activity held back VACUUM's OldestXmin) won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Test VACUUM's mechanism for skipping MultiXact freezing.
+#
+# This provides test coverage for code paths that are only hit when we need to
+# freeze, but inability to acquire a cleanup lock on a heap page makes
+# freezing some XIDs/XMIDs < FreezeLimit/MultiXactCutoff impossible (without
+# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+permutation
+    dml_begin
+    dml_other_begin
+    dml_key_share
+    dml_other_key_share
+    # Will get cleanup lock, can't advance relminmxid yet:
+    # (though will usually advance relfrozenxid by ~2 XIDs)
+    vacuumer_nonaggressive_vacuum
+    pinholder_cursor
+    dml_other_update
+    dml_commit
+    dml_other_commit
+    # Can't cleanup lock, so still can't advance relminmxid here:
+    # (relfrozenxid held back by XIDs in MultiXact too)
+    vacuumer_nonaggressive_vacuum
+    pinholder_commit
+    # Pin was dropped, so will advance relminmxid, at long last:
+    # (ditto for relfrozenxid advancement)
+    vacuumer_nonaggressive_vacuum
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
deleted file mode 100644
index a2a461f2f..000000000
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ /dev/null
@@ -1,49 +0,0 @@
-# Test for vacuum's handling of reltuples when pages are skipped due
-# to page pins. We absolutely need to avoid setting reltuples=0 in
-# such cases, since that interferes badly with planning.
-#
-# Expected result for all three permutation is 21 tuples, including
-# the second permutation.  VACUUM is able to count the concurrently
-# inserted tuple in its final reltuples, even when a cleanup lock
-# cannot be acquired on the affected heap page.
-
-setup {
-    create table smalltbl
-        as select i as id from generate_series(1,20) i;
-    alter table smalltbl set (autovacuum_enabled = off);
-}
-setup {
-    vacuum analyze smalltbl;
-}
-
-teardown {
-    drop table smalltbl;
-}
-
-session worker
-step open {
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-}
-step fetch1 {
-    fetch next from c1;
-}
-step close {
-    commit;
-}
-step stats {
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-}
-
-session vacuumer
-step vac {
-    vacuum smalltbl;
-}
-step modify {
-    insert into smalltbl select max(id)+1 from smalltbl;
-}
-
-permutation modify vac stats
-permutation modify open fetch1 vac close stats
-permutation modify vac stats
-- 
2.32.0

v12-0003-vacuumlazy.c-Move-resource-allocation-to-heap_va.patchapplication/octet-stream; name=v12-0003-vacuumlazy.c-Move-resource-allocation-to-heap_va.patchDownload

From 7aabff81998ad3f500aa8c6da354ea2e81cee10c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 25 Mar 2022 12:51:05 -0700
Subject: [PATCH v12 3/3] vacuumlazy.c: Move resource allocation to
 heap_vacuum_rel().

Finish off work started by commit 73f6ec3d: move remaining resource
allocation and deallocation code from lazy_scan_heap() to its caller,
heap_vacuum_rel().
---
 src/backend/access/heap/vacuumlazy.c | 68 ++++++++++++----------------
 1 file changed, 28 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c647d60bc..7e641036e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -246,7 +246,7 @@ typedef struct LVSavedErrInfo
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -519,11 +519,28 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->NewRelminMxid = OldestMxact;
 	vacrel->skippedallvis = false;
 
+	/*
+	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * parallel VACUUM initialization as part of allocating shared memory
+	 * space used for dead_items.  (But do a failsafe precheck first, to
+	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
+	 * is already dangerously old.)
+	 */
+	lazy_check_wraparound_failsafe(vacrel);
+	dead_items_alloc(vacrel, params->nworkers);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params->nworkers);
+	lazy_scan_heap(vacrel);
+
+	/*
+	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
+	 * passing when necessary.
+	 */
+	dead_items_cleanup(vacrel);
+	Assert(!IsInParallelMode());
 
 	/*
 	 * Update pg_class entries for each of rel's indexes where appropriate.
@@ -833,14 +850,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, int nworkers)
+lazy_scan_heap(LVRelState *vacrel)
 {
-	VacDeadItems *dead_items;
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
 				next_unskippable_block,
-				next_failsafe_block,
-				next_fsm_block_to_vacuum;
+				next_failsafe_block = 0,
+				next_fsm_block_to_vacuum = 0;
+	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
@@ -851,23 +868,6 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	};
 	int64		initprog_val[3];
 
-	/*
-	 * Do failsafe precheck before calling dead_items_alloc.  This ensures
-	 * that parallel VACUUM won't be attempted when relfrozenxid is already
-	 * dangerously old.
-	 */
-	lazy_check_wraparound_failsafe(vacrel);
-	next_failsafe_block = 0;
-
-	/*
-	 * Allocate the space for dead_items.  Note that this handles parallel
-	 * VACUUM initialization as part of allocating shared memory space used
-	 * for dead_items.
-	 */
-	dead_items_alloc(vacrel, nworkers);
-	dead_items = vacrel->dead_items;
-	next_fsm_block_to_vacuum = 0;
-
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
@@ -1244,12 +1244,11 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		}
 	}
 
+	vacrel->blkno = InvalidBlockNumber;
+
 	/* report that everything is now scanned */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
-	/* Clear the block number information */
-	vacrel->blkno = InvalidBlockNumber;
-
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
 													 vacrel->scanned_pages,
@@ -1264,15 +1263,11 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		vacrel->missed_dead_tuples;
 
 	/*
-	 * Release any remaining pin on visibility map page.
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
 	 */
 	if (BufferIsValid(vmbuffer))
-	{
 		ReleaseBuffer(vmbuffer);
-		vmbuffer = InvalidBuffer;
-	}
-
-	/* Perform a final round of index and heap vacuuming */
 	if (dead_items->num_items > 0)
 		lazy_vacuum(vacrel);
 
@@ -1286,16 +1281,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
-	/* Do post-vacuum cleanup */
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		lazy_cleanup_all_indexes(vacrel);
-
-	/*
-	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
-	 * passing when necessary.
-	 */
-	dead_items_cleanup(vacrel);
-	Assert(!IsInParallelMode());
 }
 
 /*
-- 
2.32.0

v12-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchapplication/octet-stream; name=v12-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchDownload

From d359145664f8efe54376ff3bb0b26a07ebbba965 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v12 2/3] Generalize how VACUUM skips all-frozen pages.

Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to
aggressive VACUUMs) around advancing relfrozenxid before now.  The
underlying issue was that lazy_scan_heap conditioned its skipping
behavior on whether or not the current VACUUM was aggressive.  VACUUM
could fail to increment its frozenskipped_pages counter as a result, and
so could miss out on advancing relfrozenxid (in the non-aggressive case)
for no good reason.

The issue only comes up when concurrent activity might unset a page's
visibility map bit at exactly the wrong time.  The non-aggressive case
rechecked the visibility map at the point of skipping each page before
now.  This created a window for some other session to concurrently unset
the same heap page's bit in the visibility map.  If the bit was unset at
the wrong time, it would cause VACUUM to conservatively conclude that
the page was _never_ all-frozen on recheck.  frozenskipped_pages would
not be incremented for the page as a result.  lazy_scan_heap had already
committed to skipping the page/range at that point, though -- which made
it unsafe to advance relfrozenxid/relminmxid later on.

Consistently avoid the issue by generalizing how we skip frozen pages
during aggressive VACUUMs: take the same approach when skipping any
skippable page range during aggressive and non-aggressive VACUUMs alike.
The new approach makes ranges (not individual pages) the fundamental
unit of skipping using the visibility map.  frozenskipped_pages is
replaced with a boolean flag that represents whether some skippable
range with one or more all-visible pages was actually skipped (making
relfrozenxid unsafe to update).

It is safe for VACUUM to treat a page as all-frozen provided it at least
had its all-frozen bit set after the OldestXmin cutoff was established.
VACUUM is only required to scan pages that might have XIDs < OldestXmin
that are not yet frozen to be able to safely advance relfrozenxid.
Tuples concurrently inserted on skipped pages are equivalent to tuples
concurrently inserted on a block >= rel_pages from the same table.

It's possible that the issue this commit fixes hardly ever came up in
practice.  But we only had to be unlucky once to lose out on advancing
relfrozenxid -- a single affected heap page was enough to throw VACUUM
off.  That seems like something to avoid on general principle.  This is
similar to an issue fixed by commit 44fa8488, which taught vacuumlazy.c
to not give up on non-aggressive relfrozenxid advancement just because a
cleanup lock wasn't immediately available on some heap page.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com
Discussion: https://postgr.es/m/CA+TgmobhuzSR442_cfpgxidmiRdL-GdaFSc8SD=GJcpLTx_BAw@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 309 +++++++++++++--------------
 1 file changed, 146 insertions(+), 163 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 723408744..c647d60bc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -176,6 +176,7 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -196,7 +197,6 @@ typedef struct LVRelState
 	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
@@ -247,6 +247,10 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
+								  BlockNumber next_block,
+								  bool *next_unskippable_allvis,
+								  bool *skipping_current_range);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -467,7 +471,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
-	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
@@ -514,6 +517,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/XMID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
+	vacrel->skippedallvis = false;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -578,7 +582,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(vacrel->NewRelminMxid == OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
+	if (vacrel->skippedallvis)
 	{
 		/* Keep existing relfrozenxid and relminmxid (can't trust trackers) */
 		Assert(!aggressive);
@@ -838,7 +842,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 				next_failsafe_block,
 				next_fsm_block_to_vacuum;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		next_unskippable_allvis,
+				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -869,179 +874,52 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/*
-	 * Set things up for skipping blocks using visibility map.
-	 *
-	 * Except when vacrel->aggressive is set, we want to skip pages that are
-	 * all-visible according to the visibility map, but only when we can skip
-	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
-	 * sequentially, the OS should be doing readahead for us, so there's no
-	 * gain in skipping a page now and then; that's likely to disable
-	 * readahead and so be counterproductive. Also, skipping even a single
-	 * page means that we can't update relfrozenxid, so we only want to do it
-	 * if we can skip a goodly number of pages.
-	 *
-	 * When vacrel->aggressive is set, we can't skip pages just because they
-	 * are all-visible, but we can still skip pages that are all-frozen, since
-	 * such pages do not need freezing and do not affect the value that we can
-	 * safely set for relfrozenxid or relminmxid.
-	 *
-	 * Before entering the main loop, establish the invariant that
-	 * next_unskippable_block is the next block number >= blkno that we can't
-	 * skip based on the visibility map, either all-visible for a regular scan
-	 * or all-frozen for an aggressive scan.  We set it to rel_pages when
-	 * there's no such block.  We also set up the skipping_blocks flag
-	 * correctly at this stage.
-	 *
-	 * Note: The value returned by visibilitymap_get_status could be slightly
-	 * out-of-date, since we make this test before reading the corresponding
-	 * heap page or locking the buffer.  This is OK.  If we mistakenly think
-	 * that the page is all-visible or all-frozen when in fact the flag's just
-	 * been cleared, we might fail to vacuum the page.  It's easy to see that
-	 * skipping a page when aggressive is not set is not a very big deal; we
-	 * might leave some dead tuples lying around, but the next vacuum will
-	 * find them.  But even when aggressive *is* set, it's still OK if we miss
-	 * a page whose all-frozen marking has just been cleared.  Any new XIDs
-	 * just added to that page are necessarily >= vacrel->OldestXmin, and so
-	 * they'll have no effect on the value to which we can safely set
-	 * relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 */
-	next_unskippable_block = 0;
-	if (vacrel->skipwithvm)
-	{
-		while (next_unskippable_block < rel_pages)
-		{
-			uint8		vmstatus;
-
-			vmstatus = visibilitymap_get_status(vacrel->rel,
-												next_unskippable_block,
-												&vmbuffer);
-			if (vacrel->aggressive)
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
-					break;
-			}
-			else
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
-					break;
-			}
-			vacuum_delay_point();
-			next_unskippable_block++;
-		}
-	}
-
-	if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
-		skipping_blocks = true;
-	else
-		skipping_blocks = false;
-
+	/* Set up an initial range of skippable blocks using the visibility map */
+	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
+											&next_unskippable_allvis,
+											&skipping_current_range);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm = false;
+		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
-
-		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
-								 blkno, InvalidOffsetNumber);
-
 		if (blkno == next_unskippable_block)
 		{
-			/* Time to advance next_unskippable_block */
-			next_unskippable_block++;
-			if (vacrel->skipwithvm)
-			{
-				while (next_unskippable_block < rel_pages)
-				{
-					uint8		vmskipflags;
-
-					vmskipflags = visibilitymap_get_status(vacrel->rel,
-														   next_unskippable_block,
-														   &vmbuffer);
-					if (vacrel->aggressive)
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
-							break;
-					}
-					else
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
-							break;
-					}
-					vacuum_delay_point();
-					next_unskippable_block++;
-				}
-			}
-
 			/*
-			 * We know we can't skip the current block.  But set up
-			 * skipping_blocks to do the right thing at the following blocks.
+			 * Can't skip this page safely.  Must scan the page.  But
+			 * determine the next skippable range after the page first.
 			 */
-			if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)
-				skipping_blocks = true;
-			else
-				skipping_blocks = false;
+			all_visible_according_to_vm = next_unskippable_allvis;
+			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
+													blkno + 1,
+													&next_unskippable_allvis,
+													&skipping_current_range);
 
-			/*
-			 * Normally, the fact that we can't skip this block must mean that
-			 * it's not all-visible.  But in an aggressive vacuum we know only
-			 * that it's not all-frozen, so it might still be all-visible.
-			 */
-			if (vacrel->aggressive &&
-				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+			Assert(next_unskippable_block >= blkno + 1);
 		}
 		else
 		{
-			/*
-			 * The current page can be skipped if we've seen a long enough run
-			 * of skippable blocks to justify skipping it -- provided it's not
-			 * the last page in the relation (according to rel_pages).
-			 *
-			 * We always scan the table's last page to determine whether it
-			 * has tuples or not, even if it would otherwise be skipped. This
-			 * avoids having lazy_truncate_heap() take access-exclusive lock
-			 * on the table to attempt a truncation that just fails
-			 * immediately because there are tuples on the last page.
-			 */
-			if (skipping_blocks && blkno < rel_pages - 1)
-			{
-				/*
-				 * Tricky, tricky.  If this is in aggressive vacuum, the page
-				 * must have been all-frozen at the time we checked whether it
-				 * was skippable, but it might not be any more.  We must be
-				 * careful to count it as a skipped all-frozen page in that
-				 * case, or else we'll think we can't update relfrozenxid and
-				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was initially all-frozen, so we have to
-				 * recheck.
-				 */
-				if (vacrel->aggressive ||
-					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-					vacrel->frozenskipped_pages++;
-				continue;
-			}
+			/* Last page always scanned (may need to set nonempty_pages) */
+			Assert(blkno < rel_pages - 1);
 
-			/*
-			 * SKIP_PAGES_THRESHOLD (threshold for skipping) was not
-			 * crossed, or this is the last page.  Scan the page, even
-			 * though it's all-visible (and possibly even all-frozen).
-			 */
+			if (skipping_current_range)
+				continue;
+
+			/* Current range is too small to skip -- just scan the page */
 			all_visible_according_to_vm = true;
 		}
 
-		vacuum_delay_point();
-
-		/*
-		 * We're not skipping this page using the visibility map, and so it is
-		 * (by definition) a scanned page.  Any tuples from this page are now
-		 * guaranteed to be counted below, after some preparatory checks.
-		 */
 		vacrel->scanned_pages++;
 
+		/* Report as block scanned, update error traceback information */
+		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+								 blkno, InvalidOffsetNumber);
+
+		vacuum_delay_point();
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1241,8 +1119,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on what the VM said about
-		 * the page before pruning started, and using prunestate
+		 * Handle setting visibility map bit based on information from the VM
+		 * (as of last lazy_scan_skip() call), and from prunestate
 		 */
 		if (!all_visible_according_to_vm && prunestate.all_visible)
 		{
@@ -1274,9 +1152,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
 		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * got cleared after lazy_scan_skip() was called, so we must recheck
+		 * with buffer lock before concluding that the VM is corrupt.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
@@ -1315,7 +1192,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * If the all-visible page is all-frozen but not marked as such yet,
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
+		 * all_visible is true, so we must check both prunestate fields.
 		 */
 		else if (all_visible_according_to_vm && prunestate.all_visible &&
 				 prunestate.all_frozen &&
@@ -1421,6 +1298,112 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	Assert(!IsInParallelMode());
 }
 
+/*
+ *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *
+ * lazy_scan_heap() calls here every time it needs to set up a new range of
+ * blocks to skip via the visibility map.  Caller passes the next block in
+ * line.  We return a next_unskippable_block for this range.  When there are
+ * no skippable blocks we just return caller's next_block.  The all-visible
+ * status of the returned block is set in *next_unskippable_allvis for caller,
+ * too.  Block usually won't be all-visible (since it's unskippable), but it
+ * can be during aggressive VACUUMs (as well as in certain edge cases).
+ *
+ * Sets *skipping_current_range to indicate if caller should skip this range.
+ * Costs and benefits drive our decision.  Very small ranges won't be skipped.
+ *
+ * Note: our opinion of which blocks can be skipped can go stale immediately.
+ * It's okay if caller "misses" a page whose all-visible or all-frozen marking
+ * was concurrently cleared, though.  All that matters is that caller scan all
+ * pages whose tuples might contain XIDs < OldestXmin, or XMIDs < OldestMxact.
+ * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
+ * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
+ * choice to skip such a range is actually made, making everything safe.)
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
+			   bool *next_unskippable_allvis, bool *skipping_current_range)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_unskippable_block = next_block,
+				nskippable_blocks = 0;
+	bool		allvisinrange = false;
+
+	*next_unskippable_allvis = true;
+	while (next_unskippable_block < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
+													   next_unskippable_block,
+													   vmbuffer);
+
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			*next_unskippable_allvis = false;
+			break;
+		}
+
+		/*
+		 * Caller must scan the last page to determine whether it has tuples
+		 * (caller must have the opportunity to set vacrel->nonempty_pages).
+		 * This rule avoids having lazy_truncate_heap() take access-exclusive
+		 * lock on rel to attempt a truncation that fails anyway, just because
+		 * there are tuples on the last page (it is likely that there will be
+		 * tuples on other nearby pages as well, but those can be skipped).
+		 *
+		 * Implement this by always treating the last block as unsafe to skip.
+		 */
+		if (next_unskippable_block == rel_pages - 1)
+			break;
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.  They may still skip all-frozen pages, which can't
+		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case.  But
+			 * remember that the final range contains such a block for later.
+			 */
+			allvisinrange = true;
+		}
+
+		vacuum_delay_point();
+		next_unskippable_block++;
+		nskippable_blocks++;
+	}
+
+	/*
+	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
+	 * pages.  Since we're reading sequentially, the OS should be doing
+	 * readahead for us, so there's no gain in skipping a page now and then.
+	 * Skipping such a range might even discourage sequential detection.
+	 *
+	 * This test also enables more frequent relfrozenxid advancement during
+	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
+	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 */
+	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
+		*skipping_current_range = false;
+	else
+	{
+		*skipping_current_range = true;
+		if (allvisinrange)
+			vacrel->skippedallvis = true;
+	}
+
+	return next_unskippable_block;
+}
+
 /*
  *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
  *
-- 
2.32.0

#105

Robert Haas

robertmhaas@gmail.com

almost 4 years ago

In reply to: Peter Geoghegan (#104)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sun, Mar 27, 2022 at 11:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

Attached is v12. My current goal is to commit all 3 patches before
feature freeze. Note that this does not include the more complicated
patch including with previous revisions of the patch series (the
page-level freezing work that appeared in versions before v11).

Reviewing 0001, focusing on the words in the patch file much more than the code:

I can understand this version of the commit message. Woohoo! I like
understanding things.

I think the header comments for FreezeMultiXactId() focus way too much
on what the caller is supposed to do and not nearly enough on what
FreezeMultiXactId() itself does. I think to some extent this also
applies to the comments within the function body.

On the other hand, the header comments for heap_prepare_freeze_tuple()
seem good to me. If I were thinking of calling this function, I would
know how to use the new arguments. If I were looking for bugs in it, I
could compare the logic in the function to what these comments say it
should be doing. Yay.

I think I understand what the first paragraph of the header comment
for heap_tuple_needs_freeze() is trying to say, but the second one is
quite confusing. I think this is again because it veers into talking
about what the caller should do rather than explaining what the
function itself does.

I don't like the statement-free else block in lazy_scan_noprune(). I
think you could delete the else{} and just put that same comment there
with one less level of indentation. There's a clear "return false"
just above so it shouldn't be confusing what's happening.

The comment hunk at the end of lazy_scan_noprune() would probably be
better if it said something more specific than "caller can tolerate
reduced processing." My guess is that it would be something like
"caller does not need to do something or other."

I have my doubts about whether the overwrite-a-future-relfrozenxid
behavior is any good, but that's a topic for another day. I suggest
keeping the words "it seems best to", though, because they convey a
level of tentativeness, which seems appropriate.

I am surprised to see you write in maintenance.sgml that the VACUUM
which most recently advanced relfrozenxid will typically be the most
recent aggressive VACUUM. I would have expected something like "(often
the most recent VACUUM)".

--
Robert Haas
EDB: http://www.enterprisedb.com

#106

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Robert Haas (#105)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Mar 29, 2022 at 10:03 AM Robert Haas <robertmhaas@gmail.com> wrote:

I can understand this version of the commit message. Woohoo! I like
understanding things.

That's good news.

I think the header comments for FreezeMultiXactId() focus way too much
on what the caller is supposed to do and not nearly enough on what
FreezeMultiXactId() itself does. I think to some extent this also
applies to the comments within the function body.

To some extent this is a legitimate difference in style. I myself
don't think that it's intrinsically good to have these sorts of
comments. I just think that it can be the least worst thing when a
function is intrinsically written with one caller and one very
specific set of requirements in mind. That is pretty much a matter of
taste, though.

I think I understand what the first paragraph of the header comment
for heap_tuple_needs_freeze() is trying to say, but the second one is
quite confusing. I think this is again because it veers into talking
about what the caller should do rather than explaining what the
function itself does.

I wouldn't have done it that way if the function wasn't called
heap_tuple_needs_freeze().

I would be okay with removing this paragraph if the function was
renamed to reflect the fact it now tells the caller something about
the tuple having an old XID/MXID relative to the caller's own XID/MXID
cutoffs. Maybe the function name should be heap_tuple_would_freeze(),
making it clear that the function merely tells caller what
heap_prepare_freeze_tuple() *would* do, without presuming to tell the
vacuumlazy.c caller what it *should* do about any of the information
it is provided.

Then it becomes natural to see the boolean return value and the
changes the function makes to caller's relfrozenxid/relminmxid tracker
variables as independent.

I don't like the statement-free else block in lazy_scan_noprune(). I
think you could delete the else{} and just put that same comment there
with one less level of indentation. There's a clear "return false"
just above so it shouldn't be confusing what's happening.

Okay, will fix.

The comment hunk at the end of lazy_scan_noprune() would probably be
better if it said something more specific than "caller can tolerate
reduced processing." My guess is that it would be something like
"caller does not need to do something or other."

I meant "caller can tolerate not pruning or freezing this particular
page". Will fix.

I have my doubts about whether the overwrite-a-future-relfrozenxid
behavior is any good, but that's a topic for another day. I suggest
keeping the words "it seems best to", though, because they convey a
level of tentativeness, which seems appropriate.

I agree that it's best to keep a tentative tone here. That code was
written following a very specific bug in pg_upgrade several years
back. There was a very recent bug fixed only last year, by commit
74cf7d46.

FWIW I tend to think that we'd have a much better chance of catching
that sort of thing if we'd had better relfrozenxid instrumentation
before now. Now you'd see a negative value in the "new relfrozenxid:
%u, which is %d xids ahead of previous value" part of the autovacuum
log message in the event of such a bug. That's weird enough that I bet
somebody would notice and report it.

I am surprised to see you write in maintenance.sgml that the VACUUM
which most recently advanced relfrozenxid will typically be the most
recent aggressive VACUUM. I would have expected something like "(often
the most recent VACUUM)".

That's always been true, and will only be slightly less true in
Postgres 15 -- the fact is that we only need to skip one all-visible
page to lose out, and that's not unlikely with tables that aren't
quite small with all the patches from v12 applied (we're still much
too naive). The work that I'll get into Postgres 15 on VACUUM is very
valuable as a basis for future improvements, but not all that valuable
to users (improved instrumentation might be the biggest benefit in 15,
or maybe relminmxid advancement for certain types of applications).

I still think that we need to do more proactive page-level freezing to
make relfrozenxid advancement happen in almost every VACUUM, but even
that won't quite be enough. There are still cases where we need to
make a choice about giving up on relfrozenxid advancement in a
non-aggressive VACUUM -- all-visible pages won't completely go away
with page-level freezing. At a minimum we'll still have edge cases
like the case where heap_lock_tuple() unsets the all-frozen bit. And
pg_upgrade'd databases, too.

0002 structures the logic for skipping using the VM in a way that will
make the choice to skip or not skip all-visible pages in
non-aggressive VACUUMs quite natural. I suspect that
SKIP_PAGES_THRESHOLD was always mostly just about relfrozenxid
advancement in non-aggressive VACUUM, all along. We can do much better
than SKIP_PAGES_THRESHOLD, especially if we preprocess the entire
visibility map up-front -- we'll know the costs and benefits up-front,
before committing to early relfrozenxid advancement.

Overall, aggressive vs non-aggressive VACUUM seems like a false
dichotomy to me. ISTM that it should be a totally dynamic set of
behaviors. There should probably be several different "aggressive
gradations''. Most VACUUMs start out completely non-aggressive
(including even anti-wraparound autovacuums), but can escalate from
there. The non-cancellable autovacuum behavior (technically an
anti-wraparound thing, but really an aggressiveness thing) should be
something we escalate to, as with the failsafe.

Dynamic behavior works a lot better. And it makes scheduling of
autovacuum workers a lot more straightforward -- the discontinuities
seem to make that much harder, which is one more reason to avoid them
altogether.

--
Peter Geoghegan

#107

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#106)

3 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Mar 29, 2022 at 11:58 AM Peter Geoghegan <pg@bowt.ie> wrote:

I think I understand what the first paragraph of the header comment
for heap_tuple_needs_freeze() is trying to say, but the second one is
quite confusing. I think this is again because it veers into talking
about what the caller should do rather than explaining what the
function itself does.

I wouldn't have done it that way if the function wasn't called
heap_tuple_needs_freeze().

I would be okay with removing this paragraph if the function was
renamed to reflect the fact it now tells the caller something about
the tuple having an old XID/MXID relative to the caller's own XID/MXID
cutoffs. Maybe the function name should be heap_tuple_would_freeze(),
making it clear that the function merely tells caller what
heap_prepare_freeze_tuple() *would* do, without presuming to tell the
vacuumlazy.c caller what it *should* do about any of the information
it is provided.

Attached is v13, which does it that way. This does seem like a real
increase in clarity, albeit one that comes at the cost of renaming
heap_tuple_needs_freeze().

v13 also addresses all of the other items from Robert's most recent
round of feedback.

I would like to commit something close to v13 on Friday or Saturday.

Thanks
--
Peter Geoghegan

Attachments:

v13-0003-vacuumlazy.c-Move-resource-allocation-to-heap_va.patchapplication/octet-stream; name=v13-0003-vacuumlazy.c-Move-resource-allocation-to-heap_va.patchDownload

From 0114ad047d7c513705421b1fcf3d2ff7fa8a001c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 25 Mar 2022 12:51:05 -0700
Subject: [PATCH v13 3/3] vacuumlazy.c: Move resource allocation to
 heap_vacuum_rel().

Finish off work started by commit 73f6ec3d: move remaining resource
allocation and deallocation code from lazy_scan_heap() to its caller,
heap_vacuum_rel().
---
 src/backend/access/heap/vacuumlazy.c | 68 ++++++++++++----------------
 1 file changed, 28 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e5c08166a..3562decf8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -246,7 +246,7 @@ typedef struct LVSavedErrInfo
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -519,11 +519,28 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->NewRelminMxid = OldestMxact;
 	vacrel->skippedallvis = false;
 
+	/*
+	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * parallel VACUUM initialization as part of allocating shared memory
+	 * space used for dead_items.  (But do a failsafe precheck first, to
+	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
+	 * is already dangerously old.)
+	 */
+	lazy_check_wraparound_failsafe(vacrel);
+	dead_items_alloc(vacrel, params->nworkers);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params->nworkers);
+	lazy_scan_heap(vacrel);
+
+	/*
+	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
+	 * passing when necessary.
+	 */
+	dead_items_cleanup(vacrel);
+	Assert(!IsInParallelMode());
 
 	/*
 	 * Update pg_class entries for each of rel's indexes where appropriate.
@@ -833,14 +850,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, int nworkers)
+lazy_scan_heap(LVRelState *vacrel)
 {
-	VacDeadItems *dead_items;
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
 				next_unskippable_block,
-				next_failsafe_block,
-				next_fsm_block_to_vacuum;
+				next_failsafe_block = 0,
+				next_fsm_block_to_vacuum = 0;
+	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
@@ -851,23 +868,6 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	};
 	int64		initprog_val[3];
 
-	/*
-	 * Do failsafe precheck before calling dead_items_alloc.  This ensures
-	 * that parallel VACUUM won't be attempted when relfrozenxid is already
-	 * dangerously old.
-	 */
-	lazy_check_wraparound_failsafe(vacrel);
-	next_failsafe_block = 0;
-
-	/*
-	 * Allocate the space for dead_items.  Note that this handles parallel
-	 * VACUUM initialization as part of allocating shared memory space used
-	 * for dead_items.
-	 */
-	dead_items_alloc(vacrel, nworkers);
-	dead_items = vacrel->dead_items;
-	next_fsm_block_to_vacuum = 0;
-
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
@@ -1244,12 +1244,11 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		}
 	}
 
+	vacrel->blkno = InvalidBlockNumber;
+
 	/* report that everything is now scanned */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
-	/* Clear the block number information */
-	vacrel->blkno = InvalidBlockNumber;
-
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
 													 vacrel->scanned_pages,
@@ -1264,15 +1263,11 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		vacrel->missed_dead_tuples;
 
 	/*
-	 * Release any remaining pin on visibility map page.
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
 	 */
 	if (BufferIsValid(vmbuffer))
-	{
 		ReleaseBuffer(vmbuffer);
-		vmbuffer = InvalidBuffer;
-	}
-
-	/* Perform a final round of index and heap vacuuming */
 	if (dead_items->num_items > 0)
 		lazy_vacuum(vacrel);
 
@@ -1286,16 +1281,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	/* report all blocks vacuumed */
 	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
-	/* Do post-vacuum cleanup */
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		lazy_cleanup_all_indexes(vacrel);
-
-	/*
-	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
-	 * passing when necessary.
-	 */
-	dead_items_cleanup(vacrel);
-	Assert(!IsInParallelMode());
 }
 
 /*
-- 
2.32.0

v13-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchapplication/octet-stream; name=v13-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchDownload

From 61ad7689a0b68d1fddec3e3b7840395b7bd85a72 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v13 2/3] Generalize how VACUUM skips all-frozen pages.

Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to
aggressive VACUUMs) around advancing relfrozenxid before now.  The
underlying issue was that lazy_scan_heap conditioned its skipping
behavior on whether or not the current VACUUM was aggressive.  VACUUM
could fail to increment its frozenskipped_pages counter as a result, and
so could miss out on advancing relfrozenxid (in the non-aggressive case)
for no good reason.

The issue only comes up when concurrent activity might unset a page's
visibility map bit at exactly the wrong time.  The non-aggressive case
rechecked the visibility map at the point of skipping each page before
now.  This created a window for some other session to concurrently unset
the same heap page's bit in the visibility map.  If the bit was unset at
the wrong time, it would cause VACUUM to conservatively conclude that
the page was _never_ all-frozen on recheck.  frozenskipped_pages would
not be incremented for the page as a result.  lazy_scan_heap had already
committed to skipping the page/range at that point, though -- which made
it unsafe to advance relfrozenxid/relminmxid later on.

Consistently avoid the issue by generalizing how we skip frozen pages
during aggressive VACUUMs: take the same approach when skipping any
skippable page range during aggressive and non-aggressive VACUUMs alike.
The new approach makes ranges (not individual pages) the fundamental
unit of skipping using the visibility map.  frozenskipped_pages is
replaced with a boolean flag that represents whether some skippable
range with one or more all-visible pages was actually skipped (making
relfrozenxid unsafe to update).

It is safe for VACUUM to treat a page as all-frozen provided it at least
had its all-frozen bit set after the OldestXmin cutoff was established.
VACUUM is only required to scan pages that might have XIDs < OldestXmin
that are not yet frozen to be able to safely advance relfrozenxid.
Tuples concurrently inserted on skipped pages are equivalent to tuples
concurrently inserted on a block >= rel_pages from the same table.

It's possible that the issue this commit fixes hardly ever came up in
practice.  But we only had to be unlucky once to lose out on advancing
relfrozenxid -- a single affected heap page was enough to throw VACUUM
off.  That seems like something to avoid on general principle.  This is
similar to an issue fixed by commit 44fa8488, which taught vacuumlazy.c
to not give up on non-aggressive relfrozenxid advancement just because a
cleanup lock wasn't immediately available on some heap page.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com
Discussion: https://postgr.es/m/CA+TgmobhuzSR442_cfpgxidmiRdL-GdaFSc8SD=GJcpLTx_BAw@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 309 +++++++++++++--------------
 1 file changed, 146 insertions(+), 163 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6cb688efc..e5c08166a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -176,6 +176,7 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -196,7 +197,6 @@ typedef struct LVRelState
 	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
@@ -247,6 +247,10 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
+								  BlockNumber next_block,
+								  bool *next_unskippable_allvis,
+								  bool *skipping_current_range);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -467,7 +471,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
-	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
@@ -514,6 +517,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/XMID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
+	vacrel->skippedallvis = false;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -569,7 +573,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(vacrel->NewRelminMxid == OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
+	if (vacrel->skippedallvis)
 	{
 		/* Keep existing relfrozenxid and relminmxid (can't trust trackers) */
 		Assert(!aggressive);
@@ -838,7 +842,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 				next_failsafe_block,
 				next_fsm_block_to_vacuum;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		next_unskippable_allvis,
+				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -869,179 +874,52 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/*
-	 * Set things up for skipping blocks using visibility map.
-	 *
-	 * Except when vacrel->aggressive is set, we want to skip pages that are
-	 * all-visible according to the visibility map, but only when we can skip
-	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
-	 * sequentially, the OS should be doing readahead for us, so there's no
-	 * gain in skipping a page now and then; that's likely to disable
-	 * readahead and so be counterproductive. Also, skipping even a single
-	 * page means that we can't update relfrozenxid, so we only want to do it
-	 * if we can skip a goodly number of pages.
-	 *
-	 * When vacrel->aggressive is set, we can't skip pages just because they
-	 * are all-visible, but we can still skip pages that are all-frozen, since
-	 * such pages do not need freezing and do not affect the value that we can
-	 * safely set for relfrozenxid or relminmxid.
-	 *
-	 * Before entering the main loop, establish the invariant that
-	 * next_unskippable_block is the next block number >= blkno that we can't
-	 * skip based on the visibility map, either all-visible for a regular scan
-	 * or all-frozen for an aggressive scan.  We set it to rel_pages when
-	 * there's no such block.  We also set up the skipping_blocks flag
-	 * correctly at this stage.
-	 *
-	 * Note: The value returned by visibilitymap_get_status could be slightly
-	 * out-of-date, since we make this test before reading the corresponding
-	 * heap page or locking the buffer.  This is OK.  If we mistakenly think
-	 * that the page is all-visible or all-frozen when in fact the flag's just
-	 * been cleared, we might fail to vacuum the page.  It's easy to see that
-	 * skipping a page when aggressive is not set is not a very big deal; we
-	 * might leave some dead tuples lying around, but the next vacuum will
-	 * find them.  But even when aggressive *is* set, it's still OK if we miss
-	 * a page whose all-frozen marking has just been cleared.  Any new XIDs
-	 * just added to that page are necessarily >= vacrel->OldestXmin, and so
-	 * they'll have no effect on the value to which we can safely set
-	 * relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 */
-	next_unskippable_block = 0;
-	if (vacrel->skipwithvm)
-	{
-		while (next_unskippable_block < rel_pages)
-		{
-			uint8		vmstatus;
-
-			vmstatus = visibilitymap_get_status(vacrel->rel,
-												next_unskippable_block,
-												&vmbuffer);
-			if (vacrel->aggressive)
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
-					break;
-			}
-			else
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
-					break;
-			}
-			vacuum_delay_point();
-			next_unskippable_block++;
-		}
-	}
-
-	if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
-		skipping_blocks = true;
-	else
-		skipping_blocks = false;
-
+	/* Set up an initial range of skippable blocks using the visibility map */
+	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
+											&next_unskippable_allvis,
+											&skipping_current_range);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm = false;
+		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
-
-		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
-								 blkno, InvalidOffsetNumber);
-
 		if (blkno == next_unskippable_block)
 		{
-			/* Time to advance next_unskippable_block */
-			next_unskippable_block++;
-			if (vacrel->skipwithvm)
-			{
-				while (next_unskippable_block < rel_pages)
-				{
-					uint8		vmskipflags;
-
-					vmskipflags = visibilitymap_get_status(vacrel->rel,
-														   next_unskippable_block,
-														   &vmbuffer);
-					if (vacrel->aggressive)
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
-							break;
-					}
-					else
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
-							break;
-					}
-					vacuum_delay_point();
-					next_unskippable_block++;
-				}
-			}
-
 			/*
-			 * We know we can't skip the current block.  But set up
-			 * skipping_blocks to do the right thing at the following blocks.
+			 * Can't skip this page safely.  Must scan the page.  But
+			 * determine the next skippable range after the page first.
 			 */
-			if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)
-				skipping_blocks = true;
-			else
-				skipping_blocks = false;
+			all_visible_according_to_vm = next_unskippable_allvis;
+			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
+													blkno + 1,
+													&next_unskippable_allvis,
+													&skipping_current_range);
 
-			/*
-			 * Normally, the fact that we can't skip this block must mean that
-			 * it's not all-visible.  But in an aggressive vacuum we know only
-			 * that it's not all-frozen, so it might still be all-visible.
-			 */
-			if (vacrel->aggressive &&
-				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+			Assert(next_unskippable_block >= blkno + 1);
 		}
 		else
 		{
-			/*
-			 * The current page can be skipped if we've seen a long enough run
-			 * of skippable blocks to justify skipping it -- provided it's not
-			 * the last page in the relation (according to rel_pages).
-			 *
-			 * We always scan the table's last page to determine whether it
-			 * has tuples or not, even if it would otherwise be skipped. This
-			 * avoids having lazy_truncate_heap() take access-exclusive lock
-			 * on the table to attempt a truncation that just fails
-			 * immediately because there are tuples on the last page.
-			 */
-			if (skipping_blocks && blkno < rel_pages - 1)
-			{
-				/*
-				 * Tricky, tricky.  If this is in aggressive vacuum, the page
-				 * must have been all-frozen at the time we checked whether it
-				 * was skippable, but it might not be any more.  We must be
-				 * careful to count it as a skipped all-frozen page in that
-				 * case, or else we'll think we can't update relfrozenxid and
-				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was initially all-frozen, so we have to
-				 * recheck.
-				 */
-				if (vacrel->aggressive ||
-					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-					vacrel->frozenskipped_pages++;
-				continue;
-			}
+			/* Last page always scanned (may need to set nonempty_pages) */
+			Assert(blkno < rel_pages - 1);
 
-			/*
-			 * SKIP_PAGES_THRESHOLD (threshold for skipping) was not
-			 * crossed, or this is the last page.  Scan the page, even
-			 * though it's all-visible (and possibly even all-frozen).
-			 */
+			if (skipping_current_range)
+				continue;
+
+			/* Current range is too small to skip -- just scan the page */
 			all_visible_according_to_vm = true;
 		}
 
-		vacuum_delay_point();
-
-		/*
-		 * We're not skipping this page using the visibility map, and so it is
-		 * (by definition) a scanned page.  Any tuples from this page are now
-		 * guaranteed to be counted below, after some preparatory checks.
-		 */
 		vacrel->scanned_pages++;
 
+		/* Report as block scanned, update error traceback information */
+		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+								 blkno, InvalidOffsetNumber);
+
+		vacuum_delay_point();
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1241,8 +1119,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on what the VM said about
-		 * the page before pruning started, and using prunestate
+		 * Handle setting visibility map bit based on information from the VM
+		 * (as of last lazy_scan_skip() call), and from prunestate
 		 */
 		if (!all_visible_according_to_vm && prunestate.all_visible)
 		{
@@ -1274,9 +1152,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
 		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * got cleared after lazy_scan_skip() was called, so we must recheck
+		 * with buffer lock before concluding that the VM is corrupt.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
@@ -1315,7 +1192,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * If the all-visible page is all-frozen but not marked as such yet,
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
+		 * all_visible is true, so we must check both prunestate fields.
 		 */
 		else if (all_visible_according_to_vm && prunestate.all_visible &&
 				 prunestate.all_frozen &&
@@ -1421,6 +1298,112 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	Assert(!IsInParallelMode());
 }
 
+/*
+ *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *
+ * lazy_scan_heap() calls here every time it needs to set up a new range of
+ * blocks to skip via the visibility map.  Caller passes the next block in
+ * line.  We return a next_unskippable_block for this range.  When there are
+ * no skippable blocks we just return caller's next_block.  The all-visible
+ * status of the returned block is set in *next_unskippable_allvis for caller,
+ * too.  Block usually won't be all-visible (since it's unskippable), but it
+ * can be during aggressive VACUUMs (as well as in certain edge cases).
+ *
+ * Sets *skipping_current_range to indicate if caller should skip this range.
+ * Costs and benefits drive our decision.  Very small ranges won't be skipped.
+ *
+ * Note: our opinion of which blocks can be skipped can go stale immediately.
+ * It's okay if caller "misses" a page whose all-visible or all-frozen marking
+ * was concurrently cleared, though.  All that matters is that caller scan all
+ * pages whose tuples might contain XIDs < OldestXmin, or XMIDs < OldestMxact.
+ * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
+ * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
+ * choice to skip such a range is actually made, making everything safe.)
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
+			   bool *next_unskippable_allvis, bool *skipping_current_range)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_unskippable_block = next_block,
+				nskippable_blocks = 0;
+	bool		skipsallvis = false;
+
+	*next_unskippable_allvis = true;
+	while (next_unskippable_block < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
+													   next_unskippable_block,
+													   vmbuffer);
+
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			*next_unskippable_allvis = false;
+			break;
+		}
+
+		/*
+		 * Caller must scan the last page to determine whether it has tuples
+		 * (caller must have the opportunity to set vacrel->nonempty_pages).
+		 * This rule avoids having lazy_truncate_heap() take access-exclusive
+		 * lock on rel to attempt a truncation that fails anyway, just because
+		 * there are tuples on the last page (it is likely that there will be
+		 * tuples on other nearby pages as well, but those can be skipped).
+		 *
+		 * Implement this by always treating the last block as unsafe to skip.
+		 */
+		if (next_unskippable_block == rel_pages - 1)
+			break;
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.  They may still skip all-frozen pages, which can't
+		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case.  But
+			 * remember that the final range contains such a block for later.
+			 */
+			skipsallvis = true;
+		}
+
+		vacuum_delay_point();
+		next_unskippable_block++;
+		nskippable_blocks++;
+	}
+
+	/*
+	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
+	 * pages.  Since we're reading sequentially, the OS should be doing
+	 * readahead for us, so there's no gain in skipping a page now and then.
+	 * Skipping such a range might even discourage sequential detection.
+	 *
+	 * This test also enables more frequent relfrozenxid advancement during
+	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
+	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 */
+	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
+		*skipping_current_range = false;
+	else
+	{
+		*skipping_current_range = true;
+		if (skipsallvis)
+			vacrel->skippedallvis = true;
+	}
+
+	return next_unskippable_block;
+}
+
 /*
  *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
  *
-- 
2.32.0

v13-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patchapplication/octet-stream; name=v13-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patchDownload

From b62cff8f7da337be812d66623c5f6bb7e9047ba5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v13 1/3] Set relfrozenxid to oldest extant XID seen by VACUUM.

When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive: the relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.

VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
extant XID (and oldest extant MultiXactId) from the table, including
XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
This optimization doesn't require any changes to the definition of
relfrozenxid, nor does it require changes to the core design of
freezing.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM -- FreezeLimit still acts as a lower bound on the final value
that aggressive VACUUM can set relfrozenxid to.  Since standard VACUUMs
still make no guarantees about advancing relfrozenxid, they might as
well set relfrozenxid to a value from well before FreezeLimit when the
opportunity presents itself.  In general standard VACUUMs may now set
relfrozenxid to any value > the original relfrozenxid and <= OldestXmin.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam.h                   |   6 +-
 src/include/access/heapam_xlog.h              |   4 +-
 src/include/commands/vacuum.h                 |   1 +
 src/backend/access/heap/heapam.c              | 324 +++++++++++++-----
 src/backend/access/heap/vacuumlazy.c          | 177 ++++++----
 src/backend/commands/cluster.c                |   5 +-
 src/backend/commands/vacuum.c                 |  39 ++-
 doc/src/sgml/maintenance.sgml                 |  30 +-
 .../expected/vacuum-no-cleanup-lock.out       | 189 ++++++++++
 .../isolation/expected/vacuum-reltuples.out   |  67 ----
 src/test/isolation/isolation_schedule         |   2 +-
 .../specs/vacuum-no-cleanup-lock.spec         | 150 ++++++++
 .../isolation/specs/vacuum-reltuples.spec     |  49 ---
 13 files changed, 740 insertions(+), 303 deletions(-)
 create mode 100644 src/test/isolation/expected/vacuum-no-cleanup-lock.out
 delete mode 100644 src/test/isolation/expected/vacuum-reltuples.out
 create mode 100644 src/test/isolation/specs/vacuum-no-cleanup-lock.spec
 delete mode 100644 src/test/isolation/specs/vacuum-reltuples.spec

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b46ab7d73..4403f01e1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,10 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi);
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+									MultiXactId cutoff_multi,
+									TransactionId *relfrozenxid_out,
+									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..2d8a7f627 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *relfrozenxid_out,
+									  MultiXactId *relminmxid_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 74ad445e5..c012a07ac 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6079,10 +6079,12 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		Determine what to do during freezing when a tuple is marked by a
  *		MultiXactId.
  *
- * NB -- this might have the side-effect of creating a new MultiXactId!
- *
  * "flags" is an output value; it's used to tell caller what to do on return.
- * Possible flags are:
+ *
+ * "mxid_oldest_xid_out" is an output value; it's used to track the oldest
+ * extant Xid within any Multixact that will remain after freezing executes.
+ *
+ * Possible values that we can set in "flags":
  * FRM_NOOP
  *		don't do anything -- keep existing Xmax
  * FRM_INVALIDATE_XMAX
@@ -6094,12 +6096,17 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "mxid_oldest_xid_out" is only set when "flags" contains either FRM_NOOP or
+ * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
+ *
+ * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6111,6 +6118,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId temp_xid_out;
 
 	*flags = 0;
 
@@ -6147,7 +6155,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
 			*flags |= FRM_INVALIDATE_XMAX;
-			xid = InvalidTransactionId; /* not strictly necessary */
+			xid = InvalidTransactionId;
 		}
 		else
 		{
@@ -6174,7 +6182,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 							(errcode(ERRCODE_DATA_CORRUPTED),
 							 errmsg_internal("cannot freeze committed update xid %u", xid)));
 				*flags |= FRM_INVALIDATE_XMAX;
-				xid = InvalidTransactionId; /* not strictly necessary */
+				xid = InvalidTransactionId;
 			}
 			else
 			{
@@ -6182,6 +6190,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 		}
 
+		/*
+		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
+		 * when no Xids will remain
+		 */
 		return xid;
 	}
 
@@ -6205,6 +6217,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
@@ -6212,28 +6225,38 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			need_replace = true;
 			break;
 		}
+		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+			temp_xid_out = members[i].xid;
 	}
 
 	/*
 	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as is.
+	 * keep the existing MultiXactId as-is, avoiding a more expensive second
+	 * pass over the multi
 	 */
 	if (!need_replace)
 	{
+		/*
+		 * When mxid_oldest_xid_out gets pushed back here it's likely that the
+		 * update Xid was the oldest member, but we don't rely on that
+		 */
 		*flags |= FRM_NOOP;
+		*mxid_oldest_xid_out = temp_xid_out;
 		pfree(members);
-		return InvalidTransactionId;
+		return multi;
 	}
 
 	/*
-	 * If the multi needs to be updated, figure out which members do we need
-	 * to keep.
+	 * Do a more thorough second pass over the multi to figure out which
+	 * member XIDs actually need to be kept.  Checking the precise status of
+	 * individual members might even show that we don't need to keep anything.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
 	update_xid = InvalidTransactionId;
 	update_committed = false;
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
 
 	for (i = 0; i < nmembers; i++)
 	{
@@ -6289,7 +6312,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 
 			/*
-			 * Since the tuple wasn't marked HEAPTUPLE_DEAD by vacuum, the
+			 * Since the tuple wasn't totally removed when vacuum pruned, the
 			 * update Xid cannot possibly be older than the xid cutoff. The
 			 * presence of such a tuple would cause corruption, so be paranoid
 			 * and check.
@@ -6302,15 +6325,20 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 										 update_xid, cutoff_xid)));
 
 			/*
-			 * If we determined that it's an Xid corresponding to an update
-			 * that must be retained, additionally add it to the list of
-			 * members of the new Multi, in case we end up using that.  (We
-			 * might still decide to use only an update Xid and not a multi,
-			 * but it's easier to maintain the list as we walk the old members
-			 * list.)
+			 * We determined that this is an Xid corresponding to an update
+			 * that must be retained -- add it to new members list for later.
+			 *
+			 * Also consider pushing back temp_xid_out, which is needed when
+			 * we later conclude that a new multi is required (i.e. when we go
+			 * on to set FRM_RETURN_IS_MULTI for our caller because we also
+			 * need to retain a locker that's still running).
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+					temp_xid_out = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6318,8 +6346,18 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
 				TransactionIdIsInProgress(members[i].xid))
 			{
-				/* running locker cannot possibly be older than the cutoff */
+				/*
+				 * Running locker cannot possibly be older than the cutoff.
+				 *
+				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * initial value used for top-level relfrozenxid_out tracking
+				 * state.  A running locker cannot be older than VACUUM's
+				 * OldestXmin, either, so we don't need a temp_xid_out step.
+				 */
+				Assert(TransactionIdIsNormal(members[i].xid));
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid,
+											  *mxid_oldest_xid_out));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6328,11 +6366,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	pfree(members);
 
+	/*
+	 * Determine what to do with caller's multi based on information gathered
+	 * during our second pass
+	 */
 	if (nnewmembers == 0)
 	{
 		/* nothing worth keeping!? Tell caller to remove the whole thing */
 		*flags |= FRM_INVALIDATE_XMAX;
 		xid = InvalidTransactionId;
+		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
 	{
@@ -6348,15 +6391,18 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
 	}
 	else
 	{
 		/*
 		 * Create a new multixact with the surviving members of the previous
-		 * one, to set as new Xmax in the tuple.
+		 * one, to set as new Xmax in the tuple.  The oldest surviving member
+		 * might push back mxid_oldest_xid_out.
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+		*mxid_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
@@ -6375,21 +6421,30 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
+ * The *relfrozenxid_out and *relminmxid_out arguments are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
+ * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
+ *
  * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
+ * NB: This function has side effects: it might allocate a new MultiXactId.
+ * It will be set as tuple's new xmax when our *frz output is processed within
+ * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
+ * then caller had better have an exclusive lock on it already.
+ *
+ * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
  * XID older than it could neither be running nor seen as running by any
  * open transaction.  This ensures that the replacement will not change
  * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be less than or equal to the smallest
- * MultiXactId used by any transaction currently open.
- *
- * If the tuple is in a shared buffer, caller must hold an exclusive lock on
- * that buffer.
+ * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
  *
  * NB: It is not enough to set hint bits to indicate something is
  * committed/invalid -- they might not be set on a standby, or after crash
@@ -6399,7 +6454,9 @@ bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
+						  TransactionId *relfrozenxid_out,
+						  MultiXactId *relminmxid_out)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6418,7 +6475,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * already a permanent value), while in the block below it is set true to
 	 * mean "xmin won't need freezing after what we do to it here" (false
 	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.
+	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
+	 * handling, since either way the tuple's xmin will be a permanent value
+	 * once we're done with it.
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
@@ -6443,6 +6502,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else
+		{
+			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 
 	/*
@@ -6452,7 +6517,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_needs_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6460,15 +6525,28 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi,
+									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
 		if (flags & FRM_RETURN_IS_XID)
 		{
+			/*
+			 * xmax will become an updater Xid (original MultiXact's updater
+			 * member Xid will be carried forward as a simple Xid in Xmax).
+			 * Might have to ratchet back relfrozenxid_out here, though never
+			 * relminmxid_out.
+			 */
+			Assert(!freeze_xmax);
+			Assert(TransactionIdIsValid(newxmax));
+			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
+				*relfrozenxid_out = newxmax;
+
 			/*
 			 * NB -- some of these transformations are only valid because we
 			 * know the return Xid is a tuple updater (i.e. not merely a
@@ -6487,6 +6565,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			uint16		newbits;
 			uint16		newbits2;
 
+			/*
+			 * xmax is an old MultiXactId that we have to replace with a new
+			 * MultiXactId, to carry forward two or more original member XIDs.
+			 * Might have to ratchet back relfrozenxid_out here, though never
+			 * relminmxid_out.
+			 */
+			Assert(!freeze_xmax);
+			Assert(MultiXactIdIsValid(newxmax));
+			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
+												 *relfrozenxid_out));
+			*relfrozenxid_out = mxid_oldest_xid_out;
+
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
 			 * here; that routine initializes the masks to all zeroes, which
@@ -6503,6 +6594,30 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 			changed = true;
 		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
+			 * both together.
+			 */
+			Assert(!freeze_xmax);
+			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
+			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
+												 *relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, *relminmxid_out))
+				*relminmxid_out = xid;
+			*relfrozenxid_out = mxid_oldest_xid_out;
+		}
+		else
+		{
+			/*
+			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.
+			 * Won't have to ratchet back relminmxid_out or relfrozenxid_out.
+			 */
+			Assert(freeze_xmax);
+			Assert(!TransactionIdIsValid(newxmax));
+		}
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6527,15 +6642,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
 			freeze_xmax = true;
+			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
 	{
 		freeze_xmax = false;
 		xmax_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmax */
 	}
 	else
 		ereport(ERROR,
@@ -6576,6 +6697,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * No need for relfrozenxid_out handling, since we always freeze xvac.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6653,11 +6776,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId relfrozenxid_out = cutoff_xid;
+	MultiXactId relminmxid_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &relfrozenxid_out, &relminmxid_out);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7036,9 +7162,7 @@ ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status,
  * heap_tuple_needs_eventual_freeze
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * will eventually require freezing.  Similar to heap_tuple_needs_freeze,
- * but there's no cutoff, since we're trying to figure out whether freezing
- * will ever be needed, not whether it's needed now.
+ * will eventually require freezing (if tuple isn't removed before then).
  */
 bool
 heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
@@ -7082,87 +7206,109 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
 }
 
 /*
- * heap_tuple_needs_freeze
+ * heap_tuple_would_freeze
  *
- * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the specified cutoff XID or MultiXactId.  If so, return true.
+ * Return value indicates if heap_prepare_freeze_tuple sibling function would
+ * freeze any of the XID/XMID fields from the tuple, given the same cutoffs.
+ * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
+ * could be processed by pruning away the whole tuple (instead of freezing).
  *
- * It doesn't matter whether the tuple is alive or dead, we are checking
- * to see if a tuple needs to be removed or frozen to avoid wraparound.
+ * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
+ * like the heap_prepare_freeze_tuple arguments that they're based on.  We
+ * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  *
  * NB: Cannot rely on hint bits here, they might not be set after a crash or
  * on a standby.
  */
 bool
-heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi)
+heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+						MultiXactId cutoff_multi,
+						TransactionId *relfrozenxid_out,
+						MultiXactId *relminmxid_out)
 {
+	bool		would_freeze = false;
 	TransactionId xid;
+	MultiXactId multi;
 
+	/* First deal with xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
-
-	/*
-	 * The considerations for multixacts are complicated; look at
-	 * heap_prepare_freeze_tuple for justifications.  This routine had better
-	 * be in sync with that one!
-	 */
-	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+	if (TransactionIdIsNormal(xid))
 	{
-		MultiXactId multi;
+		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			would_freeze = true;
+	}
 
+	/* Now deal with xmax */
+	xid = InvalidTransactionId;
+	multi = InvalidMultiXactId;
+	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
-			return true;
-		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
-		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
+	else
+		xid = HeapTupleHeaderGetRawXmax(tuple);
 
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
-		}
+	if (TransactionIdIsNormal(xid))
+	{
+		/* xmax is a non-permanent XID */
+		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			would_freeze = true;
+	}
+	else if (!MultiXactIdIsValid(multi))
+	{
+		/* xmax is a permanent XID or invalid MultiXactId/XID */
+	}
+	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+	{
+		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
+		if (MultiXactIdPrecedes(multi, *relminmxid_out))
+			*relminmxid_out = multi;
+		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
+		would_freeze = true;
 	}
 	else
 	{
-		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		/* xmax is a MultiXactId that may have an updater XID */
+		MultiXactMember *members;
+		int			nmembers;
+
+		if (MultiXactIdPrecedes(multi, *relminmxid_out))
+			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, cutoff_multi))
+			would_freeze = true;
+
+		/* need to check whether any member of the mxact is old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
+		{
+			xid = members[i].xid;
+			Assert(TransactionIdIsNormal(xid));
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				would_freeze = true;
+		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+			/* heap_prepare_freeze_tuple always freezes xvac */
+			would_freeze = true;
+		}
 	}
 
-	return false;
+	return would_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 87ab7775a..6cb688efc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,7 +144,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
@@ -173,8 +173,9 @@ typedef struct LVRelState
 	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -319,17 +320,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				skipwithvm;
 	bool		frozenxid_updated,
 				minmulti_updated;
-	BlockNumber orig_rel_pages;
+	BlockNumber orig_rel_pages,
+				new_rel_pages,
+				new_rel_allvisible;
 	char	  **indnames = NULL;
-	BlockNumber new_rel_pages;
-	BlockNumber new_rel_allvisible;
-	double		new_live_tuples;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
-	TransactionId OldestXmin;
-	TransactionId FreezeLimit;
-	MultiXactId MultiXactCutoff;
+	TransactionId OldestXmin,
+				FreezeLimit;
+	MultiXactId OldestMxact,
+				MultiXactCutoff;
 
 	verbose = (params->options & VACOPT_VERBOSE) != 0;
 	instrument = (verbose || (IsAutoVacuumWorkerProcess() &&
@@ -351,20 +352,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
 	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.
-	 *
-	 * If this is an aggressive VACUUM, then we're strictly required to freeze
-	 * any and all XIDs from before FreezeLimit, so that we will be able to
-	 * safely advance relfrozenxid up to FreezeLimit below (we must be able to
-	 * advance relminmxid up to MultiXactCutoff, too).
+	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
+	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
+	 * XIDs < FreezeLimit (or unfrozen MXIDs < MultiXactCutoff).
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -511,10 +509,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing */
+	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+	/* Initialize state used to track oldest extant XID/XMID */
+	vacrel->NewRelfrozenXid = OldestXmin;
+	vacrel->NewRelminMxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -548,16 +547,41 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * In principle new_live_tuples could be -1 indicating that we (still)
-	 * don't know the tuple count.  In practice that probably can't happen,
-	 * since we'd surely have scanned some pages if the table is new and
-	 * nonempty.
-	 *
+	 * Aggressive VACUUMs must advance relfrozenxid to a value >= FreezeLimit,
+	 * and advance relminmxid to a value >= MultiXactCutoff.
+	 */
+	Assert(!aggressive || vacrel->NewRelfrozenXid == OldestXmin ||
+		   TransactionIdPrecedesOrEquals(FreezeLimit,
+										 vacrel->NewRelfrozenXid));
+	Assert(!aggressive || vacrel->NewRelminMxid == OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(MultiXactCutoff,
+									   vacrel->NewRelminMxid));
+
+	/*
+	 * Non-aggressive VACUUMs might advance relfrozenxid to an XID that is
+	 * either older or newer than FreezeLimit (same applies to relminmxid and
+	 * MultiXactCutoff).  But the state that tracks the oldest remaining XID
+	 * and MXID cannot be trusted when any all-visible pages were skipped.
+	 */
+	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
+		   TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
+										 vacrel->NewRelfrozenXid));
+	Assert(vacrel->NewRelminMxid == OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(vacrel->relminmxid,
+									   vacrel->NewRelminMxid));
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
+	{
+		/* Keep existing relfrozenxid and relminmxid (can't trust trackers) */
+		Assert(!aggressive);
+		vacrel->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->NewRelminMxid = InvalidMultiXactId;
+	}
+
+	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
-	 * relpages to.
+	 * pg_class.relpages to
 	 */
 	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
-	new_live_tuples = vacrel->new_live_tuples;
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -565,33 +589,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Now actually update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
-	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
-	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * In principle new_live_tuples could be -1 indicating that we (still)
+	 * don't know the tuple count.  In practice that can't happen, since we
+	 * scan every page that isn't skipped using the visibility map.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
-	{
-		/* Cannot advance relfrozenxid/relminmxid */
-		Assert(!aggressive);
-		frozenxid_updated = minmulti_updated = false;
-		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
-							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId,
-							NULL, NULL, false);
-	}
-	else
-	{
-		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
-			   orig_rel_pages);
-		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
-							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
-							&frozenxid_updated, &minmulti_updated, false);
-	}
+	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
+						new_rel_allvisible, vacrel->nindexes > 0,
+						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
 	 * Report results to the stats collector, too.
@@ -605,7 +610,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
-						 Max(new_live_tuples, 0),
+						 Max(vacrel->new_live_tuples, 0),
 						 vacrel->recently_dead_tuples +
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
@@ -694,17 +699,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminMxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -1584,6 +1591,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1593,7 +1602,9 @@ lazy_scan_prune(LVRelState *vacrel,
 
 retry:
 
-	/* Initialize (or reset) page-level counters */
+	/* Initialize (or reset) page-level state */
+	NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	NewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1801,7 +1812,8 @@ retry:
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
 									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &tuple_totally_frozen,
+									  &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1815,13 +1827,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1971,6 +1986,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				recently_dead_tuples,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
+	TransactionId NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NewRelminMxid = vacrel->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2015,22 +2032,37 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_needs_freeze(tupleheader,
+		if (heap_tuple_would_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff))
+									vacrel->MultiXactCutoff,
+									&NewRelfrozenXid, &NewRelminMxid))
 		{
+			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
 			if (vacrel->aggressive)
 			{
-				/* Going to have to get cleanup lock for lazy_scan_prune */
+				/*
+				 * Aggressive VACUUMs must always be able to advance rel's
+				 * relfrozenxid to a value >= FreezeLimit (and be able to
+				 * advance rel's relminmxid to a value >= MultiXactCutoff).
+				 * The ongoing aggressive VACUUM won't be able to do that
+				 * unless it can freezes an XID (or XMID) from this tuple now.
+				 *
+				 * The only safe option is to have caller perform processing
+				 * of this page using lazy_scan_prune.  Caller might have to
+				 * wait a while for a cleanup lock, but it can't be helped.
+				 */
 				vacrel->offnum = InvalidOffsetNumber;
 				return false;
 			}
 
 			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
+			 * Non-aggressive VACUUMs are under no strict obligated to advance
+			 * relfrozenxid (not even by one XID).  We can be much laxer here.
+			 *
+			 * Currently we always just accept an older final relfrozenxid
+			 * and/or relminmxid value.  We never make caller wait or work a
+			 * little harder, even when it likely makes sense to do so.
 			 */
-			vacrel->freeze_cutoffs_valid = false;
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
@@ -2080,9 +2112,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel (though
-	 * only when VACUUM uses two-pass strategy)
+	 * By here we know for sure that caller can put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
+
+	/* Save details of the LP_DEAD items from the page */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 02a7e94bf..a7e988298 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 50a4a612e..3ff08a2a1 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -945,14 +945,22 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
- * - multiXactCutoff is the value below which all MultiXactIds are removed
- *   from Xmax.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
+ * - multiXactCutoff is the value below which all MultiXactIds are definitely
+ *   removed from Xmax during aggressive vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit, and relminmxid up to multiXactCutoff.
+ * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
+ * minimum).
+ *
+ * oldestXmin and oldestMxact are the most recent values that can ever be
+ * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
+ * vacuumlazy.c caller later on.  These values should be passed when it turns
+ * out that VACUUM will leave no unfrozen XIDs/XMIDs behind in the table.
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -961,6 +969,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -969,7 +978,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1065,9 +1073,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1082,8 +1092,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
@@ -1390,12 +1400,9 @@ vac_update_relstats(Relation relation,
 	 * Update relfrozenxid, unless caller passed InvalidTransactionId
 	 * indicating it has no new data.
 	 *
-	 * Ordinarily, we don't let relfrozenxid go backwards: if things are
-	 * working correctly, the only way the new frozenxid could be older would
-	 * be if a previous VACUUM was done with a tighter freeze_min_age, in
-	 * which case we don't want to forget the work it already did.  However,
-	 * if the stored relfrozenxid is "in the future", then it must be corrupt
-	 * and it seems best to overwrite it with the cutoff we used this time.
+	 * Ordinarily, we don't let relfrozenxid go backwards.  However, if the
+	 * stored relfrozenxid is "in the future" then it must be corrupt.  Seems
+	 * best to overwrite it with the oldest extant XID left behind by VACUUM.
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 34d72dba7..0a7b38c17 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -585,9 +585,11 @@
     statistics in the system tables <structname>pg_class</structname> and
     <structname>pg_database</structname>.  In particular,
     the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the freeze cutoff XID that was used
-    by the last aggressive <command>VACUUM</command> for that table.  All rows
-    inserted by transactions with XIDs older than this cutoff XID are
+    <structname>pg_class</structname> row contains the oldest
+    remaining XID at the end of the most recent <command>VACUUM</command>
+    that successfully advanced <structfield>relfrozenxid</structfield>
+    (typically the most recent aggressive VACUUM).  All rows inserted
+    by transactions with XIDs older than this cutoff XID are
     guaranteed to have been frozen.  Similarly,
     the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
@@ -610,6 +612,17 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     cutoff XID to the current transaction's XID.
    </para>
 
+   <tip>
+    <para>
+     <literal>VACUUM VERBOSE</literal> outputs information about
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> when either field was
+     advanced.  The same details appear in the server log when <xref
+      linkend="guc-log-autovacuum-min-duration"/> reports on vacuuming
+     by autovacuum.
+    </para>
+   </tip>
+
    <para>
     <command>VACUUM</command> normally only scans pages that have been modified
     since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
@@ -624,7 +637,11 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     set <literal>age(relfrozenxid)</literal> to a value just a little more than the
     <varname>vacuum_freeze_min_age</varname> setting
     that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  If no <structfield>relfrozenxid</structfield>-advancing
+    <command>VACUUM</command> started).  <command>VACUUM</command>
+    will set <structfield>relfrozenxid</structfield> to the oldest XID
+    that remains in the table, so it's possible that the final value
+    will be much more recent than strictly required.
+    If no <structfield>relfrozenxid</structfield>-advancing
     <command>VACUUM</command> is issued on the table until
     <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
     be forced for the table.
@@ -711,8 +728,9 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command> scans, regardless of
-     what causes them, enable advancing the value for that table.
+     Aggressive <command>VACUUM</command> scans, regardless of what
+     causes them, are <emphasis>guaranteed</emphasis> to be able to
+     advance the table's <structfield>relminmxid</structfield>.
      Eventually, as all tables in all databases are scanned and their
      oldest multixact values are advanced, on-disk storage for older
      multixacts can be removed.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
new file mode 100644
index 000000000..f7bc93e8f
--- /dev/null
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -0,0 +1,189 @@
+Parsed test spec with 4 sessions
+
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+step dml_begin: BEGIN;
+step dml_other_begin: BEGIN;
+step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step dml_other_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
+step dml_commit: COMMIT;
+step dml_other_commit: COMMIT;
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_commit: 
+  COMMIT;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
deleted file mode 100644
index ce55376e7..000000000
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ /dev/null
@@ -1,67 +0,0 @@
-Parsed test spec with 2 sessions
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify open fetch1 vac close stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step open: 
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-
-step fetch1: 
-    fetch next from c1;
-
-dummy
------
-    1
-(1 row)
-
-step vac: 
-    vacuum smalltbl;
-
-step close: 
-    commit;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 00749a40b..a48caae22 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -84,7 +84,7 @@ test: alter-table-4
 test: create-trigger
 test: sequence-ddl
 test: async-notify
-test: vacuum-reltuples
+test: vacuum-no-cleanup-lock
 test: timeouts
 test: vacuum-concurrent-drop
 test: vacuum-conflict
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
new file mode 100644
index 000000000..a88be66de
--- /dev/null
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -0,0 +1,150 @@
+# Test for vacuum's reduced processing of heap pages (used for any heap page
+# where a cleanup lock isn't immediately available)
+#
+# Debugging tip: Change VACUUM to VACUUM VERBOSE to get feedback on what's
+# really going on
+
+# Use name type here to avoid TOAST table:
+setup
+{
+  CREATE TABLE smalltbl AS SELECT i AS id, 't'::name AS t FROM generate_series(1,20) i;
+  ALTER TABLE smalltbl SET (autovacuum_enabled = off);
+  ALTER TABLE smalltbl ADD PRIMARY KEY (id);
+}
+setup
+{
+  VACUUM ANALYZE smalltbl;
+}
+
+teardown
+{
+  DROP TABLE smalltbl;
+}
+
+# This session holds a pin on smalltbl's only heap page:
+session pinholder
+step pinholder_cursor
+{
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+}
+step pinholder_commit
+{
+  COMMIT;
+}
+
+# This session inserts and deletes tuples, potentially affecting reltuples:
+session dml
+step dml_insert
+{
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+}
+step dml_delete
+{
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+}
+step dml_begin            { BEGIN; }
+step dml_key_share        { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_commit           { COMMIT; }
+
+# Needed for Multixact test:
+session dml_other
+step dml_other_begin      { BEGIN; }
+step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
+step dml_other_commit     { COMMIT; }
+
+# This session runs non-aggressive VACUUM, but with maximally aggressive
+# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+session vacuumer
+setup
+{
+  SET vacuum_freeze_min_age = 0;
+  SET vacuum_multixact_freeze_min_age = 0;
+}
+step vacuumer_nonaggressive_vacuum
+{
+  VACUUM smalltbl;
+}
+step vacuumer_pg_class_stats
+{
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+}
+
+# Test VACUUM's reltuples counting mechanism.
+#
+# Final pg_class.reltuples should never be affected by VACUUM's inability to
+# get a cleanup lock on any page, except to the extent that any cleanup lock
+# contention changes the number of tuples that remain ("missed dead" tuples
+# are counted in reltuples, much like "recently dead" tuples).
+
+# Easy case:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+
+# Harder case -- count 21 tuples at the end (like last time), but with cleanup
+# lock contention this time:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    pinholder_cursor
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but vary the order, and delete an inserted row:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    pinholder_cursor
+    dml_insert
+    dml_delete
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "recently dead" tuple won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but initial insert and delete before cursor:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    dml_delete
+    pinholder_cursor
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
+    # concurrent activity held back VACUUM's OldestXmin) won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Test VACUUM's mechanism for skipping MultiXact freezing.
+#
+# This provides test coverage for code paths that are only hit when we need to
+# freeze, but inability to acquire a cleanup lock on a heap page makes
+# freezing some XIDs/XMIDs < FreezeLimit/MultiXactCutoff impossible (without
+# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+permutation
+    dml_begin
+    dml_other_begin
+    dml_key_share
+    dml_other_key_share
+    # Will get cleanup lock, can't advance relminmxid yet:
+    # (though will usually advance relfrozenxid by ~2 XIDs)
+    vacuumer_nonaggressive_vacuum
+    pinholder_cursor
+    dml_other_update
+    dml_commit
+    dml_other_commit
+    # Can't cleanup lock, so still can't advance relminmxid here:
+    # (relfrozenxid held back by XIDs in MultiXact too)
+    vacuumer_nonaggressive_vacuum
+    pinholder_commit
+    # Pin was dropped, so will advance relminmxid, at long last:
+    # (ditto for relfrozenxid advancement)
+    vacuumer_nonaggressive_vacuum
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
deleted file mode 100644
index a2a461f2f..000000000
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ /dev/null
@@ -1,49 +0,0 @@
-# Test for vacuum's handling of reltuples when pages are skipped due
-# to page pins. We absolutely need to avoid setting reltuples=0 in
-# such cases, since that interferes badly with planning.
-#
-# Expected result for all three permutation is 21 tuples, including
-# the second permutation.  VACUUM is able to count the concurrently
-# inserted tuple in its final reltuples, even when a cleanup lock
-# cannot be acquired on the affected heap page.
-
-setup {
-    create table smalltbl
-        as select i as id from generate_series(1,20) i;
-    alter table smalltbl set (autovacuum_enabled = off);
-}
-setup {
-    vacuum analyze smalltbl;
-}
-
-teardown {
-    drop table smalltbl;
-}
-
-session worker
-step open {
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-}
-step fetch1 {
-    fetch next from c1;
-}
-step close {
-    commit;
-}
-step stats {
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-}
-
-session vacuumer
-step vac {
-    vacuum smalltbl;
-}
-step modify {
-    insert into smalltbl select max(id)+1 from smalltbl;
-}
-
-permutation modify vac stats
-permutation modify open fetch1 vac close stats
-permutation modify vac stats
-- 
2.32.0

#108

Justin Pryzby

pryzby@telsasoft.com

almost 4 years ago

In reply to: Peter Geoghegan (#107)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

+                               diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+                               Assert(diff > 0);

Did you see that this crashed on windows cfbot?

https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log
TRAP: FailedAssertion("diff > 0", File: "c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984)
abort() has been called2022-03-30 03:48:30.267 GMT [5316][client backend] [pg_regress/tablefunc][3/15389:0] ERROR: infinite recursion detected
2022-03-30 03:48:38.031 GMT [5592][postmaster] LOG: server process (PID 5984) was terminated by exception 0xC0000354
2022-03-30 03:48:38.031 GMT [5592][postmaster] DETAIL: Failed process was running: autovacuum: VACUUM ANALYZE pg_catalog.pg_database
2022-03-30 03:48:38.031 GMT [5592][postmaster] HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.

https://cirrus-ci.com/task/4592929254670336

00000000`007ff130 00000001`400b4ef8 postgres!ExceptionalCondition(
char * conditionName = 0x00000001`40a915d8 "diff > 0",
char * errorType = 0x00000001`40a915c8 "FailedAssertion",
char * fileName = 0x00000001`40a91598 "c:\cirrus\src\backend\access\heap\vacuumlazy.c",
int lineNumber = 0n724)+0x8d [c:\cirrus\src\backend\utils\error\assert.c @ 70]
00000000`007ff170 00000001`402a0914 postgres!heap_vacuum_rel(
struct RelationData * rel = 0x00000000`00a51088,
struct VacuumParams * params = 0x00000000`00a8420c,
struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0)+0x1038 [c:\cirrus\src\backend\access\heap\vacuumlazy.c @ 724]
00000000`007ff350 00000001`402a4686 postgres!table_relation_vacuum(
struct RelationData * rel = 0x00000000`00a51088,
struct VacuumParams * params = 0x00000000`00a8420c,
struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0)+0x34 [c:\cirrus\src\include\access\tableam.h @ 1681]
00000000`007ff380 00000001`402a1a2d postgres!vacuum_rel(
unsigned int relid = 0x4ee,
struct RangeVar * relation = 0x00000000`01799ae0,
struct VacuumParams * params = 0x00000000`00a8420c)+0x5a6 [c:\cirrus\src\backend\commands\vacuum.c @ 2068]
00000000`007ff400 00000001`4050f1ef postgres!vacuum(
struct List * relations = 0x00000000`0179df58,
struct VacuumParams * params = 0x00000000`00a8420c,
struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0,
bool isTopLevel = true)+0x69d [c:\cirrus\src\backend\commands\vacuum.c @ 482]
00000000`007ff5f0 00000001`4050dc95 postgres!autovacuum_do_vac_analyze(
struct autovac_table * tab = 0x00000000`00a84208,
struct BufferAccessStrategyData * bstrategy = 0x00000000`00a842a0)+0x8f [c:\cirrus\src\backend\postmaster\autovacuum.c @ 3248]
00000000`007ff640 00000001`4050b4e3 postgres!do_autovacuum(void)+0xef5 [c:\cirrus\src\backend\postmaster\autovacuum.c @ 2503]

It seems like there should be even more logs, especially since it says:
[03:48:43.119] Uploading 3 artifacts for c:\cirrus\**\*.diffs
[03:48:43.122] Uploaded c:\cirrus\contrib\tsm_system_rows\regression.diffs
[03:48:43.125] Uploaded c:\cirrus\contrib\tsm_system_time\regression.diffs

#109

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Justin Pryzby (#108)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Tue, Mar 29, 2022 at 11:10 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

+                               diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+                               Assert(diff > 0);
Did you see that this crashed on windows cfbot?

https://api.cirrus-ci.com/v1/artifact/task/4592929254670336/log/tmp_check/postmaster.log
TRAP: FailedAssertion("diff > 0", File: "c:\cirrus\src\backend\access\heap\vacuumlazy.c", Line: 724, PID: 5984)

That's weird. There are very similar assertions a little earlier, that
must have *not* failed here, before the call to vac_update_relstats().
I was actually thinking of removing this assertion for that reason --
I thought that it was redundant.

Perhaps something is amiss inside vac_update_relstats(), where the
boolean flag that indicates that pg_class.relfrozenxid was advanced is
set:

if (frozenxid_updated)
*frozenxid_updated = false;
if (TransactionIdIsNormal(frozenxid) &&
pgcform->relfrozenxid != frozenxid &&
(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
TransactionIdPrecedes(ReadNextTransactionId(),
pgcform->relfrozenxid)))
{
if (frozenxid_updated)
*frozenxid_updated = true;
pgcform->relfrozenxid = frozenxid;
dirty = true;
}

Maybe the "existing relfrozenxid is in the future, silently update
relfrozenxid" part of the condition (which involves
ReadNextTransactionId()) somehow does the wrong thing here. But how?

The other assertions take into account the fact that OldestXmin can
itself "go backwards" across VACUUM operations against the same table:

Assert(!aggressive || vacrel->NewRelfrozenXid == OldestXmin ||
TransactionIdPrecedesOrEquals(FreezeLimit,
vacrel->NewRelfrozenXid));

Note the "vacrel->NewRelfrozenXid == OldestXmin", without which the
assertion will fail pretty easily when the regression tests are run.
Perhaps I need to do something like that with the other assertion as
well (or more likely just get rid of it). Will figure it out tomorrow.

--
Peter Geoghegan

#110

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#109)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 30, 2022 at 12:01 AM Peter Geoghegan <pg@bowt.ie> wrote:

Perhaps something is amiss inside vac_update_relstats(), where the
boolean flag that indicates that pg_class.relfrozenxid was advanced is
set:

if (frozenxid_updated)
*frozenxid_updated = false;
if (TransactionIdIsNormal(frozenxid) &&
pgcform->relfrozenxid != frozenxid &&
(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
TransactionIdPrecedes(ReadNextTransactionId(),
pgcform->relfrozenxid)))
{
if (frozenxid_updated)
*frozenxid_updated = true;
pgcform->relfrozenxid = frozenxid;
dirty = true;
}

Maybe the "existing relfrozenxid is in the future, silently update
relfrozenxid" part of the condition (which involves
ReadNextTransactionId()) somehow does the wrong thing here. But how?

I tried several times to recreate this issue on CI. No luck with that,
though -- can't get it to fail again after 4 attempts.

This was a VACUUM of pg_database, run from an autovacuum worker. I am
vaguely reminded of the two bugs fixed by Andres in commit a54e1f15.
Both were issues with the shared relcache init file affecting shared
and nailed catalog relations. Those bugs had symptoms like " ERROR:
found xmin ... from before relfrozenxid ..." for various system
catalogs.

We know that this particular assertion did not fail during the same VACUUM:

Assert(vacrel->NewRelfrozenXid == OldestXmin ||
TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
vacrel->NewRelfrozenXid));

So it's hard to see how this could be a bug in the patch -- the final
new relfrozenxid is presumably equal to VACUUM's OldestXmin in the
problem scenario seen on the CI Windows instance yesterday (that's why
this earlier assertion didn't fail). The assertion I'm showing here
needs the "vacrel->NewRelfrozenXid == OldestXmin" part of the
condition to account for the fact that
OldestXmin/GetOldestNonRemovableTransactionId() is known to "go
backwards". Without that the regression tests will fail quite easily.

The surprising part of the CI failure must have taken place just after
this assertion, when VACUUM's call to vacuum_set_xid_limits() actually
updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid --
presumably because the existing relfrozenxid appeared to be "in the
future" when we examine it in pg_class again. We see evidence that
this must have happened afterwards, when the closely related assertion
(used only in instrumentation code) fails:

From my patch:

if (frozenxid_updated)
{
-               diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+               diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+               Assert(diff > 0);
appendStringInfo(&buf,
_("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-                                FreezeLimit, diff);
+                                vacrel->NewRelfrozenXid, diff);
}

Does anybody have any ideas about what might be going on here?

--
Peter Geoghegan

#111

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#110)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-03-30 17:50:42 -0700, Peter Geoghegan wrote:

I tried several times to recreate this issue on CI. No luck with that,
though -- can't get it to fail again after 4 attempts.

It's really annoying that we don't have Assert variants that show the compared
values, that might make it easier to intepret what's going on.

Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe
AssertCmp(type, a, op, b),

Then the assertion could have been something like
AssertCmp(int32, diff, >, 0)

Does the line number in the failed run actually correspond to the xid, rather
than the mxid case? I didn't check.

You could try to increase the likelihood of reproducing the failure by
duplicating the invocation that lead to the crash a few times in the
.cirrus.yml file in your dev branch. That might allow hitting the problem more
quickly.

Maybe reduce autovacuum_naptime in src/tools/ci/pg_ci_base.conf?

Or locally - one thing that windows CI does different from the other platforms
is that it runs isolation, contrib and a bunch of other tests using the same
cluster. Which of course increases the likelihood of autovacuum having stuff
to do, *particularly* on shared relations - normally there's probably not
enough changes for that.

You can do something similar locally on linux with
make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck prove_installcheck=true
(the prove_installcheck=true to prevent tap tests from running, we don't seem
to have another way for that)

I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more
load concurrently than running tests serially...

We know that this particular assertion did not fail during the same VACUUM:

Assert(vacrel->NewRelfrozenXid == OldestXmin ||
TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
vacrel->NewRelfrozenXid));

The comment in your patch says "is either older or newer than FreezeLimit" - I
assume that's some rephrasing damage?

So it's hard to see how this could be a bug in the patch -- the final
new relfrozenxid is presumably equal to VACUUM's OldestXmin in the
problem scenario seen on the CI Windows instance yesterday (that's why
this earlier assertion didn't fail).

Perhaps it's worth commiting improved assertions on master? If this is indeed
a pre-existing bug, and we're just missing due to slightly less stringent
asserts, we could rectify that separately.

The surprising part of the CI failure must have taken place just after
this assertion, when VACUUM's call to vacuum_set_xid_limits() actually
updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid --
presumably because the existing relfrozenxid appeared to be "in the
future" when we examine it in pg_class again. We see evidence that
this must have happened afterwards, when the closely related assertion
(used only in instrumentation code) fails:

Hm. This triggers some vague memories. There's some oddities around shared
relations being vacuumed separately in all the databases and thus having
separate horizons.

After "remembering" that, I looked in the cirrus log for the failed run, and
the worker was processing shared a shared relation last:

2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG: automatic analyze of table "contrib_regression.pg_catalog.pg_authid"

Obviously that's not a guarantee that the next table processed also is a
shared catalog, but ...

Oh, the relid is actually in the stack trace. 0x4ee = 1262 =
pg_database. Which makes sense, the test ends up with a high percentage of
dead rows in pg_database, due to all the different contrib tests
creating/dropping a database.

From my patch:

if (frozenxid_updated)
{
-               diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+               diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+               Assert(diff > 0);
appendStringInfo(&buf,
_("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-                                FreezeLimit, diff);
+                                vacrel->NewRelfrozenXid, diff);
}

Perhaps this ought to be an elog() instead of an Assert()? Something has gone
pear shaped if we get here... It's a bit annoying though, because it'd have to
be a PANIC to be visible on the bf / CI :(.

Greetings,

Andres Freund

#112

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#111)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 30, 2022 at 7:00 PM Andres Freund <andres@anarazel.de> wrote:

Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe
AssertCmp(type, a, op, b),

Then the assertion could have been something like
AssertCmp(int32, diff, >, 0)

I'd definitely use them if they were there.

Does the line number in the failed run actually correspond to the xid, rather
than the mxid case? I didn't check.

Yes, I verified -- definitely relfrozenxid.

You can do something similar locally on linux with
make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck prove_installcheck=true
(the prove_installcheck=true to prevent tap tests from running, we don't seem
to have another way for that)

I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more
load concurrently than running tests serially...

Can't get it to fail locally with that recipe.

Assert(vacrel->NewRelfrozenXid == OldestXmin ||
TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
vacrel->NewRelfrozenXid));

The comment in your patch says "is either older or newer than FreezeLimit" - I
assume that's some rephrasing damage?

Both the comment and the assertion are correct. I see what you mean, though.

Perhaps it's worth commiting improved assertions on master? If this is indeed
a pre-existing bug, and we're just missing due to slightly less stringent
asserts, we could rectify that separately.

I don't think there's much chance of the assertion actually hitting
without the rest of the patch series. The new relfrozenxid value is
always going to be OldestXmin - vacuum_min_freeze_age on HEAD, while
with the patch it's sometimes close to OldestXmin. Especially when you
have lots of dead tuples that you churn through constantly (like
pgbench_tellers, or like these system catalogs on the CI test
machine).

Hm. This triggers some vague memories. There's some oddities around shared
relations being vacuumed separately in all the databases and thus having
separate horizons.

That's what I was thinking of, obviously.

After "remembering" that, I looked in the cirrus log for the failed run, and
the worker was processing shared a shared relation last:

2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG: automatic analyze of table "contrib_regression.pg_catalog.pg_authid"

I noticed the same thing myself. Should have said sooner.

Perhaps this ought to be an elog() instead of an Assert()? Something has gone
pear shaped if we get here... It's a bit annoying though, because it'd have to
be a PANIC to be visible on the bf / CI :(.

Yeah, a WARNING would be good here. I can write a new version of my
patch series with a separation patch for that this evening. Actually,
better make it a PANIC for now...

--
Peter Geoghegan

#113

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#112)

4 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 30, 2022 at 7:37 PM Peter Geoghegan <pg@bowt.ie> wrote:

Yeah, a WARNING would be good here. I can write a new version of my
patch series with a separation patch for that this evening. Actually,
better make it a PANIC for now...

Attached is v14, which includes a new patch that PANICs like that in
vac_update_relstats() --- 0003.

This approach also covers manual VACUUMs, which isn't the case with
the failing assertion, which is in instrumentation code (actually
VACUUM VERBOSE might hit it).

I definitely think that something like this should be committed.
Silently ignoring system catalog corruption isn't okay.

--
Peter Geoghegan

Attachments:

v14-0003-PANIC-on-relfrozenxid-from-the-future.patchapplication/octet-stream; name=v14-0003-PANIC-on-relfrozenxid-from-the-future.patchDownload

From 7d2c63423d18f16fd5d4c21e49a4980f78cde69a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 30 Mar 2022 18:59:41 -0700
Subject: [PATCH v14 3/4] PANIC on relfrozenxid from the future.

This should be made into a WARNING later on.
---
 src/backend/commands/vacuum.c | 78 +++++++++++++++++++++++++++--------
 1 file changed, 60 insertions(+), 18 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index deec4887b..40b6a723b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1340,7 +1340,11 @@ vac_update_relstats(Relation relation,
 	Relation	rd;
 	HeapTuple	ctup;
 	Form_pg_class pgcform;
-	bool		dirty;
+	bool		dirty,
+				relfrozenxid_warn,
+				relminmxid_warn;
+	TransactionId oldrelfrozenxid;
+	MultiXactId oldrelminmxid;
 
 	rd = table_open(RelationRelationId, RowExclusiveLock);
 
@@ -1406,32 +1410,57 @@ vac_update_relstats(Relation relation,
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
+	relfrozenxid_warn = false;
 	if (frozenxid_updated)
 		*frozenxid_updated = false;
-	if (TransactionIdIsNormal(frozenxid) &&
-		pgcform->relfrozenxid != frozenxid &&
-		(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
-		 TransactionIdPrecedes(ReadNextTransactionId(),
-							   pgcform->relfrozenxid)))
+	if (TransactionIdIsNormal(frozenxid) && pgcform->relfrozenxid != frozenxid)
 	{
-		if (frozenxid_updated)
-			*frozenxid_updated = true;
-		pgcform->relfrozenxid = frozenxid;
-		dirty = true;
+		bool	update = false;
+
+		if (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid))
+			update = true;
+		else if (TransactionIdPrecedes(ReadNextTransactionId(),
+									   pgcform->relfrozenxid))
+		{
+			relfrozenxid_warn = true;
+			oldrelfrozenxid = pgcform->relfrozenxid;
+			update = true;
+		}
+
+		if (update)
+		{
+			if (frozenxid_updated)
+				*frozenxid_updated = true;
+			pgcform->relfrozenxid = frozenxid;
+			dirty = true;
+		}
 	}
 
 	/* Similarly for relminmxid */
+	relminmxid_warn = false;
 	if (minmulti_updated)
 		*minmulti_updated = false;
-	if (MultiXactIdIsValid(minmulti) &&
-		pgcform->relminmxid != minmulti &&
-		(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
-		 MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
+	if (MultiXactIdIsValid(minmulti) && pgcform->relminmxid != minmulti)
 	{
-		if (minmulti_updated)
-			*minmulti_updated = true;
-		pgcform->relminmxid = minmulti;
-		dirty = true;
+		bool	update = false;
+
+		if (MultiXactIdPrecedes(pgcform->relminmxid, minmulti))
+			update = true;
+		else if (MultiXactIdPrecedes(ReadNextMultiXactId(),
+									 pgcform->relminmxid))
+		{
+			relminmxid_warn = true;
+			oldrelminmxid = pgcform->relminmxid;
+			update = true;
+		}
+
+		if (update)
+		{
+			if (minmulti_updated)
+				*minmulti_updated = true;
+			pgcform->relminmxid = minmulti;
+			dirty = true;
+		}
 	}
 
 	/* If anything changed, write out the tuple. */
@@ -1439,6 +1468,19 @@ vac_update_relstats(Relation relation,
 		heap_inplace_update(rd, ctup);
 
 	table_close(rd, RowExclusiveLock);
+
+	if (relfrozenxid_warn)
+		ereport(PANIC,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg_internal("overwrote invalid pg_class.relfrozenxid value %u with new value %u in table \"%s\"",
+								 oldrelfrozenxid, frozenxid,
+								 RelationGetRelationName(relation))));
+	if (relminmxid_warn)
+		ereport(PANIC,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg_internal("overwrote invalid pg_class.relminmxid value %u with new value %u in table \"%s\"",
+								 oldrelminmxid, minmulti,
+								 RelationGetRelationName(relation))));
 }
 
 
-- 
2.32.0

v14-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patchapplication/octet-stream; name=v14-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patchDownload

From 61938823ce68c38aa6f35016a7781d1a4186618d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v14 1/4] Set relfrozenxid to oldest extant XID seen by VACUUM.

When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive: the relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.

VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
extant XID (and oldest extant MultiXactId) from the table, including
XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
This optimization doesn't require any changes to the definition of
relfrozenxid, nor does it require changes to the core design of
freezing.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM -- FreezeLimit still acts as a lower bound on the final value
that aggressive VACUUM can set relfrozenxid to.  Since standard VACUUMs
still make no guarantees about advancing relfrozenxid, they might as
well set relfrozenxid to a value from well before FreezeLimit when the
opportunity presents itself.  In general standard VACUUMs may now set
relfrozenxid to any value > the original relfrozenxid and <= OldestXmin.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam.h                   |   6 +-
 src/include/access/heapam_xlog.h              |   4 +-
 src/include/commands/vacuum.h                 |   1 +
 src/backend/access/heap/heapam.c              | 332 +++++++++++++-----
 src/backend/access/heap/vacuumlazy.c          | 180 ++++++----
 src/backend/commands/cluster.c                |   5 +-
 src/backend/commands/vacuum.c                 |  39 +-
 doc/src/sgml/maintenance.sgml                 |  30 +-
 .../expected/vacuum-no-cleanup-lock.out       | 189 ++++++++++
 .../isolation/expected/vacuum-reltuples.out   |  67 ----
 src/test/isolation/isolation_schedule         |   2 +-
 .../specs/vacuum-no-cleanup-lock.spec         | 150 ++++++++
 .../isolation/specs/vacuum-reltuples.spec     |  49 ---
 13 files changed, 743 insertions(+), 311 deletions(-)
 create mode 100644 src/test/isolation/expected/vacuum-no-cleanup-lock.out
 delete mode 100644 src/test/isolation/expected/vacuum-reltuples.out
 create mode 100644 src/test/isolation/specs/vacuum-no-cleanup-lock.spec
 delete mode 100644 src/test/isolation/specs/vacuum-reltuples.spec

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b46ab7d73..4403f01e1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,10 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi);
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+									MultiXactId cutoff_multi,
+									TransactionId *relfrozenxid_out,
+									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..2d8a7f627 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *relfrozenxid_out,
+									  MultiXactId *relminmxid_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 74ad445e5..1ee985f63 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6079,10 +6079,12 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		Determine what to do during freezing when a tuple is marked by a
  *		MultiXactId.
  *
- * NB -- this might have the side-effect of creating a new MultiXactId!
- *
  * "flags" is an output value; it's used to tell caller what to do on return.
- * Possible flags are:
+ *
+ * "mxid_oldest_xid_out" is an output value; it's used to track the oldest
+ * extant Xid within any Multixact that will remain after freezing executes.
+ *
+ * Possible values that we can set in "flags":
  * FRM_NOOP
  *		don't do anything -- keep existing Xmax
  * FRM_INVALIDATE_XMAX
@@ -6094,12 +6096,17 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "mxid_oldest_xid_out" is only set when "flags" contains either FRM_NOOP or
+ * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
+ *
+ * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6111,6 +6118,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId temp_xid_out;
 
 	*flags = 0;
 
@@ -6147,7 +6155,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
 			*flags |= FRM_INVALIDATE_XMAX;
-			xid = InvalidTransactionId; /* not strictly necessary */
+			xid = InvalidTransactionId;
 		}
 		else
 		{
@@ -6174,7 +6182,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 							(errcode(ERRCODE_DATA_CORRUPTED),
 							 errmsg_internal("cannot freeze committed update xid %u", xid)));
 				*flags |= FRM_INVALIDATE_XMAX;
-				xid = InvalidTransactionId; /* not strictly necessary */
+				xid = InvalidTransactionId;
 			}
 			else
 			{
@@ -6182,6 +6190,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 		}
 
+		/*
+		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
+		 * when no Xids will remain
+		 */
 		return xid;
 	}
 
@@ -6205,6 +6217,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
@@ -6212,28 +6225,38 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			need_replace = true;
 			break;
 		}
+		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+			temp_xid_out = members[i].xid;
 	}
 
 	/*
 	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as is.
+	 * keep the existing MultiXactId as-is, avoiding a more expensive second
+	 * pass over the multi
 	 */
 	if (!need_replace)
 	{
+		/*
+		 * When mxid_oldest_xid_out gets pushed back here it's likely that the
+		 * update Xid was the oldest member, but we don't rely on that
+		 */
 		*flags |= FRM_NOOP;
+		*mxid_oldest_xid_out = temp_xid_out;
 		pfree(members);
-		return InvalidTransactionId;
+		return multi;
 	}
 
 	/*
-	 * If the multi needs to be updated, figure out which members do we need
-	 * to keep.
+	 * Do a more thorough second pass over the multi to figure out which
+	 * member XIDs actually need to be kept.  Checking the precise status of
+	 * individual members might even show that we don't need to keep anything.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
 	update_xid = InvalidTransactionId;
 	update_committed = false;
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
 
 	for (i = 0; i < nmembers; i++)
 	{
@@ -6289,7 +6312,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 
 			/*
-			 * Since the tuple wasn't marked HEAPTUPLE_DEAD by vacuum, the
+			 * Since the tuple wasn't totally removed when vacuum pruned, the
 			 * update Xid cannot possibly be older than the xid cutoff. The
 			 * presence of such a tuple would cause corruption, so be paranoid
 			 * and check.
@@ -6302,15 +6325,20 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 										 update_xid, cutoff_xid)));
 
 			/*
-			 * If we determined that it's an Xid corresponding to an update
-			 * that must be retained, additionally add it to the list of
-			 * members of the new Multi, in case we end up using that.  (We
-			 * might still decide to use only an update Xid and not a multi,
-			 * but it's easier to maintain the list as we walk the old members
-			 * list.)
+			 * We determined that this is an Xid corresponding to an update
+			 * that must be retained -- add it to new members list for later.
+			 *
+			 * Also consider pushing back temp_xid_out, which is needed when
+			 * we later conclude that a new multi is required (i.e. when we go
+			 * on to set FRM_RETURN_IS_MULTI for our caller because we also
+			 * need to retain a locker that's still running).
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+					temp_xid_out = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6318,8 +6346,18 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
 				TransactionIdIsInProgress(members[i].xid))
 			{
-				/* running locker cannot possibly be older than the cutoff */
+				/*
+				 * Running locker cannot possibly be older than the cutoff.
+				 *
+				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * initial value used for top-level relfrozenxid_out tracking
+				 * state.  A running locker cannot be older than VACUUM's
+				 * OldestXmin, either, so we don't need a temp_xid_out step.
+				 */
+				Assert(TransactionIdIsNormal(members[i].xid));
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid,
+											  *mxid_oldest_xid_out));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6328,11 +6366,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	pfree(members);
 
+	/*
+	 * Determine what to do with caller's multi based on information gathered
+	 * during our second pass
+	 */
 	if (nnewmembers == 0)
 	{
 		/* nothing worth keeping!? Tell caller to remove the whole thing */
 		*flags |= FRM_INVALIDATE_XMAX;
 		xid = InvalidTransactionId;
+		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
 	{
@@ -6348,15 +6391,18 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
 	}
 	else
 	{
 		/*
 		 * Create a new multixact with the surviving members of the previous
-		 * one, to set as new Xmax in the tuple.
+		 * one, to set as new Xmax in the tuple.  The oldest surviving member
+		 * might push back mxid_oldest_xid_out.
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+		*mxid_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
@@ -6375,31 +6421,41 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
- * Caller is responsible for setting the offset field, if appropriate.
+ * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
+ * The *relfrozenxid_out and *relminmxid_out arguments are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
+ * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
+ *
+ * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
  * XID older than it could neither be running nor seen as running by any
  * open transaction.  This ensures that the replacement will not change
  * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be less than or equal to the smallest
- * MultiXactId used by any transaction currently open.
+ * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
  *
- * If the tuple is in a shared buffer, caller must hold an exclusive lock on
- * that buffer.
+ * NB: This function has side effects: it might allocate a new MultiXactId.
+ * It will be set as tuple's new xmax when our *frz output is processed within
+ * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
+ * then caller had better have an exclusive lock on it already.
  *
- * NB: It is not enough to set hint bits to indicate something is
- * committed/invalid -- they might not be set on a standby, or after crash
- * recovery.  We really need to remove old xids.
+ * NB: It is not enough to set hint bits to indicate an XID committed/aborted.
+ * The *frz WAL record we output completely removes all old XIDs during REDO.
  */
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
+						  TransactionId *relfrozenxid_out,
+						  MultiXactId *relminmxid_out)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6418,7 +6474,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * already a permanent value), while in the block below it is set true to
 	 * mean "xmin won't need freezing after what we do to it here" (false
 	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.
+	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
+	 * handling, since either way the tuple's xmin will be a permanent value
+	 * once we're done with it.
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
@@ -6443,6 +6501,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else
+		{
+			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 
 	/*
@@ -6452,7 +6516,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_needs_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6460,15 +6524,28 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi,
+									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
 		if (flags & FRM_RETURN_IS_XID)
 		{
+			/*
+			 * xmax will become an updater Xid (original MultiXact's updater
+			 * member Xid will be carried forward as a simple Xid in Xmax).
+			 * Might have to ratchet back relfrozenxid_out here, though never
+			 * relminmxid_out.
+			 */
+			Assert(!freeze_xmax);
+			Assert(TransactionIdIsValid(newxmax));
+			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
+				*relfrozenxid_out = newxmax;
+
 			/*
 			 * NB -- some of these transformations are only valid because we
 			 * know the return Xid is a tuple updater (i.e. not merely a
@@ -6487,6 +6564,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			uint16		newbits;
 			uint16		newbits2;
 
+			/*
+			 * xmax is an old MultiXactId that we have to replace with a new
+			 * MultiXactId, to carry forward two or more original member XIDs.
+			 * Might have to ratchet back relfrozenxid_out here, though never
+			 * relminmxid_out.
+			 */
+			Assert(!freeze_xmax);
+			Assert(MultiXactIdIsValid(newxmax));
+			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
+												 *relfrozenxid_out));
+			*relfrozenxid_out = mxid_oldest_xid_out;
+
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
 			 * here; that routine initializes the masks to all zeroes, which
@@ -6503,6 +6593,30 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 			changed = true;
 		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
+			 * both together.
+			 */
+			Assert(!freeze_xmax);
+			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
+			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
+												 *relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, *relminmxid_out))
+				*relminmxid_out = xid;
+			*relfrozenxid_out = mxid_oldest_xid_out;
+		}
+		else
+		{
+			/*
+			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.
+			 * Won't have to ratchet back relminmxid_out or relfrozenxid_out.
+			 */
+			Assert(freeze_xmax);
+			Assert(!TransactionIdIsValid(newxmax));
+		}
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6527,15 +6641,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
 			freeze_xmax = true;
+			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
 	{
 		freeze_xmax = false;
 		xmax_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmax */
 	}
 	else
 		ereport(ERROR,
@@ -6576,6 +6696,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * No need for relfrozenxid_out handling, since we always freeze xvac.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6653,11 +6775,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId relfrozenxid_out = cutoff_xid;
+	MultiXactId relminmxid_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &relfrozenxid_out, &relminmxid_out);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7036,9 +7161,7 @@ ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status,
  * heap_tuple_needs_eventual_freeze
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * will eventually require freezing.  Similar to heap_tuple_needs_freeze,
- * but there's no cutoff, since we're trying to figure out whether freezing
- * will ever be needed, not whether it's needed now.
+ * will eventually require freezing (if tuple isn't removed by pruning first).
  */
 bool
 heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
@@ -7082,87 +7205,106 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
 }
 
 /*
- * heap_tuple_needs_freeze
+ * heap_tuple_would_freeze
  *
- * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the specified cutoff XID or MultiXactId.  If so, return true.
+ * Return value indicates if heap_prepare_freeze_tuple sibling function would
+ * freeze any of the XID/XMID fields from the tuple, given the same cutoffs.
+ * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
+ * could be processed by pruning away the whole tuple instead of freezing.
  *
- * It doesn't matter whether the tuple is alive or dead, we are checking
- * to see if a tuple needs to be removed or frozen to avoid wraparound.
- *
- * NB: Cannot rely on hint bits here, they might not be set after a crash or
- * on a standby.
+ * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
+ * like the heap_prepare_freeze_tuple arguments that they're based on.  We
+ * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi)
+heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+						MultiXactId cutoff_multi,
+						TransactionId *relfrozenxid_out,
+						MultiXactId *relminmxid_out)
 {
 	TransactionId xid;
+	MultiXactId multi;
+	bool		would_freeze = false;
 
+	/* First deal with xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
-
-	/*
-	 * The considerations for multixacts are complicated; look at
-	 * heap_prepare_freeze_tuple for justifications.  This routine had better
-	 * be in sync with that one!
-	 */
-	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+	if (TransactionIdIsNormal(xid))
 	{
-		MultiXactId multi;
+		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			would_freeze = true;
+	}
 
+	/* Now deal with xmax */
+	xid = InvalidTransactionId;
+	multi = InvalidMultiXactId;
+	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
-			return true;
-		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
-		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
+	else
+		xid = HeapTupleHeaderGetRawXmax(tuple);
 
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
-		}
+	if (TransactionIdIsNormal(xid))
+	{
+		/* xmax is a non-permanent XID */
+		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			would_freeze = true;
+	}
+	else if (!MultiXactIdIsValid(multi))
+	{
+		/* xmax is a permanent XID or invalid MultiXactId/XID */
+	}
+	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+	{
+		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
+		if (MultiXactIdPrecedes(multi, *relminmxid_out))
+			*relminmxid_out = multi;
+		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
+		would_freeze = true;
 	}
 	else
 	{
-		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		/* xmax is a MultiXactId that may have an updater XID */
+		MultiXactMember *members;
+		int			nmembers;
+
+		if (MultiXactIdPrecedes(multi, *relminmxid_out))
+			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, cutoff_multi))
+			would_freeze = true;
+
+		/* need to check whether any member of the mxact is old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
+		{
+			xid = members[i].xid;
+			Assert(TransactionIdIsNormal(xid));
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				would_freeze = true;
+		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+			/* heap_prepare_freeze_tuple always freezes xvac */
+			would_freeze = true;
+		}
 	}
 
-	return false;
+	return would_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 87ab7775a..110bbfb56 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,7 +144,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
@@ -173,8 +173,9 @@ typedef struct LVRelState
 	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -319,17 +320,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				skipwithvm;
 	bool		frozenxid_updated,
 				minmulti_updated;
-	BlockNumber orig_rel_pages;
+	BlockNumber orig_rel_pages,
+				new_rel_pages,
+				new_rel_allvisible;
 	char	  **indnames = NULL;
-	BlockNumber new_rel_pages;
-	BlockNumber new_rel_allvisible;
-	double		new_live_tuples;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
-	TransactionId OldestXmin;
-	TransactionId FreezeLimit;
-	MultiXactId MultiXactCutoff;
+	TransactionId OldestXmin,
+				FreezeLimit;
+	MultiXactId OldestMxact,
+				MultiXactCutoff;
 
 	verbose = (params->options & VACOPT_VERBOSE) != 0;
 	instrument = (verbose || (IsAutoVacuumWorkerProcess() &&
@@ -351,20 +352,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
 	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.
-	 *
-	 * If this is an aggressive VACUUM, then we're strictly required to freeze
-	 * any and all XIDs from before FreezeLimit, so that we will be able to
-	 * safely advance relfrozenxid up to FreezeLimit below (we must be able to
-	 * advance relminmxid up to MultiXactCutoff, too).
+	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
+	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
+	 * XIDs < FreezeLimit (or unfrozen MXIDs < MultiXactCutoff).
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -511,10 +509,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing */
+	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+	/* Initialize state used to track oldest extant XID/XMID */
+	vacrel->NewRelfrozenXid = OldestXmin;
+	vacrel->NewRelminMxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -548,16 +547,41 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * In principle new_live_tuples could be -1 indicating that we (still)
-	 * don't know the tuple count.  In practice that probably can't happen,
-	 * since we'd surely have scanned some pages if the table is new and
-	 * nonempty.
-	 *
+	 * Aggressive VACUUMs must advance relfrozenxid to a value >= FreezeLimit,
+	 * and advance relminmxid to a value >= MultiXactCutoff.
+	 */
+	Assert(!aggressive || vacrel->NewRelfrozenXid == OldestXmin ||
+		   TransactionIdPrecedesOrEquals(FreezeLimit,
+										 vacrel->NewRelfrozenXid));
+	Assert(!aggressive || vacrel->NewRelminMxid == OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(MultiXactCutoff,
+									   vacrel->NewRelminMxid));
+
+	/*
+	 * Non-aggressive VACUUMs might advance relfrozenxid to an XID that is
+	 * either older or newer than FreezeLimit (same applies to relminmxid and
+	 * MultiXactCutoff).  But the state that tracks the oldest remaining XID
+	 * and MXID cannot be trusted when any all-visible pages were skipped.
+	 */
+	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
+		   TransactionIdPrecedesOrEquals(vacrel->relfrozenxid,
+										 vacrel->NewRelfrozenXid));
+	Assert(vacrel->NewRelminMxid == OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(vacrel->relminmxid,
+									   vacrel->NewRelminMxid));
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
+	{
+		/* Keep existing relfrozenxid and relminmxid (can't trust trackers) */
+		Assert(!aggressive);
+		vacrel->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->NewRelminMxid = InvalidMultiXactId;
+	}
+
+	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
-	 * relpages to.
+	 * pg_class.relpages to
 	 */
 	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
-	new_live_tuples = vacrel->new_live_tuples;
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -565,33 +589,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Now actually update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
-	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
-	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * In principle new_live_tuples could be -1 indicating that we (still)
+	 * don't know the tuple count.  In practice that can't happen, since we
+	 * scan every page that isn't skipped using the visibility map.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
-	{
-		/* Cannot advance relfrozenxid/relminmxid */
-		Assert(!aggressive);
-		frozenxid_updated = minmulti_updated = false;
-		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
-							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId,
-							NULL, NULL, false);
-	}
-	else
-	{
-		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
-			   orig_rel_pages);
-		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
-							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
-							&frozenxid_updated, &minmulti_updated, false);
-	}
+	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
+						new_rel_allvisible, vacrel->nindexes > 0,
+						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
 	 * Report results to the stats collector, too.
@@ -605,7 +610,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
-						 Max(new_live_tuples, 0),
+						 Max(vacrel->new_live_tuples, 0),
 						 vacrel->recently_dead_tuples +
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
@@ -674,7 +679,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
 							 vacrel->removed_pages,
-							 vacrel->rel_pages,
+							 new_rel_pages,
 							 vacrel->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
 							 100.0 * vacrel->scanned_pages / orig_rel_pages);
@@ -694,17 +699,19 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
+				Assert(diff > 0);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminMxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -1584,6 +1591,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1593,7 +1602,9 @@ lazy_scan_prune(LVRelState *vacrel,
 
 retry:
 
-	/* Initialize (or reset) page-level counters */
+	/* Initialize (or reset) page-level state */
+	NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	NewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1800,8 +1811,8 @@ retry:
 									  vacrel->relminmxid,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
-									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &frozen[nfrozen], &tuple_totally_frozen,
+									  &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1815,13 +1826,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1971,6 +1985,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				recently_dead_tuples,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
+	TransactionId NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NewRelminMxid = vacrel->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2015,22 +2031,37 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_needs_freeze(tupleheader,
+		if (heap_tuple_would_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff))
+									vacrel->MultiXactCutoff,
+									&NewRelfrozenXid, &NewRelminMxid))
 		{
+			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
 			if (vacrel->aggressive)
 			{
-				/* Going to have to get cleanup lock for lazy_scan_prune */
+				/*
+				 * Aggressive VACUUMs must always be able to advance rel's
+				 * relfrozenxid to a value >= FreezeLimit (and be able to
+				 * advance rel's relminmxid to a value >= MultiXactCutoff).
+				 * The ongoing aggressive VACUUM won't be able to do that
+				 * unless it can freeze an XID (or XMID) from this tuple now.
+				 *
+				 * The only safe option is to have caller perform processing
+				 * of this page using lazy_scan_prune.  Caller might have to
+				 * wait a while for a cleanup lock, but it can't be helped.
+				 */
 				vacrel->offnum = InvalidOffsetNumber;
 				return false;
 			}
 
 			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
+			 * Non-aggressive VACUUMs are under no obligation to advance
+			 * relfrozenxid (even by one XID).  We can be much laxer here.
+			 *
+			 * Currently we always just accept an older final relfrozenxid
+			 * and/or relminmxid value.  We never make caller wait or work a
+			 * little harder, even when it likely makes sense to do so.
 			 */
-			vacrel->freeze_cutoffs_valid = false;
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
@@ -2080,9 +2111,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel (though
-	 * only when VACUUM uses two-pass strategy)
+	 * By here we know for sure that caller can put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
+
+	/* Save any LP_DEAD items found on the page in dead_items array */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 02a7e94bf..a7e988298 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 50a4a612e..deec4887b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -945,14 +945,22 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
- * - multiXactCutoff is the value below which all MultiXactIds are removed
- *   from Xmax.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
+ * - multiXactCutoff is the value below which all MultiXactIds are definitely
+ *   removed from Xmax during aggressive vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit, and relminmxid up to multiXactCutoff.
+ * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
+ * minimum).
+ *
+ * oldestXmin and oldestMxact are the most recent values that can ever be
+ * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
+ * vacuumlazy.c caller later on.  These values should be passed when it turns
+ * out that VACUUM will leave no unfrozen XIDs/XMIDs behind in the table.
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -961,6 +969,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -969,7 +978,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1065,9 +1073,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1082,8 +1092,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
@@ -1390,12 +1400,9 @@ vac_update_relstats(Relation relation,
 	 * Update relfrozenxid, unless caller passed InvalidTransactionId
 	 * indicating it has no new data.
 	 *
-	 * Ordinarily, we don't let relfrozenxid go backwards: if things are
-	 * working correctly, the only way the new frozenxid could be older would
-	 * be if a previous VACUUM was done with a tighter freeze_min_age, in
-	 * which case we don't want to forget the work it already did.  However,
-	 * if the stored relfrozenxid is "in the future", then it must be corrupt
-	 * and it seems best to overwrite it with the cutoff we used this time.
+	 * Ordinarily, we don't let relfrozenxid go backwards.  However, if the
+	 * stored relfrozenxid is "in the future" then it seems best to assume
+	 * it's corrupt, and overwrite with the oldest remaining XID in the table.
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 34d72dba7..0a7b38c17 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -585,9 +585,11 @@
     statistics in the system tables <structname>pg_class</structname> and
     <structname>pg_database</structname>.  In particular,
     the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the freeze cutoff XID that was used
-    by the last aggressive <command>VACUUM</command> for that table.  All rows
-    inserted by transactions with XIDs older than this cutoff XID are
+    <structname>pg_class</structname> row contains the oldest
+    remaining XID at the end of the most recent <command>VACUUM</command>
+    that successfully advanced <structfield>relfrozenxid</structfield>
+    (typically the most recent aggressive VACUUM).  All rows inserted
+    by transactions with XIDs older than this cutoff XID are
     guaranteed to have been frozen.  Similarly,
     the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
@@ -610,6 +612,17 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     cutoff XID to the current transaction's XID.
    </para>
 
+   <tip>
+    <para>
+     <literal>VACUUM VERBOSE</literal> outputs information about
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> when either field was
+     advanced.  The same details appear in the server log when <xref
+      linkend="guc-log-autovacuum-min-duration"/> reports on vacuuming
+     by autovacuum.
+    </para>
+   </tip>
+
    <para>
     <command>VACUUM</command> normally only scans pages that have been modified
     since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
@@ -624,7 +637,11 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     set <literal>age(relfrozenxid)</literal> to a value just a little more than the
     <varname>vacuum_freeze_min_age</varname> setting
     that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  If no <structfield>relfrozenxid</structfield>-advancing
+    <command>VACUUM</command> started).  <command>VACUUM</command>
+    will set <structfield>relfrozenxid</structfield> to the oldest XID
+    that remains in the table, so it's possible that the final value
+    will be much more recent than strictly required.
+    If no <structfield>relfrozenxid</structfield>-advancing
     <command>VACUUM</command> is issued on the table until
     <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
     be forced for the table.
@@ -711,8 +728,9 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command> scans, regardless of
-     what causes them, enable advancing the value for that table.
+     Aggressive <command>VACUUM</command> scans, regardless of what
+     causes them, are <emphasis>guaranteed</emphasis> to be able to
+     advance the table's <structfield>relminmxid</structfield>.
      Eventually, as all tables in all databases are scanned and their
      oldest multixact values are advanced, on-disk storage for older
      multixacts can be removed.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
new file mode 100644
index 000000000..f7bc93e8f
--- /dev/null
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -0,0 +1,189 @@
+Parsed test spec with 4 sessions
+
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+step dml_begin: BEGIN;
+step dml_other_begin: BEGIN;
+step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step dml_other_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
+step dml_commit: COMMIT;
+step dml_other_commit: COMMIT;
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_commit: 
+  COMMIT;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
deleted file mode 100644
index ce55376e7..000000000
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ /dev/null
@@ -1,67 +0,0 @@
-Parsed test spec with 2 sessions
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify open fetch1 vac close stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step open: 
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-
-step fetch1: 
-    fetch next from c1;
-
-dummy
------
-    1
-(1 row)
-
-step vac: 
-    vacuum smalltbl;
-
-step close: 
-    commit;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 00749a40b..a48caae22 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -84,7 +84,7 @@ test: alter-table-4
 test: create-trigger
 test: sequence-ddl
 test: async-notify
-test: vacuum-reltuples
+test: vacuum-no-cleanup-lock
 test: timeouts
 test: vacuum-concurrent-drop
 test: vacuum-conflict
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
new file mode 100644
index 000000000..a88be66de
--- /dev/null
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -0,0 +1,150 @@
+# Test for vacuum's reduced processing of heap pages (used for any heap page
+# where a cleanup lock isn't immediately available)
+#
+# Debugging tip: Change VACUUM to VACUUM VERBOSE to get feedback on what's
+# really going on
+
+# Use name type here to avoid TOAST table:
+setup
+{
+  CREATE TABLE smalltbl AS SELECT i AS id, 't'::name AS t FROM generate_series(1,20) i;
+  ALTER TABLE smalltbl SET (autovacuum_enabled = off);
+  ALTER TABLE smalltbl ADD PRIMARY KEY (id);
+}
+setup
+{
+  VACUUM ANALYZE smalltbl;
+}
+
+teardown
+{
+  DROP TABLE smalltbl;
+}
+
+# This session holds a pin on smalltbl's only heap page:
+session pinholder
+step pinholder_cursor
+{
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+}
+step pinholder_commit
+{
+  COMMIT;
+}
+
+# This session inserts and deletes tuples, potentially affecting reltuples:
+session dml
+step dml_insert
+{
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+}
+step dml_delete
+{
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+}
+step dml_begin            { BEGIN; }
+step dml_key_share        { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_commit           { COMMIT; }
+
+# Needed for Multixact test:
+session dml_other
+step dml_other_begin      { BEGIN; }
+step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
+step dml_other_commit     { COMMIT; }
+
+# This session runs non-aggressive VACUUM, but with maximally aggressive
+# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+session vacuumer
+setup
+{
+  SET vacuum_freeze_min_age = 0;
+  SET vacuum_multixact_freeze_min_age = 0;
+}
+step vacuumer_nonaggressive_vacuum
+{
+  VACUUM smalltbl;
+}
+step vacuumer_pg_class_stats
+{
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+}
+
+# Test VACUUM's reltuples counting mechanism.
+#
+# Final pg_class.reltuples should never be affected by VACUUM's inability to
+# get a cleanup lock on any page, except to the extent that any cleanup lock
+# contention changes the number of tuples that remain ("missed dead" tuples
+# are counted in reltuples, much like "recently dead" tuples).
+
+# Easy case:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+
+# Harder case -- count 21 tuples at the end (like last time), but with cleanup
+# lock contention this time:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    pinholder_cursor
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but vary the order, and delete an inserted row:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    pinholder_cursor
+    dml_insert
+    dml_delete
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "recently dead" tuple won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but initial insert and delete before cursor:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    dml_delete
+    pinholder_cursor
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
+    # concurrent activity held back VACUUM's OldestXmin) won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Test VACUUM's mechanism for skipping MultiXact freezing.
+#
+# This provides test coverage for code paths that are only hit when we need to
+# freeze, but inability to acquire a cleanup lock on a heap page makes
+# freezing some XIDs/XMIDs < FreezeLimit/MultiXactCutoff impossible (without
+# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+permutation
+    dml_begin
+    dml_other_begin
+    dml_key_share
+    dml_other_key_share
+    # Will get cleanup lock, can't advance relminmxid yet:
+    # (though will usually advance relfrozenxid by ~2 XIDs)
+    vacuumer_nonaggressive_vacuum
+    pinholder_cursor
+    dml_other_update
+    dml_commit
+    dml_other_commit
+    # Can't cleanup lock, so still can't advance relminmxid here:
+    # (relfrozenxid held back by XIDs in MultiXact too)
+    vacuumer_nonaggressive_vacuum
+    pinholder_commit
+    # Pin was dropped, so will advance relminmxid, at long last:
+    # (ditto for relfrozenxid advancement)
+    vacuumer_nonaggressive_vacuum
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
deleted file mode 100644
index a2a461f2f..000000000
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ /dev/null
@@ -1,49 +0,0 @@
-# Test for vacuum's handling of reltuples when pages are skipped due
-# to page pins. We absolutely need to avoid setting reltuples=0 in
-# such cases, since that interferes badly with planning.
-#
-# Expected result for all three permutation is 21 tuples, including
-# the second permutation.  VACUUM is able to count the concurrently
-# inserted tuple in its final reltuples, even when a cleanup lock
-# cannot be acquired on the affected heap page.
-
-setup {
-    create table smalltbl
-        as select i as id from generate_series(1,20) i;
-    alter table smalltbl set (autovacuum_enabled = off);
-}
-setup {
-    vacuum analyze smalltbl;
-}
-
-teardown {
-    drop table smalltbl;
-}
-
-session worker
-step open {
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-}
-step fetch1 {
-    fetch next from c1;
-}
-step close {
-    commit;
-}
-step stats {
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-}
-
-session vacuumer
-step vac {
-    vacuum smalltbl;
-}
-step modify {
-    insert into smalltbl select max(id)+1 from smalltbl;
-}
-
-permutation modify vac stats
-permutation modify open fetch1 vac close stats
-permutation modify vac stats
-- 
2.32.0

v14-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchapplication/octet-stream; name=v14-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchDownload

From 8e6e859dc2ba235b4c41d712e1bdbb4884692725 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v14 2/4] Generalize how VACUUM skips all-frozen pages.

Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to
aggressive VACUUMs) around advancing relfrozenxid before now.  The
underlying issue was that lazy_scan_heap conditioned its skipping
behavior on whether or not the current VACUUM was aggressive.  VACUUM
could fail to increment its frozenskipped_pages counter as a result, and
so could miss out on advancing relfrozenxid (in the non-aggressive case)
for no good reason.

The issue only comes up when concurrent activity might unset a page's
visibility map bit at exactly the wrong time.  The non-aggressive case
rechecked the visibility map at the point of skipping each page before
now.  This created a window for some other session to concurrently unset
the same heap page's bit in the visibility map.  If the bit was unset at
the wrong time, it would cause VACUUM to conservatively conclude that
the page was _never_ all-frozen on recheck.  frozenskipped_pages would
not be incremented for the page as a result.  lazy_scan_heap had already
committed to skipping the page/range at that point, though -- which made
it unsafe to advance relfrozenxid/relminmxid later on.

Consistently avoid the issue by generalizing how we skip frozen pages
during aggressive VACUUMs: take the same approach when skipping any
skippable page range during aggressive and non-aggressive VACUUMs alike.
The new approach makes ranges (not individual pages) the fundamental
unit of skipping using the visibility map.  frozenskipped_pages is
replaced with a boolean flag that represents whether some skippable
range with one or more all-visible pages was actually skipped (making
relfrozenxid unsafe to update).

It is safe for VACUUM to treat a page as all-frozen provided it at least
had its all-frozen bit set after the OldestXmin cutoff was established.
VACUUM is only required to scan pages that might have XIDs < OldestXmin
that are not yet frozen to be able to safely advance relfrozenxid.
Tuples concurrently inserted on skipped pages are equivalent to tuples
concurrently inserted on a block >= rel_pages from the same table.

It's possible that the issue this commit fixes hardly ever came up in
practice.  But we only had to be unlucky once to lose out on advancing
relfrozenxid -- a single affected heap page was enough to throw VACUUM
off.  That seems like something to avoid on general principle.  This is
similar to an issue fixed by commit 44fa8488, which taught vacuumlazy.c
to not give up on non-aggressive relfrozenxid advancement just because a
cleanup lock wasn't immediately available on some heap page.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com
Discussion: https://postgr.es/m/CA+TgmobhuzSR442_cfpgxidmiRdL-GdaFSc8SD=GJcpLTx_BAw@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 309 +++++++++++++--------------
 1 file changed, 146 insertions(+), 163 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 110bbfb56..b0d70a0b1 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -176,6 +176,7 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -196,7 +197,6 @@ typedef struct LVRelState
 	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
@@ -247,6 +247,10 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
+								  BlockNumber next_block,
+								  bool *next_unskippable_allvis,
+								  bool *skipping_current_range);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -467,7 +471,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
-	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
@@ -514,6 +517,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/XMID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
+	vacrel->skippedallvis = false;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -569,7 +573,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	Assert(vacrel->NewRelminMxid == OldestMxact ||
 		   MultiXactIdPrecedesOrEquals(vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
+	if (vacrel->skippedallvis)
 	{
 		/* Keep existing relfrozenxid and relminmxid (can't trust trackers) */
 		Assert(!aggressive);
@@ -838,7 +842,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 				next_failsafe_block,
 				next_fsm_block_to_vacuum;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		next_unskippable_allvis,
+				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -869,179 +874,52 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/*
-	 * Set things up for skipping blocks using visibility map.
-	 *
-	 * Except when vacrel->aggressive is set, we want to skip pages that are
-	 * all-visible according to the visibility map, but only when we can skip
-	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
-	 * sequentially, the OS should be doing readahead for us, so there's no
-	 * gain in skipping a page now and then; that's likely to disable
-	 * readahead and so be counterproductive. Also, skipping even a single
-	 * page means that we can't update relfrozenxid, so we only want to do it
-	 * if we can skip a goodly number of pages.
-	 *
-	 * When vacrel->aggressive is set, we can't skip pages just because they
-	 * are all-visible, but we can still skip pages that are all-frozen, since
-	 * such pages do not need freezing and do not affect the value that we can
-	 * safely set for relfrozenxid or relminmxid.
-	 *
-	 * Before entering the main loop, establish the invariant that
-	 * next_unskippable_block is the next block number >= blkno that we can't
-	 * skip based on the visibility map, either all-visible for a regular scan
-	 * or all-frozen for an aggressive scan.  We set it to rel_pages when
-	 * there's no such block.  We also set up the skipping_blocks flag
-	 * correctly at this stage.
-	 *
-	 * Note: The value returned by visibilitymap_get_status could be slightly
-	 * out-of-date, since we make this test before reading the corresponding
-	 * heap page or locking the buffer.  This is OK.  If we mistakenly think
-	 * that the page is all-visible or all-frozen when in fact the flag's just
-	 * been cleared, we might fail to vacuum the page.  It's easy to see that
-	 * skipping a page when aggressive is not set is not a very big deal; we
-	 * might leave some dead tuples lying around, but the next vacuum will
-	 * find them.  But even when aggressive *is* set, it's still OK if we miss
-	 * a page whose all-frozen marking has just been cleared.  Any new XIDs
-	 * just added to that page are necessarily >= vacrel->OldestXmin, and so
-	 * they'll have no effect on the value to which we can safely set
-	 * relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 */
-	next_unskippable_block = 0;
-	if (vacrel->skipwithvm)
-	{
-		while (next_unskippable_block < rel_pages)
-		{
-			uint8		vmstatus;
-
-			vmstatus = visibilitymap_get_status(vacrel->rel,
-												next_unskippable_block,
-												&vmbuffer);
-			if (vacrel->aggressive)
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
-					break;
-			}
-			else
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
-					break;
-			}
-			vacuum_delay_point();
-			next_unskippable_block++;
-		}
-	}
-
-	if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
-		skipping_blocks = true;
-	else
-		skipping_blocks = false;
-
+	/* Set up an initial range of skippable blocks using the visibility map */
+	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
+											&next_unskippable_allvis,
+											&skipping_current_range);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm = false;
+		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
-
-		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
-								 blkno, InvalidOffsetNumber);
-
 		if (blkno == next_unskippable_block)
 		{
-			/* Time to advance next_unskippable_block */
-			next_unskippable_block++;
-			if (vacrel->skipwithvm)
-			{
-				while (next_unskippable_block < rel_pages)
-				{
-					uint8		vmskipflags;
-
-					vmskipflags = visibilitymap_get_status(vacrel->rel,
-														   next_unskippable_block,
-														   &vmbuffer);
-					if (vacrel->aggressive)
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
-							break;
-					}
-					else
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
-							break;
-					}
-					vacuum_delay_point();
-					next_unskippable_block++;
-				}
-			}
-
 			/*
-			 * We know we can't skip the current block.  But set up
-			 * skipping_blocks to do the right thing at the following blocks.
+			 * Can't skip this page safely.  Must scan the page.  But
+			 * determine the next skippable range after the page first.
 			 */
-			if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)
-				skipping_blocks = true;
-			else
-				skipping_blocks = false;
+			all_visible_according_to_vm = next_unskippable_allvis;
+			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
+													blkno + 1,
+													&next_unskippable_allvis,
+													&skipping_current_range);
 
-			/*
-			 * Normally, the fact that we can't skip this block must mean that
-			 * it's not all-visible.  But in an aggressive vacuum we know only
-			 * that it's not all-frozen, so it might still be all-visible.
-			 */
-			if (vacrel->aggressive &&
-				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+			Assert(next_unskippable_block >= blkno + 1);
 		}
 		else
 		{
-			/*
-			 * The current page can be skipped if we've seen a long enough run
-			 * of skippable blocks to justify skipping it -- provided it's not
-			 * the last page in the relation (according to rel_pages).
-			 *
-			 * We always scan the table's last page to determine whether it
-			 * has tuples or not, even if it would otherwise be skipped. This
-			 * avoids having lazy_truncate_heap() take access-exclusive lock
-			 * on the table to attempt a truncation that just fails
-			 * immediately because there are tuples on the last page.
-			 */
-			if (skipping_blocks && blkno < rel_pages - 1)
-			{
-				/*
-				 * Tricky, tricky.  If this is in aggressive vacuum, the page
-				 * must have been all-frozen at the time we checked whether it
-				 * was skippable, but it might not be any more.  We must be
-				 * careful to count it as a skipped all-frozen page in that
-				 * case, or else we'll think we can't update relfrozenxid and
-				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was initially all-frozen, so we have to
-				 * recheck.
-				 */
-				if (vacrel->aggressive ||
-					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-					vacrel->frozenskipped_pages++;
-				continue;
-			}
+			/* Last page always scanned (may need to set nonempty_pages) */
+			Assert(blkno < rel_pages - 1);
 
-			/*
-			 * SKIP_PAGES_THRESHOLD (threshold for skipping) was not
-			 * crossed, or this is the last page.  Scan the page, even
-			 * though it's all-visible (and possibly even all-frozen).
-			 */
+			if (skipping_current_range)
+				continue;
+
+			/* Current range is too small to skip -- just scan the page */
 			all_visible_according_to_vm = true;
 		}
 
-		vacuum_delay_point();
-
-		/*
-		 * We're not skipping this page using the visibility map, and so it is
-		 * (by definition) a scanned page.  Any tuples from this page are now
-		 * guaranteed to be counted below, after some preparatory checks.
-		 */
 		vacrel->scanned_pages++;
 
+		/* Report as block scanned, update error traceback information */
+		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+								 blkno, InvalidOffsetNumber);
+
+		vacuum_delay_point();
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1241,8 +1119,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on what the VM said about
-		 * the page before pruning started, and using prunestate
+		 * Handle setting visibility map bit based on information from the VM
+		 * (as of last lazy_scan_skip() call), and from prunestate
 		 */
 		if (!all_visible_according_to_vm && prunestate.all_visible)
 		{
@@ -1274,9 +1152,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
 		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * got cleared after lazy_scan_skip() was called, so we must recheck
+		 * with buffer lock before concluding that the VM is corrupt.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
@@ -1315,7 +1192,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * If the all-visible page is all-frozen but not marked as such yet,
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
+		 * all_visible is true, so we must check both prunestate fields.
 		 */
 		else if (all_visible_according_to_vm && prunestate.all_visible &&
 				 prunestate.all_frozen &&
@@ -1421,6 +1298,112 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	Assert(!IsInParallelMode());
 }
 
+/*
+ *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *
+ * lazy_scan_heap() calls here every time it needs to set up a new range of
+ * blocks to skip via the visibility map.  Caller passes the next block in
+ * line.  We return a next_unskippable_block for this range.  When there are
+ * no skippable blocks we just return caller's next_block.  The all-visible
+ * status of the returned block is set in *next_unskippable_allvis for caller,
+ * too.  Block usually won't be all-visible (since it's unskippable), but it
+ * can be during aggressive VACUUMs (as well as in certain edge cases).
+ *
+ * Sets *skipping_current_range to indicate if caller should skip this range.
+ * Costs and benefits drive our decision.  Very small ranges won't be skipped.
+ *
+ * Note: our opinion of which blocks can be skipped can go stale immediately.
+ * It's okay if caller "misses" a page whose all-visible or all-frozen marking
+ * was concurrently cleared, though.  All that matters is that caller scan all
+ * pages whose tuples might contain XIDs < OldestXmin, or XMIDs < OldestMxact.
+ * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
+ * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
+ * choice to skip such a range is actually made, making everything safe.)
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
+			   bool *next_unskippable_allvis, bool *skipping_current_range)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_unskippable_block = next_block,
+				nskippable_blocks = 0;
+	bool		skipsallvis = false;
+
+	*next_unskippable_allvis = true;
+	while (next_unskippable_block < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
+													   next_unskippable_block,
+													   vmbuffer);
+
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			*next_unskippable_allvis = false;
+			break;
+		}
+
+		/*
+		 * Caller must scan the last page to determine whether it has tuples
+		 * (caller must have the opportunity to set vacrel->nonempty_pages).
+		 * This rule avoids having lazy_truncate_heap() take access-exclusive
+		 * lock on rel to attempt a truncation that fails anyway, just because
+		 * there are tuples on the last page (it is likely that there will be
+		 * tuples on other nearby pages as well, but those can be skipped).
+		 *
+		 * Implement this by always treating the last block as unsafe to skip.
+		 */
+		if (next_unskippable_block == rel_pages - 1)
+			break;
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.  They may still skip all-frozen pages, which can't
+		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case.  But
+			 * remember that the final range contains such a block for later.
+			 */
+			skipsallvis = true;
+		}
+
+		vacuum_delay_point();
+		next_unskippable_block++;
+		nskippable_blocks++;
+	}
+
+	/*
+	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
+	 * pages.  Since we're reading sequentially, the OS should be doing
+	 * readahead for us, so there's no gain in skipping a page now and then.
+	 * Skipping such a range might even discourage sequential detection.
+	 *
+	 * This test also enables more frequent relfrozenxid advancement during
+	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
+	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 */
+	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
+		*skipping_current_range = false;
+	else
+	{
+		*skipping_current_range = true;
+		if (skipsallvis)
+			vacrel->skippedallvis = true;
+	}
+
+	return next_unskippable_block;
+}
+
 /*
  *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
  *
-- 
2.32.0

v14-0004-vacuumlazy.c-Move-resource-allocation-to-heap_va.patchapplication/octet-stream; name=v14-0004-vacuumlazy.c-Move-resource-allocation-to-heap_va.patchDownload

From 4cd96ce355a72e9d504c9d88bd1bc3923c08f397 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 25 Mar 2022 12:51:05 -0700
Subject: [PATCH v14 4/4] vacuumlazy.c: Move resource allocation to
 heap_vacuum_rel().

Finish off work started by commit 73f6ec3d: move remaining resource
allocation and deallocation code from lazy_scan_heap() to its caller,
heap_vacuum_rel().

Also remove unnecessary progress report calls for the last block number.
---
 src/backend/access/heap/vacuumlazy.c | 74 +++++++++++-----------------
 1 file changed, 28 insertions(+), 46 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b0d70a0b1..bf82c98fb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -246,7 +246,7 @@ typedef struct LVSavedErrInfo
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -519,11 +519,28 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->NewRelminMxid = OldestMxact;
 	vacrel->skippedallvis = false;
 
+	/*
+	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * parallel VACUUM initialization as part of allocating shared memory
+	 * space used for dead_items.  (But do a failsafe precheck first, to
+	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
+	 * is already dangerously old.)
+	 */
+	lazy_check_wraparound_failsafe(vacrel);
+	dead_items_alloc(vacrel, params->nworkers);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params->nworkers);
+	lazy_scan_heap(vacrel);
+
+	/*
+	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
+	 * passing when necessary.
+	 */
+	dead_items_cleanup(vacrel);
+	Assert(!IsInParallelMode());
 
 	/*
 	 * Update pg_class entries for each of rel's indexes where appropriate.
@@ -833,14 +850,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, int nworkers)
+lazy_scan_heap(LVRelState *vacrel)
 {
-	VacDeadItems *dead_items;
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
 				next_unskippable_block,
-				next_failsafe_block,
-				next_fsm_block_to_vacuum;
+				next_failsafe_block = 0,
+				next_fsm_block_to_vacuum = 0;
+	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
@@ -851,23 +868,6 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	};
 	int64		initprog_val[3];
 
-	/*
-	 * Do failsafe precheck before calling dead_items_alloc.  This ensures
-	 * that parallel VACUUM won't be attempted when relfrozenxid is already
-	 * dangerously old.
-	 */
-	lazy_check_wraparound_failsafe(vacrel);
-	next_failsafe_block = 0;
-
-	/*
-	 * Allocate the space for dead_items.  Note that this handles parallel
-	 * VACUUM initialization as part of allocating shared memory space used
-	 * for dead_items.
-	 */
-	dead_items_alloc(vacrel, nworkers);
-	dead_items = vacrel->dead_items;
-	next_fsm_block_to_vacuum = 0;
-
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
@@ -1244,11 +1244,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		}
 	}
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
-
-	/* Clear the block number information */
 	vacrel->blkno = InvalidBlockNumber;
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
@@ -1264,15 +1262,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		vacrel->missed_dead_tuples;
 
 	/*
-	 * Release any remaining pin on visibility map page.
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
 	 */
-	if (BufferIsValid(vmbuffer))
-	{
-		ReleaseBuffer(vmbuffer);
-		vmbuffer = InvalidBuffer;
-	}
-
-	/* Perform a final round of index and heap vacuuming */
 	if (dead_items->num_items > 0)
 		lazy_vacuum(vacrel);
 
@@ -1283,19 +1275,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
-
-	/* Do post-vacuum cleanup */
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		lazy_cleanup_all_indexes(vacrel);
-
-	/*
-	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
-	 * passing when necessary.
-	 */
-	dead_items_cleanup(vacrel);
-	Assert(!IsInParallelMode());
 }
 
 /*
-- 
2.32.0

#114

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#112)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

I was able to trigger the crash.

cat ~/tmp/pgbench-createdb.sql
CREATE DATABASE pgb_:client_id;
DROP DATABASE pgb_:client_id;

pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql

while I was also running

for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s installcheck -j48 -s prove_installcheck=true USE_MODULE_DB=1 > /tmp/ci-$i.log 2>&1; done

I triggered twice now, but it took a while longer the second time.

(gdb) bt full
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
set = {__val = {4194304, 0, 0, 0, 0, 0, 216172782113783808, 2, 2377909399344644096, 18446497967838863616, 0, 0, 0, 0, 0, 0}}
pid = <optimized out>
tid = <optimized out>
ret = <optimized out>
#1 0x00007fe49a2db546 in __GI_abort () at abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}},
sa_flags = 0, sa_restorer = 0x107e0}
sigs = {__val = {32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}
#2 0x00007fe49b9706f1 in ExceptionalCondition (conditionName=0x7fe49ba0618d "diff > 0", errorType=0x7fe49ba05bd1 "FailedAssertion",
fileName=0x7fe49ba05b90 "/home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c", lineNumber=724)
at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
No locals.
#3 0x00007fe49b2fc739 in heap_vacuum_rel (rel=0x7fe497a8d148, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10)
at /home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:724
buf = {
data = 0x7fe49c17e238 "automatic vacuum of table \"contrib_regression_dict_int.pg_catalog.pg_database\": index scans: 1\npages: 0 removed, 3 remain, 3 scanned (100.00% of total)\ntuples: 49 removed, 53 remain, 9 are dead but no"..., len = 279, maxlen = 1024, cursor = 0}
msgfmt = 0x7fe49ba06038 "automatic vacuum of table \"%s.%s.%s\": index scans: %d\n"
diff = 0
endtime = 702011687982080
vacrel = 0x7fe49c19b5b8
verbose = false
instrument = true
ru0 = {tv = {tv_sec = 1648696487, tv_usec = 975963}, ru = {ru_utime = {tv_sec = 0, tv_usec = 0}, ru_stime = {tv_sec = 0, tv_usec = 3086}, {
--Type <RET> for more, q to quit, c to continue without paging--c
ru_maxrss = 10824, __ru_maxrss_word = 10824}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 449, __ru_minflt_word = 449}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 0, __ru_oublock_word = 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 2, __ru_nvcsw_word = 2}, {ru_nivcsw = 0, __ru_nivcsw_word = 0}}}
starttime = 702011687975964
walusage_start = {wal_records = 0, wal_fpi = 0, wal_bytes = 0}
walusage = {wal_records = 11, wal_fpi = 7, wal_bytes = 30847}
secs = 0
usecs = 6116
read_rate = 16.606033355134073
write_rate = 7.6643230869849575
aggressive = false
skipwithvm = true
frozenxid_updated = true
minmulti_updated = true
orig_rel_pages = 3
new_rel_pages = 3
new_rel_allvisible = 0
indnames = 0x7fe49c19bb28
errcallback = {previous = 0x0, callback = 0x7fe49b3012fd <vacuum_error_callback>, arg = 0x7fe49c19b5b8}
startreadtime = 180
startwritetime = 0
OldestXmin = 67552
FreezeLimit = 4245034848
OldestMxact = 224
MultiXactCutoff = 4289967520
__func__ = "heap_vacuum_rel"
#4 0x00007fe49b523d92 in table_relation_vacuum (rel=0x7fe497a8d148, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10) at /home/andres/src/postgresql/src/include/access/tableam.h:1680
No locals.
#5 0x00007fe49b527032 in vacuum_rel (relid=1262, relation=0x7fe49c1ae360, params=0x7fe49c130d7c) at /home/andres/src/postgresql/src/backend/commands/vacuum.c:2065
lmode = 4
rel = 0x7fe497a8d148
lockrelid = {relId = 1262, dbId = 0}
toast_relid = 0
save_userid = 10
save_sec_context = 0
save_nestlevel = 2
__func__ = "vacuum_rel"
#6 0x00007fe49b524c3b in vacuum (relations=0x7fe49c1b03a8, params=0x7fe49c130d7c, bstrategy=0x7fe49c130e10, isTopLevel=true) at /home/andres/src/postgresql/src/backend/commands/vacuum.c:482
vrel = 0x7fe49c1ae3b8
cur__state = {l = 0x7fe49c1b03a8, i = 0}
cur = 0x7fe49c1b03c0
_save_exception_stack = 0x7fff97e35a10
_save_context_stack = 0x0
_local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318940970843, 9223372036854775747, 0, 0, 0, 6126579318957748059, 6139499258682879835}, __mask_was_saved = 0, __saved_mask = {__val = {32, 140619848279000, 8590910454, 140619848278592, 32, 140619848278944, 7784, 140619848278592, 140619848278816, 140735741647200, 140619839915137, 8458711686435861857, 32, 4869, 140619848278592, 140619848279024}}}}
_do_rethrow = false
in_vacuum = true
stmttype = 0x7fe49baff1a7 "VACUUM"
in_outer_xact = false
use_own_xacts = true
__func__ = "vacuum"
#7 0x00007fe49b6d483d in autovacuum_do_vac_analyze (tab=0x7fe49c130d78, bstrategy=0x7fe49c130e10) at /home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:3247
rangevar = 0x7fe49c1ae360
rel = 0x7fe49c1ae3b8
rel_list = 0x7fe49c1ae3f0
#8 0x00007fe49b6d34bc in do_autovacuum () at /home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:2495
_save_exception_stack = 0x7fff97e35d70
_save_context_stack = 0x0
_local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318779490139, 9223372036854775747, 0, 0, 0, 6126579319014371163, 6139499700101525339}, __mask_was_saved = 0, __saved_mask = {__val = {140619840139982, 140735741647712, 140619841923928, 957, 140619847223443, 140735741647656, 140619847312112, 140619847223451, 140619847223443, 140619847224399, 0, 139637976727552, 140619817480714, 140735741647616, 140619839856340, 1024}}}}
_do_rethrow = false
tab = 0x7fe49c130d78
skipit = false
stdVacuumCostDelay = 0
stdVacuumCostLimit = 200
iter = {cur = 0x7fe497668da0, end = 0x7fe497668da0}
relid = 1262
classTup = 0x7fe497a6c568
isshared = true
cell__state = {l = 0x7fe49c130d40, i = 0}
classRel = 0x7fe497a5ae18
tuple = 0x0
relScan = 0x7fe49c130928
dbForm = 0x7fe497a64fb8
table_oids = 0x7fe49c130d40
orphan_oids = 0x0
ctl = {num_partitions = 0, ssize = 0, dsize = 1296236544, max_dsize = 140619847224424, keysize = 4, entrysize = 96, hash = 0x0, match = 0x0, keycopy = 0x0, alloc = 0x0, hcxt = 0x7fff97e35c50, hctl = 0x7fe49b9a787e <AllocSetFree+670>}
table_toast_map = 0x7fe49c19d2f0
cell = 0x7fe49c130d58
shared = 0x7fe49c17c360
dbentry = 0x7fe49c18d7a0
bstrategy = 0x7fe49c130e10
key = {sk_flags = 0, sk_attno = 17, sk_strategy = 3, sk_subtype = 0, sk_collation = 950, sk_func = {fn_addr = 0x7fe49b809a6a <chareq>, fn_oid = 61, fn_nargs = 2, fn_strict = true, fn_retset = false, fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x7fe49c12f7f0, fn_expr = 0x0}, sk_argument = 116}
pg_class_desc = 0x7fe49c12f910
effective_multixact_freeze_max_age = 400000000
did_vacuum = false
found_concurrent_worker = false
i = 32740
__func__ = "do_autovacuum"
#9 0x00007fe49b6d21c4 in AutoVacWorkerMain (argc=0, argv=0x0) at /home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:1719
dbname = "contrib_regression_dict_int\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
local_sigjmp_buf = {{__jmpbuf = {140735741652128, 6126579318890639195, 9223372036854775747, 0, 0, 0, 6126579318785781595, 6139499699353759579}, __mask_was_saved = 1, __saved_mask = {__val = {18446744066192964099, 8, 140735741648416, 140735741648352, 3156423108750738944, 0, 30, 140735741647888, 140619835812981, 140735741648080, 32666874400, 140735741648448, 140619836964693, 140735741652128, 2586778441, 140735741648448}}}}
dbid = 205328
__func__ = "AutoVacWorkerMain"
#10 0x00007fe49b6d1d5b in StartAutoVacWorker () at /home/andres/src/postgresql/src/backend/postmaster/autovacuum.c:1504
worker_pid = 0
__func__ = "StartAutoVacWorker"
#11 0x00007fe49b6e79af in StartAutovacuumWorker () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5635
bn = 0x7fe49c0da920
__func__ = "StartAutovacuumWorker"
#12 0x00007fe49b6e745d in sigusr1_handler (postgres_signal_arg=10) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5340
save_errno = 4
__func__ = "sigusr1_handler"
#13 <signal handler called>
No locals.
#14 0x00007fe49a3a9fc4 in __GI___select (nfds=8, readfds=0x7fff97e36c20, writefds=0x0, exceptfds=0x0, timeout=0x7fff97e36ca0) at ../sysdeps/unix/sysv/linux/select.c:71
sc_ret = -4
sc_ret = <optimized out>
s = <optimized out>
us = <optimized out>
ns = <optimized out>
ts64 = {tv_sec = 59, tv_nsec = 765565741}
pts64 = <optimized out>
r = <optimized out>
#15 0x00007fe49b6e26c7 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1765
timeout = {tv_sec = 60, tv_usec = 0}
rmask = {fds_bits = {224, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}
selres = -1
now = 1648696487
readmask = {fds_bits = {224, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}}
nSockets = 8
last_lockfile_recheck_time = 1648696432
last_touch_time = 1648696072
__func__ = "ServerLoop"
#16 0x00007fe49b6e2031 in PostmasterMain (argc=55, argv=0x7fe49c0aa2d0) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1473
opt = -1
status = 0
userDoption = 0x7fe49c0951d0 "/srv/dev/pgdev-dev/"
listen_addr_saved = true
i = 64
output_config_variable = 0x0
__func__ = "PostmasterMain"
#17 0x00007fe49b5d2808 in main (argc=55, argv=0x7fe49c0aa2d0) at /home/andres/src/postgresql/src/backend/main/main.c:202
do_check_root = true

Greetings,

Andres Freund

#115

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#114)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <andres@anarazel.de> wrote:

I triggered twice now, but it took a while longer the second time.

Great.

I wonder if you can get an RR recording...
--
Peter Geoghegan

#116

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Andres Freund (#114)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-03-30 20:28:44 -0700, Andres Freund wrote:

I was able to trigger the crash.

cat ~/tmp/pgbench-createdb.sql
CREATE DATABASE pgb_:client_id;
DROP DATABASE pgb_:client_id;

pgbench -n -P1 -c 10 -j10 -T100 -f ~/tmp/pgbench-createdb.sql

while I was also running

for i in $(seq 1 100); do echo iteration $i; make -Otarget -C contrib/ -s installcheck -j48 -s prove_installcheck=true USE_MODULE_DB=1 > /tmp/ci-$i.log 2>&1; done

I triggered twice now, but it took a while longer the second time.

Forgot to say how postgres was started. Via my usual devenv script, which
results in:

+ /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres -c hba_file=/home/andres/tmp/pgdev/pg_hba.conf -D /srv/dev/pgdev-dev/ -p 5440 -c shared_buffers=2GB -c wal_level=hot_standby -c max_wal_senders=10 -c track_io_timing=on -c restart_after_crash=false -c max_prepared_transactions=20 -c log_checkpoints=on -c min_wal_size=48MB -c max_wal_size=150GB -c 'cluster_name=dev assert' -c ssl_cert_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.pem -c ssl_key_file=/home/andres/tmp/pgdev/ssl-cert-snakeoil.key -c 'log_line_prefix=%m [%p][%b][%v:%x][%a] ' -c shared_buffers=16MB -c log_min_messages=debug1 -c log_connections=on -c allow_in_place_tablespaces=1 -c log_autovacuum_min_duration=0 -c log_lock_waits=true -c autovacuum_naptime=10s -c fsync=off

Greetings,

Andres Freund

#117

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#115)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote:

On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <andres@anarazel.de> wrote:

I triggered twice now, but it took a while longer the second time.

Great.

I wonder if you can get an RR recording...

Started it, but looks like it's too slow.

(gdb) p MyProcPid
$1 = 2172500

(gdb) p vacrel->NewRelfrozenXid
$3 = 717
(gdb) p vacrel->relfrozenxid
$4 = 717
(gdb) p OldestXmin
$5 = 5112
(gdb) p aggressive
$6 = false

There was another autovacuum of pg_database 10s before:

2022-03-30 20:35:17.622 PDT [2165344][autovacuum worker][5/3:0][] LOG: automatic vacuum of table "postgres.pg_catalog.pg_database": index scans: 1
pages: 0 removed, 3 remain, 3 scanned (100.00% of total)
tuples: 61 removed, 4 remain, 1 are dead but not yet removable
removable cutoff: 1921, older by 3 xids when operation ended
new relfrozenxid: 717, which is 3 xids ahead of previous value
index scan needed: 3 pages from table (100.00% of total) had 599 dead item identifiers removed
index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_database_oid_index": pages: 4 in total, 0 newly deleted, 0 currently deleted, 0 reusable
I/O timings: read: 0.029 ms, write: 0.034 ms
avg read rate: 134.120 MB/s, avg write rate: 89.413 MB/s
buffer usage: 35 hits, 12 misses, 8 dirtied
WAL usage: 12 records, 5 full page images, 27218 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s

The dying backend:
2022-03-30 20:35:27.668 PDT [2172500][autovacuum worker][7/0:0][] DEBUG: autovacuum: processing database "contrib_regression_hstore"
...
2022-03-30 20:35:27.690 PDT [2172500][autovacuum worker][7/674:0][] CONTEXT: while cleaning up index "pg_database_oid_index" of relation "pg_catalog.pg_database"

Greetings,

Andres Freund

#118

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#117)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 30, 2022 at 9:04 PM Andres Freund <andres@anarazel.de> wrote:

(gdb) p vacrel->NewRelfrozenXid
$3 = 717
(gdb) p vacrel->relfrozenxid
$4 = 717
(gdb) p OldestXmin
$5 = 5112
(gdb) p aggressive
$6 = false

Does this OldestXmin seem reasonable at this point in execution, based
on context? Does it look too high? Something else?

--
Peter Geoghegan

#119

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Andres Freund (#117)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-03-30 21:04:07 -0700, Andres Freund wrote:

On 2022-03-30 20:35:25 -0700, Peter Geoghegan wrote:

On Wed, Mar 30, 2022 at 8:28 PM Andres Freund <andres@anarazel.de> wrote:

I triggered twice now, but it took a while longer the second time.

Great.

I wonder if you can get an RR recording...

Started it, but looks like it's too slow.

(gdb) p MyProcPid
$1 = 2172500

(gdb) p vacrel->NewRelfrozenXid
$3 = 717
(gdb) p vacrel->relfrozenxid
$4 = 717
(gdb) p OldestXmin
$5 = 5112
(gdb) p aggressive
$6 = false

I added a bunch of debug elogs to see what sets *frozenxid_updated to true.

(gdb) p *vacrel
$1 = {rel = 0x7fe24f3e0148, indrels = 0x7fe255c17ef8, nindexes = 2, aggressive = false, skipwithvm = true, failsafe_active = false,
consider_bypass_optimization = true, do_index_vacuuming = true, do_index_cleanup = true, do_rel_truncate = true, bstrategy = 0x7fe255bb0e28, pvs = 0x0,
relfrozenxid = 717, relminmxid = 6, old_live_tuples = 42, OldestXmin = 20751, vistest = 0x7fe255058970 <GlobalVisSharedRels>, FreezeLimit = 4244988047,
MultiXactCutoff = 4289967302, NewRelfrozenXid = 717, NewRelminMxid = 6, skippedallvis = false, relnamespace = 0x7fe255c17bf8 "pg_catalog",
relname = 0x7fe255c17cb8 "pg_database", indname = 0x0, blkno = 4294967295, offnum = 0, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP, verbose = false,
dead_items = 0x7fe255c131d0, rel_pages = 8, scanned_pages = 8, removed_pages = 0, lpdead_item_pages = 0, missed_dead_pages = 0, nonempty_pages = 8,
new_rel_tuples = 124, new_live_tuples = 42, indstats = 0x7fe255c18320, num_index_scans = 0, tuples_deleted = 0, lpdead_items = 0, live_tuples = 42,
recently_dead_tuples = 82, missed_dead_tuples = 0}

But the debug elog reports that

relfrozenxid updated 714 -> 717
relminmxid updated 1 -> 6

Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
from the shared relcache init file written by another backend:

2022-03-30 21:10:47.626 PDT [2625038][autovacuum worker][6/433:0][] LOG: automatic vacuum of table "contrib_regression_postgres_fdw.pg_catalog.pg_database": index scans: 1
pages: 0 removed, 8 remain, 8 scanned (100.00% of total)
tuples: 4 removed, 114 remain, 72 are dead but not yet removable
removable cutoff: 20751, older by 596 xids when operation ended
new relfrozenxid: 717, which is 3 xids ahead of previous value
new relminmxid: 6, which is 5 mxids ahead of previous value
index scan needed: 3 pages from table (37.50% of total) had 8 dead item identifiers removed
index "pg_database_datname_index": pages: 2 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "pg_database_oid_index": pages: 6 in total, 0 newly deleted, 2 currently deleted, 2 reusable
I/O timings: read: 0.050 ms, write: 0.102 ms
avg read rate: 209.860 MB/s, avg write rate: 76.313 MB/s
buffer usage: 42 hits, 22 misses, 8 dirtied
WAL usage: 13 records, 5 full page images, 33950 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
...
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][:0][] DEBUG: InitPostgres
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] DEBUG: my backend ID is 6
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/0:0][] LOG: reading shared init file
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] DEBUG: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/sub>
2022-03-30 21:10:47.772 PDT [2625043][autovacuum worker][6/443:0][] LOG: reading non-shared init file

This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's
normally fairly harmless - I think.

Perhaps we should just fetch the horizons from the "local" catalog for shared
rels?

Greetings,

Andres Freund

#120

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#118)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-03-30 21:11:48 -0700, Peter Geoghegan wrote:

On Wed, Mar 30, 2022 at 9:04 PM Andres Freund <andres@anarazel.de> wrote:

(gdb) p vacrel->NewRelfrozenXid
$3 = 717
(gdb) p vacrel->relfrozenxid
$4 = 717
(gdb) p OldestXmin
$5 = 5112
(gdb) p aggressive
$6 = false

Does this OldestXmin seem reasonable at this point in execution, based
on context? Does it look too high? Something else?

Reasonable:
(gdb) p *ShmemVariableCache
$1 = {nextOid = 78969, oidCount = 2951, nextXid = {value = 21411}, oldestXid = 714, xidVacLimit = 200000714, xidWarnLimit = 2107484361,
xidStopLimit = 2144484361, xidWrapLimit = 2147484361, oldestXidDB = 1, oldestCommitTsXid = 0, newestCommitTsXid = 0, latestCompletedXid = {value = 21408},
xactCompletionCount = 1635, oldestClogXid = 714}

I think the explanation I just sent explains the problem, without "in-memory"
confusion about what's running and what's not.

Greetings,

Andres Freund

#121

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#119)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 30, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote:

But the debug elog reports that

relfrozenxid updated 714 -> 717
relminmxid updated 1 -> 6

Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
from the shared relcache init file written by another backend:

We should have added logging of relfrozenxid and relminmxid a long time ago.

This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's
normally fairly harmless - I think.

Is this one pretty old?

Perhaps we should just fetch the horizons from the "local" catalog for shared
rels?

Not sure what you mean.

--
Peter Geoghegan

#122

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#121)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 30, 2022 at 9:29 PM Peter Geoghegan <pg@bowt.ie> wrote:

Perhaps we should just fetch the horizons from the "local" catalog for shared
rels?

Not sure what you mean.

Wait, you mean use vacrel->relfrozenxid directly? Seems kind of ugly...

--
Peter Geoghegan

#123

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#121)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote:

On Wed, Mar 30, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote:

But the debug elog reports that

relfrozenxid updated 714 -> 717
relminmxid updated 1 -> 6

Tthe problem is that the crashing backend reads the relfrozenxid/relminmxid
from the shared relcache init file written by another backend:

We should have added logging of relfrozenxid and relminmxid a long time ago.

At least at DEBUG1 or such.

This is basically the inverse of a54e1f15 - we read a *newer* horizon. That's
normally fairly harmless - I think.

Is this one pretty old?

What do you mean with "this one"? The cause for the assert failure?

I'm not sure there's a proper bug on HEAD here. I think at worst it can delay
the horizon increasing a bunch, by falsely not using an aggressive vacuum when
we should have - might even be limited to a single autovacuum cycle.

Perhaps we should just fetch the horizons from the "local" catalog for shared
rels?

Not sure what you mean.

Basically, instead of relying on the relcache, which for shared relation is
vulnerable to seeing "too new" horizons due to the shared relcache init file,
explicitly load relfrozenxid / relminmxid from the the catalog / syscache.

I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using
SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid
etc. Whereas right now we only fetch the pg_class row in
vac_update_relstats(), but use the relcache before.

Greetings,

Andres Freund

#124

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Andres Freund (#123)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-03-30 21:59:15 -0700, Andres Freund wrote:

On 2022-03-30 21:29:16 -0700, Peter Geoghegan wrote:

On Wed, Mar 30, 2022 at 9:20 PM Andres Freund <andres@anarazel.de> wrote:

Perhaps we should just fetch the horizons from the "local" catalog for shared
rels?

Not sure what you mean.

Basically, instead of relying on the relcache, which for shared relation is
vulnerable to seeing "too new" horizons due to the shared relcache init file,
explicitly load relfrozenxid / relminmxid from the the catalog / syscache.

I.e. fetch the relevant pg_class row in heap_vacuum_rel() (using
SearchSysCache[Copy1](RELID)). And use that to set vacrel->relfrozenxid
etc. Whereas right now we only fetch the pg_class row in
vac_update_relstats(), but use the relcache before.

Perhaps we should explicitly mask out parts of relcache entries in the shared
init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to
Invalid* or such.

I even wonder if we should just generally move those out of the fields we have
in the relcache, not just for shared rels loaded from the init
fork. Presumably by just moving them into the CATALOG_VARLEN ifdef.

The only place that appears to access rd_rel->relfrozenxid outside of DDL is
heap_abort_speculative().

Greetings,

Andres Freund

#125

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#124)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Mar 31, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:

Perhaps we should explicitly mask out parts of relcache entries in the shared
init file that we know to be unreliable. I.e. set relfrozenxid, relminmxid to
Invalid* or such.

That has the advantage of being more honest. If you're going to break
the abstraction, then it seems best to break it in an obvious way,
that leaves no doubts about what you're supposed to be relying on.

This bug doesn't seem like the kind of thing that should be left
as-is. If only because it makes it hard to add something like a
WARNING when we make relfrozenxid go backwards (on the basis of the
existing value apparently being in the future), which we really should
have been doing all along.

The whole reason why we overwrite pg_class.relfrozenxid values from
the future is to ameliorate the effects of more serious bugs like the
pg_upgrade/pg_resetwal one fixed in commit 74cf7d46 not so long ago
(mid last year). We had essentially the same pg_upgrade "from the
future" bug twice (once for relminmxid in the MultiXact bug era,
another more recent version affecting relfrozenxid).

The only place that appears to access rd_rel->relfrozenxid outside of DDL is
heap_abort_speculative().

I wonder how necessary that really is. Even if the XID is before
relfrozenxid, does that in itself really make it "in the future"?
Obviously it's often necessary to make the assumption that allowing
wraparound amounts to allowing XIDs "from the future" to exist, which
is dangerous. But why here? Won't pruning by VACUUM eventually correct
the issue anyway?

--
Peter Geoghegan

#126

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#125)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-03-31 09:58:18 -0700, Peter Geoghegan wrote:

On Thu, Mar 31, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:

The only place that appears to access rd_rel->relfrozenxid outside of DDL is
heap_abort_speculative().

I wonder how necessary that really is. Even if the XID is before
relfrozenxid, does that in itself really make it "in the future"?
Obviously it's often necessary to make the assumption that allowing
wraparound amounts to allowing XIDs "from the future" to exist, which
is dangerous. But why here? Won't pruning by VACUUM eventually correct
the issue anyway?

I don't think we should weaken defenses against xids from before relfrozenxid
in vacuum / amcheck / .... If anything we should strengthen them.

Isn't it also just plainly required for correctness? We'd not necessarily
trigger a vacuum in time to remove the xid before approaching wraparound if we
put in an xid before relfrozenxid? That happening in prune_xid is obviously
les bad than on actual data, but still.

ISTM we should just use our own xid. Yes, it might delay cleanup a bit
longer. But unless there's already crud on the page (with prune_xid already
set, the abort of the speculative insertion isn't likely to make the
difference?

Greetings,

Andres Freund

#127

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#123)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Wed, Mar 30, 2022 at 9:59 PM Andres Freund <andres@anarazel.de> wrote:

I'm not sure there's a proper bug on HEAD here. I think at worst it can delay
the horizon increasing a bunch, by falsely not using an aggressive vacuum when
we should have - might even be limited to a single autovacuum cycle.

So, to be clear: vac_update_relstats() never actually considered the
new relfrozenxid value from its vacuumlazy.c caller to be "in the
future"? It just looked that way to the failing assertion in
vacuumlazy.c, because its own version of the original relfrozenxid was
stale from the beginning? And so the worst problem is probably just
that we don't use aggressive VACUUM when we really should in rare
cases?

--
Peter Geoghegan

#128

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#126)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Mar 31, 2022 at 10:11 AM Andres Freund <andres@anarazel.de> wrote:

I don't think we should weaken defenses against xids from before relfrozenxid
in vacuum / amcheck / .... If anything we should strengthen them.

Isn't it also just plainly required for correctness? We'd not necessarily
trigger a vacuum in time to remove the xid before approaching wraparound if we
put in an xid before relfrozenxid? That happening in prune_xid is obviously
les bad than on actual data, but still.

Yeah, you're right. Ambiguity about stuff like this should be avoided
on general principle.

ISTM we should just use our own xid. Yes, it might delay cleanup a bit
longer. But unless there's already crud on the page (with prune_xid already
set, the abort of the speculative insertion isn't likely to make the
difference?

Speculative insertion abort is pretty rare in the real world, I bet.
The speculative insertion precheck is very likely to work almost
always with real workloads.

--
Peter Geoghegan

#129

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#127)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-03-31 10:12:49 -0700, Peter Geoghegan wrote:

On Wed, Mar 30, 2022 at 9:59 PM Andres Freund <andres@anarazel.de> wrote:

I'm not sure there's a proper bug on HEAD here. I think at worst it can delay
the horizon increasing a bunch, by falsely not using an aggressive vacuum when
we should have - might even be limited to a single autovacuum cycle.

So, to be clear: vac_update_relstats() never actually considered the
new relfrozenxid value from its vacuumlazy.c caller to be "in the
future"?

No, I added separate debug messages for those, and also applied your patch,
and it didn't trigger.

I don't immediately see how we could end up computing a frozenxid value that
would be problematic? The pgcform->relfrozenxid value will always be the
"local" value, which afaics can be behind the other database's value (and thus
behind the value from the relcache init file). But it can't be ahead, we have
the proper invalidations for that (I think).

I do think we should apply a version of the warnings you have (with a WARNING
instead of PANIC obviously). I think it's bordering on insanity that we have
so many paths to just silently fix stuff up around vacuum. It's like we want
things to be undebuggable, and to give users no warnings about something being
up.

It just looked that way to the failing assertion in
vacuumlazy.c, because its own version of the original relfrozenxid was
stale from the beginning? And so the worst problem is probably just
that we don't use aggressive VACUUM when we really should in rare
cases?

Yes, I think that's right.

Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5
and fsync=off made the crash trigger more quickly.

Greetings,

Andres Freund

#130

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#129)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Mar 31, 2022 at 10:50 AM Andres Freund <andres@anarazel.de> wrote:

So, to be clear: vac_update_relstats() never actually considered the
new relfrozenxid value from its vacuumlazy.c caller to be "in the
future"?

No, I added separate debug messages for those, and also applied your patch,
and it didn't trigger.

The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)". Plus
the other related assert I mentioned did not trigger. So when this
"diff" assert did trigger, the value of "diff" must have been 0 (not a
negative value). While this state does technically indicate that the
"existing" relfrozenxid value (actually a stale version) appears to be
"in the future" (because the OldestXmin XID might still never have
been allocated), it won't ever be in the future according to
vac_update_relstats() (even if it used that version).

I suppose that I might be wrong about that, somehow -- anything is
possible. The important point is that there is currently no evidence
that this bug (or any very recent bug) could ever allow
vac_update_relstats() to actually believe that it needs to update
relfrozenxid/relminmxid, purely because the existing value is in the
future.

The fact that vac_update_relstats() doesn't log/warn when this happens
is very unfortunate, but there is nevertheless no evidence that that
would have informed us of any bug on HEAD, even including the actual
bug here, which is a bug in vacuumlazy.c (not in vac_update_relstats).

I do think we should apply a version of the warnings you have (with a WARNING
instead of PANIC obviously). I think it's bordering on insanity that we have
so many paths to just silently fix stuff up around vacuum. It's like we want
things to be undebuggable, and to give users no warnings about something being
up.

Yeah, it's just totally self defeating to not at least log it. I mean
this is a code path that is only hit once per VACUUM, so there is
practically no risk of that causing any new problems.

Can you repro the issue with my recipe? FWIW, adding log_min_messages=debug5
and fsync=off made the crash trigger more quickly.

I'll try to do that today. I'm not feeling the most energetic right
now, to be honest.

--
Peter Geoghegan

#131

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#130)

4 attachment(s)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan <pg@bowt.ie> wrote:

The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)".

Attached is v15. I plan to commit the first two patches (the most
substantial two patches by far) in the next couple of days, barring
objections.

v15 removes this "Assert(diff > 0)" assertion from 0001. It's not
adding any value, now that the underlying issue that it accidentally
brought to light is well understood (there are still more robust
assertions to the relfrozenxid/relminmxid invariants). "Assert(diff >
0)" is liable to fail until the underlying bug on HEAD is fixed, which
can be treated as separate work.

I also refined the WARNING patch in v15. It now actually issues
WARNINGs (rather than PANICs, which were just a temporary debugging
measure in v14). Also fixed a compiler warning in this patch, based on
a complaint from CFBot's CompilerWarnings task. I can delay commiting
this WARNING patch until right before feature freeze. Seems best to
give others more opportunity for comments.

--
Peter Geoghegan

Attachments:

v15-0003-Have-VACUUM-warn-on-relfrozenxid-from-the-future.patchapplication/octet-stream; name=v15-0003-Have-VACUUM-warn-on-relfrozenxid-from-the-future.patchDownload

From 11c535c5a789778355d3205cb2effe4fe8c10a8b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 30 Mar 2022 18:59:41 -0700
Subject: [PATCH v15 3/4] Have VACUUM warn on relfrozenxid "from the future".

Commits 74cf7d46 and a61daa14 fixed pg_upgrade bugs where a table's
relfrozenxid or relminmxid value wasn't initialized to something
reasonable, or carried forward correctly.

Problems like these were ameliorated by commit 78db307bb2, which taught
VACUUM to always overwrite existing invalid relfrozenxid or relminmxid
values that are apparently "from the future".  Extend that work by
adding WARNINGs about invalid preexisting relfrozenxid and relminmxid
values "from the future" when VACUUM encounters any in pg_class.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzmRZEzeGvLv8yDW0AbFmSvJjTziORqjVUrf74mL4GL0Ww@mail.gmail.com
---
 src/backend/commands/vacuum.c | 70 ++++++++++++++++++++++++++---------
 1 file changed, 52 insertions(+), 18 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index deec4887b..fb33953e3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1340,7 +1340,11 @@ vac_update_relstats(Relation relation,
 	Relation	rd;
 	HeapTuple	ctup;
 	Form_pg_class pgcform;
-	bool		dirty;
+	bool		dirty,
+				futurexid,
+				futuremxid;
+	TransactionId oldfrozenxid;
+	MultiXactId oldminmulti;
 
 	rd = table_open(RelationRelationId, RowExclusiveLock);
 
@@ -1406,32 +1410,49 @@ vac_update_relstats(Relation relation,
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
+	oldfrozenxid = pgcform->relfrozenxid;
+	futurexid = false;
 	if (frozenxid_updated)
 		*frozenxid_updated = false;
-	if (TransactionIdIsNormal(frozenxid) &&
-		pgcform->relfrozenxid != frozenxid &&
-		(TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
-		 TransactionIdPrecedes(ReadNextTransactionId(),
-							   pgcform->relfrozenxid)))
+	if (TransactionIdIsNormal(frozenxid) && oldfrozenxid != frozenxid)
 	{
-		if (frozenxid_updated)
-			*frozenxid_updated = true;
-		pgcform->relfrozenxid = frozenxid;
-		dirty = true;
+		bool	update = false;
+
+		if (TransactionIdPrecedes(oldfrozenxid, frozenxid))
+			update = true;
+		else if (TransactionIdPrecedes(ReadNextTransactionId(), oldfrozenxid))
+			futurexid = update = true;
+
+		if (update)
+		{
+			pgcform->relfrozenxid = frozenxid;
+			dirty = true;
+			if (frozenxid_updated)
+				*frozenxid_updated = true;
+		}
 	}
 
 	/* Similarly for relminmxid */
+	oldminmulti = pgcform->relminmxid;
+	futuremxid = false;
 	if (minmulti_updated)
 		*minmulti_updated = false;
-	if (MultiXactIdIsValid(minmulti) &&
-		pgcform->relminmxid != minmulti &&
-		(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
-		 MultiXactIdPrecedes(ReadNextMultiXactId(), pgcform->relminmxid)))
+	if (MultiXactIdIsValid(minmulti) && oldminmulti != minmulti)
 	{
-		if (minmulti_updated)
-			*minmulti_updated = true;
-		pgcform->relminmxid = minmulti;
-		dirty = true;
+		bool	update = false;
+
+		if (MultiXactIdPrecedes(oldminmulti, minmulti))
+			update = true;
+		else if (MultiXactIdPrecedes(ReadNextMultiXactId(), oldminmulti))
+			futuremxid = update = true;
+
+		if (update)
+		{
+			pgcform->relminmxid = minmulti;
+			dirty = true;
+			if (minmulti_updated)
+				*minmulti_updated = true;
+		}
 	}
 
 	/* If anything changed, write out the tuple. */
@@ -1439,6 +1460,19 @@ vac_update_relstats(Relation relation,
 		heap_inplace_update(rd, ctup);
 
 	table_close(rd, RowExclusiveLock);
+
+	if (futurexid)
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg_internal("overwrote invalid relfrozenxid value %u with new value %u for table \"%s\"",
+								 oldfrozenxid, frozenxid,
+								 RelationGetRelationName(relation))));
+	if (futuremxid)
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg_internal("overwrote invalid relminmxid value %u with new value %u for table \"%s\"",
+								 oldminmulti, minmulti,
+								 RelationGetRelationName(relation))));
 }
 
 
-- 
2.32.0

v15-0004-vacuumlazy.c-Move-resource-allocation-to-heap_va.patchapplication/octet-stream; name=v15-0004-vacuumlazy.c-Move-resource-allocation-to-heap_va.patchDownload

From ff13b46f159e69edd747d974d049f2e25c87e546 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 25 Mar 2022 12:51:05 -0700
Subject: [PATCH v15 4/4] vacuumlazy.c: Move resource allocation to
 heap_vacuum_rel().

Finish off work started by commit 73f6ec3d: move remaining resource
allocation and deallocation code from lazy_scan_heap() to its caller,
heap_vacuum_rel().

Also remove unnecessary progress report calls for the last block number.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzk3fNBa_S3Ngi+16GQiyJ=AmUu3oUY99syMDTMRxitfyQ@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 74 +++++++++++-----------------
 1 file changed, 28 insertions(+), 46 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 7717f35ca..37bdb58e0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -246,7 +246,7 @@ typedef struct LVSavedErrInfo
 
 
 /* non-export function prototypes */
-static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static void lazy_scan_heap(LVRelState *vacrel);
 static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
 								  BlockNumber next_block,
 								  bool *next_unskippable_allvis,
@@ -519,11 +519,28 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->NewRelminMxid = OldestMxact;
 	vacrel->skippedallvis = false;
 
+	/*
+	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * parallel VACUUM initialization as part of allocating shared memory
+	 * space used for dead_items.  (But do a failsafe precheck first, to
+	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
+	 * is already dangerously old.)
+	 */
+	lazy_check_wraparound_failsafe(vacrel);
+	dead_items_alloc(vacrel, params->nworkers);
+
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
 	 * vacuuming, and heap vacuuming (plus related processing)
 	 */
-	lazy_scan_heap(vacrel, params->nworkers);
+	lazy_scan_heap(vacrel);
+
+	/*
+	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
+	 * passing when necessary.
+	 */
+	dead_items_cleanup(vacrel);
+	Assert(!IsInParallelMode());
 
 	/*
 	 * Update pg_class entries for each of rel's indexes where appropriate.
@@ -827,14 +844,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		supply.
  */
 static void
-lazy_scan_heap(LVRelState *vacrel, int nworkers)
+lazy_scan_heap(LVRelState *vacrel)
 {
-	VacDeadItems *dead_items;
 	BlockNumber rel_pages = vacrel->rel_pages,
 				blkno,
 				next_unskippable_block,
-				next_failsafe_block,
-				next_fsm_block_to_vacuum;
+				next_failsafe_block = 0,
+				next_fsm_block_to_vacuum = 0;
+	VacDeadItems *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
@@ -845,23 +862,6 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	};
 	int64		initprog_val[3];
 
-	/*
-	 * Do failsafe precheck before calling dead_items_alloc.  This ensures
-	 * that parallel VACUUM won't be attempted when relfrozenxid is already
-	 * dangerously old.
-	 */
-	lazy_check_wraparound_failsafe(vacrel);
-	next_failsafe_block = 0;
-
-	/*
-	 * Allocate the space for dead_items.  Note that this handles parallel
-	 * VACUUM initialization as part of allocating shared memory space used
-	 * for dead_items.
-	 */
-	dead_items_alloc(vacrel, nworkers);
-	dead_items = vacrel->dead_items;
-	next_fsm_block_to_vacuum = 0;
-
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
@@ -1238,11 +1238,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		}
 	}
 
-	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
-
-	/* Clear the block number information */
 	vacrel->blkno = InvalidBlockNumber;
+	if (BufferIsValid(vmbuffer))
+		ReleaseBuffer(vmbuffer);
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
@@ -1258,15 +1256,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		vacrel->missed_dead_tuples;
 
 	/*
-	 * Release any remaining pin on visibility map page.
+	 * Do index vacuuming (call each index's ambulkdelete routine), then do
+	 * related heap vacuuming
 	 */
-	if (BufferIsValid(vmbuffer))
-	{
-		ReleaseBuffer(vmbuffer);
-		vmbuffer = InvalidBuffer;
-	}
-
-	/* Perform a final round of index and heap vacuuming */
 	if (dead_items->num_items > 0)
 		lazy_vacuum(vacrel);
 
@@ -1277,19 +1269,9 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	if (blkno > next_fsm_block_to_vacuum)
 		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
 
-	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
-
-	/* Do post-vacuum cleanup */
+	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
 		lazy_cleanup_all_indexes(vacrel);
-
-	/*
-	 * Free resources managed by dead_items_alloc.  This ends parallel mode in
-	 * passing when necessary.
-	 */
-	dead_items_cleanup(vacrel);
-	Assert(!IsInParallelMode());
 }
 
 /*
-- 
2.32.0

v15-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchapplication/octet-stream; name=v15-0002-Generalize-how-VACUUM-skips-all-frozen-pages.patchDownload

From 728c2acd3c5b497aaf17990eb2e111089c30dfcc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v15 2/4] Generalize how VACUUM skips all-frozen pages.

Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to
aggressive VACUUMs) around advancing relfrozenxid before now.  The
underlying issue was that lazy_scan_heap conditioned its skipping
behavior on whether or not the current VACUUM was aggressive.  VACUUM
could fail to increment its frozenskipped_pages counter as a result, and
so could miss out on advancing relfrozenxid (in the non-aggressive case)
for no good reason.

The issue only comes up when concurrent activity might unset a page's
visibility map bit at exactly the wrong time.  The non-aggressive case
rechecked the visibility map at the point of skipping each page before
now.  This created a window for some other session to concurrently unset
the same heap page's bit in the visibility map.  If the bit was unset at
the wrong time, it would cause VACUUM to conservatively conclude that
the page was _never_ all-frozen on recheck.  frozenskipped_pages would
not be incremented for the page as a result.  lazy_scan_heap had already
committed to skipping the page/range at that point, though -- which made
it unsafe to advance relfrozenxid/relminmxid later on.

Consistently avoid the issue by generalizing how we skip frozen pages
during aggressive VACUUMs: take the same approach when skipping any
skippable page range during aggressive and non-aggressive VACUUMs alike.
The new approach makes ranges (not individual pages) the fundamental
unit of skipping using the visibility map.  frozenskipped_pages is
replaced with a boolean flag that represents whether some skippable
range with one or more all-visible pages was actually skipped (making
relfrozenxid unsafe to update).

It is safe for VACUUM to treat a page as all-frozen provided it at least
had its all-frozen bit set after the OldestXmin cutoff was established.
VACUUM is only required to scan pages that might have XIDs < OldestXmin
that are not yet frozen to be able to safely advance relfrozenxid.
Tuples concurrently inserted on skipped pages are equivalent to tuples
concurrently inserted on a block >= rel_pages from the same table.

It's possible that the issue this commit fixes hardly ever came up in
practice.  But we only had to be unlucky once to lose out on advancing
relfrozenxid -- a single affected heap page was enough to throw VACUUM
off.  That seems like something to avoid on general principle.  This is
similar to an issue fixed by commit 44fa8488, which taught vacuumlazy.c
to not give up on non-aggressive relfrozenxid advancement just because a
cleanup lock wasn't immediately available on some heap page.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com
Discussion: https://postgr.es/m/CA+TgmobhuzSR442_cfpgxidmiRdL-GdaFSc8SD=GJcpLTx_BAw@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 311 +++++++++++++--------------
 1 file changed, 147 insertions(+), 164 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3b9f3b6af..7717f35ca 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -176,6 +176,7 @@ typedef struct LVRelState
 	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
 	TransactionId NewRelfrozenXid;
 	MultiXactId NewRelminMxid;
+	bool		skippedallvis;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -196,7 +197,6 @@ typedef struct LVRelState
 	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
-	BlockNumber frozenskipped_pages;	/* # frozen pages skipped via VM */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
 	BlockNumber lpdead_item_pages;	/* # pages with LP_DEAD items */
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
@@ -247,6 +247,10 @@ typedef struct LVSavedErrInfo
 
 /* non-export function prototypes */
 static void lazy_scan_heap(LVRelState *vacrel, int nworkers);
+static BlockNumber lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer,
+								  BlockNumber next_block,
+								  bool *next_unskippable_allvis,
+								  bool *skipping_current_range);
 static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
 								   BlockNumber blkno, Page page,
 								   bool sharelock, Buffer vmbuffer);
@@ -467,7 +471,6 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 
 	/* Initialize page counters explicitly (be tidy) */
 	vacrel->scanned_pages = 0;
-	vacrel->frozenskipped_pages = 0;
 	vacrel->removed_pages = 0;
 	vacrel->lpdead_item_pages = 0;
 	vacrel->missed_dead_pages = 0;
@@ -514,6 +517,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/* Initialize state used to track oldest extant XID/XMID */
 	vacrel->NewRelfrozenXid = OldestXmin;
 	vacrel->NewRelminMxid = OldestMxact;
+	vacrel->skippedallvis = false;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -559,11 +563,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
 									   vacrel->relminmxid,
 									   vacrel->NewRelminMxid));
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
+	if (vacrel->skippedallvis)
 	{
 		/*
 		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
-		 * had to skip an all-visible page.  The state that tracks new
+		 * chose to skip an all-visible page range.  The state that tracks new
 		 * values will have missed unfrozen XIDs from the pages we skipped.
 		 * (Even if we knew the true oldest XID it likely wouldn't help us,
 		 * since it'll usually be very close to rel's original relfrozenxid.)
@@ -832,7 +836,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 				next_failsafe_block,
 				next_fsm_block_to_vacuum;
 	Buffer		vmbuffer = InvalidBuffer;
-	bool		skipping_blocks;
+	bool		next_unskippable_allvis,
+				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
@@ -863,179 +868,52 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	initprog_val[2] = dead_items->max_items;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
-	/*
-	 * Set things up for skipping blocks using visibility map.
-	 *
-	 * Except when vacrel->aggressive is set, we want to skip pages that are
-	 * all-visible according to the visibility map, but only when we can skip
-	 * at least SKIP_PAGES_THRESHOLD consecutive pages.  Since we're reading
-	 * sequentially, the OS should be doing readahead for us, so there's no
-	 * gain in skipping a page now and then; that's likely to disable
-	 * readahead and so be counterproductive. Also, skipping even a single
-	 * page means that we can't update relfrozenxid, so we only want to do it
-	 * if we can skip a goodly number of pages.
-	 *
-	 * When vacrel->aggressive is set, we can't skip pages just because they
-	 * are all-visible, but we can still skip pages that are all-frozen, since
-	 * such pages do not need freezing and do not affect the value that we can
-	 * safely set for relfrozenxid or relminmxid.
-	 *
-	 * Before entering the main loop, establish the invariant that
-	 * next_unskippable_block is the next block number >= blkno that we can't
-	 * skip based on the visibility map, either all-visible for a regular scan
-	 * or all-frozen for an aggressive scan.  We set it to rel_pages when
-	 * there's no such block.  We also set up the skipping_blocks flag
-	 * correctly at this stage.
-	 *
-	 * Note: The value returned by visibilitymap_get_status could be slightly
-	 * out-of-date, since we make this test before reading the corresponding
-	 * heap page or locking the buffer.  This is OK.  If we mistakenly think
-	 * that the page is all-visible or all-frozen when in fact the flag's just
-	 * been cleared, we might fail to vacuum the page.  It's easy to see that
-	 * skipping a page when aggressive is not set is not a very big deal; we
-	 * might leave some dead tuples lying around, but the next vacuum will
-	 * find them.  But even when aggressive *is* set, it's still OK if we miss
-	 * a page whose all-frozen marking has just been cleared.  Any new XIDs
-	 * just added to that page are necessarily >= vacrel->OldestXmin, and so
-	 * they'll have no effect on the value to which we can safely set
-	 * relfrozenxid.  A similar argument applies for MXIDs and relminmxid.
-	 */
-	next_unskippable_block = 0;
-	if (vacrel->skipwithvm)
-	{
-		while (next_unskippable_block < rel_pages)
-		{
-			uint8		vmstatus;
-
-			vmstatus = visibilitymap_get_status(vacrel->rel,
-												next_unskippable_block,
-												&vmbuffer);
-			if (vacrel->aggressive)
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0)
-					break;
-			}
-			else
-			{
-				if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)
-					break;
-			}
-			vacuum_delay_point();
-			next_unskippable_block++;
-		}
-	}
-
-	if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)
-		skipping_blocks = true;
-	else
-		skipping_blocks = false;
-
+	/* Set up an initial range of skippable blocks using the visibility map */
+	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
+											&next_unskippable_allvis,
+											&skipping_current_range);
 	for (blkno = 0; blkno < rel_pages; blkno++)
 	{
 		Buffer		buf;
 		Page		page;
-		bool		all_visible_according_to_vm = false;
+		bool		all_visible_according_to_vm;
 		LVPagePruneState prunestate;
 
-		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
-
-		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
-								 blkno, InvalidOffsetNumber);
-
 		if (blkno == next_unskippable_block)
 		{
-			/* Time to advance next_unskippable_block */
-			next_unskippable_block++;
-			if (vacrel->skipwithvm)
-			{
-				while (next_unskippable_block < rel_pages)
-				{
-					uint8		vmskipflags;
-
-					vmskipflags = visibilitymap_get_status(vacrel->rel,
-														   next_unskippable_block,
-														   &vmbuffer);
-					if (vacrel->aggressive)
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0)
-							break;
-					}
-					else
-					{
-						if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)
-							break;
-					}
-					vacuum_delay_point();
-					next_unskippable_block++;
-				}
-			}
-
 			/*
-			 * We know we can't skip the current block.  But set up
-			 * skipping_blocks to do the right thing at the following blocks.
+			 * Can't skip this page safely.  Must scan the page.  But
+			 * determine the next skippable range after the page first.
 			 */
-			if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)
-				skipping_blocks = true;
-			else
-				skipping_blocks = false;
+			all_visible_according_to_vm = next_unskippable_allvis;
+			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
+													blkno + 1,
+													&next_unskippable_allvis,
+													&skipping_current_range);
 
-			/*
-			 * Normally, the fact that we can't skip this block must mean that
-			 * it's not all-visible.  But in an aggressive vacuum we know only
-			 * that it's not all-frozen, so it might still be all-visible.
-			 */
-			if (vacrel->aggressive &&
-				VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
-				all_visible_according_to_vm = true;
+			Assert(next_unskippable_block >= blkno + 1);
 		}
 		else
 		{
-			/*
-			 * The current page can be skipped if we've seen a long enough run
-			 * of skippable blocks to justify skipping it -- provided it's not
-			 * the last page in the relation (according to rel_pages).
-			 *
-			 * We always scan the table's last page to determine whether it
-			 * has tuples or not, even if it would otherwise be skipped. This
-			 * avoids having lazy_truncate_heap() take access-exclusive lock
-			 * on the table to attempt a truncation that just fails
-			 * immediately because there are tuples on the last page.
-			 */
-			if (skipping_blocks && blkno < rel_pages - 1)
-			{
-				/*
-				 * Tricky, tricky.  If this is in aggressive vacuum, the page
-				 * must have been all-frozen at the time we checked whether it
-				 * was skippable, but it might not be any more.  We must be
-				 * careful to count it as a skipped all-frozen page in that
-				 * case, or else we'll think we can't update relfrozenxid and
-				 * relminmxid.  If it's not an aggressive vacuum, we don't
-				 * know whether it was initially all-frozen, so we have to
-				 * recheck.
-				 */
-				if (vacrel->aggressive ||
-					VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
-					vacrel->frozenskipped_pages++;
-				continue;
-			}
+			/* Last page always scanned (may need to set nonempty_pages) */
+			Assert(blkno < rel_pages - 1);
 
-			/*
-			 * SKIP_PAGES_THRESHOLD (threshold for skipping) was not
-			 * crossed, or this is the last page.  Scan the page, even
-			 * though it's all-visible (and possibly even all-frozen).
-			 */
+			if (skipping_current_range)
+				continue;
+
+			/* Current range is too small to skip -- just scan the page */
 			all_visible_according_to_vm = true;
 		}
 
-		vacuum_delay_point();
-
-		/*
-		 * We're not skipping this page using the visibility map, and so it is
-		 * (by definition) a scanned page.  Any tuples from this page are now
-		 * guaranteed to be counted below, after some preparatory checks.
-		 */
 		vacrel->scanned_pages++;
 
+		/* Report as block scanned, update error traceback information */
+		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+								 blkno, InvalidOffsetNumber);
+
+		vacuum_delay_point();
+
 		/*
 		 * Regularly check if wraparound failsafe should trigger.
 		 *
@@ -1235,8 +1113,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		}
 
 		/*
-		 * Handle setting visibility map bit based on what the VM said about
-		 * the page before pruning started, and using prunestate
+		 * Handle setting visibility map bit based on information from the VM
+		 * (as of last lazy_scan_skip() call), and from prunestate
 		 */
 		if (!all_visible_according_to_vm && prunestate.all_visible)
 		{
@@ -1268,9 +1146,8 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
 		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after we checked it and before we took the buffer
-		 * content lock, so we must recheck before jumping to the conclusion
-		 * that something bad has happened.
+		 * got cleared after lazy_scan_skip() was called, so we must recheck
+		 * with buffer lock before concluding that the VM is corrupt.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
 				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
@@ -1309,7 +1186,7 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 		/*
 		 * If the all-visible page is all-frozen but not marked as such yet,
 		 * mark it as all-frozen.  Note that all_frozen is only valid if
-		 * all_visible is true, so we must check both.
+		 * all_visible is true, so we must check both prunestate fields.
 		 */
 		else if (all_visible_according_to_vm && prunestate.all_visible &&
 				 prunestate.all_frozen &&
@@ -1415,6 +1292,112 @@ lazy_scan_heap(LVRelState *vacrel, int nworkers)
 	Assert(!IsInParallelMode());
 }
 
+/*
+ *	lazy_scan_skip() -- set up range of skippable blocks using visibility map.
+ *
+ * lazy_scan_heap() calls here every time it needs to set up a new range of
+ * blocks to skip via the visibility map.  Caller passes the next block in
+ * line.  We return a next_unskippable_block for this range.  When there are
+ * no skippable blocks we just return caller's next_block.  The all-visible
+ * status of the returned block is set in *next_unskippable_allvis for caller,
+ * too.  Block usually won't be all-visible (since it's unskippable), but it
+ * can be during aggressive VACUUMs (as well as in certain edge cases).
+ *
+ * Sets *skipping_current_range to indicate if caller should skip this range.
+ * Costs and benefits drive our decision.  Very small ranges won't be skipped.
+ *
+ * Note: our opinion of which blocks can be skipped can go stale immediately.
+ * It's okay if caller "misses" a page whose all-visible or all-frozen marking
+ * was concurrently cleared, though.  All that matters is that caller scan all
+ * pages whose tuples might contain XIDs < OldestXmin, or XMIDs < OldestMxact.
+ * (Actually, non-aggressive VACUUMs can choose to skip all-visible pages with
+ * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
+ * choice to skip such a range is actually made, making everything safe.)
+ */
+static BlockNumber
+lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
+			   bool *next_unskippable_allvis, bool *skipping_current_range)
+{
+	BlockNumber rel_pages = vacrel->rel_pages,
+				next_unskippable_block = next_block,
+				nskippable_blocks = 0;
+	bool		skipsallvis = false;
+
+	*next_unskippable_allvis = true;
+	while (next_unskippable_block < rel_pages)
+	{
+		uint8		mapbits = visibilitymap_get_status(vacrel->rel,
+													   next_unskippable_block,
+													   vmbuffer);
+
+		if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
+		{
+			Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
+			*next_unskippable_allvis = false;
+			break;
+		}
+
+		/*
+		 * Caller must scan the last page to determine whether it has tuples
+		 * (caller must have the opportunity to set vacrel->nonempty_pages).
+		 * This rule avoids having lazy_truncate_heap() take access-exclusive
+		 * lock on rel to attempt a truncation that fails anyway, just because
+		 * there are tuples on the last page (it is likely that there will be
+		 * tuples on other nearby pages as well, but those can be skipped).
+		 *
+		 * Implement this by always treating the last block as unsafe to skip.
+		 */
+		if (next_unskippable_block == rel_pages - 1)
+			break;
+
+		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
+		if (!vacrel->skipwithvm)
+			break;
+
+		/*
+		 * Aggressive VACUUM caller can't skip pages just because they are
+		 * all-visible.  They may still skip all-frozen pages, which can't
+		 * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
+		 */
+		if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
+		{
+			if (vacrel->aggressive)
+				break;
+
+			/*
+			 * All-visible block is safe to skip in non-aggressive case.  But
+			 * remember that the final range contains such a block for later.
+			 */
+			skipsallvis = true;
+		}
+
+		vacuum_delay_point();
+		next_unskippable_block++;
+		nskippable_blocks++;
+	}
+
+	/*
+	 * We only skip a range with at least SKIP_PAGES_THRESHOLD consecutive
+	 * pages.  Since we're reading sequentially, the OS should be doing
+	 * readahead for us, so there's no gain in skipping a page now and then.
+	 * Skipping such a range might even discourage sequential detection.
+	 *
+	 * This test also enables more frequent relfrozenxid advancement during
+	 * non-aggressive VACUUMs.  If the range has any all-visible pages then
+	 * skipping makes updating relfrozenxid unsafe, which is a real downside.
+	 */
+	if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
+		*skipping_current_range = false;
+	else
+	{
+		*skipping_current_range = true;
+		if (skipsallvis)
+			vacrel->skippedallvis = true;
+	}
+
+	return next_unskippable_block;
+}
+
 /*
  *	lazy_scan_new_or_empty() -- lazy_scan_heap() new/empty page handling.
  *
-- 
2.32.0

v15-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patchapplication/octet-stream; name=v15-0001-Set-relfrozenxid-to-oldest-extant-XID-seen-by-VA.patchDownload

From b99f2c415f8385cbbc2f96344117113ca4f8b1d4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 11 Mar 2022 19:16:02 -0800
Subject: [PATCH v15 1/4] Set relfrozenxid to oldest extant XID seen by VACUUM.

When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive: the relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.

VACUUM now sets relfrozenxid (and relminmxid) using the exact oldest
extant XID (and oldest extant MultiXactId) from the table, including
XIDs from the table's remaining/unfrozen MultiXacts.  This requires that
VACUUM carefully track the oldest unfrozen XID/MultiXactId as it goes.
This optimization doesn't require any changes to the definition of
relfrozenxid, nor does it require changes to the core design of
freezing.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM -- FreezeLimit still acts as a lower bound on the final value
that aggressive VACUUM can set relfrozenxid to.  Since standard VACUUMs
still make no guarantees about advancing relfrozenxid, they might as
well set relfrozenxid to a value from well before FreezeLimit when the
opportunity presents itself.  In general standard VACUUMs may now set
relfrozenxid to any value > the original relfrozenxid and <= OldestXmin.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com
---
 src/include/access/heapam.h                   |   6 +-
 src/include/access/heapam_xlog.h              |   4 +-
 src/include/commands/vacuum.h                 |   1 +
 src/backend/access/heap/heapam.c              | 332 +++++++++++++-----
 src/backend/access/heap/vacuumlazy.c          | 174 +++++----
 src/backend/commands/cluster.c                |   5 +-
 src/backend/commands/vacuum.c                 |  39 +-
 doc/src/sgml/maintenance.sgml                 |  30 +-
 .../expected/vacuum-no-cleanup-lock.out       | 189 ++++++++++
 .../isolation/expected/vacuum-reltuples.out   |  67 ----
 src/test/isolation/isolation_schedule         |   2 +-
 .../specs/vacuum-no-cleanup-lock.spec         | 150 ++++++++
 .../isolation/specs/vacuum-reltuples.spec     |  49 ---
 13 files changed, 737 insertions(+), 311 deletions(-)
 create mode 100644 src/test/isolation/expected/vacuum-no-cleanup-lock.out
 delete mode 100644 src/test/isolation/expected/vacuum-reltuples.out
 create mode 100644 src/test/isolation/specs/vacuum-no-cleanup-lock.spec
 delete mode 100644 src/test/isolation/specs/vacuum-reltuples.spec

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b46ab7d73..4403f01e1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -167,8 +167,10 @@ extern void heap_inplace_update(Relation relation, HeapTuple tuple);
 extern bool heap_freeze_tuple(HeapTupleHeader tuple,
 							  TransactionId relfrozenxid, TransactionId relminmxid,
 							  TransactionId cutoff_xid, TransactionId cutoff_multi);
-extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-									MultiXactId cutoff_multi);
+extern bool heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+									MultiXactId cutoff_multi,
+									TransactionId *relfrozenxid_out,
+									MultiXactId *relminmxid_out);
 extern bool heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple);
 
 extern void simple_heap_insert(Relation relation, HeapTuple tup);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 5c47fdcec..2d8a7f627 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -410,7 +410,9 @@ extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 									  TransactionId cutoff_xid,
 									  TransactionId cutoff_multi,
 									  xl_heap_freeze_tuple *frz,
-									  bool *totally_frozen);
+									  bool *totally_frozen,
+									  TransactionId *relfrozenxid_out,
+									  MultiXactId *relminmxid_out);
 extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
 									  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index d64f6268f..ead88edda 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -291,6 +291,7 @@ extern bool vacuum_set_xid_limits(Relation rel,
 								  int multixact_freeze_min_age,
 								  int multixact_freeze_table_age,
 								  TransactionId *oldestXmin,
+								  MultiXactId *oldestMxact,
 								  TransactionId *freezeLimit,
 								  MultiXactId *multiXactCutoff);
 extern bool vacuum_xid_failsafe_check(TransactionId relfrozenxid,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 74ad445e5..1ee985f63 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6079,10 +6079,12 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  *		Determine what to do during freezing when a tuple is marked by a
  *		MultiXactId.
  *
- * NB -- this might have the side-effect of creating a new MultiXactId!
- *
  * "flags" is an output value; it's used to tell caller what to do on return.
- * Possible flags are:
+ *
+ * "mxid_oldest_xid_out" is an output value; it's used to track the oldest
+ * extant Xid within any Multixact that will remain after freezing executes.
+ *
+ * Possible values that we can set in "flags":
  * FRM_NOOP
  *		don't do anything -- keep existing Xmax
  * FRM_INVALIDATE_XMAX
@@ -6094,12 +6096,17 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * FRM_RETURN_IS_MULTI
  *		The return value is a new MultiXactId to set as new Xmax.
  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ *
+ * "mxid_oldest_xid_out" is only set when "flags" contains either FRM_NOOP or
+ * FRM_RETURN_IS_MULTI, since we only leave behind a MultiXactId for these.
+ *
+ * NB: Creates a _new_ MultiXactId when FRM_RETURN_IS_MULTI is set in "flags".
  */
 static TransactionId
 FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 				  TransactionId relfrozenxid, TransactionId relminmxid,
 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				  uint16 *flags)
+				  uint16 *flags, TransactionId *mxid_oldest_xid_out)
 {
 	TransactionId xid = InvalidTransactionId;
 	int			i;
@@ -6111,6 +6118,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	bool		has_lockers;
 	TransactionId update_xid;
 	bool		update_committed;
+	TransactionId temp_xid_out;
 
 	*flags = 0;
 
@@ -6147,7 +6155,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
 		{
 			*flags |= FRM_INVALIDATE_XMAX;
-			xid = InvalidTransactionId; /* not strictly necessary */
+			xid = InvalidTransactionId;
 		}
 		else
 		{
@@ -6174,7 +6182,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 							(errcode(ERRCODE_DATA_CORRUPTED),
 							 errmsg_internal("cannot freeze committed update xid %u", xid)));
 				*flags |= FRM_INVALIDATE_XMAX;
-				xid = InvalidTransactionId; /* not strictly necessary */
+				xid = InvalidTransactionId;
 			}
 			else
 			{
@@ -6182,6 +6190,10 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 		}
 
+		/*
+		 * Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid, or
+		 * when no Xids will remain
+		 */
 		return xid;
 	}
 
@@ -6205,6 +6217,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	/* is there anything older than the cutoff? */
 	need_replace = false;
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_NOOP */
 	for (i = 0; i < nmembers; i++)
 	{
 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
@@ -6212,28 +6225,38 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			need_replace = true;
 			break;
 		}
+		if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+			temp_xid_out = members[i].xid;
 	}
 
 	/*
 	 * In the simplest case, there is no member older than the cutoff; we can
-	 * keep the existing MultiXactId as is.
+	 * keep the existing MultiXactId as-is, avoiding a more expensive second
+	 * pass over the multi
 	 */
 	if (!need_replace)
 	{
+		/*
+		 * When mxid_oldest_xid_out gets pushed back here it's likely that the
+		 * update Xid was the oldest member, but we don't rely on that
+		 */
 		*flags |= FRM_NOOP;
+		*mxid_oldest_xid_out = temp_xid_out;
 		pfree(members);
-		return InvalidTransactionId;
+		return multi;
 	}
 
 	/*
-	 * If the multi needs to be updated, figure out which members do we need
-	 * to keep.
+	 * Do a more thorough second pass over the multi to figure out which
+	 * member XIDs actually need to be kept.  Checking the precise status of
+	 * individual members might even show that we don't need to keep anything.
 	 */
 	nnewmembers = 0;
 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
 	has_lockers = false;
 	update_xid = InvalidTransactionId;
 	update_committed = false;
+	temp_xid_out = *mxid_oldest_xid_out;	/* init for FRM_RETURN_IS_MULTI */
 
 	for (i = 0; i < nmembers; i++)
 	{
@@ -6289,7 +6312,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			}
 
 			/*
-			 * Since the tuple wasn't marked HEAPTUPLE_DEAD by vacuum, the
+			 * Since the tuple wasn't totally removed when vacuum pruned, the
 			 * update Xid cannot possibly be older than the xid cutoff. The
 			 * presence of such a tuple would cause corruption, so be paranoid
 			 * and check.
@@ -6302,15 +6325,20 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 										 update_xid, cutoff_xid)));
 
 			/*
-			 * If we determined that it's an Xid corresponding to an update
-			 * that must be retained, additionally add it to the list of
-			 * members of the new Multi, in case we end up using that.  (We
-			 * might still decide to use only an update Xid and not a multi,
-			 * but it's easier to maintain the list as we walk the old members
-			 * list.)
+			 * We determined that this is an Xid corresponding to an update
+			 * that must be retained -- add it to new members list for later.
+			 *
+			 * Also consider pushing back temp_xid_out, which is needed when
+			 * we later conclude that a new multi is required (i.e. when we go
+			 * on to set FRM_RETURN_IS_MULTI for our caller because we also
+			 * need to retain a locker that's still running).
 			 */
 			if (TransactionIdIsValid(update_xid))
+			{
 				newmembers[nnewmembers++] = members[i];
+				if (TransactionIdPrecedes(members[i].xid, temp_xid_out))
+					temp_xid_out = members[i].xid;
+			}
 		}
 		else
 		{
@@ -6318,8 +6346,18 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
 				TransactionIdIsInProgress(members[i].xid))
 			{
-				/* running locker cannot possibly be older than the cutoff */
+				/*
+				 * Running locker cannot possibly be older than the cutoff.
+				 *
+				 * The cutoff is <= VACUUM's OldestXmin, which is also the
+				 * initial value used for top-level relfrozenxid_out tracking
+				 * state.  A running locker cannot be older than VACUUM's
+				 * OldestXmin, either, so we don't need a temp_xid_out step.
+				 */
+				Assert(TransactionIdIsNormal(members[i].xid));
 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				Assert(!TransactionIdPrecedes(members[i].xid,
+											  *mxid_oldest_xid_out));
 				newmembers[nnewmembers++] = members[i];
 				has_lockers = true;
 			}
@@ -6328,11 +6366,16 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	pfree(members);
 
+	/*
+	 * Determine what to do with caller's multi based on information gathered
+	 * during our second pass
+	 */
 	if (nnewmembers == 0)
 	{
 		/* nothing worth keeping!? Tell caller to remove the whole thing */
 		*flags |= FRM_INVALIDATE_XMAX;
 		xid = InvalidTransactionId;
+		/* Don't push back mxid_oldest_xid_out -- no Xids will remain */
 	}
 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
 	{
@@ -6348,15 +6391,18 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		if (update_committed)
 			*flags |= FRM_MARK_COMMITTED;
 		xid = update_xid;
+		/* Don't push back mxid_oldest_xid_out using FRM_RETURN_IS_XID Xid */
 	}
 	else
 	{
 		/*
 		 * Create a new multixact with the surviving members of the previous
-		 * one, to set as new Xmax in the tuple.
+		 * one, to set as new Xmax in the tuple.  The oldest surviving member
+		 * might push back mxid_oldest_xid_out.
 		 */
 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
 		*flags |= FRM_RETURN_IS_MULTI;
+		*mxid_oldest_xid_out = temp_xid_out;
 	}
 
 	pfree(newmembers);
@@ -6375,31 +6421,41 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * will be totally frozen after these operations are performed and false if
  * more freezing will eventually be required.
  *
- * Caller is responsible for setting the offset field, if appropriate.
+ * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
  * (else we should be removing the tuple, not freezing it).
  *
- * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
+ * The *relfrozenxid_out and *relminmxid_out arguments are the current target
+ * relfrozenxid and relminmxid for VACUUM caller's heap rel.  Any and all
+ * unfrozen XIDs or MXIDs that remain in caller's rel after VACUUM finishes
+ * _must_ have values >= the final relfrozenxid/relminmxid values in pg_class.
+ * This includes XIDs that remain as MultiXact members from any tuple's xmax.
+ * Each call here pushes back *relfrozenxid_out and/or *relminmxid_out as
+ * needed to avoid unsafe final values in rel's authoritative pg_class tuple.
+ *
+ * NB: cutoff_xid *must* be <= VACUUM's OldestXmin, to ensure that any
  * XID older than it could neither be running nor seen as running by any
  * open transaction.  This ensures that the replacement will not change
  * anyone's idea of the tuple state.
- * Similarly, cutoff_multi must be less than or equal to the smallest
- * MultiXactId used by any transaction currently open.
+ * Similarly, cutoff_multi must be <= VACUUM's OldestMxact.
  *
- * If the tuple is in a shared buffer, caller must hold an exclusive lock on
- * that buffer.
+ * NB: This function has side effects: it might allocate a new MultiXactId.
+ * It will be set as tuple's new xmax when our *frz output is processed within
+ * heap_execute_freeze_tuple later on.  If the tuple is in a shared buffer
+ * then caller had better have an exclusive lock on it already.
  *
- * NB: It is not enough to set hint bits to indicate something is
- * committed/invalid -- they might not be set on a standby, or after crash
- * recovery.  We really need to remove old xids.
+ * NB: It is not enough to set hint bits to indicate an XID committed/aborted.
+ * The *frz WAL record we output completely removes all old XIDs during REDO.
  */
 bool
 heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						  TransactionId relfrozenxid, TransactionId relminmxid,
 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
-						  xl_heap_freeze_tuple *frz, bool *totally_frozen)
+						  xl_heap_freeze_tuple *frz, bool *totally_frozen,
+						  TransactionId *relfrozenxid_out,
+						  MultiXactId *relminmxid_out)
 {
 	bool		changed = false;
 	bool		xmax_already_frozen = false;
@@ -6418,7 +6474,9 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * already a permanent value), while in the block below it is set true to
 	 * mean "xmin won't need freezing after what we do to it here" (false
 	 * otherwise).  In both cases we're allowed to set totally_frozen, as far
-	 * as xmin is concerned.
+	 * as xmin is concerned.  Both cases also don't require relfrozenxid_out
+	 * handling, since either way the tuple's xmin will be a permanent value
+	 * once we're done with it.
 	 */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (!TransactionIdIsNormal(xid))
@@ -6443,6 +6501,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			frz->t_infomask |= HEAP_XMIN_FROZEN;
 			changed = true;
 		}
+		else
+		{
+			/* xmin to remain unfrozen.  Could push back relfrozenxid_out. */
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 
 	/*
@@ -6452,7 +6516,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	 * freezing, too.  Also, if a multi needs freezing, we cannot simply take
 	 * it out --- if there's a live updater Xid, it needs to be kept.
 	 *
-	 * Make sure to keep heap_tuple_needs_freeze in sync with this.
+	 * Make sure to keep heap_tuple_would_freeze in sync with this.
 	 */
 	xid = HeapTupleHeaderGetRawXmax(tuple);
 
@@ -6460,15 +6524,28 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 	{
 		TransactionId newxmax;
 		uint16		flags;
+		TransactionId mxid_oldest_xid_out = *relfrozenxid_out;
 
 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
 									relfrozenxid, relminmxid,
-									cutoff_xid, cutoff_multi, &flags);
+									cutoff_xid, cutoff_multi,
+									&flags, &mxid_oldest_xid_out);
 
 		freeze_xmax = (flags & FRM_INVALIDATE_XMAX);
 
 		if (flags & FRM_RETURN_IS_XID)
 		{
+			/*
+			 * xmax will become an updater Xid (original MultiXact's updater
+			 * member Xid will be carried forward as a simple Xid in Xmax).
+			 * Might have to ratchet back relfrozenxid_out here, though never
+			 * relminmxid_out.
+			 */
+			Assert(!freeze_xmax);
+			Assert(TransactionIdIsValid(newxmax));
+			if (TransactionIdPrecedes(newxmax, *relfrozenxid_out))
+				*relfrozenxid_out = newxmax;
+
 			/*
 			 * NB -- some of these transformations are only valid because we
 			 * know the return Xid is a tuple updater (i.e. not merely a
@@ -6487,6 +6564,19 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 			uint16		newbits;
 			uint16		newbits2;
 
+			/*
+			 * xmax is an old MultiXactId that we have to replace with a new
+			 * MultiXactId, to carry forward two or more original member XIDs.
+			 * Might have to ratchet back relfrozenxid_out here, though never
+			 * relminmxid_out.
+			 */
+			Assert(!freeze_xmax);
+			Assert(MultiXactIdIsValid(newxmax));
+			Assert(!MultiXactIdPrecedes(newxmax, *relminmxid_out));
+			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
+												 *relfrozenxid_out));
+			*relfrozenxid_out = mxid_oldest_xid_out;
+
 			/*
 			 * We can't use GetMultiXactIdHintBits directly on the new multi
 			 * here; that routine initializes the masks to all zeroes, which
@@ -6503,6 +6593,30 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 			changed = true;
 		}
+		else if (flags & FRM_NOOP)
+		{
+			/*
+			 * xmax is a MultiXactId, and nothing about it changes for now.
+			 * Might have to ratchet back relminmxid_out, relfrozenxid_out, or
+			 * both together.
+			 */
+			Assert(!freeze_xmax);
+			Assert(MultiXactIdIsValid(newxmax) && xid == newxmax);
+			Assert(TransactionIdPrecedesOrEquals(mxid_oldest_xid_out,
+												 *relfrozenxid_out));
+			if (MultiXactIdPrecedes(xid, *relminmxid_out))
+				*relminmxid_out = xid;
+			*relfrozenxid_out = mxid_oldest_xid_out;
+		}
+		else
+		{
+			/*
+			 * Keeping nothing (neither an Xid nor a MultiXactId) in xmax.
+			 * Won't have to ratchet back relminmxid_out or relfrozenxid_out.
+			 */
+			Assert(freeze_xmax);
+			Assert(!TransactionIdIsValid(newxmax));
+		}
 	}
 	else if (TransactionIdIsNormal(xid))
 	{
@@ -6527,15 +6641,21 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 						 errmsg_internal("cannot freeze committed xmax %u",
 										 xid)));
 			freeze_xmax = true;
+			/* No need for relfrozenxid_out handling, since we'll freeze xmax */
 		}
 		else
+		{
 			freeze_xmax = false;
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+		}
 	}
 	else if ((tuple->t_infomask & HEAP_XMAX_INVALID) ||
 			 !TransactionIdIsValid(HeapTupleHeaderGetRawXmax(tuple)))
 	{
 		freeze_xmax = false;
 		xmax_already_frozen = true;
+		/* No need for relfrozenxid_out handling for already-frozen xmax */
 	}
 	else
 		ereport(ERROR,
@@ -6576,6 +6696,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 		 * was removed in PostgreSQL 9.0.  Note that if we were to respect
 		 * cutoff_xid here, we'd need to make surely to clear totally_frozen
 		 * when we skipped freezing on that basis.
+		 *
+		 * No need for relfrozenxid_out handling, since we always freeze xvac.
 		 */
 		if (TransactionIdIsNormal(xid))
 		{
@@ -6653,11 +6775,14 @@ heap_freeze_tuple(HeapTupleHeader tuple,
 	xl_heap_freeze_tuple frz;
 	bool		do_freeze;
 	bool		tuple_totally_frozen;
+	TransactionId relfrozenxid_out = cutoff_xid;
+	MultiXactId relminmxid_out = cutoff_multi;
 
 	do_freeze = heap_prepare_freeze_tuple(tuple,
 										  relfrozenxid, relminmxid,
 										  cutoff_xid, cutoff_multi,
-										  &frz, &tuple_totally_frozen);
+										  &frz, &tuple_totally_frozen,
+										  &relfrozenxid_out, &relminmxid_out);
 
 	/*
 	 * Note that because this is not a WAL-logged operation, we don't need to
@@ -7036,9 +7161,7 @@ ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status,
  * heap_tuple_needs_eventual_freeze
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * will eventually require freezing.  Similar to heap_tuple_needs_freeze,
- * but there's no cutoff, since we're trying to figure out whether freezing
- * will ever be needed, not whether it's needed now.
+ * will eventually require freezing (if tuple isn't removed by pruning first).
  */
 bool
 heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
@@ -7082,87 +7205,106 @@ heap_tuple_needs_eventual_freeze(HeapTupleHeader tuple)
 }
 
 /*
- * heap_tuple_needs_freeze
+ * heap_tuple_would_freeze
  *
- * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the specified cutoff XID or MultiXactId.  If so, return true.
+ * Return value indicates if heap_prepare_freeze_tuple sibling function would
+ * freeze any of the XID/XMID fields from the tuple, given the same cutoffs.
+ * We must also deal with dead tuples here, since (xmin, xmax, xvac) fields
+ * could be processed by pruning away the whole tuple instead of freezing.
  *
- * It doesn't matter whether the tuple is alive or dead, we are checking
- * to see if a tuple needs to be removed or frozen to avoid wraparound.
- *
- * NB: Cannot rely on hint bits here, they might not be set after a crash or
- * on a standby.
+ * The *relfrozenxid_out and *relminmxid_out input/output arguments work just
+ * like the heap_prepare_freeze_tuple arguments that they're based on.  We
+ * never freeze here, which makes tracking the oldest extant XID/MXID simple.
  */
 bool
-heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
-						MultiXactId cutoff_multi)
+heap_tuple_would_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
+						MultiXactId cutoff_multi,
+						TransactionId *relfrozenxid_out,
+						MultiXactId *relminmxid_out)
 {
 	TransactionId xid;
+	MultiXactId multi;
+	bool		would_freeze = false;
 
+	/* First deal with xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
-	if (TransactionIdIsNormal(xid) &&
-		TransactionIdPrecedes(xid, cutoff_xid))
-		return true;
-
-	/*
-	 * The considerations for multixacts are complicated; look at
-	 * heap_prepare_freeze_tuple for justifications.  This routine had better
-	 * be in sync with that one!
-	 */
-	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
+	if (TransactionIdIsNormal(xid))
 	{
-		MultiXactId multi;
+		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			would_freeze = true;
+	}
 
+	/* Now deal with xmax */
+	xid = InvalidTransactionId;
+	multi = InvalidMultiXactId;
+	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 		multi = HeapTupleHeaderGetRawXmax(tuple);
-		if (!MultiXactIdIsValid(multi))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
-			return true;
-		else if (MultiXactIdPrecedes(multi, cutoff_multi))
-			return true;
-		else
-		{
-			MultiXactMember *members;
-			int			nmembers;
-			int			i;
+	else
+		xid = HeapTupleHeaderGetRawXmax(tuple);
 
-			/* need to check whether any member of the mxact is too old */
-
-			nmembers = GetMultiXactIdMembers(multi, &members, false,
-											 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
-
-			for (i = 0; i < nmembers; i++)
-			{
-				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
-				{
-					pfree(members);
-					return true;
-				}
-			}
-			if (nmembers > 0)
-				pfree(members);
-		}
+	if (TransactionIdIsNormal(xid))
+	{
+		/* xmax is a non-permanent XID */
+		if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+			*relfrozenxid_out = xid;
+		if (TransactionIdPrecedes(xid, cutoff_xid))
+			would_freeze = true;
+	}
+	else if (!MultiXactIdIsValid(multi))
+	{
+		/* xmax is a permanent XID or invalid MultiXactId/XID */
+	}
+	else if (HEAP_LOCKED_UPGRADED(tuple->t_infomask))
+	{
+		/* xmax is a pg_upgrade'd MultiXact, which can't have updater XID */
+		if (MultiXactIdPrecedes(multi, *relminmxid_out))
+			*relminmxid_out = multi;
+		/* heap_prepare_freeze_tuple always freezes pg_upgrade'd xmax */
+		would_freeze = true;
 	}
 	else
 	{
-		xid = HeapTupleHeaderGetRawXmax(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		/* xmax is a MultiXactId that may have an updater XID */
+		MultiXactMember *members;
+		int			nmembers;
+
+		if (MultiXactIdPrecedes(multi, *relminmxid_out))
+			*relminmxid_out = multi;
+		if (MultiXactIdPrecedes(multi, cutoff_multi))
+			would_freeze = true;
+
+		/* need to check whether any member of the mxact is old */
+		nmembers = GetMultiXactIdMembers(multi, &members, false,
+										 HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
+
+		for (int i = 0; i < nmembers; i++)
+		{
+			xid = members[i].xid;
+			Assert(TransactionIdIsNormal(xid));
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+				would_freeze = true;
+		}
+		if (nmembers > 0)
+			pfree(members);
 	}
 
 	if (tuple->t_infomask & HEAP_MOVED)
 	{
 		xid = HeapTupleHeaderGetXvac(tuple);
-		if (TransactionIdIsNormal(xid) &&
-			TransactionIdPrecedes(xid, cutoff_xid))
-			return true;
+		if (TransactionIdIsNormal(xid))
+		{
+			if (TransactionIdPrecedes(xid, *relfrozenxid_out))
+				*relfrozenxid_out = xid;
+			/* heap_prepare_freeze_tuple always freezes xvac */
+			would_freeze = true;
+		}
 	}
 
-	return false;
+	return would_freeze;
 }
 
 /*
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 87ab7775a..3b9f3b6af 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,7 +144,7 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
-	/* Aggressive VACUUM (scan all unfrozen pages)? */
+	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
 	bool		skipwithvm;
@@ -173,8 +173,9 @@ typedef struct LVRelState
 	/* VACUUM operation's target cutoffs for freezing XIDs and MultiXactIds */
 	TransactionId FreezeLimit;
 	MultiXactId MultiXactCutoff;
-	/* Are FreezeLimit/MultiXactCutoff still valid? */
-	bool		freeze_cutoffs_valid;
+	/* Tracks oldest extant XID/MXID for setting relfrozenxid/relminmxid */
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 
 	/* Error reporting state */
 	char	   *relnamespace;
@@ -319,17 +320,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 				skipwithvm;
 	bool		frozenxid_updated,
 				minmulti_updated;
-	BlockNumber orig_rel_pages;
+	BlockNumber orig_rel_pages,
+				new_rel_pages,
+				new_rel_allvisible;
 	char	  **indnames = NULL;
-	BlockNumber new_rel_pages;
-	BlockNumber new_rel_allvisible;
-	double		new_live_tuples;
 	ErrorContextCallback errcallback;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
-	TransactionId OldestXmin;
-	TransactionId FreezeLimit;
-	MultiXactId MultiXactCutoff;
+	TransactionId OldestXmin,
+				FreezeLimit;
+	MultiXactId OldestMxact,
+				MultiXactCutoff;
 
 	verbose = (params->options & VACOPT_VERBOSE) != 0;
 	instrument = (verbose || (IsAutoVacuumWorkerProcess() &&
@@ -351,20 +352,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Get OldestXmin cutoff, which is used to determine which deleted tuples
 	 * are considered DEAD, not just RECENTLY_DEAD.  Also get related cutoffs
-	 * used to determine which XIDs/MultiXactIds will be frozen.
-	 *
-	 * If this is an aggressive VACUUM, then we're strictly required to freeze
-	 * any and all XIDs from before FreezeLimit, so that we will be able to
-	 * safely advance relfrozenxid up to FreezeLimit below (we must be able to
-	 * advance relminmxid up to MultiXactCutoff, too).
+	 * used to determine which XIDs/MultiXactIds will be frozen.  If this is
+	 * an aggressive VACUUM then lazy_scan_heap cannot leave behind unfrozen
+	 * XIDs < FreezeLimit (or unfrozen MXIDs < MultiXactCutoff).
 	 */
 	aggressive = vacuum_set_xid_limits(rel,
 									   params->freeze_min_age,
 									   params->freeze_table_age,
 									   params->multixact_freeze_min_age,
 									   params->multixact_freeze_table_age,
-									   &OldestXmin, &FreezeLimit,
-									   &MultiXactCutoff);
+									   &OldestXmin, &OldestMxact,
+									   &FreezeLimit, &MultiXactCutoff);
 
 	skipwithvm = true;
 	if (params->options & VACOPT_DISABLE_PAGE_SKIPPING)
@@ -511,10 +509,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->vistest = GlobalVisTestFor(rel);
 	/* FreezeLimit controls XID freezing (always <= OldestXmin) */
 	vacrel->FreezeLimit = FreezeLimit;
-	/* MultiXactCutoff controls MXID freezing */
+	/* MultiXactCutoff controls MXID freezing (always <= OldestMxact) */
 	vacrel->MultiXactCutoff = MultiXactCutoff;
-	/* Track if cutoffs became invalid (possible in !aggressive case only) */
-	vacrel->freeze_cutoffs_valid = true;
+	/* Initialize state used to track oldest extant XID/XMID */
+	vacrel->NewRelfrozenXid = OldestXmin;
+	vacrel->NewRelminMxid = OldestMxact;
 
 	/*
 	 * Call lazy_scan_heap to perform all required heap pruning, index
@@ -548,16 +547,37 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Prepare to update rel's pg_class entry.
 	 *
-	 * In principle new_live_tuples could be -1 indicating that we (still)
-	 * don't know the tuple count.  In practice that probably can't happen,
-	 * since we'd surely have scanned some pages if the table is new and
-	 * nonempty.
-	 *
+	 * Aggressive VACUUMs must always be able to advance relfrozenxid to a
+	 * value >= FreezeLimit, and relminmxid to a value >= MultiXactCutoff.
+	 * Non-aggressive VACUUMs may advance them by any amount, or not at all.
+	 */
+	Assert(vacrel->NewRelfrozenXid == OldestXmin ||
+		   TransactionIdPrecedesOrEquals(aggressive ? FreezeLimit :
+										 vacrel->relfrozenxid,
+										 vacrel->NewRelfrozenXid));
+	Assert(vacrel->NewRelminMxid == OldestMxact ||
+		   MultiXactIdPrecedesOrEquals(aggressive ? MultiXactCutoff :
+									   vacrel->relminmxid,
+									   vacrel->NewRelminMxid));
+	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages)
+	{
+		/*
+		 * Must keep original relfrozenxid in a non-aggressive VACUUM that
+		 * had to skip an all-visible page.  The state that tracks new
+		 * values will have missed unfrozen XIDs from the pages we skipped.
+		 * (Even if we knew the true oldest XID it likely wouldn't help us,
+		 * since it'll usually be very close to rel's original relfrozenxid.)
+		 */
+		Assert(!aggressive);
+		vacrel->NewRelfrozenXid = InvalidTransactionId;
+		vacrel->NewRelminMxid = InvalidMultiXactId;
+	}
+
+	/*
 	 * For safety, clamp relallvisible to be not more than what we're setting
-	 * relpages to.
+	 * pg_class.relpages to
 	 */
 	new_rel_pages = vacrel->rel_pages;	/* After possible rel truncation */
-	new_live_tuples = vacrel->new_live_tuples;
 	visibilitymap_count(rel, &new_rel_allvisible, NULL);
 	if (new_rel_allvisible > new_rel_pages)
 		new_rel_allvisible = new_rel_pages;
@@ -565,33 +585,14 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	/*
 	 * Now actually update rel's pg_class entry.
 	 *
-	 * Aggressive VACUUM must reliably advance relfrozenxid (and relminmxid).
-	 * We are able to advance relfrozenxid in a non-aggressive VACUUM too,
-	 * provided we didn't skip any all-visible (not all-frozen) pages using
-	 * the visibility map, and assuming that we didn't fail to get a cleanup
-	 * lock that made it unsafe with respect to FreezeLimit (or perhaps our
-	 * MultiXactCutoff) established for VACUUM operation.
+	 * In principle new_live_tuples could be -1 indicating that we (still)
+	 * don't know the tuple count.  In practice that can't happen, since we
+	 * scan every page that isn't skipped using the visibility map.
 	 */
-	if (vacrel->scanned_pages + vacrel->frozenskipped_pages < orig_rel_pages ||
-		!vacrel->freeze_cutoffs_valid)
-	{
-		/* Cannot advance relfrozenxid/relminmxid */
-		Assert(!aggressive);
-		frozenxid_updated = minmulti_updated = false;
-		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
-							new_rel_allvisible, vacrel->nindexes > 0,
-							InvalidTransactionId, InvalidMultiXactId,
-							NULL, NULL, false);
-	}
-	else
-	{
-		Assert(vacrel->scanned_pages + vacrel->frozenskipped_pages ==
-			   orig_rel_pages);
-		vac_update_relstats(rel, new_rel_pages, new_live_tuples,
-							new_rel_allvisible, vacrel->nindexes > 0,
-							FreezeLimit, MultiXactCutoff,
-							&frozenxid_updated, &minmulti_updated, false);
-	}
+	vac_update_relstats(rel, new_rel_pages, vacrel->new_live_tuples,
+						new_rel_allvisible, vacrel->nindexes > 0,
+						vacrel->NewRelfrozenXid, vacrel->NewRelminMxid,
+						&frozenxid_updated, &minmulti_updated, false);
 
 	/*
 	 * Report results to the stats collector, too.
@@ -605,7 +606,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 */
 	pgstat_report_vacuum(RelationGetRelid(rel),
 						 rel->rd_rel->relisshared,
-						 Max(new_live_tuples, 0),
+						 Max(vacrel->new_live_tuples, 0),
 						 vacrel->recently_dead_tuples +
 						 vacrel->missed_dead_tuples);
 	pgstat_progress_end_command();
@@ -674,7 +675,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 vacrel->num_index_scans);
 			appendStringInfo(&buf, _("pages: %u removed, %u remain, %u scanned (%.2f%% of total)\n"),
 							 vacrel->removed_pages,
-							 vacrel->rel_pages,
+							 new_rel_pages,
 							 vacrel->scanned_pages,
 							 orig_rel_pages == 0 ? 100.0 :
 							 100.0 * vacrel->scanned_pages / orig_rel_pages);
@@ -694,17 +695,17 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 							 OldestXmin, diff);
 			if (frozenxid_updated)
 			{
-				diff = (int32) (FreezeLimit - vacrel->relfrozenxid);
+				diff = (int32) (vacrel->NewRelfrozenXid - vacrel->relfrozenxid);
 				appendStringInfo(&buf,
 								 _("new relfrozenxid: %u, which is %d xids ahead of previous value\n"),
-								 FreezeLimit, diff);
+								 vacrel->NewRelfrozenXid, diff);
 			}
 			if (minmulti_updated)
 			{
-				diff = (int32) (MultiXactCutoff - vacrel->relminmxid);
+				diff = (int32) (vacrel->NewRelminMxid - vacrel->relminmxid);
 				appendStringInfo(&buf,
 								 _("new relminmxid: %u, which is %d mxids ahead of previous value\n"),
-								 MultiXactCutoff, diff);
+								 vacrel->NewRelminMxid, diff);
 			}
 			if (orig_rel_pages > 0)
 			{
@@ -1584,6 +1585,8 @@ lazy_scan_prune(LVRelState *vacrel,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	int			nfrozen;
+	TransactionId NewRelfrozenXid;
+	MultiXactId NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
 
@@ -1593,7 +1596,9 @@ lazy_scan_prune(LVRelState *vacrel,
 
 retry:
 
-	/* Initialize (or reset) page-level counters */
+	/* Initialize (or reset) page-level state */
+	NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	NewRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	lpdead_items = 0;
 	live_tuples = 0;
@@ -1800,8 +1805,8 @@ retry:
 									  vacrel->relminmxid,
 									  vacrel->FreezeLimit,
 									  vacrel->MultiXactCutoff,
-									  &frozen[nfrozen],
-									  &tuple_totally_frozen))
+									  &frozen[nfrozen], &tuple_totally_frozen,
+									  &NewRelfrozenXid, &NewRelminMxid))
 		{
 			/* Will execute freeze below */
 			frozen[nfrozen++].offset = offnum;
@@ -1815,13 +1820,16 @@ retry:
 			prunestate->all_frozen = false;
 	}
 
+	vacrel->offnum = InvalidOffsetNumber;
+
 	/*
 	 * We have now divided every item on the page into either an LP_DEAD item
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * LP_REDIRECT items also remain, but are of no further interest to us).
 	 */
-	vacrel->offnum = InvalidOffsetNumber;
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
 
 	/*
 	 * Consider the need to freeze any items with tuple storage from the page
@@ -1971,6 +1979,8 @@ lazy_scan_noprune(LVRelState *vacrel,
 				recently_dead_tuples,
 				missed_dead_tuples;
 	HeapTupleHeader tupleheader;
+	TransactionId NewRelfrozenXid = vacrel->NewRelfrozenXid;
+	MultiXactId NewRelminMxid = vacrel->NewRelminMxid;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -2015,22 +2025,37 @@ lazy_scan_noprune(LVRelState *vacrel,
 
 		*hastup = true;			/* page prevents rel truncation */
 		tupleheader = (HeapTupleHeader) PageGetItem(page, itemid);
-		if (heap_tuple_needs_freeze(tupleheader,
+		if (heap_tuple_would_freeze(tupleheader,
 									vacrel->FreezeLimit,
-									vacrel->MultiXactCutoff))
+									vacrel->MultiXactCutoff,
+									&NewRelfrozenXid, &NewRelminMxid))
 		{
+			/* Tuple with XID < FreezeLimit (or MXID < MultiXactCutoff) */
 			if (vacrel->aggressive)
 			{
-				/* Going to have to get cleanup lock for lazy_scan_prune */
+				/*
+				 * Aggressive VACUUMs must always be able to advance rel's
+				 * relfrozenxid to a value >= FreezeLimit (and be able to
+				 * advance rel's relminmxid to a value >= MultiXactCutoff).
+				 * The ongoing aggressive VACUUM won't be able to do that
+				 * unless it can freeze an XID (or XMID) from this tuple now.
+				 *
+				 * The only safe option is to have caller perform processing
+				 * of this page using lazy_scan_prune.  Caller might have to
+				 * wait a while for a cleanup lock, but it can't be helped.
+				 */
 				vacrel->offnum = InvalidOffsetNumber;
 				return false;
 			}
 
 			/*
-			 * Current non-aggressive VACUUM operation definitely won't be
-			 * able to advance relfrozenxid or relminmxid
+			 * Non-aggressive VACUUMs are under no obligation to advance
+			 * relfrozenxid (even by one XID).  We can be much laxer here.
+			 *
+			 * Currently we always just accept an older final relfrozenxid
+			 * and/or relminmxid value.  We never make caller wait or work a
+			 * little harder, even when it likely makes sense to do so.
 			 */
-			vacrel->freeze_cutoffs_valid = false;
 		}
 
 		ItemPointerSet(&(tuple.t_self), blkno, offnum);
@@ -2080,9 +2105,14 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->offnum = InvalidOffsetNumber;
 
 	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel (though
-	 * only when VACUUM uses two-pass strategy)
+	 * By here we know for sure that caller can put off freezing and pruning
+	 * this particular page until the next VACUUM.  Remember its details now.
+	 * (lazy_scan_prune expects a clean slate, so we have to do this last.)
 	 */
+	vacrel->NewRelfrozenXid = NewRelfrozenXid;
+	vacrel->NewRelminMxid = NewRelminMxid;
+
+	/* Save any LP_DEAD items found on the page in dead_items array */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 02a7e94bf..a7e988298 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -767,6 +767,7 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	TupleDesc	oldTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TupleDesc	newTupDesc PG_USED_FOR_ASSERTS_ONLY;
 	TransactionId OldestXmin;
+	MultiXactId oldestMxact;
 	TransactionId FreezeXid;
 	MultiXactId MultiXactCutoff;
 	bool		use_sort;
@@ -856,8 +857,8 @@ copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	 * Since we're going to rewrite the whole table anyway, there's no reason
 	 * not to be aggressive about this.
 	 */
-	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0,
-						  &OldestXmin, &FreezeXid, &MultiXactCutoff);
+	vacuum_set_xid_limits(OldHeap, 0, 0, 0, 0, &OldestXmin, &oldestMxact,
+						  &FreezeXid, &MultiXactCutoff);
 
 	/*
 	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 50a4a612e..deec4887b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -945,14 +945,22 @@ get_all_vacuum_rels(int options)
  * The output parameters are:
  * - oldestXmin is the Xid below which tuples deleted by any xact (that
  *   committed) should be considered DEAD, not just RECENTLY_DEAD.
- * - freezeLimit is the Xid below which all Xids are replaced by
- *	 FrozenTransactionId during vacuum.
- * - multiXactCutoff is the value below which all MultiXactIds are removed
- *   from Xmax.
+ * - oldestMxact is the Mxid below which MultiXacts are definitely not
+ *   seen as visible by any running transaction.
+ * - freezeLimit is the Xid below which all Xids are definitely replaced by
+ *   FrozenTransactionId during aggressive vacuums.
+ * - multiXactCutoff is the value below which all MultiXactIds are definitely
+ *   removed from Xmax during aggressive vacuums.
  *
  * Return value indicates if vacuumlazy.c caller should make its VACUUM
  * operation aggressive.  An aggressive VACUUM must advance relfrozenxid up to
- * FreezeLimit, and relminmxid up to multiXactCutoff.
+ * FreezeLimit (at a minimum), and relminmxid up to multiXactCutoff (at a
+ * minimum).
+ *
+ * oldestXmin and oldestMxact are the most recent values that can ever be
+ * passed to vac_update_relstats() as frozenxid and minmulti arguments by our
+ * vacuumlazy.c caller later on.  These values should be passed when it turns
+ * out that VACUUM will leave no unfrozen XIDs/XMIDs behind in the table.
  */
 bool
 vacuum_set_xid_limits(Relation rel,
@@ -961,6 +969,7 @@ vacuum_set_xid_limits(Relation rel,
 					  int multixact_freeze_min_age,
 					  int multixact_freeze_table_age,
 					  TransactionId *oldestXmin,
+					  MultiXactId *oldestMxact,
 					  TransactionId *freezeLimit,
 					  MultiXactId *multiXactCutoff)
 {
@@ -969,7 +978,6 @@ vacuum_set_xid_limits(Relation rel,
 	int			effective_multixact_freeze_max_age;
 	TransactionId limit;
 	TransactionId safeLimit;
-	MultiXactId oldestMxact;
 	MultiXactId mxactLimit;
 	MultiXactId safeMxactLimit;
 	int			freezetable;
@@ -1065,9 +1073,11 @@ vacuum_set_xid_limits(Relation rel,
 						 effective_multixact_freeze_max_age / 2);
 	Assert(mxid_freezemin >= 0);
 
+	/* Remember for caller */
+	*oldestMxact = GetOldestMultiXactId();
+
 	/* compute the cutoff multi, being careful to generate a valid value */
-	oldestMxact = GetOldestMultiXactId();
-	mxactLimit = oldestMxact - mxid_freezemin;
+	mxactLimit = *oldestMxact - mxid_freezemin;
 	if (mxactLimit < FirstMultiXactId)
 		mxactLimit = FirstMultiXactId;
 
@@ -1082,8 +1092,8 @@ vacuum_set_xid_limits(Relation rel,
 				(errmsg("oldest multixact is far in the past"),
 				 errhint("Close open transactions with multixacts soon to avoid wraparound problems.")));
 		/* Use the safe limit, unless an older mxact is still running */
-		if (MultiXactIdPrecedes(oldestMxact, safeMxactLimit))
-			mxactLimit = oldestMxact;
+		if (MultiXactIdPrecedes(*oldestMxact, safeMxactLimit))
+			mxactLimit = *oldestMxact;
 		else
 			mxactLimit = safeMxactLimit;
 	}
@@ -1390,12 +1400,9 @@ vac_update_relstats(Relation relation,
 	 * Update relfrozenxid, unless caller passed InvalidTransactionId
 	 * indicating it has no new data.
 	 *
-	 * Ordinarily, we don't let relfrozenxid go backwards: if things are
-	 * working correctly, the only way the new frozenxid could be older would
-	 * be if a previous VACUUM was done with a tighter freeze_min_age, in
-	 * which case we don't want to forget the work it already did.  However,
-	 * if the stored relfrozenxid is "in the future", then it must be corrupt
-	 * and it seems best to overwrite it with the cutoff we used this time.
+	 * Ordinarily, we don't let relfrozenxid go backwards.  However, if the
+	 * stored relfrozenxid is "in the future" then it seems best to assume
+	 * it's corrupt, and overwrite with the oldest remaining XID in the table.
 	 * This should match vac_update_datfrozenxid() concerning what we consider
 	 * to be "in the future".
 	 */
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 34d72dba7..0a7b38c17 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -585,9 +585,11 @@
     statistics in the system tables <structname>pg_class</structname> and
     <structname>pg_database</structname>.  In particular,
     the <structfield>relfrozenxid</structfield> column of a table's
-    <structname>pg_class</structname> row contains the freeze cutoff XID that was used
-    by the last aggressive <command>VACUUM</command> for that table.  All rows
-    inserted by transactions with XIDs older than this cutoff XID are
+    <structname>pg_class</structname> row contains the oldest
+    remaining XID at the end of the most recent <command>VACUUM</command>
+    that successfully advanced <structfield>relfrozenxid</structfield>
+    (typically the most recent aggressive VACUUM).  All rows inserted
+    by transactions with XIDs older than this cutoff XID are
     guaranteed to have been frozen.  Similarly,
     the <structfield>datfrozenxid</structfield> column of a database's
     <structname>pg_database</structname> row is a lower bound on the unfrozen XIDs
@@ -610,6 +612,17 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     cutoff XID to the current transaction's XID.
    </para>
 
+   <tip>
+    <para>
+     <literal>VACUUM VERBOSE</literal> outputs information about
+     <structfield>relfrozenxid</structfield> and/or
+     <structfield>relminmxid</structfield> when either field was
+     advanced.  The same details appear in the server log when <xref
+      linkend="guc-log-autovacuum-min-duration"/> reports on vacuuming
+     by autovacuum.
+    </para>
+   </tip>
+
    <para>
     <command>VACUUM</command> normally only scans pages that have been modified
     since the last vacuum, but <structfield>relfrozenxid</structfield> can only be
@@ -624,7 +637,11 @@ SELECT datname, age(datfrozenxid) FROM pg_database;
     set <literal>age(relfrozenxid)</literal> to a value just a little more than the
     <varname>vacuum_freeze_min_age</varname> setting
     that was used (more by the number of transactions started since the
-    <command>VACUUM</command> started).  If no <structfield>relfrozenxid</structfield>-advancing
+    <command>VACUUM</command> started).  <command>VACUUM</command>
+    will set <structfield>relfrozenxid</structfield> to the oldest XID
+    that remains in the table, so it's possible that the final value
+    will be much more recent than strictly required.
+    If no <structfield>relfrozenxid</structfield>-advancing
     <command>VACUUM</command> is issued on the table until
     <varname>autovacuum_freeze_max_age</varname> is reached, an autovacuum will soon
     be forced for the table.
@@ -711,8 +728,9 @@ HINT:  Stop the postmaster and vacuum that database in single-user mode.
     </para>
 
     <para>
-     Aggressive <command>VACUUM</command> scans, regardless of
-     what causes them, enable advancing the value for that table.
+     Aggressive <command>VACUUM</command> scans, regardless of what
+     causes them, are <emphasis>guaranteed</emphasis> to be able to
+     advance the table's <structfield>relminmxid</structfield>.
      Eventually, as all tables in all databases are scanned and their
      oldest multixact values are advanced, on-disk storage for older
      multixacts can be removed.
diff --git a/src/test/isolation/expected/vacuum-no-cleanup-lock.out b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
new file mode 100644
index 000000000..f7bc93e8f
--- /dev/null
+++ b/src/test/isolation/expected/vacuum-no-cleanup-lock.out
@@ -0,0 +1,189 @@
+Parsed test spec with 4 sessions
+
+starting permutation: vacuumer_pg_class_stats dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert pinholder_cursor vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats pinholder_cursor dml_insert dml_delete dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: vacuumer_pg_class_stats dml_insert dml_delete pinholder_cursor dml_insert vacuumer_nonaggressive_vacuum vacuumer_pg_class_stats pinholder_commit
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       20
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step dml_delete: 
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_insert: 
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step vacuumer_pg_class_stats: 
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+
+relpages|reltuples
+--------+---------
+       1|       21
+(1 row)
+
+step pinholder_commit: 
+  COMMIT;
+
+
+starting permutation: dml_begin dml_other_begin dml_key_share dml_other_key_share vacuumer_nonaggressive_vacuum pinholder_cursor dml_other_update dml_commit dml_other_commit vacuumer_nonaggressive_vacuum pinholder_commit vacuumer_nonaggressive_vacuum
+step dml_begin: BEGIN;
+step dml_other_begin: BEGIN;
+step dml_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step dml_other_key_share: SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE;
+id
+--
+ 3
+(1 row)
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_cursor: 
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+
+dummy
+-----
+    1
+(1 row)
+
+step dml_other_update: UPDATE smalltbl SET t = 'u' WHERE id = 3;
+step dml_commit: COMMIT;
+step dml_other_commit: COMMIT;
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
+step pinholder_commit: 
+  COMMIT;
+
+step vacuumer_nonaggressive_vacuum: 
+  VACUUM smalltbl;
+
diff --git a/src/test/isolation/expected/vacuum-reltuples.out b/src/test/isolation/expected/vacuum-reltuples.out
deleted file mode 100644
index ce55376e7..000000000
--- a/src/test/isolation/expected/vacuum-reltuples.out
+++ /dev/null
@@ -1,67 +0,0 @@
-Parsed test spec with 2 sessions
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify open fetch1 vac close stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step open: 
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-
-step fetch1: 
-    fetch next from c1;
-
-dummy
------
-    1
-(1 row)
-
-step vac: 
-    vacuum smalltbl;
-
-step close: 
-    commit;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
-
-starting permutation: modify vac stats
-step modify: 
-    insert into smalltbl select max(id)+1 from smalltbl;
-
-step vac: 
-    vacuum smalltbl;
-
-step stats: 
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-
-relpages|reltuples
---------+---------
-       1|       21
-(1 row)
-
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 00749a40b..a48caae22 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -84,7 +84,7 @@ test: alter-table-4
 test: create-trigger
 test: sequence-ddl
 test: async-notify
-test: vacuum-reltuples
+test: vacuum-no-cleanup-lock
 test: timeouts
 test: vacuum-concurrent-drop
 test: vacuum-conflict
diff --git a/src/test/isolation/specs/vacuum-no-cleanup-lock.spec b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
new file mode 100644
index 000000000..a88be66de
--- /dev/null
+++ b/src/test/isolation/specs/vacuum-no-cleanup-lock.spec
@@ -0,0 +1,150 @@
+# Test for vacuum's reduced processing of heap pages (used for any heap page
+# where a cleanup lock isn't immediately available)
+#
+# Debugging tip: Change VACUUM to VACUUM VERBOSE to get feedback on what's
+# really going on
+
+# Use name type here to avoid TOAST table:
+setup
+{
+  CREATE TABLE smalltbl AS SELECT i AS id, 't'::name AS t FROM generate_series(1,20) i;
+  ALTER TABLE smalltbl SET (autovacuum_enabled = off);
+  ALTER TABLE smalltbl ADD PRIMARY KEY (id);
+}
+setup
+{
+  VACUUM ANALYZE smalltbl;
+}
+
+teardown
+{
+  DROP TABLE smalltbl;
+}
+
+# This session holds a pin on smalltbl's only heap page:
+session pinholder
+step pinholder_cursor
+{
+  BEGIN;
+  DECLARE c1 CURSOR FOR SELECT 1 AS dummy FROM smalltbl;
+  FETCH NEXT FROM c1;
+}
+step pinholder_commit
+{
+  COMMIT;
+}
+
+# This session inserts and deletes tuples, potentially affecting reltuples:
+session dml
+step dml_insert
+{
+  INSERT INTO smalltbl SELECT max(id) + 1 FROM smalltbl;
+}
+step dml_delete
+{
+  DELETE FROM smalltbl WHERE id = (SELECT min(id) FROM smalltbl);
+}
+step dml_begin            { BEGIN; }
+step dml_key_share        { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_commit           { COMMIT; }
+
+# Needed for Multixact test:
+session dml_other
+step dml_other_begin      { BEGIN; }
+step dml_other_key_share  { SELECT id FROM smalltbl WHERE id = 3 FOR KEY SHARE; }
+step dml_other_update     { UPDATE smalltbl SET t = 'u' WHERE id = 3; }
+step dml_other_commit     { COMMIT; }
+
+# This session runs non-aggressive VACUUM, but with maximally aggressive
+# cutoffs for tuple freezing (e.g., FreezeLimit == OldestXmin):
+session vacuumer
+setup
+{
+  SET vacuum_freeze_min_age = 0;
+  SET vacuum_multixact_freeze_min_age = 0;
+}
+step vacuumer_nonaggressive_vacuum
+{
+  VACUUM smalltbl;
+}
+step vacuumer_pg_class_stats
+{
+  SELECT relpages, reltuples FROM pg_class WHERE oid = 'smalltbl'::regclass;
+}
+
+# Test VACUUM's reltuples counting mechanism.
+#
+# Final pg_class.reltuples should never be affected by VACUUM's inability to
+# get a cleanup lock on any page, except to the extent that any cleanup lock
+# contention changes the number of tuples that remain ("missed dead" tuples
+# are counted in reltuples, much like "recently dead" tuples).
+
+# Easy case:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+
+# Harder case -- count 21 tuples at the end (like last time), but with cleanup
+# lock contention this time:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    pinholder_cursor
+    vacuumer_nonaggressive_vacuum
+    vacuumer_pg_class_stats  # End with 21 tuples
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but vary the order, and delete an inserted row:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    pinholder_cursor
+    dml_insert
+    dml_delete
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "recently dead" tuple won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Same as "harder case", but initial insert and delete before cursor:
+permutation
+    vacuumer_pg_class_stats  # Start with 20 tuples
+    dml_insert
+    dml_delete
+    pinholder_cursor
+    dml_insert
+    vacuumer_nonaggressive_vacuum
+    # reltuples is 21 here again -- "missed dead" tuple ("recently dead" when
+    # concurrent activity held back VACUUM's OldestXmin) won't be included in
+    # count here:
+    vacuumer_pg_class_stats
+    pinholder_commit  # order doesn't matter
+
+# Test VACUUM's mechanism for skipping MultiXact freezing.
+#
+# This provides test coverage for code paths that are only hit when we need to
+# freeze, but inability to acquire a cleanup lock on a heap page makes
+# freezing some XIDs/XMIDs < FreezeLimit/MultiXactCutoff impossible (without
+# waiting for a cleanup lock, which non-aggressive VACUUM is unwilling to do).
+permutation
+    dml_begin
+    dml_other_begin
+    dml_key_share
+    dml_other_key_share
+    # Will get cleanup lock, can't advance relminmxid yet:
+    # (though will usually advance relfrozenxid by ~2 XIDs)
+    vacuumer_nonaggressive_vacuum
+    pinholder_cursor
+    dml_other_update
+    dml_commit
+    dml_other_commit
+    # Can't cleanup lock, so still can't advance relminmxid here:
+    # (relfrozenxid held back by XIDs in MultiXact too)
+    vacuumer_nonaggressive_vacuum
+    pinholder_commit
+    # Pin was dropped, so will advance relminmxid, at long last:
+    # (ditto for relfrozenxid advancement)
+    vacuumer_nonaggressive_vacuum
diff --git a/src/test/isolation/specs/vacuum-reltuples.spec b/src/test/isolation/specs/vacuum-reltuples.spec
deleted file mode 100644
index a2a461f2f..000000000
--- a/src/test/isolation/specs/vacuum-reltuples.spec
+++ /dev/null
@@ -1,49 +0,0 @@
-# Test for vacuum's handling of reltuples when pages are skipped due
-# to page pins. We absolutely need to avoid setting reltuples=0 in
-# such cases, since that interferes badly with planning.
-#
-# Expected result for all three permutation is 21 tuples, including
-# the second permutation.  VACUUM is able to count the concurrently
-# inserted tuple in its final reltuples, even when a cleanup lock
-# cannot be acquired on the affected heap page.
-
-setup {
-    create table smalltbl
-        as select i as id from generate_series(1,20) i;
-    alter table smalltbl set (autovacuum_enabled = off);
-}
-setup {
-    vacuum analyze smalltbl;
-}
-
-teardown {
-    drop table smalltbl;
-}
-
-session worker
-step open {
-    begin;
-    declare c1 cursor for select 1 as dummy from smalltbl;
-}
-step fetch1 {
-    fetch next from c1;
-}
-step close {
-    commit;
-}
-step stats {
-    select relpages, reltuples from pg_class
-     where oid='smalltbl'::regclass;
-}
-
-session vacuumer
-step vac {
-    vacuum smalltbl;
-}
-step modify {
-    insert into smalltbl select max(id)+1 from smalltbl;
-}
-
-permutation modify vac stats
-permutation modify open fetch1 vac close stats
-permutation modify vac stats
-- 
2.32.0

#132

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#131)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-04-01 10:54:14 -0700, Peter Geoghegan wrote:

On Thu, Mar 31, 2022 at 11:19 AM Peter Geoghegan <pg@bowt.ie> wrote:

The assert is "Assert(diff > 0)", and not "Assert(diff >= 0)".

Attached is v15. I plan to commit the first two patches (the most
substantial two patches by far) in the next couple of days, barring
objections.

Just saw that you committed: Wee! I think this will be a substantial
improvement for our users.

While I was writing the above I, again, realized that it'd be awfully nice to
have some accumulated stats about (auto-)vacuum's effectiveness. For us to get
feedback about improvements more easily and for users to know what aspects
they need to tune.

Knowing how many times a table was vacuumed doesn't really tell that much, and
requiring to enable log_autovacuum_min_duration and then aggregating those
results is pretty painful (and version dependent).

If we just collected something like:
- number of heap passes
- time spent heap vacuuming
- number of index scans
- time spent index vacuuming
- time spent delaying
- percentage of non-yet-removable vs removable tuples

it'd start to be a heck of a lot easier to judge how well autovacuum is
coping.

If we tracked the related pieces above in the index stats (or perhaps
additionally there), it'd also make it easier to judge the cost of different
indexes.

- Andres

#133

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#132)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Sun, Apr 3, 2022 at 12:05 PM Andres Freund <andres@anarazel.de> wrote:

Just saw that you committed: Wee! I think this will be a substantial
improvement for our users.

I hope so! I think that it's much more useful as the basis for future
work than as a standalone thing. Users of Postgres 15 might not notice
a huge difference. But it opens up a lot of new directions to take
VACUUM in.

I would like to get rid of anti-wraparound VACUUMs and aggressive
VACUUMs in Postgres 16. This isn't as radical as it sounds. It seems
quite possible to find a way for *every* VACUUM to become aggressive
progressively and dynamically. We'll still need to have autovacuum.c
know about wraparound, but it should just be just another threshold,
not fundamentally different to the other thresholds (except that it's
still used when autovacuum is nominally disabled).

The behavior around autovacuum cancellations is probably still going
to be necessary when age(relfrozenxid) gets too high, but it shouldn't
be conditioned on what age(relfrozenxid) *used to be*, when the
autovacuum started. That could have been a long time ago. It should be
based on what's happening *right now*.

While I was writing the above I, again, realized that it'd be awfully nice to
have some accumulated stats about (auto-)vacuum's effectiveness. For us to get
feedback about improvements more easily and for users to know what aspects
they need to tune.

Strongly agree. And I'm excited about the potential of the shared
memory stats patch to enable more thorough instrumentation, which
allows us to improve things with feedback that we just can't get right
now.

VACUUM is still too complicated -- that makes this kind of analysis
much harder, even for experts. You need more continuous behavior to
get value from this kind of analysis. There are too many things that
might end up mattering, that really shouldn't ever matter. Too much
potential for strange illogical discontinuities in performance over
time.

Having only one type of VACUUM (excluding VACUUM FULL) will be much
easier for users to reason about. But I also think that it'll be much
easier for us to reason about. For example, better autovacuum
scheduling will be made much easier if autovacuum.c can just assume
that every VACUUM operation will do the same amount of work. (Another
problem with the scheduling is that it uses ANALYZE statistics
(sampling) in a way that just doesn't make any sense for something
like VACUUM, which is an inherently dynamic and cyclic process.)

None of this stuff has to rely on my patch for freezing. We don't
necessarily have to make every VACUUM advance relfrozenxid to do all
this. The important point is that we definitely shouldn't be putting
off *all* freezing of all-visible pages in non-aggressive VACUUMs (or
in VACUUMs that are not expected to advance relfrozenxid). Even a very
conservative implementation could achieve all this; we need only
spread out the burden of freezing all-visible pages over time, across
multiple VACUUM operations. Make the behavior continuous.

Knowing how many times a table was vacuumed doesn't really tell that much, and
requiring to enable log_autovacuum_min_duration and then aggregating those
results is pretty painful (and version dependent).

Yeah. Ideally we could avoid making the output of
log_autovacuum_min_duration into an API, by having a real API instead.
The output probably needs to evolve some more. A lot of very basic
information wasn't there until recently.

If we just collected something like:
- number of heap passes
- time spent heap vacuuming
- number of index scans
- time spent index vacuuming
- time spent delaying

You forgot FPIs.

- percentage of non-yet-removable vs removable tuples

I think that we should address this directly too. By "taking a
snapshot of the visibility map", so we at least don't scan/vacuum heap
pages that don't really need it. This is also valuable because it
makes slowing down VACUUM (maybe slowing it down a lot) have fewer
downsides. At least we'll have "locked in" our scanned_pages, which we
can figure out in full before we really scan even one page.

it'd start to be a heck of a lot easier to judge how well autovacuum is
coping.

What about the potential of the shared memory stats stuff to totally
replace the use of ANALYZE stats in autovacuum.c? Possibly with help
from vacuumlazy.c, and the visibility map?

I see a lot of potential for exploiting the visibility map more, both
within vacuumlazy.c itself, and for autovacuum.c scheduling [1]/messages/by-id/CAH2-Wzkt9Ey9NNm7q9nSaw5jdBjVsAq3yvb4UT4M93UaJVd_xg@mail.gmail.com -- Peter Geoghegan. I'd
probably start with the scheduling stuff, and only then work out how
to show users more actionable information.

[1]: /messages/by-id/CAH2-Wzkt9Ey9NNm7q9nSaw5jdBjVsAq3yvb4UT4M93UaJVd_xg@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#134

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#131)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan <pg@bowt.ie> wrote:

I also refined the WARNING patch in v15. It now actually issues
WARNINGs (rather than PANICs, which were just a temporary debugging
measure in v14).

Going to commit this remaining patch tomorrow, barring objections.

--
Peter Geoghegan

#135

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#134)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

Hi,

On 2022-04-04 19:32:13 -0700, Peter Geoghegan wrote:

On Fri, Apr 1, 2022 at 10:54 AM Peter Geoghegan <pg@bowt.ie> wrote:

I also refined the WARNING patch in v15. It now actually issues
WARNINGs (rather than PANICs, which were just a temporary debugging
measure in v14).

Going to commit this remaining patch tomorrow, barring objections.

The remaining patch are the warnings in vac_update_relstats(), correct? I
guess one could argue they should be LOG rather than WARNING, but I find the
project stance on that pretty impractical. So warning's ok with me.

Not sure why you used errmsg_internal()?

Otherwise LGTM.

Greetings,

Andres Freund

#136

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#135)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Apr 4, 2022 at 8:18 PM Andres Freund <andres@anarazel.de> wrote:

The remaining patch are the warnings in vac_update_relstats(), correct? I
guess one could argue they should be LOG rather than WARNING, but I find the
project stance on that pretty impractical. So warning's ok with me.

Right. The reason I used WARNINGs was because it matches vaguely
related WARNINGs in vac_update_relstats()'s sibling function,
vacuum_set_xid_limits().

Not sure why you used errmsg_internal()?

The usual reason for using errmsg_internal(), I suppose. I tend to do
that with corruption related messages on the grounds that they're
usually highly obscure issues that are (by definition) never supposed
to happen. The only thing that a user can be expected to do with the
information from the message is to report it to -bugs, or find some
other similar report.

--
Peter Geoghegan

#137

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Peter Geoghegan (#136)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Mon, Apr 4, 2022 at 8:25 PM Peter Geoghegan <pg@bowt.ie> wrote:

Right. The reason I used WARNINGs was because it matches vaguely
related WARNINGs in vac_update_relstats()'s sibling function,
vacuum_set_xid_limits().

Okay, pushed the relfrozenxid warning patch.

Thanks
--
Peter Geoghegan

#138

Jim Nasby

nasbyj@amazon.com

over 3 years ago

In reply to: Andres Freund (#132)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On 4/3/22 12:05 PM, Andres Freund wrote:

While I was writing the above I, again, realized that it'd be awfully nice to
have some accumulated stats about (auto-)vacuum's effectiveness. For us to get
feedback about improvements more easily and for users to know what aspects
they need to tune.

Knowing how many times a table was vacuumed doesn't really tell that much, and
requiring to enable log_autovacuum_min_duration and then aggregating those
results is pretty painful (and version dependent).

If we just collected something like:
- number of heap passes
- time spent heap vacuuming
- number of index scans
- time spent index vacuuming
- time spent delaying

The number of passes would let you know if maintenance_work_mem is too
small (or to stop killing 187M+ tuples in one go). The timing info would
give you an idea of the impact of throttling.

- percentage of non-yet-removable vs removable tuples

This'd give you an idea how bad your long-running-transaction problem is.

Another metric I think would be useful is the average utilization of
your autovac workers. No spare workers means you almost certainly have
tables that need vacuuming but have to wait. As a single number, it'd
also be much easier for users to understand. I'm no stats expert, but
one way to handle that cheaply would be to maintain an
engineering-weighted-mean of the percentage of autovac workers that are
in use at the end of each autovac launcher cycle (though that would
probably not work great for people that have extreme values for launcher
delay, or constantly muck with launcher_delay).

Show quoted text

it'd start to be a heck of a lot easier to judge how well autovacuum is
coping.

If we tracked the related pieces above in the index stats (or perhaps
additionally there), it'd also make it easier to judge the cost of different
indexes.

- Andres

#139

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Jim Nasby (#138)

Re: Removing more vacuumlazy.c special cases, relfrozenxid optimizations

On Thu, Apr 14, 2022 at 4:19 PM Jim Nasby <nasbyj@amazon.com> wrote:

- percentage of non-yet-removable vs removable tuples

This'd give you an idea how bad your long-running-transaction problem is.

VACUUM fundamentally works by removing those tuples that are
considered dead according to an XID-based cutoff established when the
operation begins. And so many very long running VACUUM operations will
see dead-but-not-removable tuples even when there are absolutely no
long running transactions (nor any other VACUUM operations). The only
long running thing involved might be our own long running VACUUM
operation.

I would like to reduce the number of non-removal dead tuples
encountered by VACUUM by "locking in" heap pages that we'd like to
scan up front. This would work by having VACUUM create its own local
in-memory copy of the visibility map before it even starts scanning
heap pages. That way VACUUM won't end up visiting heap pages just
because they were concurrently modified half way through our VACUUM
(by some other transactions). We don't really need to scan these pages
at all -- they have dead tuples, but not tuples that are "dead to
VACUUM".

The key idea here is to remove a big unnatural downside to slowing
VACUUM down. The cutoff would almost work like an MVCC snapshot, that
described precisely the work that VACUUM needs to do (which pages to
scan) up-front. Once that's locked in, the amount of work we're
required to do cannot go up as we're doing it (or it'll be less of an
issue, at least).

It would also help if VACUUM didn't scan pages that it already knows
don't have any dead tuples. The current SKIP_PAGES_THRESHOLD rule
could easily be improved. That's almost the same problem.

--
Peter Geoghegan