Fixing a couple of buglets in how VACUUM sets visibility map bits

Started by Peter Geogheganabout 3 years ago22 messages
2 attachment(s)

Attached are two patches, each of which fixes two historical buglets
around VACUUM's approach to setting bits in the visibility map.
(Whether or not this is actually refactoring work or hardening work is
debatable, I suppose.)

The first patch makes sure that the snapshotConflictHorizon cutoff
(XID cutoff for recovery conflicts) is never a special XID, unless
that XID is InvalidTransactionId, which is interpreted as a
snapshotConflictHorizon value that will never need a recovery conflict
(per the general convention for snapshotConflictHorizon values
explained above ResolveRecoveryConflictWithSnapshot). This patch
establishes a hard rule that snapshotConflictHorizon values can never
be a special XID value, unless it's InvalidTransactionId. An assertion
enforces the rule for us in REDO routines (at the point that they call
ResolveRecoveryConflictWithSnapshot with the WAL record's
snapshotConflictHorizon XID value).

The second patch makes sure that VACUUM can never set a page
all-frozen in the visibility map without also setting the same page
all-visible in the same call to visibilitymap_set() -- regardless of
what we think we know about the current state of the all-visible bit
in the VM.

The second patch adjusts one of the visibilitymap_set() calls in
vacuumlazy.c that would previously sometimes set a page's all-frozen
bit without also setting its all-visible bit. This could allow VACUUM
to leave a page all-frozen but not all-visible in the visibility map
(since the value of all_visible_according_to_vm can go stale). I think
that this should be treated as a basic visibility map invariant: an
all-frozen page must also be all-visible, by definition, so why should
it be physically possible for the VM to give a contradictory picture
of the all_visible/all_frozen status of any one page? Assertions are
added that more or less make this rule into an invariant.
amcheck/pg_visibility coverage might make sense too, but I haven't
done that here.

The second patch also adjusts a later visibilitymap_set() call site
(the one used just after heap vacuuming runs in the final heap pass)
in roughly the same way. It no longer reads from the visibility map to
see what bits need to be changed. The existing approach here seems
rather odd. The whole point of calling lazy_vacuum_heap_page() is to
set LP_DEAD items referenced by VACUUM's dead_items array to LP_UNUSED
-- there has to have been at least one LP_DEAD item on the page for us
to end up here (which a Postgres 14 era assertion verifies for us). So
we already know perfectly well that the visibility map shouldn't
indicate that the page is all-visible yet -- why bother asking the VM?
And besides, any call to visibilitymap_set() will only modify the VM
when it directly observes that the bits have changed -- so why even
attempt to duplicate that on the caller side?

It seems to me that the visibilitymap_get_status() call inside
lazy_vacuum_heap_page() is actually abused to work as a substitute for
visibilitymap_pin(). Why not use top-level visibilitymap_pin() calls
instead, just like we do it in the first heap pass? That's how it's
done in the second patch; it adds a visibilitymap_pin() call in
lazy_vacuum_heap_rel()'s blkno-wise loop. That gives us parity between
the first and second heap pass, which seems like a clear
maintainability win -- everybody can pass the
already-pinned/already-setup vmbuffer by value.

--
Peter Geoghegan

Attachments:

v1-0001-Avoid-special-XIDs-in-snapshotConflictHorizon-val.patchapplication/octet-stream; name=v1-0001-Avoid-special-XIDs-in-snapshotConflictHorizon-val.patchDownload
From 89d80319b8e94512ef1670a15ced8d96f8d9dcb9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 25 Dec 2022 19:06:22 -0800
Subject: [PATCH v1 1/2] Avoid special XIDs in snapshotConflictHorizon values.

---
 src/backend/access/heap/vacuumlazy.c | 8 +++++---
 src/backend/storage/ipc/standby.c    | 1 +
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9923994b5..5d8fd2fb7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1698,7 +1698,8 @@ retry:
 					}
 
 					/* Track newest xmin on page. */
-					if (TransactionIdFollows(xmin, prunestate->visibility_cutoff_xid))
+					if (TransactionIdFollows(xmin, prunestate->visibility_cutoff_xid) &&
+						TransactionIdIsNormal(xmin))
 						prunestate->visibility_cutoff_xid = xmin;
 				}
 				break;
@@ -1863,7 +1864,7 @@ retry:
 		 * because visibility_cutoff_xid will be logged by our caller in a
 		 * moment.
 		 */
-		Assert(cutoff == FrozenTransactionId ||
+		Assert(!TransactionIdIsValid(cutoff) ||
 			   cutoff == prunestate->visibility_cutoff_xid);
 	}
 #endif
@@ -3293,7 +3294,8 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 					}
 
 					/* Track newest xmin on page. */
-					if (TransactionIdFollows(xmin, *visibility_cutoff_xid))
+					if (TransactionIdFollows(xmin, *visibility_cutoff_xid) &&
+						TransactionIdIsNormal(xmin))
 						*visibility_cutoff_xid = xmin;
 
 					/* Check whether this tuple is already frozen or not */
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index f43229dfd..ede00fee9 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -493,6 +493,7 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
 	if (!TransactionIdIsValid(snapshotConflictHorizon))
 		return;
 
+	Assert(TransactionIdIsNormal(snapshotConflictHorizon));
 	backends = GetConflictingVirtualXIDs(snapshotConflictHorizon,
 										 locator.dbOid);
 	ResolveRecoveryConflictWithVirtualXIDs(backends,
-- 
2.38.1

v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patchapplication/octet-stream; name=v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patchDownload
From 3c20233dfb16003101804c1ad911b904b2c24889 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 31 Dec 2022 15:13:01 -0800
Subject: [PATCH v1 2/2] Never just set the all-frozen bit in VM.

---
 src/backend/access/heap/vacuumlazy.c    | 68 ++++++++++++-------------
 src/backend/access/heap/visibilitymap.c |  5 ++
 2 files changed, 37 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 5d8fd2fb7..4cea237e0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -259,7 +259,7 @@ static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+								  Buffer buffer, int index, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -1038,7 +1038,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
 				dead_items->num_items = 0;
@@ -1163,11 +1163,13 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			/*
 			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
+			 * because setting the all-frozen bit doesn't need any recovery
+			 * conflicts (heap_freeze_execute_prepared does all we need).
+			 * Also make sure that the all-visible bit is still set.
 			 */
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE	|
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
 
@@ -2391,8 +2393,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
-	BlockNumber vacuumed_pages;
+	int			index = 0;
+	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 
@@ -2409,31 +2411,36 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	vacuumed_pages = 0;
-
-	index = 0;
 	while (index < vacrel->dead_items->num_items)
 	{
-		BlockNumber tblk;
+		BlockNumber blkno;
 		Buffer		buf;
 		Page		page;
 		Size		freespace;
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
-		vacrel->blkno = tblk;
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
+		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		vacrel->blkno = blkno;
+
+		/*
+		 * Pin the visibility map page in case we need to mark the page
+		 * all-visible.  In most cases this will be very cheap, because we'll
+		 * already have the correct page pinned anyway.
+		 */
+		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
+
+		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
 		freespace = PageGetHeapFreeSpace(page);
 
 		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
 
@@ -2468,7 +2475,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  *						  vacrel->dead_items array.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
- * cleanup lock is also acceptable).
+ * cleanup lock is also acceptable).  vmbuffer must be valid and already have
+ * a pin on blkno's visibility map page.
  *
  * index is an offset into the vacrel->dead_items array for the first listed
  * LP_DEAD item on the page.  The return value is the first index immediately
@@ -2476,7 +2484,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  */
 static int
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+					  int index, Buffer vmbuffer)
 {
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
@@ -2557,31 +2565,19 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	 * dirty, exclusively locked, and, if needed, a full page image has been
 	 * emitted.
 	 */
+	Assert(!PageIsAllVisible(page));
 	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
 								 &all_frozen))
-		PageSetAllVisible(page);
-
-	/*
-	 * All the changes to the heap page have been done. If the all-visible
-	 * flag is now set, also set the VM all-visible bit (and, if possible, the
-	 * all-frozen bit) unless this has already been done previously.
-	 */
-	if (PageIsAllVisible(page))
 	{
-		uint8		flags = 0;
-		uint8		vm_status = visibilitymap_get_status(vacrel->rel,
-														 blkno, vmbuffer);
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
-		/* Set the VM all-frozen bit to flag, if needed */
-		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
-			flags |= VISIBILITYMAP_ALL_VISIBLE;
-		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
+		if (all_frozen)
 			flags |= VISIBILITYMAP_ALL_FROZEN;
 
-		Assert(BufferIsValid(*vmbuffer));
-		if (flags != 0)
-			visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
-							  *vmbuffer, visibility_cutoff_xid, flags);
+		PageSetAllVisible(page);
+
+		visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
+						  vmbuffer, visibility_cutoff_xid, flags);
 	}
 
 	/* Revert to the previous phase information for error traceback */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 4ed70275e..28489919d 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -146,7 +146,9 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf, uint8 flags
 	char	   *map;
 	bool		cleared = false;
 
+	/* Must never remove all_visible bit while leaving all_frozen bit set */
 	Assert(flags & VISIBILITYMAP_VALID_BITS);
+	Assert(flags != VISIBILITYMAP_ALL_VISIBLE);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
@@ -257,7 +259,10 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 
 	Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
 	Assert(InRecovery || BufferIsValid(heapBuf));
+
+	/* Must never set all_frozen bit without also setting all_visible bit */
 	Assert(flags & VISIBILITYMAP_VALID_BITS);
+	Assert(flags != VISIBILITYMAP_ALL_FROZEN);
 
 	/* Check that we have the right heap page pinned, if present */
 	if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
-- 
2.38.1

In reply to: Peter Geoghegan (#1)
1 attachment(s)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Sat, Dec 31, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

The first patch makes sure that the snapshotConflictHorizon cutoff
(XID cutoff for recovery conflicts) is never a special XID, unless
that XID is InvalidTransactionId, which is interpreted as a
snapshotConflictHorizon value that will never need a recovery conflict
(per the general convention for snapshotConflictHorizon values
explained above ResolveRecoveryConflictWithSnapshot).

Pushed this just now.

Attached is another very simple refactoring patch for vacuumlazy.c. It
makes vacuumlazy.c save the result of get_database_name() in vacrel,
which matches what we already do with things like
get_namespace_name().

Would be helpful if I could get a +1 on
v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is
somewhat more substantial than the others.

--
Peter Geoghegan

Attachments:

0001-Save-get_database_name-in-vacrel.patchapplication/octet-stream; name=0001-Save-get_database_name-in-vacrel.patchDownload
From 5e1a1b0048b531b0f03c205f38ae80528fd1df3f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 31 Dec 2022 22:23:02 -0800
Subject: [PATCH 1/2] Save get_database_name() in vacrel.

---
 src/backend/access/heap/vacuumlazy.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index e962b8d72..457fdacdb 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -172,6 +172,7 @@ typedef struct LVRelState
 	bool		skippedallvis;
 
 	/* Error reporting state */
+	char	   *dbname;
 	char	   *relnamespace;
 	char	   *relname;
 	char	   *indname;		/* Current index name */
@@ -354,6 +355,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * these temp copies.
 	 */
 	vacrel = (LVRelState *) palloc0(sizeof(LVRelState));
+	vacrel->dbname = get_database_name(MyDatabaseId);
 	vacrel->relnamespace = get_namespace_name(RelationGetNamespace(rel));
 	vacrel->relname = pstrdup(RelationGetRelationName(rel));
 	vacrel->indname = NULL;
@@ -475,13 +477,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		if (vacrel->aggressive)
 			ereport(INFO,
 					(errmsg("aggressively vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
+							vacrel->dbname, vacrel->relnamespace,
+							vacrel->relname)));
 		else
 			ereport(INFO,
 					(errmsg("vacuuming \"%s.%s.%s\"",
-							get_database_name(MyDatabaseId),
-							vacrel->relnamespace, vacrel->relname)));
+							vacrel->dbname, vacrel->relnamespace,
+							vacrel->relname)));
 	}
 
 	/*
@@ -650,7 +652,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 					msgfmt = _("automatic vacuum of table \"%s.%s.%s\": index scans: %d\n");
 			}
 			appendStringInfo(&buf, msgfmt,
-							 get_database_name(MyDatabaseId),
+							 vacrel->dbname,
 							 vacrel->relnamespace,
 							 vacrel->relname,
 							 vacrel->num_index_scans);
@@ -2620,9 +2622,7 @@ lazy_check_wraparound_failsafe(LVRelState *vacrel)
 
 		ereport(WARNING,
 				(errmsg("bypassing nonessential maintenance of table \"%s.%s.%s\" as a failsafe after %d index scans",
-						get_database_name(MyDatabaseId),
-						vacrel->relnamespace,
-						vacrel->relname,
+						vacrel->dbname, vacrel->relnamespace, vacrel->relname,
 						vacrel->num_index_scans),
 				 errdetail("The table's relfrozenxid or relminmxid is too far in the past."),
 				 errhint("Consider increasing configuration parameter \"maintenance_work_mem\" or \"autovacuum_work_mem\".\n"
-- 
2.38.1

In reply to: Peter Geoghegan (#2)
1 attachment(s)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Mon, Jan 2, 2023 at 10:31 AM Peter Geoghegan <pg@bowt.ie> wrote:

Would be helpful if I could get a +1 on
v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is
somewhat more substantial than the others.

There has been no response on this thread for over a full week at this
point. I'm CC'ing Robert now, since the bug is from his commit
a892234f83.

Attached revision of the "don't unset all-visible bit while unsetting
all-frozen bit" patch adds some assertions that verify that
visibility_cutoff_xid is InvalidTransactionId as expected when we go
to set any page all-frozen in the VM. It also broadens an existing
nearby test for corruption, which gives us some chance of detecting
and repairing corruption of this sort that might have slipped in in
the field.

My current plan is to commit something like this in another week or
so, barring any objections.

--
Peter Geoghegan

Attachments:

v2-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patchapplication/octet-stream; name=v2-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patchDownload
From 834c22730b8edc663a5c09e529c34c7ba66d44fb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 31 Dec 2022 15:13:01 -0800
Subject: [PATCH v2] Don't accidentally unset all-visible bit in VM.

One of the calls to visibilitymap_set() during VACUUM's initial heap
pass could unset a page's all-visible bit during the process of setting
the same page's all-frozen bit.  This could happen in the event of a
concurrent HOT update from a transaction that aborts soon after.  Since
the all_visible_according_to_vm local variable that lazy_scan_heap works
off of when setting the VM doesn't reflect the current state of the VM,
and since visibilitymap_set() just requested that the all-frozen bit get
set in one case, there was a race condition.  Heap pages could initially
be all-visible just as all_visible_according_to_vm is established, then
not be all-visible after the update, and then become eligible to be set
all-visible once more following pruning by VACUUM.  There is no reason
why VACUUM can't remove a concurrently aborted heap-only tuple right
away, and so no reason why such a page won't be able to reach the
relevant visibilitymap_set() call site.

To fix, tighten up the way that visibilitymap_set() is called: request
that both the all-visible and all-frozen bits get set whenever the
all-frozen bit is set, regardless of what we think we know about the
present state of the all-visible bit.  Also tighten up the defensive VM
checks are performed in passing so that corruption caused by this bug
can be detected and repaired by SKIP_PAGES_THRESHOLD vacuuming.

Oversight in commit a892234f83, which added the all-frozen bit to the VM
fork.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c    | 97 +++++++++++++------------
 src/backend/access/heap/visibilitymap.c |  9 ++-
 2 files changed, 55 insertions(+), 51 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a42e881da..853d1d76a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -260,7 +260,7 @@ static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+								  Buffer buffer, int index, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -1040,7 +1040,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
 				dead_items->num_items = 0;
@@ -1092,7 +1092,10 @@ lazy_scan_heap(LVRelState *vacrel)
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
 			if (prunestate.all_frozen)
+			{
+				Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 				flags |= VISIBILITYMAP_ALL_FROZEN;
+			}
 
 			/*
 			 * It should never be the case that the visibility map page is set
@@ -1120,8 +1123,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * got cleared after lazy_scan_skip() was called, so we must recheck
 		 * with buffer lock before concluding that the VM is corrupt.
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+		else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
+				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1165,11 +1168,14 @@ lazy_scan_heap(LVRelState *vacrel)
 		{
 			/*
 			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
+			 * because setting the all-frozen bit doesn't need any recovery
+			 * conflicts (heap_freeze_execute_prepared does all we need).
+			 * Also make sure that the all-visible bit is still set.
 			 */
+			Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE	|
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
 
@@ -1818,7 +1824,11 @@ retry:
 			 * cutoff by stepping back from OldestXmin.
 			 */
 			if (prunestate->all_visible && prunestate->all_frozen)
+			{
+				/* Using same cutoff when setting VM is now unnecessary */
 				snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+				prunestate->visibility_cutoff_xid = InvalidTransactionId;
+			}
 			else
 			{
 				/* Avoids false conflicts when hot_standby_feedback in use */
@@ -2388,8 +2398,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
-	BlockNumber vacuumed_pages;
+	int			index = 0;
+	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 
@@ -2406,42 +2416,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	vacuumed_pages = 0;
-
-	index = 0;
 	while (index < vacrel->dead_items->num_items)
 	{
-		BlockNumber tblk;
+		BlockNumber blkno;
 		Buffer		buf;
 		Page		page;
 		Size		freespace;
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
-		vacrel->blkno = tblk;
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
+		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		vacrel->blkno = blkno;
+
+		/*
+		 * Pin the visibility map page in case we need to mark the page
+		 * all-visible.  In most cases this will be very cheap, because we'll
+		 * already have the correct page pinned anyway.
+		 */
+		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
+
+		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
 		freespace = PageGetHeapFreeSpace(page);
 
 		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
 
-	/* Clear the block number information */
 	vacrel->blkno = InvalidBlockNumber;
-
 	if (BufferIsValid(vmbuffer))
-	{
 		ReleaseBuffer(vmbuffer);
-		vmbuffer = InvalidBuffer;
-	}
 
 	/*
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
@@ -2465,7 +2475,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  *						  vacrel->dead_items array.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
- * cleanup lock is also acceptable).
+ * cleanup lock is also acceptable).  vmbuffer must be valid and already have
+ * a pin on blkno's visibility map page.
  *
  * index is an offset into the vacrel->dead_items array for the first listed
  * LP_DEAD item on the page.  The return value is the first index immediately
@@ -2473,12 +2484,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  */
 static int
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+					  int index, Buffer vmbuffer)
 {
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
-	int			uncnt = 0;
+	int			nunused = 0;
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
@@ -2508,10 +2519,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
 		ItemIdSetUnused(itemid);
-		unused[uncnt++] = toff;
+		unused[nunused++] = toff;
 	}
 
-	Assert(uncnt > 0);
+	Assert(nunused > 0);
 
 	/* Attempt to truncate line pointer array now */
 	PageTruncateLinePointerArray(page);
@@ -2527,13 +2538,13 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
 
-		xlrec.nunused = uncnt;
+		xlrec.nunused = nunused;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
 
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+		XLogRegisterBufData(0, (char *) unused, nunused * sizeof(OffsetNumber));
 
 		recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
 
@@ -2554,31 +2565,21 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	 * dirty, exclusively locked, and, if needed, a full page image has been
 	 * emitted.
 	 */
+	Assert(!PageIsAllVisible(page));
 	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
 								 &all_frozen))
-		PageSetAllVisible(page);
-
-	/*
-	 * All the changes to the heap page have been done. If the all-visible
-	 * flag is now set, also set the VM all-visible bit (and, if possible, the
-	 * all-frozen bit) unless this has already been done previously.
-	 */
-	if (PageIsAllVisible(page))
 	{
-		uint8		flags = 0;
-		uint8		vm_status = visibilitymap_get_status(vacrel->rel,
-														 blkno, vmbuffer);
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
-		/* Set the VM all-frozen bit to flag, if needed */
-		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
-			flags |= VISIBILITYMAP_ALL_VISIBLE;
-		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
+		if (all_frozen)
+		{
+			Assert(!TransactionIdIsValid(visibility_cutoff_xid));
 			flags |= VISIBILITYMAP_ALL_FROZEN;
+		}
 
-		Assert(BufferIsValid(*vmbuffer));
-		if (flags != 0)
-			visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
-							  *vmbuffer, visibility_cutoff_xid, flags);
+		PageSetAllVisible(page);
+		visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
+						  vmbuffer, visibility_cutoff_xid, flags);
 	}
 
 	/* Revert to the previous phase information for error traceback */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a..74ff01bb1 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -146,7 +146,9 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf, uint8 flags
 	char	   *map;
 	bool		cleared = false;
 
+	/* Must never clear all_visible bit while leaving all_frozen bit set */
 	Assert(flags & VISIBILITYMAP_VALID_BITS);
+	Assert(flags != VISIBILITYMAP_ALL_VISIBLE);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
@@ -256,8 +258,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 #endif
 
 	Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
-	Assert(InRecovery || BufferIsValid(heapBuf));
+	Assert(InRecovery || PageIsAllVisible((Page) BufferGetPage(heapBuf)));
+
+	/* Must never set all_frozen bit without also setting all_visible bit */
 	Assert(flags & VISIBILITYMAP_VALID_BITS);
+	Assert(flags != VISIBILITYMAP_ALL_FROZEN);
 
 	/* Check that we have the right heap page pinned, if present */
 	if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -299,8 +304,6 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 				{
 					Page		heapPage = BufferGetPage(heapBuf);
 
-					/* caller is expected to set PD_ALL_VISIBLE first */
-					Assert(PageIsAllVisible(heapPage));
 					PageSetLSN(heapPage, recptr);
 				}
 			}
-- 
2.38.1

#4Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#3)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

Hi,

On 2023-01-08 14:39:19 -0800, Peter Geoghegan wrote:

One of the calls to visibilitymap_set() during VACUUM's initial heap
pass could unset a page's all-visible bit during the process of setting
the same page's all-frozen bit.

How? visibilitymap_set() just adds flags, it doesn't remove any already
existing bits:

map[mapByte] |= (flags << mapOffset);

It'll afaict lead to potentially unnecessary WAL records though, which does
seem buggy:
if (flags != (map[mapByte] >> mapOffset & VISIBILITYMAP_VALID_BITS))

here we check for *equivalence*, but then below we just or-in flags. So
visibilitymap_set() with just one of the flags bits set in the parameters,
but both set in the page would end up WAL logging unnecessarily.

@@ -2388,8 +2398,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
-	int			index;
-	BlockNumber vacuumed_pages;
+	int			index = 0;
+	BlockNumber vacuumed_pages = 0;
Buffer		vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;

@@ -2406,42 +2416,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);

- vacuumed_pages = 0;
-
- index = 0;

@@ -2473,12 +2484,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
*/
static int
lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
- int index, Buffer *vmbuffer)
+ int index, Buffer vmbuffer)
{
VacDeadItems *dead_items = vacrel->dead_items;
Page page = BufferGetPage(buffer);
OffsetNumber unused[MaxHeapTuplesPerPage];
- int uncnt = 0;
+ int nunused = 0;
TransactionId visibility_cutoff_xid;
bool all_frozen;
LVSavedErrInfo saved_err_info;
@@ -2508,10 +2519,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,

Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
ItemIdSetUnused(itemid);
- unused[uncnt++] = toff;
+ unused[nunused++] = toff;
}

-	Assert(uncnt > 0);
+	Assert(nunused > 0);

/* Attempt to truncate line pointer array now */
PageTruncateLinePointerArray(page);
@@ -2527,13 +2538,13 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
xl_heap_vacuum xlrec;
XLogRecPtr recptr;

-		xlrec.nunused = uncnt;
+		xlrec.nunused = nunused;

XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);

XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+		XLogRegisterBufData(0, (char *) unused, nunused * sizeof(OffsetNumber));

recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);

You have plenty of changes like this, which are afaict entirely unrelated to
the issue the commit is fixing, in here. It just makes it hard to review the
patch.

Greetings,

Andres Freund

In reply to: Andres Freund (#4)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Sun, Jan 8, 2023 at 3:53 PM Andres Freund <andres@anarazel.de> wrote:

How?

See the commit message for the scenario I have in mind, which involves
a concurrent HOT update that aborts.

We're vulnerable to allowing "all-frozen but not all-visible"
inconsistencies because of two factors: this business with not passing
VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to
visibilitymap_set(), *and* the use of all_visible_according_to_vm to
set the VM (a local variable that can go stale). We sort of assume
that all_visible_according_to_vm cannot go stale here without our
detecting it. That's almost always the case, but it's not quite
guaranteed.

visibilitymap_set() just adds flags, it doesn't remove any already
existing bits:

I know. The concrete scenario I have in mind is very subtle (if the
problem was this obvious I'm sure somebody would have noticed it by
now, since we do hit this visibilitymap_set() call site reasonably
often). A concurrent HOT update will certainly clear all the bits for
us, which is enough.

You have plenty of changes like this, which are afaict entirely unrelated to
the issue the commit is fixing, in here. It just makes it hard to review the
patch.

I didn't think that it was that big of a deal to tweak the style of
one or two details in and around lazy_vacuum_heap_rel() in passing,
for consistency with lazy_scan_heap(), since the patch already needs
to do some of that. I do take your point, though.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#5)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Sun, Jan 8, 2023 at 4:27 PM Peter Geoghegan <pg@bowt.ie> wrote:

We're vulnerable to allowing "all-frozen but not all-visible"
inconsistencies because of two factors: this business with not passing
VISIBILITYMAP_ALL_VISIBLE along with VISIBILITYMAP_ALL_FROZEN to
visibilitymap_set(), *and* the use of all_visible_according_to_vm to
set the VM (a local variable that can go stale). We sort of assume
that all_visible_according_to_vm cannot go stale here without our
detecting it. That's almost always the case, but it's not quite
guaranteed.

On further reflection even v2 won't repair the page-level
PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we
might actually leave the all-frozen bit set in the VM, while both the
all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset.
Again, all due to the approach we take with
all_visible_according_to_vm, which can go stale independently of both
the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my
example problem scenario).

FWIW I don't have this remaining problem in my VACUUM
freezing/scanning strategies patchset. It just gets rid of
all_visible_according_to_vm altogether, which makes things a lot
simpler at the point that we set VM bits at the end of lazy_scan_heap
-- there is nothing left that can become stale. Quite a lot of the
code is just removed; there is exactly one call to visibilitymap_set()
at the end of lazy_scan_heap with the patchset, that does everything
we need.

The patchset also has logic for setting PD_ALL_VISIBLE when it needs
to be set, which isn't (and shouldn't) be conditioned on whether we're
doing a "all visible -> all frozen " transition or a "neither -> all
visible" transition. What it actually needs to be conditioned on is
whether it's unset now, and so needs to be set in passing, as part of
setting one or both VM bits -- simple as that.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#6)
1 attachment(s)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Sun, Jan 8, 2023 at 6:43 PM Peter Geoghegan <pg@bowt.ie> wrote:

On further reflection even v2 won't repair the page-level
PD_ALL_VISIBLE flag in passing in this scenario. ISTM that on HEAD we
might actually leave the all-frozen bit set in the VM, while both the
all-visible bit and the page-level PD_ALL_VISIBLE bit remain unset.
Again, all due to the approach we take with
all_visible_according_to_vm, which can go stale independently of both
the VM bit being unset and the PD_ALL_VISIBLE bit being unset (in my
example problem scenario).

Attached is v3, which explicitly checks the need to set the
PD_ALL_VISIBLE flag at the relevant visibilitymap_set() call site. It
also has improved comments.

--
Peter Geoghegan

Attachments:

v3-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patchapplication/octet-stream; name=v3-0001-Don-t-accidentally-unset-all-visible-bit-in-VM.patchDownload
From e7788ebdb589fb7c6f866cf53658cc369f9858b5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 31 Dec 2022 15:13:01 -0800
Subject: [PATCH v3] Don't accidentally unset all-visible bit in VM.

One of the calls to visibilitymap_set() during VACUUM's initial heap
pass could unset a page's all-visible bit during the process of setting
the same page's all-frozen bit.  This could happen in the event of a
concurrent HOT update from a transaction that aborts soon after.  Since
the all_visible_according_to_vm local variable that lazy_scan_heap works
off of when setting the VM doesn't reflect the current state of the VM,
and since visibilitymap_set() just requested that the all-frozen bit get
set in one case, there was a race condition.  Heap pages could initially
be all-visible just as all_visible_according_to_vm is established, then
not be all-visible after the update, and then become eligible to be set
all-visible once more following pruning by VACUUM.  There is no reason
why VACUUM can't remove a concurrently aborted heap-only tuple right
away, and so no reason why such a page won't be able to reach the
relevant visibilitymap_set() call site.

To fix, tighten up the way that visibilitymap_set() is called: request
that both the all-visible and all-frozen bits get set whenever the
all-frozen bit is set, regardless of what we think we know about the
present state of the all-visible bit.  Also tighten up the defensive VM
checks that get performed in passing, so that any corruption caused by
this bug can be detected and repaired by SKIP_PAGES_THRESHOLD vacuuming.

Oversight in commit a892234f83, which added the all-frozen bit to the VM
fork.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c    | 122 ++++++++++++++----------
 src/backend/access/heap/visibilitymap.c |   9 +-
 2 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a42e881da..b7c184ab2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -260,7 +260,7 @@ static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+								  Buffer buffer, int index, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -1040,7 +1040,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
 				dead_items->num_items = 0;
@@ -1092,7 +1092,10 @@ lazy_scan_heap(LVRelState *vacrel)
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
 			if (prunestate.all_frozen)
+			{
+				Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 				flags |= VISIBILITYMAP_ALL_FROZEN;
+			}
 
 			/*
 			 * It should never be the case that the visibility map page is set
@@ -1120,8 +1123,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * got cleared after lazy_scan_skip() was called, so we must recheck
 		 * with buffer lock before concluding that the VM is corrupt.
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+		else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
+				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1164,12 +1167,34 @@ lazy_scan_heap(LVRelState *vacrel)
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
 			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
+			 * Avoid relying on all_visible_according_to_vm as a proxy for the
+			 * page-level PD_ALL_VISIBLE bit being set, since it might have
+			 * become stale -- even when all_visible is set in prunestate.
+			 *
+			 * Consider the example of a page that starts out all-visible and
+			 * then has a tuple concurrently deleted by an xact that aborts.
+			 * The page will be all_visible_according_to_vm, and will have
+			 * all_visible set in prunestate.  It will nevertheless not have
+			 * PD_ALL_VISIBLE set by here (plus neither VM bit will be set).
+			 * And so we must check if PD_ALL_VISIBLE needs to be set.
 			 */
+			if (!PageIsAllVisible(page))
+			{
+				PageSetAllVisible(page);
+				MarkBufferDirty(buf);
+			}
+
+			/*
+			 * Set the page all-frozen (and all-visible) in the VM.
+			 *
+			 * We can pass InvalidTransactionId as our visibility_cutoff_xid,
+			 * since a snapshotConflictHorizon sufficient to make everything
+			 * safe for REDO was logged when the page's tuples were frozen.
+			 */
+			Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE	|
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
 
@@ -1311,7 +1336,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		if (!vacrel->skipwithvm)
+		{
+			/* Caller shouldn't rely on all_visible_according_to_vm */
+			*next_unskippable_allvis = false;
 			break;
+		}
 
 		/*
 		 * Aggressive VACUUM caller can't skip pages just because they are
@@ -1818,7 +1847,11 @@ retry:
 			 * cutoff by stepping back from OldestXmin.
 			 */
 			if (prunestate->all_visible && prunestate->all_frozen)
+			{
+				/* Using same cutoff when setting VM is now unnecessary */
 				snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+				prunestate->visibility_cutoff_xid = InvalidTransactionId;
+			}
 			else
 			{
 				/* Avoids false conflicts when hot_standby_feedback in use */
@@ -2388,8 +2421,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
-	BlockNumber vacuumed_pages;
+	int			index = 0;
+	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 
@@ -2406,42 +2439,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	vacuumed_pages = 0;
-
-	index = 0;
 	while (index < vacrel->dead_items->num_items)
 	{
-		BlockNumber tblk;
+		BlockNumber blkno;
 		Buffer		buf;
 		Page		page;
 		Size		freespace;
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
-		vacrel->blkno = tblk;
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
+		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		vacrel->blkno = blkno;
+
+		/*
+		 * Pin the visibility map page in case we need to mark the page
+		 * all-visible.  In most cases this will be very cheap, because we'll
+		 * already have the correct page pinned anyway.
+		 */
+		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
+
+		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
 		freespace = PageGetHeapFreeSpace(page);
 
 		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
+		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
 
-	/* Clear the block number information */
 	vacrel->blkno = InvalidBlockNumber;
-
 	if (BufferIsValid(vmbuffer))
-	{
 		ReleaseBuffer(vmbuffer);
-		vmbuffer = InvalidBuffer;
-	}
 
 	/*
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
@@ -2465,7 +2498,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  *						  vacrel->dead_items array.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
- * cleanup lock is also acceptable).
+ * cleanup lock is also acceptable).  vmbuffer must be valid and already have
+ * a pin on blkno's visibility map page.
  *
  * index is an offset into the vacrel->dead_items array for the first listed
  * LP_DEAD item on the page.  The return value is the first index immediately
@@ -2473,12 +2507,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  */
 static int
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+					  int index, Buffer vmbuffer)
 {
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
-	int			uncnt = 0;
+	int			nunused = 0;
 	TransactionId visibility_cutoff_xid;
 	bool		all_frozen;
 	LVSavedErrInfo saved_err_info;
@@ -2508,10 +2542,10 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
 		ItemIdSetUnused(itemid);
-		unused[uncnt++] = toff;
+		unused[nunused++] = toff;
 	}
 
-	Assert(uncnt > 0);
+	Assert(nunused > 0);
 
 	/* Attempt to truncate line pointer array now */
 	PageTruncateLinePointerArray(page);
@@ -2527,13 +2561,13 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 		xl_heap_vacuum xlrec;
 		XLogRecPtr	recptr;
 
-		xlrec.nunused = uncnt;
+		xlrec.nunused = nunused;
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfHeapVacuum);
 
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-		XLogRegisterBufData(0, (char *) unused, uncnt * sizeof(OffsetNumber));
+		XLogRegisterBufData(0, (char *) unused, nunused * sizeof(OffsetNumber));
 
 		recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_VACUUM);
 
@@ -2554,31 +2588,21 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	 * dirty, exclusively locked, and, if needed, a full page image has been
 	 * emitted.
 	 */
+	Assert(!PageIsAllVisible(page));
 	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
 								 &all_frozen))
-		PageSetAllVisible(page);
-
-	/*
-	 * All the changes to the heap page have been done. If the all-visible
-	 * flag is now set, also set the VM all-visible bit (and, if possible, the
-	 * all-frozen bit) unless this has already been done previously.
-	 */
-	if (PageIsAllVisible(page))
 	{
-		uint8		flags = 0;
-		uint8		vm_status = visibilitymap_get_status(vacrel->rel,
-														 blkno, vmbuffer);
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
-		/* Set the VM all-frozen bit to flag, if needed */
-		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
-			flags |= VISIBILITYMAP_ALL_VISIBLE;
-		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
+		if (all_frozen)
+		{
+			Assert(!TransactionIdIsValid(visibility_cutoff_xid));
 			flags |= VISIBILITYMAP_ALL_FROZEN;
+		}
 
-		Assert(BufferIsValid(*vmbuffer));
-		if (flags != 0)
-			visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
-							  *vmbuffer, visibility_cutoff_xid, flags);
+		PageSetAllVisible(page);
+		visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
+						  vmbuffer, visibility_cutoff_xid, flags);
 	}
 
 	/* Revert to the previous phase information for error traceback */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a..74ff01bb1 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -146,7 +146,9 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf, uint8 flags
 	char	   *map;
 	bool		cleared = false;
 
+	/* Must never clear all_visible bit while leaving all_frozen bit set */
 	Assert(flags & VISIBILITYMAP_VALID_BITS);
+	Assert(flags != VISIBILITYMAP_ALL_VISIBLE);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
@@ -256,8 +258,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 #endif
 
 	Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
-	Assert(InRecovery || BufferIsValid(heapBuf));
+	Assert(InRecovery || PageIsAllVisible((Page) BufferGetPage(heapBuf)));
+
+	/* Must never set all_frozen bit without also setting all_visible bit */
 	Assert(flags & VISIBILITYMAP_VALID_BITS);
+	Assert(flags != VISIBILITYMAP_ALL_FROZEN);
 
 	/* Check that we have the right heap page pinned, if present */
 	if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -299,8 +304,6 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 				{
 					Page		heapPage = BufferGetPage(heapBuf);
 
-					/* caller is expected to set PD_ALL_VISIBLE first */
-					Assert(PageIsAllVisible(heapPage));
 					PageSetLSN(heapPage, recptr);
 				}
 			}
-- 
2.38.1

#8Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#5)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

Hi,

On 2023-01-08 16:27:59 -0800, Peter Geoghegan wrote:

On Sun, Jan 8, 2023 at 3:53 PM Andres Freund <andres@anarazel.de> wrote:

How?

See the commit message for the scenario I have in mind, which involves
a concurrent HOT update that aborts.

I looked at it. I specifically was wondering about this part of it:

One of the calls to visibilitymap_set() during VACUUM's initial heap
pass could unset a page's all-visible bit during the process of setting
the same page's all-frozen bit.

Which I just don't see as possible, due to visibilitymap_set() simply never
unsetting bits.

I think that's just an imprecise formulation though - the problem is that we
can call visibilitymap_set() with just VISIBILITYMAP_ALL_FROZEN, even though
VISIBILITYMAP_ALL_VISIBLE was concurrently unset.

ISTM that we ought to update all_visible_according_to_vm from
PageIsAllVisible() once we've locked the page. Even if we avoid this specific
case, it seems a recipe for future bugs to have a potentially outdated
variable around.

Greetings,

Andres Freund

#9Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#7)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

Hi,

On 2023-01-09 10:16:03 -0800, Peter Geoghegan wrote:

Attached is v3, which explicitly checks the need to set the PD_ALL_VISIBLE
flag at the relevant visibilitymap_set() call site. It also has improved
comments.

Afaict we'll need to backpatch this all the way?

From e7788ebdb589fb7c6f866cf53658cc369f9858b5 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 31 Dec 2022 15:13:01 -0800
Subject: [PATCH v3] Don't accidentally unset all-visible bit in VM.

One of the calls to visibilitymap_set() during VACUUM's initial heap
pass could unset a page's all-visible bit during the process of setting
the same page's all-frozen bit.

As just mentioned upthread, this just seems wrong.

This could happen in the event of a
concurrent HOT update from a transaction that aborts soon after. Since
the all_visible_according_to_vm local variable that lazy_scan_heap works
off of when setting the VM doesn't reflect the current state of the VM,
and since visibilitymap_set() just requested that the all-frozen bit get
set in one case, there was a race condition. Heap pages could initially
be all-visible just as all_visible_according_to_vm is established, then
not be all-visible after the update, and then become eligible to be set
all-visible once more following pruning by VACUUM. There is no reason
why VACUUM can't remove a concurrently aborted heap-only tuple right
away, and so no reason why such a page won't be able to reach the
relevant visibilitymap_set() call site.

Do you have a reproducer for this?

@@ -1120,8 +1123,8 @@ lazy_scan_heap(LVRelState *vacrel)
* got cleared after lazy_scan_skip() was called, so we must recheck
* with buffer lock before concluding that the VM is corrupt.
*/
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+		else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
+				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
{
elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
vacrel->relname, blkno);

Hm. The message gets a bit less accurate with the change. Perhaps OK? OTOH, it
might be useful to know what bit was wrong when debugging problems.

@@ -1164,12 +1167,34 @@ lazy_scan_heap(LVRelState *vacrel)
!VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
{
/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
+			 * Avoid relying on all_visible_according_to_vm as a proxy for the
+			 * page-level PD_ALL_VISIBLE bit being set, since it might have
+			 * become stale -- even when all_visible is set in prunestate.
+			 *
+			 * Consider the example of a page that starts out all-visible and
+			 * then has a tuple concurrently deleted by an xact that aborts.
+			 * The page will be all_visible_according_to_vm, and will have
+			 * all_visible set in prunestate.  It will nevertheless not have
+			 * PD_ALL_VISIBLE set by here (plus neither VM bit will be set).
+			 * And so we must check if PD_ALL_VISIBLE needs to be set.
*/
+			if (!PageIsAllVisible(page))
+			{
+				PageSetAllVisible(page);
+				MarkBufferDirty(buf);
+			}
+
+			/*
+			 * Set the page all-frozen (and all-visible) in the VM.
+			 *
+			 * We can pass InvalidTransactionId as our visibility_cutoff_xid,
+			 * since a snapshotConflictHorizon sufficient to make everything
+			 * safe for REDO was logged when the page's tuples were frozen.
+			 */
+			Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE	|
VISIBILITYMAP_ALL_FROZEN);
}

@@ -1311,7 +1336,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,

/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
if (!vacrel->skipwithvm)
+		{
+			/* Caller shouldn't rely on all_visible_according_to_vm */
+			*next_unskippable_allvis = false;
break;
+		}
/*
* Aggressive VACUUM caller can't skip pages just because they are
@@ -1818,7 +1847,11 @@ retry:
* cutoff by stepping back from OldestXmin.
*/
if (prunestate->all_visible && prunestate->all_frozen)
+			{
+				/* Using same cutoff when setting VM is now unnecessary */
snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+				prunestate->visibility_cutoff_xid = InvalidTransactionId;
+			}
else
{
/* Avoids false conflicts when hot_standby_feedback in use */
@@ -2388,8 +2421,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
static void
lazy_vacuum_heap_rel(LVRelState *vacrel)
{
-	int			index;
-	BlockNumber vacuumed_pages;
+	int			index = 0;
+	BlockNumber vacuumed_pages = 0;
Buffer		vmbuffer = InvalidBuffer;
LVSavedErrInfo saved_err_info;

@@ -2406,42 +2439,42 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
InvalidBlockNumber, InvalidOffsetNumber);

-	vacuumed_pages = 0;
-
-	index = 0;
while (index < vacrel->dead_items->num_items)
{
-		BlockNumber tblk;
+		BlockNumber blkno;
Buffer		buf;
Page		page;
Size		freespace;

vacuum_delay_point();

I still think such changes are inappropriate for a bugfix, particularly one
that needs to be backpatched.

Greetings,

Andres Freund

In reply to: Andres Freund (#8)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Mon, Jan 9, 2023 at 11:44 AM Andres Freund <andres@anarazel.de> wrote:

I think that's just an imprecise formulation though - the problem is that we
can call visibilitymap_set() with just VISIBILITYMAP_ALL_FROZEN, even though
VISIBILITYMAP_ALL_VISIBLE was concurrently unset.

That's correct.

You're right that my description of the problem from the commit
message was confusing. But we're on the same page about the problem
now.

ISTM that we ought to update all_visible_according_to_vm from
PageIsAllVisible() once we've locked the page. Even if we avoid this specific
case, it seems a recipe for future bugs to have a potentially outdated
variable around.

I basically agree, but some of the details are tricky.

As I mentioned already, my work on visibility map snapshots just gets
rid of all_visible_according_to_vm, which is my preferred long term
approach. We will very likely need to keep all_visible_according_to_vm
as a cache for performance reasons for as long as we have
SKIP_PAGES_THRESHOLD.

Can we just update all_visible_according_to_vm using
PageIsAllVisible(), without making all_visible_according_to_vm
significantly less useful as a cache? Maybe. Not sure offhand.

--
Peter Geoghegan

#11Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#2)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Mon, Jan 2, 2023 at 1:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Dec 31, 2022 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

The first patch makes sure that the snapshotConflictHorizon cutoff
(XID cutoff for recovery conflicts) is never a special XID, unless
that XID is InvalidTransactionId, which is interpreted as a
snapshotConflictHorizon value that will never need a recovery conflict
(per the general convention for snapshotConflictHorizon values
explained above ResolveRecoveryConflictWithSnapshot).

Pushed this just now.

Attached is another very simple refactoring patch for vacuumlazy.c. It
makes vacuumlazy.c save the result of get_database_name() in vacrel,
which matches what we already do with things like
get_namespace_name().

Would be helpful if I could get a +1 on
v1-0002-Never-just-set-the-all-frozen-bit-in-VM.patch, which is
somewhat more substantial than the others.

I feel that you should at least have a reproducer for these problems
posted to the thread, and ideally a regression test, before committing
things. I think it's very hard to understand what the problems are
right now.

I don't particularly have a problem with the idea of 0001, because if
we use InvalidTransactionId to mean that there cannot be any
conflicts, we do not need FrozenTransactionId to mean the same thing.
Picking one or the other makes sense. Perhaps we would need two values
if we both needed a value that meant "conflict with nothing" and also
a value that meant "conflict with everything," but in that case I
suppose we would want FrozenTransactionId to be the one that meant
conflict with nothing, since it logically precedes all other XIDs, and
conflicts are with XIDs that precede the value in the record. However,
I don't find the patch very clear, either. It doesn't update any
comments, not even this one:

         /*
          * It's possible that we froze tuples and made the page's XID cutoff
          * (for recovery conflict purposes) FrozenTransactionId.  This is okay
          * because visibility_cutoff_xid will be logged by our caller in a
          * moment.
          */
-        Assert(cutoff == FrozenTransactionId ||
+        Assert(!TransactionIdIsValid(cutoff) ||
                cutoff == prunestate->visibility_cutoff_xid);

Isn't the comment now incorrect as a direct result of the changes in the patch?

As for 0002, I agree that it's bad if we can get into a state where
the all-frozen bit is set and the all-visible bit is not. I'm not
completely sure what concrete harm that will cause, but it does not
seem good. But I also *entirely* agree with Andres that patches should
run around adjusting nearby code - e.g. variable names - in ways that
aren't truly necessary. That just makes life harder, not only for
anyone who wants to review the patch now, but also for future readers
who may need to understand what the patch changed and why.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Andres Freund (#9)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Mon, Jan 9, 2023 at 11:57 AM Andres Freund <andres@anarazel.de> wrote:

Afaict we'll need to backpatch this all the way?

I thought that we probably wouldn't need to, at first. But I now think
that we really have to.

I didn't realize that affected visibilitymap_set() calls could
generate useless set-VM WAL records until you pointed it out. That's
far more likely to happen than the race condition that I described --
it has nothing at all to do with concurrency. That's what clinches it
for me.

One of the calls to visibilitymap_set() during VACUUM's initial heap
pass could unset a page's all-visible bit during the process of setting
the same page's all-frozen bit.

As just mentioned upthread, this just seems wrong.

I don't know why this sentence ever made sense to me. Anyway, it's not
important now.

Do you have a reproducer for this?

No, but I'm quite certain that the race can happen.

If it's important to have a reproducer then I can probably come up
with one. I could likely figure out a way to write an isolation test
that reliably triggers the issue. It would have to work by playing
games with cleanup lock/buffer pin waits, since that's the only thing
that the test can hook into to make things happen in just the
right/wrong order.

elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
vacrel->relname, blkno);

Hm. The message gets a bit less accurate with the change. Perhaps OK? OTOH, it
might be useful to know what bit was wrong when debugging problems.

Theoretically it might change again, if we call
visibilitymap_get_status() again. Maybe I should just broaden the
error message a bit instead?

I still think such changes are inappropriate for a bugfix, particularly one
that needs to be backpatched.

I'll remove the changes that are inessential in the next revision. I
wouldn't have done it if I'd fully understood the seriousness of the
issue from the start.

If you're really concerned about diff size then I should point out
that the changes to lazy_vacuum_heap_rel() aren't strictly necessary,
and probably shouldn't be backpatched. I deemed that in scope because
it's part of the same overall problem of updating the visibility map
based on potentially stale information. It makes zero sense to check
with the visibility map before updating it when we already know that
the page is all-visible. I mean, are we trying to avoid the work of
needlessly updating the visibility map in cases where its state was
corrupt, but then became uncorrupt (relative to the heap page) by
mistake?

--
Peter Geoghegan

In reply to: Robert Haas (#11)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Mon, Jan 9, 2023 at 12:51 PM Robert Haas <robertmhaas@gmail.com> wrote:

I feel that you should at least have a reproducer for these problems
posted to the thread, and ideally a regression test, before committing
things. I think it's very hard to understand what the problems are
right now.

Hard to understand relative to what, exactly? We're talking about a
very subtle race condition here.

I'll try to come up with a reproducer, but I *utterly* reject your
assertion that it's a hard requirement, sight unseen. Why should those
be the parameters of the discussion?

For one thing I'm quite confident that I'm right, with or without a
reproducer. And my argument isn't all that hard to follow, if you have
relevant expertise, and actually take the time. But even this is
unlikely to matter much. Even if I somehow turn out to have been
completely wrong about the race condition, it is still self-evident
that the problem of uselessly WAL logging non-changes to the VM
exists. That doesn't require any concurrent access at all. It's a
natural consequence of calling visibilitymap_set() with
VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code
for 2 minutes to see it.

I don't particularly have a problem with the idea of 0001, because if
we use InvalidTransactionId to mean that there cannot be any
conflicts, we do not need FrozenTransactionId to mean the same thing.
Picking one or the other makes sense.

We've already picked one, many years ago -- InvalidTransactionId. This
is a long established convention, common to all REDO routines that are
capable of creating granular conflicts.

I already committed 0001 over a week ago. We were calling
ResolveRecoveryConflictWithSnapshot with FrozenTransactionId arguments
before now, which was 100% guaranteed to be a waste of cycles. I saw
no need to wait more than a few days for a +1, given that this
particular issue was so completely clear cut.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#12)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

I didn't realize that affected visibilitymap_set() calls could
generate useless set-VM WAL records until you pointed it out. That's
far more likely to happen than the race condition that I described --
it has nothing at all to do with concurrency. That's what clinches it
for me.

I didn't spend as much time on this as I'd like to so far, but I think
that this concern about visibilitymap_set() actually turns out to not
apply. The visibilitymap_set() call in question is gated by a
"!VM_ALL_FROZEN()", which is enough to avoid the problem with writing
useless VM set records.

That doesn't make me doubt my original concern about races where the
all-frozen bit can be set, without setting the all-visible bit, and
without accounting for the fact that it changed underneath us. That
scenario will have !VM_ALL_FROZEN(), so that won't save us. (And we
won't test VM_ALL_VISIBLE() or PD_ALL_VISIBLE in a way that is
sufficient to realize that all_visible_according_to_vm is stale.
prunestate.all_visible being set doesn't reliably indicate that it's not stale,
but lazy_scan_heap incorrectly believes that it really does work that way.)

--
Peter Geoghegan

#15Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#13)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Mon, Jan 9, 2023 at 5:59 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jan 9, 2023 at 12:51 PM Robert Haas <robertmhaas@gmail.com> wrote:

I feel that you should at least have a reproducer for these problems
posted to the thread, and ideally a regression test, before committing
things. I think it's very hard to understand what the problems are
right now.

Hard to understand relative to what, exactly? We're talking about a
very subtle race condition here.

I'll try to come up with a reproducer, but I *utterly* reject your
assertion that it's a hard requirement, sight unseen. Why should those
be the parameters of the discussion?

For one thing I'm quite confident that I'm right, with or without a
reproducer. And my argument isn't all that hard to follow, if you have
relevant expertise, and actually take the time.

Look, I don't want to spend time arguing about what seem to me to be
basic principles of good software engineering. When I don't put test
cases into my patches, people complain at me and tell me that I'm a
bad software engineer because I didn't include test cases. Your
argument here seems to be that you're such a good software engineer
that you don't need any test cases to know what the bug is or that
you've fixed it correctly. That seems like a surprising argument, but
even if it's true, test cases can have considerable value to future
code authors, because it allows them to avoid reintroducing bugs that
have previously been fixed. In my opinion, it's not worth trying to
have automated test cases for absolutely every bug we fix, because
many of them would be really hard to develop and executing all of them
every time we do anything would be unduly time-consuming. But I can't
remember the last time before this that someone wanted to commit a
patch for a data corruption issue without even providing a test case
that other people can run manually. If you think that is or ought to
be standard practice, I can only say that I disagree.

I don't particularly appreciate the implication that I either lack
relevant or expertise or don't actually take time, either. I spent an
hour yesterday looking at your patches yesterday and didn't feel I was
very close to understanding 0002 in that time. I feel that if the
patches were better-written, with relevant comments and test cases and
really good commit messages and a lack of extraneous changes, I
believe I probably would have gotten a lot further in the same amount
of time. There is certainly an alternate explanation, which is that I
am stupid. I'm inclined to think that's not the correct explanation,
but most stupid people believe that they aren't, so that doesn't
really prove anything.

But even this is
unlikely to matter much. Even if I somehow turn out to have been
completely wrong about the race condition, it is still self-evident
that the problem of uselessly WAL logging non-changes to the VM
exists. That doesn't require any concurrent access at all. It's a
natural consequence of calling visibilitymap_set() with
VISIBILITYMAP_ALL_FROZEN-only flags. You need only look at the code
for 2 minutes to see it.

Apparently not, because I spent more time than that.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#15)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Tue, Jan 10, 2023 at 10:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

Look, I don't want to spend time arguing about what seem to me to be
basic principles of good software engineering. When I don't put test
cases into my patches, people complain at me and tell me that I'm a
bad software engineer because I didn't include test cases. Your
argument here seems to be that you're such a good software engineer
that you don't need any test cases to know what the bug is or that
you've fixed it correctly.

That's not what I said. This is a straw man.

What I actually said was that there is no reason to declare up front
that the only circumstances under which a fix could be committed is
when a clean repro is available. I never said that a test case has
little or no value, and I certainly didn't assert that we definitely
don't need a test case to proceed with a commit -- since I am not in
the habit of presumptuously attaching conditions to such things well
in advance.

I don't particularly appreciate the implication that I either lack
relevant or expertise or don't actually take time, either.

The implication was only that you didn't take the time. Clearly you
have the expertise. Obviously you're very far from stupid.

I have been unable to reproduce the problem, and think it's possible
that the issue cannot be triggered in practice. Though only through
sheer luck. Here's why that is:

While pruning will remove aborted dead tuples, freezing will not
remove the xmax of an aborted update unless the XID happens to be <
OldestXmin. With my problem scenario, the page will be all_visible in
prunestate, but not all_frozen -- so it dodges the relevant
visibilitymap_set() call site.

That just leaves inserts that abort, I think. An aborted insert will
be totally undone by pruning, but that does still leave behind an
LP_DEAD item that needs to be vacuumed in the second heap pass. This
means that we can only set the page all-visible/all-frozen in the VM
in the second heap pass, which also dodges the relevant
visibilitymap_set() call site.

In summary, I think that there is currently no way that we can have
the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
the page all_frozen. It can happen and leave the page all_visible, but
not all_frozen, due to these very fine details. (Assuming I haven't
missed another path to the problem with aborted Multis or something,
but looks like I haven't.)

--
Peter Geoghegan

In reply to: Peter Geoghegan (#16)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Tue, Jan 10, 2023 at 11:47 AM Peter Geoghegan <pg@bowt.ie> wrote:

In summary, I think that there is currently no way that we can have
the VM (or the PD_ALL_VISIBLE flag) concurrently unset, while leaving
the page all_frozen. It can happen and leave the page all_visible, but
not all_frozen, due to these very fine details. (Assuming I haven't
missed another path to the problem with aborted Multis or something,
but looks like I haven't.)

Actually, FreezeMultiXactId() can fully remove an xmax that has some
member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
possible, at least when no individual member is still running. Doesn't
have to involve transaction aborts at all.

Let me go try to break it that way...

--
Peter Geoghegan

#18Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#16)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Tue, Jan 10, 2023 at 2:48 PM Peter Geoghegan <pg@bowt.ie> wrote:

What I actually said was that there is no reason to declare up front
that the only circumstances under which a fix could be committed is
when a clean repro is available. I never said that a test case has
little or no value, and I certainly didn't assert that we definitely
don't need a test case to proceed with a commit -- since I am not in
the habit of presumptuously attaching conditions to such things well
in advance.

I don't understand what distinction you're making. It seems like
hair-splitting to me. We should be able to reproduce problems like
this reliably, at least with the aid of a debugger and some
breakpoints, before we go changing the code. The risk of being wrong
is quite high because the code is subtle, and the consequences of
being wrong are potentially very bad because the code is critical to
data integrity. If the reproducer doesn't require a debugger or other
extreme contortions, then we should consider reducing it to a TAP test
that can be committed. If you agree with that, then I'm not sure what
your last email was complaining about. If you disagree, then I don't
know why.

I have been unable to reproduce the problem, and think it's possible
that the issue cannot be triggered in practice. Though only through
sheer luck. Here's why that is:

While pruning will remove aborted dead tuples, freezing will not
remove the xmax of an aborted update unless the XID happens to be <
OldestXmin. With my problem scenario, the page will be all_visible in
prunestate, but not all_frozen -- so it dodges the relevant
visibilitymap_set() call site.

That just leaves inserts that abort, I think. An aborted insert will
be totally undone by pruning, but that does still leave behind an
LP_DEAD item that needs to be vacuumed in the second heap pass. This
means that we can only set the page all-visible/all-frozen in the VM
in the second heap pass, which also dodges the relevant
visibilitymap_set() call site.

I guess I'm not very sure that this is sheer luck. It seems like we
could equally well suppose that the people who wrote the code
correctly understood the circumstances under which we needed to avoid
calling visibilitymap_set(), and wrote the code in a way that
accomplished that purpose. Maybe there's contrary evidence or maybe it
is actually broken somehow, but that's not currently clear to me.

For the purposes of clarifying my understanding, is this the code
you're principally worried about?

/*
* If the all-visible page is all-frozen but not marked as such yet,
* mark it as all-frozen. Note that all_frozen is only valid if
* all_visible is true, so we must check both prunestate fields.
*/
else if (all_visible_according_to_vm && prunestate.all_visible &&
prunestate.all_frozen &&
!VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
{
/*
* We can pass InvalidTransactionId as the cutoff XID here,
* because setting the all-frozen bit doesn't cause recovery
* conflicts.
*/
visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
vmbuffer, InvalidTransactionId,
VISIBILITYMAP_ALL_FROZEN);
}

Or maybe this one?

if (PageIsAllVisible(page))
{
uint8 flags = 0;
uint8 vm_status = visibilitymap_get_status(vacrel->rel,
blkno, vmbuffer);

/* Set the VM all-frozen bit to flag, if needed */
if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
flags |= VISIBILITYMAP_ALL_VISIBLE;
if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
flags |= VISIBILITYMAP_ALL_FROZEN;

Assert(BufferIsValid(*vmbuffer));
if (flags != 0)
visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
*vmbuffer, visibility_cutoff_xid, flags);
}

These are the only two call sites in vacuumlazy.c where I can see
there being a theoretical risk of the kind of problem that you're
describing.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#18)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Tue, Jan 10, 2023 at 12:19 PM Robert Haas <robertmhaas@gmail.com> wrote:

I don't understand what distinction you're making. It seems like
hair-splitting to me. We should be able to reproduce problems like
this reliably, at least with the aid of a debugger and some
breakpoints, before we go changing the code.

So we can *never* change something defensively, on the basis of a
suspected or theoretical hazard, either in backbranches or just on
HEAD? Not under any circumstances, ever?

The risk of being wrong
is quite high because the code is subtle, and the consequences of
being wrong are potentially very bad because the code is critical to
data integrity. If the reproducer doesn't require a debugger or other
extreme contortions, then we should consider reducing it to a TAP test
that can be committed. If you agree with that, then I'm not sure what
your last email was complaining about.

I was complaining about your prescribing conditions on proceeding with
a commit, based on an understanding of things that you yourself
acknowledged as incomplete. I cannot imagine how you read that as an
unwillingness to test the issue, especially given that I agreed to
work on that before you chimed in.

I have been unable to reproduce the problem, and think it's possible
that the issue cannot be triggered in practice. Though only through
sheer luck. Here's why that is:

I guess I'm not very sure that this is sheer luck.

That's just my characterization. Other people can make up their own minds.

For the purposes of clarifying my understanding, is this the code
you're principally worried about?

visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
vmbuffer, InvalidTransactionId,
VISIBILITYMAP_ALL_FROZEN);

Obviously I meant this call site, since it's the only one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE -- in vacuumlazy.c, and in general.

The other visibilitymap_set() callsite that you quoted is from the
second heap pass, where LP_DEAD items are vacuumed and become
LP_UNUSED items. That isn't buggy, but it is a silly approach, in that
it cares about what the visibility map says about the page being
all-visible, as if it might take a dissenting view that needs to be
taken into consideration (obviously we know what's going on with the
page because we just scanned it ourselves, and determined that it was
at least all-visible).

--
Peter Geoghegan

In reply to: Peter Geoghegan (#17)
1 attachment(s)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Tue, Jan 10, 2023 at 12:08 PM Peter Geoghegan <pg@bowt.ie> wrote:

Actually, FreezeMultiXactId() can fully remove an xmax that has some
member XIDs >= OldestXmin, provided FRM_NOOP processing isn't
possible, at least when no individual member is still running. Doesn't
have to involve transaction aborts at all.

Let me go try to break it that way...

Attached patch shows how this could break.

It adds an assertion that checks that the expected
PD_ALL_VISIBLE/VM_ALL_VISIBLE() conditions hold at the right point. It
also comments out FreezeMultiXactId()'s FRM_NOOP handling.

The FRM_NOOP case is really just an optimization, and shouldn't be
needed for correctness. This is amply demonstrated by running "meston
test" with the patch applied, which will pass without incident.

I can get the PD_ALL_VISIBLE assertion to fail by following this
procedure with the patch applied:

* Run a plain VACUUM to set all the pages from a table all-visible,
but not all-frozen.

* Set a breakpoint that will hit after all_visible_according_to_vm is
set to true, for an interesting blkno.

* Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
relevant visibilitymap_set() call site (the one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE).

Now all_visible_according_to_vm is set to true, but we don't have a
lock/pin on the same heap page just yet.

* Acquire several non-conflicting row locks on a row on the block in
question, so that a new multi is allocated.

* End every session whose XID is stored in our multi (commit/abort).

* Within GDB, continue from before -- observe assertion failure.

Obviously this scenario doesn't demonstrate the presence of a bug --
not quite. But it does prove that we rely on FRM_NOOP to not allow the
VM to become corrupt, which just doesn't make any sense, and can't
have been intended. At a minimum, it strongly suggests that the
current approach is very fragile.

--
Peter Geoghegan

Attachments:

0001-Debug-freeze-map-issue.patchapplication/octet-stream; name=0001-Debug-freeze-map-issue.patchDownload
From 866c6759a89c8f7b9f1ed05c4b8ce075f87f7ab7 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 10 Jan 2023 16:14:39 -0800
Subject: [PATCH] Debug freeze map issue.

---
 src/backend/access/heap/heapam.c     | 2 ++
 src/backend/access/heap/vacuumlazy.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 63c4f01f0..7a9e3c526 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6281,6 +6281,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 	if (!need_replace)
 		need_replace = MultiXactIdPrecedes(multi, cutoffs->MultiXactCutoff);
 
+#if 0
 	if (!need_replace)
 	{
 		/*
@@ -6295,6 +6296,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		pfree(members);
 		return multi;
 	}
+#endif
 
 	/*
 	 * Do a more thorough second pass over the multi to figure out which
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a42e881da..11d548436 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1168,6 +1168,8 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * because setting the all-frozen bit doesn't cause recovery
 			 * conflicts.
 			 */
+			Assert(PageIsAllVisible(page));
+			Assert(VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer));
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, InvalidTransactionId,
 							  VISIBILITYMAP_ALL_FROZEN);
-- 
2.39.0

In reply to: Peter Geoghegan (#20)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Tue, Jan 10, 2023 at 4:39 PM Peter Geoghegan <pg@bowt.ie> wrote:

* Run VACUUM FREEZE. We need FREEZE in order to be able to hit the
relevant visibilitymap_set() call site (the one that passes
VISIBILITYMAP_ALL_FROZEN as its flags, without also passing
VISIBILITYMAP_ALL_VISIBLE).

Now all_visible_according_to_vm is set to true, but we don't have a
lock/pin on the same heap page just yet.

* Acquire several non-conflicting row locks on a row on the block in
question, so that a new multi is allocated.

Forgot to mention that there needs to be a HOT update mixed in with
these SELECT ... FOR SHARE row lockers, too, which must abort once its
XID has been added to a multi. Obviously heap_lock_tuple() won't ever
unset VISIBILITYMAP_ALL_VISIBLE or PD_ALL_VISIBLE (it only ever clears
VISIBILITYMAP_ALL_FROZEN) -- so we need a heap_update() to clear all
of these status bits.

This enables the assertion to fail because:

* Pruning can get rid of the aborted successor heap-only tuple right
away, so it is not going to block us from setting the page all_visible
(that just leaves the original tuple to consider).

* The original tuple's xmax is a Multi, so it won't automatically be
ineligible for freezing because it's > OldestXmin in this scenario.

* FreezeMultiXactId() processing will completely remove xmax, without
caring too much about cutoffs like OldestXmin -- it only cares about
whether each individual XID needs to be kept or not.

(Granted, FreezeMultiXactId() will only remove xmax like this because
I deliberately removed its FRM_NOOP handling, but that is a very
delicate thing to rely on, especially from such a great distance. I
can't imagine that it doesn't fail on HEAD for any reason beyond sheer
luck.)

--
Peter Geoghegan

In reply to: Peter Geoghegan (#12)
1 attachment(s)
Re: Fixing a couple of buglets in how VACUUM sets visibility map bits

On Mon, Jan 9, 2023 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jan 9, 2023 at 11:57 AM Andres Freund <andres@anarazel.de> wrote:

Afaict we'll need to backpatch this all the way?

I thought that we probably wouldn't need to, at first. But I now think
that we really have to.

Attached is v4. This is almost the same as v3. The only notable change
is in how the issue is explained in comments, and in the commit
message.

I have revised my opinion on this question once more. In light of what
has come to light about the issue from recent testing, I lean towards
a HEAD-only commit once again. What do you think?

I still hope to be able to commit this on my original timeline (on
Monday or so), without the issue taking up too much more of
everybody's time.

Thanks
--
Peter Geoghegan

Attachments:

v4-0001-Tighten-up-VACUUM-s-approach-to-setting-VM-bits.patchapplication/octet-stream; name=v4-0001-Tighten-up-VACUUM-s-approach-to-setting-VM-bits.patchDownload
From 7b8f1f4e2b9b8c3067f9c4750005d3f31d2256ef Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 11 Jan 2023 18:06:51 -0800
Subject: [PATCH v4] Tighten up VACUUM's approach to setting VM bits.

One of the calls to visibilitymap_set() during VACUUM's initial heap
pass could theoretically set a page's all-frozen bit without also
setting the page's all-visible bit, leaving the visibility map in an
inconsistent state.  The problem was the tacit assumption that the
lazy_scan_heap local variable all_visible_according_to_vm could not
become stale without that being detected afterwards, via the prunestate
all_visible/all_frozen fields being set false by lazy_scan_prune.
However, prunestate only indicates whether the page is now eligible to
become all-visible, which doesn't reliably indicate anything about the
state of the page immediately before it was scanned -- including in
cases where all_visible_according_to_vm happened to be set from earlier
on.  In other words, lazy_scan_heap failed to account for the fact that
a page that was set all-visible in the VM in the recent past may not
still be all-visible in the VM by now, even when it's eligible to become
all-visible now.

In practice there doesn't seem to be a concrete scenario in which this
is possible.  It was almost possible in scenarios involving concurrent
HOT updates from transactions that abort, but (unlike pruning) freezing
can never remove an XID that's > VACUUM's OldestXmin, even for an XID
that is known to have aborted, which was protective here.

Tighten up the way that visibilitymap_set() is called: request that both
the all-visible and all-frozen bits get set whenever the all-frozen bit
is set, regardless of what we think we know about the present state of
the all-visible bit.  Also make sure that the page level PD_ALL_VISIBLE
flag is set in the same code path.

These issues have been around since commit a892234f83, which added the
all-frozen bit to the VM fork.

In passing, stop reading the existing state of the VM when setting the
VM in VACUUM's final heap pass.  The existing state of the visibility
map is irrelevant, since pages with LP_DEAD items are not all-visible
(or all-frozen) by definition, and we already know that affected pages
must have had at least one LP_DEAD item before we set it LP_UNUSED.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-WznuNGSzF8v6OsgjaC5aYsb3cZ6HW6MLm30X0d65cmSH6A@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c    | 83 ++++++++++++++++---------
 src/backend/access/heap/visibilitymap.c |  9 ++-
 2 files changed, 60 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 369451516..81698d0d3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -260,7 +260,7 @@ static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
 static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+								  Buffer buffer, int index, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -1040,7 +1040,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
 				dead_items->num_items = 0;
@@ -1092,7 +1092,10 @@ lazy_scan_heap(LVRelState *vacrel)
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
 			if (prunestate.all_frozen)
+			{
+				Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 				flags |= VISIBILITYMAP_ALL_FROZEN;
+			}
 
 			/*
 			 * It should never be the case that the visibility map page is set
@@ -1120,8 +1123,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * got cleared after lazy_scan_skip() was called, so we must recheck
 		 * with buffer lock before concluding that the VM is corrupt.
 		 */
-		else if (all_visible_according_to_vm && !PageIsAllVisible(page)
-				 && VM_ALL_VISIBLE(vacrel->rel, blkno, &vmbuffer))
+		else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
+				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1164,12 +1167,27 @@ lazy_scan_heap(LVRelState *vacrel)
 				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
 		{
 			/*
-			 * We can pass InvalidTransactionId as the cutoff XID here,
-			 * because setting the all-frozen bit doesn't cause recovery
-			 * conflicts.
+			 * Avoid relying on all_visible_according_to_vm as a proxy for the
+			 * page-level PD_ALL_VISIBLE bit being set, since it might have
+			 * become stale -- even when all_visible is set in prunestate
 			 */
+			if (!PageIsAllVisible(page))
+			{
+				PageSetAllVisible(page);
+				MarkBufferDirty(buf);
+			}
+
+			/*
+			 * Set the page all-frozen (and all-visible) in the VM.
+			 *
+			 * We can pass InvalidTransactionId as our visibility_cutoff_xid,
+			 * since a snapshotConflictHorizon sufficient to make everything
+			 * safe for REDO was logged when the page's tuples were frozen.
+			 */
+			Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
 							  vmbuffer, InvalidTransactionId,
+							  VISIBILITYMAP_ALL_VISIBLE	|
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
 
@@ -1311,7 +1329,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		if (!vacrel->skipwithvm)
+		{
+			/* Caller shouldn't rely on all_visible_according_to_vm */
+			*next_unskippable_allvis = false;
 			break;
+		}
 
 		/*
 		 * Aggressive VACUUM caller can't skip pages just because they are
@@ -1818,7 +1840,11 @@ retry:
 			 * cutoff by stepping back from OldestXmin.
 			 */
 			if (prunestate->all_visible && prunestate->all_frozen)
+			{
+				/* Using same cutoff when setting VM is now unnecessary */
 				snapshotConflictHorizon = prunestate->visibility_cutoff_xid;
+				prunestate->visibility_cutoff_xid = InvalidTransactionId;
+			}
 			else
 			{
 				/* Avoids false conflicts when hot_standby_feedback in use */
@@ -2417,10 +2443,18 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
 		vacrel->blkno = blkno;
+
+		/*
+		 * Pin the visibility map page in case we need to mark the page
+		 * all-visible.  In most cases this will be very cheap, because we'll
+		 * already have the correct page pinned anyway.
+		 */
+		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
+
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, &vmbuffer);
+		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2457,7 +2491,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  *						  vacrel->dead_items array.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
- * cleanup lock is also acceptable).
+ * cleanup lock is also acceptable).  vmbuffer must be valid and already have
+ * a pin on blkno's visibility map page.
  *
  * index is an offset into the vacrel->dead_items array for the first listed
  * LP_DEAD item on the page.  The return value is the first index immediately
@@ -2465,7 +2500,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  */
 static int
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+					  int index, Buffer vmbuffer)
 {
 	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
@@ -2546,31 +2581,21 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 	 * dirty, exclusively locked, and, if needed, a full page image has been
 	 * emitted.
 	 */
+	Assert(!PageIsAllVisible(page));
 	if (heap_page_is_all_visible(vacrel, buffer, &visibility_cutoff_xid,
 								 &all_frozen))
-		PageSetAllVisible(page);
-
-	/*
-	 * All the changes to the heap page have been done. If the all-visible
-	 * flag is now set, also set the VM all-visible bit (and, if possible, the
-	 * all-frozen bit) unless this has already been done previously.
-	 */
-	if (PageIsAllVisible(page))
 	{
-		uint8		flags = 0;
-		uint8		vm_status = visibilitymap_get_status(vacrel->rel,
-														 blkno, vmbuffer);
+		uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
 
-		/* Set the VM all-frozen bit to flag, if needed */
-		if ((vm_status & VISIBILITYMAP_ALL_VISIBLE) == 0)
-			flags |= VISIBILITYMAP_ALL_VISIBLE;
-		if ((vm_status & VISIBILITYMAP_ALL_FROZEN) == 0 && all_frozen)
+		if (all_frozen)
+		{
+			Assert(!TransactionIdIsValid(visibility_cutoff_xid));
 			flags |= VISIBILITYMAP_ALL_FROZEN;
+		}
 
-		Assert(BufferIsValid(*vmbuffer));
-		if (flags != 0)
-			visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
-							  *vmbuffer, visibility_cutoff_xid, flags);
+		PageSetAllVisible(page);
+		visibilitymap_set(vacrel->rel, blkno, buffer, InvalidXLogRecPtr,
+						  vmbuffer, visibility_cutoff_xid, flags);
 	}
 
 	/* Revert to the previous phase information for error traceback */
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1d1ca423a..74ff01bb1 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -146,7 +146,9 @@ visibilitymap_clear(Relation rel, BlockNumber heapBlk, Buffer vmbuf, uint8 flags
 	char	   *map;
 	bool		cleared = false;
 
+	/* Must never clear all_visible bit while leaving all_frozen bit set */
 	Assert(flags & VISIBILITYMAP_VALID_BITS);
+	Assert(flags != VISIBILITYMAP_ALL_VISIBLE);
 
 #ifdef TRACE_VISIBILITYMAP
 	elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
@@ -256,8 +258,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 #endif
 
 	Assert(InRecovery || XLogRecPtrIsInvalid(recptr));
-	Assert(InRecovery || BufferIsValid(heapBuf));
+	Assert(InRecovery || PageIsAllVisible((Page) BufferGetPage(heapBuf)));
+
+	/* Must never set all_frozen bit without also setting all_visible bit */
 	Assert(flags & VISIBILITYMAP_VALID_BITS);
+	Assert(flags != VISIBILITYMAP_ALL_FROZEN);
 
 	/* Check that we have the right heap page pinned, if present */
 	if (BufferIsValid(heapBuf) && BufferGetBlockNumber(heapBuf) != heapBlk)
@@ -299,8 +304,6 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 				{
 					Page		heapPage = BufferGetPage(heapBuf);
 
-					/* caller is expected to set PD_ALL_VISIBLE first */
-					Assert(PageIsAllVisible(heapPage));
 					PageSetLSN(heapPage, recptr);
 				}
 			}
-- 
2.39.0