Document WAL rules related to PD_ALL_VISIBLE in README
Proposed wording attached.
The typical WAL rules are broken for setting PD_ALL_VISIBLE. I'm OK
with that -- rules are meant to be broken -- but it's confusing enough
that I think we should (internally) document it better. This doesn't
guarantee things won't change again in the future, but this behavior
has been stable for a while.
The thread here:
/messages/by-id/ee47ee24-2928-96e3-a2b1-97cbe07b2c7b@garret.ru
also indicates that external projects and tools are relying on our
rules for the page LSNs.
Regards,
Jeff Davis
Attachments:
v1-0001-Document-WAL-rules-related-to-PD_ALL_VISIBLE-in-R.patchtext/x-patch; charset=UTF-8; name=v1-0001-Document-WAL-rules-related-to-PD_ALL_VISIBLE-in-R.patchDownload
From 6594a9698a9edde666f883fcb14ced392066103f Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 11 Nov 2022 14:02:52 -0800
Subject: [PATCH v1] Document WAL rules related to PD_ALL_VISIBLE in README.
Also improve comments.
---
src/backend/access/heap/heapam.c | 5 +++--
src/backend/access/heap/visibilitymap.c | 17 +++++++++++------
src/backend/access/transam/README | 17 +++++++++++++++++
3 files changed, 31 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5c8cdfb9b2..fc77fd85ed 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8193,8 +8193,9 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
* corresponding visibility map block. Both should have already been modified
* and dirtied.
*
- * If checksums are enabled, we also generate a full-page image of
- * heap_buffer, if necessary.
+ * If checksums or wal_log_hints are enabled, we also generate a full-page
+ * image of heap_buffer, if necessary. If not, we optimize away the FPI, which
+ * can substantially reduce the WAL volume in common cases.
*/
XLogRecPtr
log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index d62761728b..4ed70275e2 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -223,13 +223,13 @@ visibilitymap_pin_ok(BlockNumber heapBlk, Buffer vmbuf)
* visibilitymap_set - set bit(s) on a previously pinned page
*
* recptr is the LSN of the XLOG record we're replaying, if we're in recovery,
- * or InvalidXLogRecPtr in normal running. The page LSN is advanced to the
+ * or InvalidXLogRecPtr in normal running. The VM page LSN is advanced to the
* one provided; in normal running, we generate a new XLOG record and set the
- * page LSN to that value. cutoff_xid is the largest xmin on the page being
- * marked all-visible; it is needed for Hot Standby, and can be
- * InvalidTransactionId if the page contains no tuples. It can also be set
- * to InvalidTransactionId when a page that is already all-visible is being
- * marked all-frozen.
+ * page LSN to that value (though the heap page's LSN may *not* be updated;
+ * see below). cutoff_xid is the largest xmin on the page being marked
+ * all-visible; it is needed for Hot Standby, and can be InvalidTransactionId
+ * if the page contains no tuples. It can also be set to InvalidTransactionId
+ * when a page that is already all-visible is being marked all-frozen.
*
* Caller is expected to set the heap page's PD_ALL_VISIBLE bit before calling
* this function. Except in recovery, caller should also pass the heap
@@ -289,6 +289,11 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
/*
* If data checksums are enabled (or wal_log_hints=on), we
* need to protect the heap page from being torn.
+ *
+ * If not, then we must *not* update the heap page's LSN. In
+ * this case, the FPI for the heap page was omitted from the
+ * WAL record inserted above, so it would be incorrect to
+ * update the heap page's LSN.
*/
if (XLogHintBitIsNeeded())
{
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 72af656060..e771640b57 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -645,6 +645,23 @@ If you do decide to optimise away a WAL record, then any calls to
MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
otherwise you will expose the risk of partial page writes.
+The all-visible hint in a heap page (PD_ALL_VISIBLE) is a special
+case, because it is treated like a durable change in some respects and
+a hint in other respects. It must satisfy the invariant that, if a
+heap page's associated visibilitymap (VM) bit is set, then
+PD_ALL_VISIBLE is set on the heap page itself. Clearing of
+PD_ALL_VISIBLE is always treated like a fully-durable change to
+maintain this invariant. Additionally, if checksums or wal_log_hints
+are enabled, setting PD_ALL_VISIBLE is also treated like a
+fully-durable change to protect against torn pages.
+
+But, if neither checksums or wal_log_hints is enabled, torn pages are
+of no consequence if the only change is to PD_ALL_VISIBLE; so no full
+heap page image is taken, and the heap page's LSN is not updated. NB:
+it would be incorrect to update the heap page's LSN when applying this
+optimization, even though there is an associated WAL record, because
+subsequent modifiers (e.g. an unrelated UPDATE) of the page may
+falsely believe that a full page image is not required.
Write-Ahead Logging for Filesystem Actions
------------------------------------------
--
2.34.1
On Fri, Nov 11, 2022 at 2:14 PM Jeff Davis <pgsql@j-davis.com> wrote:
The typical WAL rules are broken for setting PD_ALL_VISIBLE. I'm OK
with that -- rules are meant to be broken -- but it's confusing enough
that I think we should (internally) document it better.
+1. I think that there is a lot of value in being deliberate about
invariants like this.
If it's too awkward to list special cases like this one in some
central place, then maybe those special cases shouldn't exist in the
first place. I don't love the fact that we have this PD_ALL_VISIBLE
special case -- "it's kind of a hint but also not really" doesn't
inspire confidence. But I don't see it changing anytime soon.
Acknowledging that it is an odd special case in a full throated sort
of way at least minimizes confusion. It kind of makes sense in one
way, I suppose -- maybe the visibility map itself is the special case.
The visibility map has its own unique definition of crash safe that
makes losing set bits tolerable, while failing to unset a bit remains
intolerable (only the latter could result in wrong answers to
queries). I think that every other on-disk structure is either a pure
hint without any accompanying WAL record, or an atomic action with a
WAL record whose REDO routine needs to reliably reproduce the same
on-disk state as original execution (barring preexisting differences
in how hint bits are set between original execution and a hot
standby).
--
Peter Geoghegan