Lack of PageSetLSN in heap_xlog_visible

Started by Konstantin Knizhnikover 3 years ago5 messages
#1Konstantin Knizhnik
knizhnik@garret.ru

Hi hackers!

heap_xlog_visible is not bumping heap page LSN when setting all-visible
flag in it.
There is long comment explaining it:

        /*
         * We don't bump the LSN of the heap page when setting the
visibility
         * map bit (unless checksums or wal_hint_bits is enabled, in which
         * case we must), because that would generate an unworkable
volume of
         * full-page writes.  This exposes us to torn page hazards, but
since
         * we're not inspecting the existing page contents in any way, we
         * don't care.
         *
         * However, all operations that clear the visibility map bit
*do* bump
         * the LSN, and those operations will only be replayed if the
XLOG LSN
         * follows the page LSN.  Thus, if the page LSN has advanced
past our
         * XLOG record's LSN, we mustn't mark the page all-visible, because
         * the subsequent update won't be replayed to clear the flag.
         */

But it still not clear for me that not bumping LSN in this place is
correct if wal_log_hints is set.
In this case we will have VM page with larger LSN than heap page,
because visibilitymap_set
bumps LSN of VM page. It means that in theory after recovery we may have
page marked as all-visible in VM,
but not having PD_ALL_VISIBLE  in page header. And it violates VM
constraint:

 * When we *set* a visibility map during VACUUM, we must write WAL.
This may
 * seem counterintuitive, since the bit is basically a hint: if it is
clear,
 * it may still be the case that every tuple on the page is visible to all
 * transactions; we just don't know that for certain.  The difficulty
is that
 * there are two bits which are typically set together: the
PD_ALL_VISIBLE bit
 * on the page itself, and the visibility map bit.  If a crash occurs
after the
 * visibility map page makes it to disk and before the updated heap
page makes
 * it to disk, redo must set the bit on the heap page.  Otherwise, the next
 * insert, update, or delete on the heap page will fail to realize that the
 * visibility map bit must be cleared, possibly causing index-only scans to
 * return wrong answers.

#2Jeff Davis
pgsql@j-davis.com
In reply to: Konstantin Knizhnik (#1)
Re: Lack of PageSetLSN in heap_xlog_visible

On Thu, 2022-10-13 at 12:50 +0300, Konstantin Knizhnik wrote:

         /*
          * We don't bump the LSN of the heap page when setting the
visibility
          * map bit (unless checksums or wal_hint_bits is enabled, in
which
          * case we must), because that would generate an unworkable
volume of
          * full-page writes.

It clearly says there that it must set the page LSN, but I don't see
where that's happening. It seems to go all the way back to the original
checksums commit, 96ef3b8ff1.

I can reproduce a case where a replica ends up with a different page
header than the primary (checksums enabled):

Primary:
create extension pageinspect;
create table t(i int) with (autovacuum_enabled=off);
insert into t values(0);

Shut down and restart primary and replica.

Primary:
insert into t values(1);
vacuum t;

Crash replica and let it recover.

Shut down and restart primary and replica.

Primary:
select * from page_header(get_raw_page('t', 0));

Replica:
select * from page_header(get_raw_page('t', 0));

The LSN on the replica is lower, but the flags are the same
(PD_ALL_VISIBLE set). That's a problem, right? The checksums are valid
on both, though.

It may violate our torn page protections for checksums, as well, but I
couldn't construct a scenario for that because recovery can only create
restartpoints at certain times.

But it still not clear for me that not bumping LSN in this place is
correct if wal_log_hints is set.
In this case we will have VM page with larger LSN than heap page,
because visibilitymap_set
bumps LSN of VM page. It means that in theory after recovery we may
have
page marked as all-visible in VM,
but not having PD_ALL_VISIBLE  in page header. And it violates VM
constraint:

I'm not quite following this scenario. If the heap page has a lower LSN
than the VM page, how could we recover to a point where the VM bit is
set but the heap flag isn't? And what does it have to do with
wal_log_hints/checksums?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#3Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#2)
1 attachment(s)
Re: Lack of PageSetLSN in heap_xlog_visible

On Thu, 2022-10-13 at 12:49 -0700, Jeff Davis wrote:

It may violate our torn page protections for checksums, as well...

I could not reproduce a problem here, but I believe one exists when
checksums are enabled, because it bypasses the protections of
UpdateMinRecoveryPoint(). By not updating the page's LSN, it would
allow the page to be flushed (and torn) without updating the minimum
recovery point. That would allow the administrator to subsequently do a
PITR or standby promotion while there's a torn page on disk.

I'm considering this a theoretical risk because for a page tear to be
consequential in this case, it would need to happen between the
checksum and the flags; and I doubt that's a real possibility. But
until we formalize that declaration, then this problem should be fixed.

Patch attached. I also fixed:

* The comment in heap_xlog_visible() says that not updating the page
LSN avoids full page writes; but the causation is backwards: avoiding
full page images requires that we don't update the page's LSN.
* Also in heap_xlog_visible(), there are comments and a branch
leftover from before commit f8f4227976 which don't seem to be necessary
any more.
* In visibilitymap_set(), I clarified that we *must* not set the page
LSN of the heap page if no full page image was taken.

But it still not clear for me that not bumping LSN in this place is
correct if wal_log_hints is set.
In this case we will have VM page with larger LSN than heap page,
because visibilitymap_set
bumps LSN of VM page. It means that in theory after recovery we may
have
page marked as all-visible in VM,
but not having PD_ALL_VISIBLE  in page header. And it violates VM
constraint:

I'm not quite following this scenario. If the heap page has a lower
LSN
than the VM page, how could we recover to a point where the VM bit is
set but the heap flag isn't?

I still don't understand this problem scenario.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachments:

v1-0001-Fix-theoretical-torn-page-hazard.patchtext/x-patch; charset=UTF-8; name=v1-0001-Fix-theoretical-torn-page-hazard.patchDownload
From aa3890a7364ee8f67f04a59c7a8a9cb65086b084 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jdavis@postgresql.org>
Date: Thu, 10 Nov 2022 14:46:30 -0800
Subject: [PATCH v1] Fix theoretical torn page hazard.

The original report was concerned with a possible inconsistency
between the heap and the visibility map, which I was unable to
confirm.

However, there does seem to be a torn page hazard when checksums are
enabled. It may be impossible in practice, because it would require a
page tear between the checksum and the flags, so I am considering this
a theoretical risk.

Also fix related comments, which are misleading.

Reported-by: Konstantin Knizhnik
Discussion: https://postgr.es/m/fed17dac-8cb8-4f5b-d462-1bb4908c029e@garret.ru
Backpatch-through: 11
---
 src/backend/access/heap/heapam.c        | 28 ++++++-------------------
 src/backend/access/heap/visibilitymap.c | 13 ++++++++++--
 2 files changed, 17 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 12be87efed..5c8cdfb9b2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8823,21 +8823,17 @@ heap_xlog_visible(XLogReaderState *record)
 		/*
 		 * We don't bump the LSN of the heap page when setting the visibility
 		 * map bit (unless checksums or wal_hint_bits is enabled, in which
-		 * case we must), because that would generate an unworkable volume of
-		 * full-page writes.  This exposes us to torn page hazards, but since
+		 * case we must). This exposes us to torn page hazards, but since
 		 * we're not inspecting the existing page contents in any way, we
 		 * don't care.
-		 *
-		 * However, all operations that clear the visibility map bit *do* bump
-		 * the LSN, and those operations will only be replayed if the XLOG LSN
-		 * follows the page LSN.  Thus, if the page LSN has advanced past our
-		 * XLOG record's LSN, we mustn't mark the page all-visible, because
-		 * the subsequent update won't be replayed to clear the flag.
 		 */
 		page = BufferGetPage(buffer);
 
 		PageSetAllVisible(page);
 
+		if (XLogHintBitIsNeeded())
+			PageSetLSN(page, lsn);
+
 		MarkBufferDirty(buffer);
 	}
 	else if (action == BLK_RESTORED)
@@ -8901,20 +8897,8 @@ heap_xlog_visible(XLogReaderState *record)
 		reln = CreateFakeRelcacheEntry(rlocator);
 		visibilitymap_pin(reln, blkno, &vmbuffer);
 
-		/*
-		 * Don't set the bit if replay has already passed this point.
-		 *
-		 * It might be safe to do this unconditionally; if replay has passed
-		 * this point, we'll replay at least as far this time as we did
-		 * before, and if this bit needs to be cleared, the record responsible
-		 * for doing so should be again replayed, and clear it.  For right
-		 * now, out of an abundance of conservatism, we use the same test here
-		 * we did for the heap page.  If this results in a dropped bit, no
-		 * real harm is done; and the next VACUUM will fix it.
-		 */
-		if (lsn > PageGetLSN(vmpage))
-			visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
-							  xlrec->cutoff_xid, xlrec->flags);
+		visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
+						  xlrec->cutoff_xid, xlrec->flags);
 
 		ReleaseBuffer(vmbuffer);
 		FreeFakeRelcacheEntry(reln);
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index d62761728b..e029ae9ae2 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -287,8 +287,17 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 										  cutoff_xid, flags);
 
 				/*
-				 * If data checksums are enabled (or wal_log_hints=on), we
-				 * need to protect the heap page from being torn.
+				 * If XLogHintBitIsNeeded(), we protect against torn pages
+				 * like usual: the call to log_heap_visible() has inserted a
+				 * WAL record which includes an image of the heap page if
+				 * necessary, and we update the heap page's LSN below.
+				 *
+				 * If !XLogHintBitIsNeeded(), a torn page is inconsequential
+				 * if the only change was to a hint bit. The above call to
+				 * log_heap_visible() has specified REGBUF_NO_IMAGE for the
+				 * heap buffer, so we must not update the heap page's LSN
+				 * because that would signal to the next modification that a
+				 * full page image has already been taken.
 				 */
 				if (XLogHintBitIsNeeded())
 				{
-- 
2.34.1

#4Konstantin Knizhnik
knizhnik@garret.ru
In reply to: Jeff Davis (#3)
Re: Lack of PageSetLSN in heap_xlog_visible

On 11.11.2022 03:20, Jeff Davis wrote:

On Thu, 2022-10-13 at 12:49 -0700, Jeff Davis wrote:

It may violate our torn page protections for checksums, as well...

I could not reproduce a problem here, but I believe one exists when
checksums are enabled, because it bypasses the protections of
UpdateMinRecoveryPoint(). By not updating the page's LSN, it would
allow the page to be flushed (and torn) without updating the minimum
recovery point. That would allow the administrator to subsequently do a
PITR or standby promotion while there's a torn page on disk.

I'm considering this a theoretical risk because for a page tear to be
consequential in this case, it would need to happen between the
checksum and the flags; and I doubt that's a real possibility. But
until we formalize that declaration, then this problem should be fixed.

Patch attached. I also fixed:

* The comment in heap_xlog_visible() says that not updating the page
LSN avoids full page writes; but the causation is backwards: avoiding
full page images requires that we don't update the page's LSN.
* Also in heap_xlog_visible(), there are comments and a branch
leftover from before commit f8f4227976 which don't seem to be necessary
any more.
* In visibilitymap_set(), I clarified that we *must* not set the page
LSN of the heap page if no full page image was taken.

But it still not clear for me that not bumping LSN in this place is
correct if wal_log_hints is set.
In this case we will have VM page with larger LSN than heap page,
because visibilitymap_set
bumps LSN of VM page. It means that in theory after recovery we may
have
page marked as all-visible in VM,
but not having PD_ALL_VISIBLE  in page header. And it violates VM
constraint:

I'm not quite following this scenario. If the heap page has a lower
LSN
than the VM page, how could we recover to a point where the VM bit is
set but the heap flag isn't?

I still don't understand this problem scenario.

I am sorry that I have reported the issue and then didn't reply.
Yes, you are right: my original concerns that it may cause problems with
recovery at replica are not correct.
Not updating page LSN may just force it's reconstruction.
I also not sure that it can cause problems with checksums - page is
marked as dirty in any case.
Yes, content and checksum of the page will be different at master and
replica. It may be a problem for recovery pages from replica.

When it really be critical - is incremental backup (like pg_probackup)
whch looks at page LSN to determine moment of page modification.

And definitely it is critical for Neon, because LSN of page
reconstructed by pageserver is different from one expected by compute node.

May be I missing something, but I do not see any penalty from setting
page LSN here.
So even if there is not obvious scenario of failure, I still thing that
it should be fixed. Breaking invariants is always bad thing
and there are should be very strong arguments for doing it. And I do not
see them here.

So it will be great if your patch can be committed.

#5Jeff Davis
pgsql@j-davis.com
In reply to: Konstantin Knizhnik (#4)
Re: Lack of PageSetLSN in heap_xlog_visible

On Fri, 2022-11-11 at 12:43 +0200, Konstantin Knizhnik wrote:

Yes, you are right: my original concerns that it may cause problems
with
recovery at replica are not correct.

Great, thank you for following up.

I also not sure that it can cause problems with checksums - page is
marked as dirty in any case.
Yes, content and checksum of the page will be different at master and
replica. It may be a problem for recovery pages from replica.

It could cause a theoretical problem: during recovery, the page could
be flushed out before minRecoveryPoint is updated, and while that's
happening, a crash could tear it. Then, a subsequent partial recovery
(e.g. PITR) could recover without fixing the torn page.

That would not be a practical problem, because it would require a tear
between two fields of the page header, which I don't think is possible.

When it really be critical - is incremental backup (like
pg_probackup)
whch looks at page LSN to determine moment of page modification.

And definitely it is critical for Neon, because LSN of page
reconstructed by pageserver is different from one expected by compute
node.

May be I missing something, but I do not see any penalty from setting
page LSN here.
So even if there is not obvious scenario of failure, I still thing
that
it should be fixed. Breaking invariants is always bad thing
and there are should be very strong arguments for doing it. And I do
not
see them here.

Agreed, thank you for the report!

Committed d6a3dbe14f and backpatched through version 11.

Also committed an unrelated cleanup patch in the same area (3eb8eeccbe)
and a README patch (97c61f70d1) to master. The README patch doesn't
guarantee that things won't change in the future, but the behavior it
describes has been true for quite a while now.

--
Jeff Davis
PostgreSQL Contributor Team - AWS