Avoiding another needless ERROR during nbtree page deletion
Work from commit 5b861baa (later backpatched as commit 43e409ce)
taught nbtree to press on with vacuuming an index when page deletion
fails to "re-find" a downlink in the target page's parent (or in some
page to the right of the parent) due to index corruption.
To recap, avoiding ERRORs during vacuuming (even those caused by index
corruption) is useful because there is no reason to expect the error
to go away on its own; we're relying on the DBA to notice the error
and REINDEX before wraparound/xidStopLimit kicks in. This is at least
the case on versions before 14, where the failsafe can eventually
kick-in and avoid catastrophe (though the failsafe can only be
expected to avoid the worst consequences).
It has come to my attention that there is a remaining issue of the
same general nature in nbtree VACUUM's page deletion code. Though this
remaining issue seems significantly less likely to come up in
practice, there is no reason to take any chances here. Attached patch
fixes it.
Also attached is a bugfix for a minor issue in amcheck's
bt_index_parent_check() function, which I noticed in passing, while I
tested the first patch. We assumed that we'd always land on the
leftmost page on each level first (the leftmost according to internal
pages one level up). That assumption is faulty because page deletion
of the leftmost page is quite possible. Page deletion can be
interrupted, leaving a half-dead leaf page (possibly the leftmost leaf
page) without any downlink one level up, while still leaving a left
sibling link on the leaf level (in the leaf page that isn't about to
become the leftmost, but won't until the interrupted page deletion can
be completed).
IMV this should be backpatched all the way. The issue in question is
rather unlikely to come up. But the fix that I've come up with is very
well targeted. It seems just about impossible for it to affect any
user that didn't already have a serious problem (without the fix).
--
Peter Geoghegan
Attachments:
v1-0001-nbtree-VACUUM-cope-with-right-sibling-link-corrup.patchapplication/octet-stream; name=v1-0001-nbtree-VACUUM-cope-with-right-sibling-link-corrup.patchDownload
From b33e9b435954f4bc8cda481df697bdb4b1e04b9e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 May 2023 16:20:55 -0700
Subject: [PATCH v1 1/2] nbtree VACUUM: cope with right sibling link
corruption.
Avoid "right sibling's left-link doesn't match" errors when vacuuming a
corrupt nbtree index: just LOG the issue and press on with vacuuming.
This ERROR is seen in the field from time to time (it's not just a
theoretical possibility), even though the relevant check takes place
after various similar checks ran without detecting any problems.
It's worth avoiding such errors, even in the event of corruption.
Anything that makes VACUUM unable to make forward progress against one
table/index ultimately risks making the system enter xidStopLimit mode.
There is no good reason to take any chances here.
This is similar to the work from commit 5b861baa (later backpatched as
commit 43e409ce), which taught nbtree to press on with vacuuming an
index when page deletion fails to "re-find" a downlink in the target
page's parent (or in some page to the right of the parent). Both cases
involve throwing errors in nbtree page deletion that correspond to
similar errors at a corresponding point during nbtree page splits. And
in both cases the solution is to demote the ERROR to a LOG message, and
press on (without changing anything about the page split errors).
Note that _bt_unlink_halfdead_page() has long been able to bail on page
deletion when the target page's left sibling was in an inconsistent
state (that goes all the way back to the initial introduction of page
deletion by commit 88dc31e3f2). _bt_unlink_halfdead_page() can now bail
on page deletion in much the same way if the sibling link corruption
happens to be in the right sibling page's link. While there was already
significant overlap between each of the two sibling checks (elevel
notwithstanding), there is no reason to take any chances here.
---
src/backend/access/nbtree/nbtpage.c | 36 ++++++++++++++++++++++++-----
1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 41aa1c4cc..def0d7926 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2412,16 +2412,16 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
}
else
{
- /* we have only a pin on leafbuf */
+ /* we have only a pin on target, which is also leafbuf */
ReleaseBuffer(leafbuf);
}
ereport(LOG,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("valid left sibling for deletion target could not be located: "
- "left sibling %u of target %u with leafblkno %u and scanblkno %u in index \"%s\"",
+ "left sibling %u of target %u with leafblkno %u and scanblkno %u on level %u of index \"%s\"",
leftsib, target, leafblkno, scanblkno,
- RelationGetRelationName(rel))));
+ targetlevel, RelationGetRelationName(rel))));
return false;
}
@@ -2496,13 +2496,37 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
page = BufferGetPage(rbuf);
opaque = BTPageGetOpaque(page);
+
+ /*
+ * Validate target's right sibling page's left link points back to target.
+ *
+ * This is known to fail in the field in the presence of index corruption,
+ * so we go to the trouble of avoiding a hard ERROR. This is fairly close
+ * to what we did earlier on when we located the target's left sibling
+ * (iff target has a left sibling).
+ */
if (opaque->btpo_prev != target)
- ereport(ERROR,
+ {
+ ereport(LOG,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("right sibling's left-link doesn't match: "
- "block %u links to %u instead of expected %u in index \"%s\"",
+ "block %u links to %u instead of expected %u on level %u of index \"%s\"",
rightsib, opaque->btpo_prev, target,
- RelationGetRelationName(rel))));
+ targetlevel, RelationGetRelationName(rel))));
+ /* release siblings */
+ if (BufferIsValid(lbuf))
+ _bt_relbuf(rel, lbuf);
+ _bt_relbuf(rel, rbuf);
+
+ /* release target */
+ _bt_relbuf(rel, buf);
+ /* release leafbuf when it isn't also the target */
+ if (target != leafblkno)
+ _bt_relbuf(rel, leafbuf);
+
+ return false;
+ }
+
rightsib_is_rightmost = P_RIGHTMOST(opaque);
*rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
--
2.40.1
v1-0002-amcheck-Fix-bt_index_parent_check-false-positive.patchapplication/octet-stream; name=v1-0002-amcheck-Fix-bt_index_parent_check-false-positive.patchDownload
From d2f6e01ec2ad76fabc340bda9dfe7ba37d46e223 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 May 2023 16:32:40 -0700
Subject: [PATCH v1 2/2] amcheck: Fix bt_index_parent_check false positive.
There is no reason why nbtree VACUUM page deletion that targets the
leftmost leaf page cannot be interrupted (e.g., by a hard crash). That
violates the assumption that nbtree verification (when run with relation
level ShareLock) ought to reliably land on the leftmost page on each
level according to the leftmost internal page one level up.
To fix, demote the ERROR to a DEBUG1 message, since it does still seem
useful to draw a little attention to the issue. It might help when
debugging amcheck itself.
---
contrib/amcheck/verify_nbtree.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6979aff72..5cad9f88f 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -767,10 +767,13 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
if (state->readonly)
{
if (!P_LEFTMOST(opaque))
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
+ {
+ ereport(DEBUG1,
+ (errcode(ERRCODE_NO_DATA),
errmsg("block %u is not leftmost in index \"%s\"",
current, RelationGetRelationName(state->rel))));
+ leftcurrent = opaque->btpo_prev;
+ }
if (level.istruerootlevel && !P_ISROOT(opaque))
ereport(ERROR,
@@ -1924,10 +1927,14 @@ bt_child_highkey_check(BtreeCheckState *state,
opaque = BTPageGetOpaque(page);
- /* The first page we visit at the level should be leftmost */
+ /*
+ * The first page we visit at the level will usually be the leftmost,
+ * but not if there was an interrupted page deletion during some
+ * earlier VACUUM
+ */
if (first && !BlockNumberIsValid(state->prevrightlink) && !P_LEFTMOST(opaque))
- ereport(ERROR,
- (errcode(ERRCODE_INDEX_CORRUPTED),
+ ereport(DEBUG1,
+ (errcode(ERRCODE_NO_DATA),
errmsg("the first child of leftmost target page is not leftmost of its level in index \"%s\"",
RelationGetRelationName(state->rel)),
errdetail_internal("Target block=%u child block=%u target page lsn=%X/%X.",
--
2.40.1
On 20/05/2023 05:17, Peter Geoghegan wrote:
It has come to my attention that there is a remaining issue of the
same general nature in nbtree VACUUM's page deletion code. Though this
remaining issue seems significantly less likely to come up in
practice, there is no reason to take any chances here.
Yeah, let's be consistent. Any idea what might cause this corruption?
Attached patch fixes it. + /* + * Validate target's right sibling page's left link points back to target. + * + * This is known to fail in the field in the presence of index corruption, + * so we go to the trouble of avoiding a hard ERROR. This is fairly close + * to what we did earlier on when we located the target's left sibling + * (iff target has a left sibling). + */
This comment notes that this is similar to what we did with the left
sibling, but there isn't really any mention at the left sibling code
about avoiding hard ERRORs. Feels a bit backwards. Maybe move the
comment about avoiding the hard ERROR to where the left sibling is
handled. Or explain it in the function comment and just have short
"shouldn't happen, but avoid hard ERROR if the index is corrupt" comment
here.
Also attached is a bugfix for a minor issue in amcheck's
bt_index_parent_check() function, which I noticed in passing, while I
tested the first patch. We assumed that we'd always land on the
leftmost page on each level first (the leftmost according to internal
pages one level up). That assumption is faulty because page deletion
of the leftmost page is quite possible. Page deletion can be
interrupted, leaving a half-dead leaf page (possibly the leftmost leaf
page) without any downlink one level up, while still leaving a left
sibling link on the leaf level (in the leaf page that isn't about to
become the leftmost, but won't until the interrupted page deletion can
be completed).
You could check that the left sibling is indeed a half-dead page.
ereport(DEBUG1,
(errcode(ERRCODE_NO_DATA),
errmsg("block %u is not leftmost in index \"%s\"",
current, RelationGetRelationName(state->rel))));
ERRCODE_NO_DATA doesn't look right. Let's just leave out the errcode.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Sun, May 21, 2023 at 11:51 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Any idea what might cause this corruption?
Not really, no. As far as I know the specific case that was brought to
my attention (that put me on the path to writing this patch) was just
an isolated incident. The interesting detail (if any) is that it was a
relatively recent version of Postgres (13), and that there were no
other known problems. This means that there is a plausible remaining
gap in the defensive checks in nbtree VACUUM on recent versions -- we
might have expected to avoid a hard ERROR in some other way, from one
of the earlier checks, but that didn't happen on at least one
occasion.
You can find several references to the "right sibling's left-link
doesn't match:" error message by googling. Most of them are clearly
from the page split ERROR. But there are some from VACUUM, too:
Granted, that was from a 9.2 database -- before your 9.4 work that
made this whole area much more robust.
This comment notes that this is similar to what we did with the left
sibling, but there isn't really any mention at the left sibling code
about avoiding hard ERRORs. Feels a bit backwards. Maybe move the
comment about avoiding the hard ERROR to where the left sibling is
handled. Or explain it in the function comment and just have short
"shouldn't happen, but avoid hard ERROR if the index is corrupt" comment
here.
Good point. Will do it that way.
Also attached is a bugfix for a minor issue in amcheck's
bt_index_parent_check() function, which I noticed in passing, while I
tested the first patch.
You could check that the left sibling is indeed a half-dead page.
It's very hard to see, but...I think that we do. Sort of. Since
bt_recheck_sibling_links() is prepared to check that the left
sibling's right link points back to the target page.
One problem with that is that it only happens in the AccessShareLock
case, whereas we're concerned with fixing an issue in the ShareLock
case. Another problem is that it's awkward and complicated to explain.
It's not obvious that it's worth trying to explain all this and/or
making sure that it happens in the ShareLock case, so that we have
everything covered. I'm unsure.
ERRCODE_NO_DATA doesn't look right. Let's just leave out the errcode.
Agreed.
--
Peter Geoghegan
On Mon, May 22, 2023 at 9:22 AM Peter Geoghegan <pg@bowt.ie> wrote:
This comment notes that this is similar to what we did with the left
sibling, but there isn't really any mention at the left sibling code
about avoiding hard ERRORs. Feels a bit backwards. Maybe move the
comment about avoiding the hard ERROR to where the left sibling is
handled. Or explain it in the function comment and just have short
"shouldn't happen, but avoid hard ERROR if the index is corrupt" comment
here.Good point. Will do it that way.
Attached is v2, which does it that way. It also adjusts the approach
taken to release locks and pins when the left sibling validation check
fails. This makes it simpler and more consistent with surrounding
code. I might not include this change in the backpatch.
Not including a revised amcheck patch here, since I'm not exactly sure
what to do with your feedback on that one just yet.
--
Peter Geoghegan
Attachments:
v2-0001-nbtree-VACUUM-cope-with-right-sibling-link-corrup.patchapplication/octet-stream; name=v2-0001-nbtree-VACUUM-cope-with-right-sibling-link-corrup.patchDownload
From 836b57525c3c221ef4f2826abed9b32d7f48ad8e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 19 May 2023 16:20:55 -0700
Subject: [PATCH v2 1/2] nbtree VACUUM: cope with right sibling link
corruption.
Avoid "right sibling's left-link doesn't match" errors when vacuuming a
corrupt nbtree index: just LOG the issue and press on with vacuuming.
This ERROR is seen in the field from time to time (it's not just a
theoretical possibility), even though the relevant check takes place
after various similar checks ran without detecting any problems.
It's worth avoiding such errors, even in the event of corruption.
Anything that makes VACUUM unable to make forward progress against one
table/index ultimately risks making the system enter xidStopLimit mode.
There is no good reason to take any chances here.
This is similar to the work from commit 5b861baa (later backpatched as
commit 43e409ce), which taught nbtree to press on with vacuuming an
index when page deletion fails to "re-find" a downlink in the target
page's parent (or in some page to the right of the parent). Both cases
involve throwing errors in nbtree page deletion that correspond to
similar errors at a corresponding point during nbtree page splits. And
in both cases the solution is to demote the ERROR to a LOG message, and
press on (without changing anything about the page split errors).
Note that _bt_unlink_halfdead_page() has long been able to bail on page
deletion when the target page's left sibling was in an inconsistent
state (that goes all the way back to the initial introduction of page
deletion by commit 88dc31e3f2). _bt_unlink_halfdead_page() can now bail
on page deletion in much the same way if the sibling link corruption
happens to be in the right sibling page's link. While there was already
significant overlap between each of the two sibling checks (elevel
notwithstanding), there is no reason to take any chances here.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wzko2q2kP1+UvgJyP9g0mF4hopK0NtQZcxwvMv9_ytGhkQ@mail.gmail.com
Backpatch: 11- (all supported versions).
---
src/backend/access/nbtree/nbtpage.c | 62 ++++++++++++++++++++---------
src/backend/access/nbtree/nbtree.c | 3 +-
2 files changed, 45 insertions(+), 20 deletions(-)
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 6be891522..486d1563d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2404,24 +2404,22 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
if (!leftsibvalid)
{
- if (target != leafblkno)
- {
- /* we have only a pin on target, but pin+lock on leafbuf */
- ReleaseBuffer(buf);
- _bt_relbuf(rel, leafbuf);
- }
- else
- {
- /* we have only a pin on leafbuf */
- ReleaseBuffer(leafbuf);
- }
-
+ /*
+ * This is known to fail in the field; sibling link corruption
+ * is relatively common. Press on with vacuuming rather than
+ * just throwing an ERROR.
+ */
ereport(LOG,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("valid left sibling for deletion target could not be located: "
- "left sibling %u of target %u with leafblkno %u and scanblkno %u in index \"%s\"",
+ "left sibling %u of target %u with leafblkno %u and scanblkno %u on level %u of index \"%s\"",
leftsib, target, leafblkno, scanblkno,
- RelationGetRelationName(rel))));
+ targetlevel, RelationGetRelationName(rel))));
+
+ /* Must release all pins and locks on failure exit */
+ ReleaseBuffer(buf);
+ if (target != leafblkno)
+ _bt_relbuf(rel, leafbuf);
return false;
}
@@ -2496,13 +2494,40 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
page = BufferGetPage(rbuf);
opaque = BTPageGetOpaque(page);
+
+ /*
+ * Validate target's right sibling page. Its left link must point back to
+ * the target page.
+ */
if (opaque->btpo_prev != target)
- ereport(ERROR,
+ {
+ /*
+ * This is known to fail in the field; sibling link corruption is
+ * relatively common. Press on with vacuuming rather than just
+ * throwing an ERROR (same approach used for left-sibling's-right-link
+ * validation check a moment ago).
+ */
+ ereport(LOG,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("right sibling's left-link doesn't match: "
- "block %u links to %u instead of expected %u in index \"%s\"",
- rightsib, opaque->btpo_prev, target,
- RelationGetRelationName(rel))));
+ "right sibling %u of target %u with leafblkno %u "
+ "and scanblkno %u spuriously links to non-target %u "
+ "on level %u of index \"%s\"",
+ rightsib, target, leafblkno,
+ scanblkno, opaque->btpo_prev,
+ targetlevel, RelationGetRelationName(rel))));
+
+ /* Must release all pins and locks on failure exit */
+ if (BufferIsValid(lbuf))
+ _bt_relbuf(rel, lbuf);
+ _bt_relbuf(rel, rbuf);
+ _bt_relbuf(rel, buf);
+ if (target != leafblkno)
+ _bt_relbuf(rel, leafbuf);
+
+ return false;
+ }
+
rightsib_is_rightmost = P_RIGHTMOST(opaque);
*rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
@@ -2727,6 +2752,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
*/
_bt_pendingfsm_add(vstate, target, safexid);
+ /* Success - hold on to lock on leafbuf (might also have been target) */
return true;
}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 2df884985..1ce5b1519 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1093,8 +1093,7 @@ backtrack:
* can't be half-dead because only an interrupted VACUUM process can
* leave pages in that state, so we'd definitely have dealt with it
* back when the page was the scanblkno page (half-dead pages are
- * always marked fully deleted by _bt_pagedel()). This assumes that
- * there can be only one vacuum process running at a time.
+ * always marked fully deleted by _bt_pagedel(), barring corruption).
*/
if (!opaque || !P_ISLEAF(opaque) || P_ISHALFDEAD(opaque))
{
--
2.40.1
On Mon, May 22, 2023 at 10:59 AM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v2, which does it that way. It also adjusts the approach
taken to release locks and pins when the left sibling validation check
fails.
I pushed this just now, backpatching all the way.
Not including a revised amcheck patch here, since I'm not exactly sure
what to do with your feedback on that one just yet.
I'd like to go with a minimal approach in my patch to address the
remaining issue in amcheck. Something similar to the patch that was
posted as part of v1. While it seems important to address the issue,
making sure that we have coverage of the leftmost page really being
half-dead (as opposed to something that would constitute corruption)
seems much less important. Ideally we'd have exhaustive coverage, but
it's not a priority for me right now.
--
Peter Geoghegan
On Mon, May 22, 2023, 12:31 Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, May 21, 2023 at 11:51 PM Heikki Linnakangas <hlinnaka@iki.fi>
wrote:Any idea what might cause this corruption?
Not really, no. As far as I know the specific case that was brought to
my attention (that put me on the path to writing this patch) was just
an isolated incident. The interesting detail (if any) is that it was a
relatively recent version of Postgres (13), and that there were no
other known problems. This means that there is a plausible remaining
gap in the defensive checks in nbtree VACUUM on recent versions -- we
might have expected to avoid a hard ERROR in some other way, from one
of the earlier checks, but that didn't happen on at least one
occasion.
What error would one expect to see? I did have a case where vacuum was
erroring on a btree in $previous_job.
On Thu, Jun 1, 2023 at 6:47 AM Greg Stark <stark@mit.edu> wrote:
What error would one expect to see? I did have a case where vacuum was erroring on a btree in $previous_job.
You mean in general? It's usually this one:
https://gitlab.com/gitlab-org/gitlab/-/issues/381443
In the case of this particular issue, the error is "right sibling's
left-link doesn't match". Per:
--
Peter Geoghegan