post-recovery amcheck expectations
Suppose we start with this nbtree (subset of a diagram from verify_nbtree.c):
* 1
* / \
* 2 <-> 3
We're deleting 2, the leftmost leaf under a leftmost internal page. After the
MARK_PAGE_HALFDEAD record, the first downlink from 1 will lead to 3, which
still has a btpo_prev pointing to 2. bt_index_parent_check() complains here:
/* The first page we visit at the level should be leftmost */
if (first && !BlockNumberIsValid(state->prevrightlink) && !P_LEFTMOST(opaque))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("the first child of leftmost target page is not leftmost of its level in index \"%s\"",
RelationGetRelationName(state->rel)),
errdetail_internal("Target block=%u child block=%u target page lsn=%X/%X.",
state->targetblock, blkno,
LSN_FORMAT_ARGS(state->targetlsn))));
One can encounter this if recovery ends between a MARK_PAGE_HALFDEAD record
and its corresponding UNLINK_PAGE record. See the attached test case. The
index is actually fine in such a state, right? I lean toward fixing this by
having amcheck scan left; if left links reach only half-dead or deleted pages,
that's as good as the present child block being P_LEFTMOST. There's a
different error from bt_index_check(), and I've not yet studied how to fix
that:
ERROR: left link/right link pair in index "not_leftmost_pk" not in agreement
DETAIL: Block=0 left block=0 left link from block=4294967295.
Alternatively, one could view this as a need for the user to VACUUM between
recovery and amcheck. The documentation could direct users to "VACUUM
(DISABLE_PAGE_SKIPPING off, INDEX_CLEANUP on, TRUNCATE off)" if not done since
last recovery. Does anyone prefer that or some other alternative?
For some other amcheck expectations, the comments suggest reliance on the
bt_index_parent_check() ShareLock. I haven't tried to make test cases for
them, but perhaps recovery can trick them the same way. Examples:
errmsg("downlink or sibling link points to deleted block in index \"%s\"",
errmsg("block %u is not leftmost in index \"%s\"",
errmsg("block %u is not true root in index \"%s\"",
Thanks,
nm
Attachments:
amcheck-post-recovery-v0.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index b82f221..9a7f4f7 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -13,6 +13,7 @@ PGFILEDESC = "amcheck - function for verifying relation integrity"
REGRESS = check check_btree check_heap
TAP_TESTS = 1
+EXTRA_INSTALL=contrib/pg_walinspect
ifdef USE_PGXS
PG_CONFIG = pg_config
diff --git a/contrib/amcheck/t/004_pitr.pl b/contrib/amcheck/t/004_pitr.pl
new file mode 100644
index 0000000..ec6d87e
--- /dev/null
+++ b/contrib/amcheck/t/004_pitr.pl
@@ -0,0 +1,65 @@
+
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Test integrity of intermediate states by PITR to those states
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# origin node: generate WAL records of interest.
+my $origin = PostgreSQL::Test::Cluster->new('origin');
+$origin->init(has_archiving => 1, allows_streaming => 1);
+$origin->append_conf('postgresql.conf', 'autovacuum = off');
+$origin->start;
+$origin->backup('my_backup');
+# Create a table with each of 6 PK values spanning 1/4 of a block. Delete the
+# first four, so one index leaf is eligible for deletion. Make a replication
+# slot just so pg_walinspect will always have access to later WAL.
+my $setup = <<EOSQL;
+BEGIN;
+CREATE EXTENSION amcheck;
+CREATE EXTENSION pg_walinspect;
+CREATE TABLE not_leftmost (c text STORAGE PLAIN);
+INSERT INTO not_leftmost
+ SELECT repeat(n::text, database_block_size / 4)
+ FROM generate_series(1,6) t(n), pg_control_init();
+ALTER TABLE not_leftmost ADD CONSTRAINT not_leftmost_pk PRIMARY KEY (c);
+DELETE FROM not_leftmost WHERE c ~ '^[1-4]';
+SELECT pg_create_physical_replication_slot('for_walinspect', true, false);
+COMMIT;
+EOSQL
+$origin->safe_psql('postgres', $setup);
+my $before_vacuum_lsn =
+ $origin->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+# VACUUM to delete the aforementioned leaf page. Find the LSN of that
+# UNLINK_PAGE record.
+my $exec = <<EOSQL;
+VACUUM VERBOSE not_leftmost;
+SELECT max(start_lsn)
+ FROM pg_get_wal_records_info('$before_vacuum_lsn', 'FFFFFFFF/FFFFFFFF')
+ WHERE resource_manager = 'Btree' AND record_type = 'UNLINK_PAGE';
+EOSQL
+my $unlink_lsn = $origin->safe_psql('postgres', $exec);
+$origin->stop;
+
+# replica node: amcheck at notable points in the WAL stream
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($origin, 'my_backup', has_restoring => 1);
+$replica->append_conf('postgresql.conf',
+ "recovery_target_lsn = '$unlink_lsn'");
+$replica->append_conf('postgresql.conf', 'recovery_target_inclusive = off');
+$replica->append_conf('postgresql.conf', 'recovery_target_action = promote');
+$replica->start;
+$replica->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
+ or die "Timed out while waiting for PITR promotion";
+# recovery done; run amcheck
+is( $replica->psql(
+ 'postgres', "SELECT bt_index_parent_check('not_leftmost_pk', true)"),
+ 0);
+is( $replica->psql(
+ 'postgres', "SELECT bt_index_check('not_leftmost_pk', true)"),
+ 0);
+
+done_testing();
On Wed, Oct 4, 2023 at 7:52 PM Noah Misch <noah@leadboat.com> wrote:
Suppose we start with this nbtree (subset of a diagram from verify_nbtree.c):
* 1
* / \
* 2 <-> 3We're deleting 2, the leftmost leaf under a leftmost internal page. After the
MARK_PAGE_HALFDEAD record, the first downlink from 1 will lead to 3, which
still has a btpo_prev pointing to 2. bt_index_parent_check() complains here:
Thanks for working on this. Your analysis seems sound to me.
When I reviewed the patch that became commit d114cc53, I presented
Alexander with a test case that demonstrated false positive reports of
corruption involving interrupted page splits (not interrupted page
deletions). Obviously I didn't give sufficient thought to this case,
which is analogous.
Might make sense to test the fix for this issue using a similar
approach: by adding custom code that randomly throws errors at a point
that stresses the implementation. I'm referring to the point at which
VACUUM is between the first and second phase of page deletion: right
before (or directly after) _bt_unlink_halfdead_page() is called. That
isn't fundamentally different to the approach from your TAP test, but
seems like it might add some interesting variation.
One can encounter this if recovery ends between a MARK_PAGE_HALFDEAD record
and its corresponding UNLINK_PAGE record. See the attached test case. The
index is actually fine in such a state, right?
Yes, it is fine.
FWIW, this feels like it might be related to the fact that (unlike
Lanin & Shasha), we don't make the key space move left; we make it
move right instead (just like page splits). In other words, page
deletion isn't the precise opposite of a page split, which is a bit
awkward.
Note, in particular, that _bt_mark_page_halfdead() doesn't do a
straight delete of the pivot tuple in the parent page that points to
the target page, as you might expect. It actually deletes the right
sibling of the target page's pivot, and then performs an in-place overwrite of
the downlink from the pivot tuple that originally pointed to the
target page. Perhaps this isn't worth going into now, but thought you
might appreciate the context.
Terminology note: we sometimes use "downlink" as a synonym of "pivot
tuple" or even "separator key", which is misleading.
I lean toward fixing this by
having amcheck scan left; if left links reach only half-dead or deleted pages,
that's as good as the present child block being P_LEFTMOST.
Also my preference.
There's a
different error from bt_index_check(), and I've not yet studied how to fix
that:ERROR: left link/right link pair in index "not_leftmost_pk" not in agreement
DETAIL: Block=0 left block=0 left link from block=4294967295.
This looks like this might be a straightforward case of confusing
P_NONE for InvalidBlockNumber (or vice-versa). They're both used to
indicate "no such block" here.
Alternatively, one could view this as a need for the user to VACUUM between
recovery and amcheck. The documentation could direct users to "VACUUM
(DISABLE_PAGE_SKIPPING off, INDEX_CLEANUP on, TRUNCATE off)" if not done since
last recovery. Does anyone prefer that or some other alternative?
I'd rather not go that route. That strikes me as defining the problem
out of existence.
For some other amcheck expectations, the comments suggest reliance on the
bt_index_parent_check() ShareLock. I haven't tried to make test cases for
them, but perhaps recovery can trick them the same way. Examples:errmsg("downlink or sibling link points to deleted block in index \"%s\"",
errmsg("block %u is not leftmost in index \"%s\"",
errmsg("block %u is not true root in index \"%s\"",
These are all much older. They're certainly all from before the
relevant checks were first added (by commit d114cc53), and seem much
less likely to be buggy.
These older cases are all cases where we descend directly from an
internal page to one of its child pages. Whereas the problem you've
demonstrated involves traversal across levels *and* across siblings in
newer code. That's quite a bit more complicated, since it requires
that we worry about both phases of page deletion -- not just the
first. That in itself necessitates that we deal with various edge
cases. (The really prominent edge-case is the interrupted page
deletion case, which requires significant handling, but evidently
missed a subtlety with leftmost pages).
--
Peter Geoghegan
On Mon, Oct 09, 2023 at 04:46:26PM -0700, Peter Geoghegan wrote:
On Wed, Oct 4, 2023 at 7:52 PM Noah Misch <noah@leadboat.com> wrote:
Might make sense to test the fix for this issue using a similar
approach: by adding custom code that randomly throws errors at a point
that stresses the implementation. I'm referring to the point at which
VACUUM is between the first and second phase of page deletion: right
before (or directly after) _bt_unlink_halfdead_page() is called. That
isn't fundamentally different to the approach from your TAP test, but
seems like it might add some interesting variation.
My initial manual test was like that, actually.
I lean toward fixing this by
having amcheck scan left; if left links reach only half-dead or deleted pages,
that's as good as the present child block being P_LEFTMOST.Also my preference.
Done mostly that way, except I didn't accept deleted pages. Making this work
on !readonly would take more than that, and readonly shouldn't need that.
There's a
different error from bt_index_check(), and I've not yet studied how to fix
that:ERROR: left link/right link pair in index "not_leftmost_pk" not in agreement
DETAIL: Block=0 left block=0 left link from block=4294967295.This looks like this might be a straightforward case of confusing
P_NONE for InvalidBlockNumber (or vice-versa). They're both used to
indicate "no such block" here.
Roughly so. It turned out this scenario was passing leftcurrent=P_NONE to
bt_recheck_sibling_links(), causing that function to use BTPageGetOpaque() on
the metapage.
For some other amcheck expectations, the comments suggest reliance on the
bt_index_parent_check() ShareLock. I haven't tried to make test cases for
them, but perhaps recovery can trick them the same way. Examples:errmsg("downlink or sibling link points to deleted block in index \"%s\"",
errmsg("block %u is not leftmost in index \"%s\"",
errmsg("block %u is not true root in index \"%s\"",These are all much older. They're certainly all from before the
relevant checks were first added (by commit d114cc53), and seem much
less likely to be buggy.
After I fixed the original error, the "block %u is not leftmost" surfaced
next. The attached patch fixes that, too. I didn't investigate the others.
The original test was flaky in response to WAL flush timing, but this one
survives thousands of runs.
Attachments:
amcheck-post-recovery-v1.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
amcheck: Distinguish interrupted page deletion from corruption.
This prevents false-positive reports about "the first child of leftmost
target page is not leftmost of its level", "block %u is not leftmost"
and "left link/right link pair". They appeared if amcheck ran before
VACUUM cleaned things, after a cluster exited recovery between the
first-stage and second-stage WAL records of a deletion. Back-patch to
v11 (all supported versions).
Reviewed by FIXME.
Discussion: https://postgr.es/m/20231005025232.c7.nmisch@google.com
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index b82f221..f830bd3 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -12,6 +12,7 @@ PGFILEDESC = "amcheck - function for verifying relation integrity"
REGRESS = check check_btree check_heap
+EXTRA_INSTALL = contrib/pg_walinspect
TAP_TESTS = 1
ifdef USE_PGXS
diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index 5b55cf3..51d8107 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -42,6 +42,7 @@ tests += {
't/001_verify_heapam.pl',
't/002_cic.pl',
't/003_cic_2pc.pl',
+ 't/004_pitr.pl',
],
},
}
diff --git a/contrib/amcheck/t/004_pitr.pl b/contrib/amcheck/t/004_pitr.pl
new file mode 100644
index 0000000..6bcc159
--- /dev/null
+++ b/contrib/amcheck/t/004_pitr.pl
@@ -0,0 +1,82 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Test integrity of intermediate states by PITR to those states
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# origin node: generate WAL records of interest.
+my $origin = PostgreSQL::Test::Cluster->new('origin');
+$origin->init(has_archiving => 1, allows_streaming => 1);
+$origin->append_conf('postgresql.conf', 'autovacuum = off');
+$origin->start;
+$origin->backup('my_backup');
+# Create a table with each of 6 PK values spanning 1/4 of a block. Delete the
+# first four, so one index leaf is eligible for deletion. Make a replication
+# slot just so pg_walinspect will always have access to later WAL.
+my $setup = <<EOSQL;
+BEGIN;
+CREATE EXTENSION amcheck;
+CREATE EXTENSION pg_walinspect;
+CREATE TABLE not_leftmost (c text STORAGE PLAIN);
+INSERT INTO not_leftmost
+ SELECT repeat(n::text, database_block_size / 4)
+ FROM generate_series(1,6) t(n), pg_control_init();
+ALTER TABLE not_leftmost ADD CONSTRAINT not_leftmost_pk PRIMARY KEY (c);
+DELETE FROM not_leftmost WHERE c ~ '^[1-4]';
+SELECT pg_create_physical_replication_slot('for_walinspect', true, false);
+COMMIT;
+EOSQL
+$origin->safe_psql('postgres', $setup);
+my $before_vacuum_lsn =
+ $origin->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
+# VACUUM to delete the aforementioned leaf page. Force an XLogFlush() by
+# dropping a permanent table. That way, the XLogReader infrastructure can
+# always see VACUUM's records, even under synchronous_commit=off. Finally,
+# find the LSN of that VACUUM's last UNLINK_PAGE record.
+my $vacuum = <<EOSQL;
+SET synchronous_commit = off;
+VACUUM (VERBOSE, INDEX_CLEANUP ON) not_leftmost;
+CREATE TABLE XLogFlush ();
+DROP TABLE XLogFlush;
+SELECT max(start_lsn)
+ FROM pg_get_wal_records_info('$before_vacuum_lsn', 'FFFFFFFF/FFFFFFFF')
+ WHERE resource_manager = 'Btree' AND record_type = 'UNLINK_PAGE';
+EOSQL
+my $unlink_lsn = $origin->safe_psql('postgres', $vacuum);
+$origin->stop;
+die "did not find UNLINK_PAGE record" unless $unlink_lsn;
+
+# replica node: amcheck at notable points in the WAL stream
+my $replica = PostgreSQL::Test::Cluster->new('replica');
+$replica->init_from_backup($origin, 'my_backup', has_restoring => 1);
+$replica->append_conf('postgresql.conf',
+ "recovery_target_lsn = '$unlink_lsn'");
+$replica->append_conf('postgresql.conf', 'recovery_target_inclusive = off');
+$replica->append_conf('postgresql.conf', 'recovery_target_action = promote');
+$replica->start;
+$replica->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
+ or die "Timed out while waiting for PITR promotion";
+# recovery done; run amcheck
+my $debug = "SET client_min_messages = 'debug1'";
+my ($rc, $stderr);
+$rc = $replica->psql(
+ 'postgres',
+ "$debug; SELECT bt_index_parent_check('not_leftmost_pk', true)",
+ stderr => \$stderr);
+print STDERR $stderr, "\n";
+is($rc, 0, "bt_index_parent_check passes");
+like(
+ $stderr,
+ qr/interrupted page deletion detected/,
+ "bt_index_parent_check: interrupted page deletion detected");
+$rc = $replica->psql(
+ 'postgres',
+ "$debug; SELECT bt_index_check('not_leftmost_pk', true)",
+ stderr => \$stderr);
+print STDERR $stderr, "\n";
+is($rc, 0, "bt_index_check passes");
+
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3e07a3e..faaf579 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -148,6 +148,9 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
bool rootdescend);
static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
BtreeLevel level);
+static bool bt_leftmost_ignoring_half_dead(BtreeCheckState *state,
+ BlockNumber start,
+ BTPageOpaque start_opaque);
static void bt_recheck_sibling_links(BtreeCheckState *state,
BlockNumber btpo_prev_from_target,
BlockNumber leftcurrent);
@@ -776,7 +779,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
*/
if (state->readonly)
{
- if (!P_LEFTMOST(opaque))
+ if (!bt_leftmost_ignoring_half_dead(state, current, opaque))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("block %u is not leftmost in index \"%s\"",
@@ -830,8 +833,16 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
*/
}
- /* Sibling links should be in mutual agreement */
- if (opaque->btpo_prev != leftcurrent)
+ /*
+ * Sibling links should be in mutual agreement. There arises
+ * leftcurrent == P_NONE && btpo_prev != P_NONE when the left sibling
+ * of the parent's low-key downlink is half-dead. (A half-dead page
+ * has no downlink from its parent.) Under heavyweight locking, the
+ * last bt_leftmost_ignoring_half_dead() validated this btpo_prev.
+ * Without heavyweight locking, validation of the P_NONE case remains
+ * unimplemented.
+ */
+ if (opaque->btpo_prev != leftcurrent && leftcurrent != P_NONE)
bt_recheck_sibling_links(state, opaque->btpo_prev, leftcurrent);
/* Check level */
@@ -913,6 +924,61 @@ nextpage:
}
/*
+ * Like P_LEFTMOST(start_opaque), but accept an arbitrarily-long chain of
+ * half-dead, sibling-linked pages to the left. If a half-dead page appears
+ * under state->readonly, the database exited recovery between the first-stage
+ * and second-stage WAL records of a deletion.
+ */
+static bool
+bt_leftmost_ignoring_half_dead(BtreeCheckState *state,
+ BlockNumber start,
+ BTPageOpaque start_opaque)
+{
+ BlockNumber reached = start_opaque->btpo_prev,
+ reached_from = start;
+ bool all_half_dead = true;
+
+ /*
+ * To handle the !readonly case, we'd need to accept BTP_DELETED pages and
+ * potentially observe nbtree/README "Page deletion and backwards scans".
+ */
+ Assert(state->readonly);
+
+ while (reached != P_NONE && all_half_dead)
+ {
+ Page page = palloc_btree_page(state, reached);
+ BTPageOpaque reached_opaque = BTPageGetOpaque(page);
+
+ /*
+ * _bt_unlink_halfdead_page() writes that side-links will continue to
+ * point to the siblings. We can easily check btpo_next.
+ */
+ all_half_dead = P_ISHALFDEAD(reached_opaque) &&
+ reached_opaque->btpo_next == reached_from;
+ if (all_half_dead)
+ {
+ XLogRecPtr pagelsn = PageGetLSN(page);
+
+ /* pagelsn should point to an XLOG_BTREE_MARK_PAGE_HALFDEAD */
+ ereport(DEBUG1,
+ (errcode(ERRCODE_NO_DATA),
+ errmsg_internal("harmless interrupted page deletion detected in index \"%s\"",
+ RelationGetRelationName(state->rel)),
+ errdetail_internal("Block=%u right block=%u page lsn=%X/%X.",
+ reached, reached_from,
+ LSN_FORMAT_ARGS(pagelsn))));
+
+ reached_from = reached;
+ reached = reached_opaque->btpo_prev;
+ }
+
+ pfree(page);
+ }
+
+ return all_half_dead;
+}
+
+/*
* Raise an error when target page's left link does not point back to the
* previous target page, called leftcurrent here. The leftcurrent page's
* right link was followed to get to the current target page, and we expect
@@ -952,6 +1018,9 @@ bt_recheck_sibling_links(BtreeCheckState *state,
BlockNumber btpo_prev_from_target,
BlockNumber leftcurrent)
{
+ /* passing metapage to BTPageGetOpaque() would give irrelevant findings */
+ Assert(leftcurrent != P_NONE);
+
if (!state->readonly)
{
Buffer lbuf;
@@ -1935,7 +2004,8 @@ bt_child_highkey_check(BtreeCheckState *state,
opaque = BTPageGetOpaque(page);
/* The first page we visit at the level should be leftmost */
- if (first && !BlockNumberIsValid(state->prevrightlink) && !P_LEFTMOST(opaque))
+ if (first && !BlockNumberIsValid(state->prevrightlink) &&
+ !bt_leftmost_ignoring_half_dead(state, blkno, opaque))
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("the first child of leftmost target page is not leftmost of its level in index \"%s\"",
On Fri, Oct 20, 2023 at 8:55 PM Noah Misch <noah@leadboat.com> wrote:
I lean toward fixing this by
having amcheck scan left; if left links reach only half-dead or deleted pages,
that's as good as the present child block being P_LEFTMOST.Also my preference.
Done mostly that way, except I didn't accept deleted pages. Making this work
on !readonly would take more than that, and readonly shouldn't need that.
That makes sense to me. I believe that it's not possible to have a
string of consecutive sibling pages that are all half-dead (regardless
of the BlockNumber order of sibling pages, even). But I'd probably
have written the fix in roughly the same way. Although...maybe you
should try to detect a string of half-dead pages? Hard to say if it's
worth the trouble.
Suggest adding a CHECK_FOR_INTERRUPTS() call to the loop, too, just
for good luck.
After I fixed the original error, the "block %u is not leftmost" surfaced
next. The attached patch fixes that, too. I didn't investigate the others.
The original test was flaky in response to WAL flush timing, but this one
survives thousands of runs.
Hmm. Can't argue with that. Your fix seems sound.
--
Peter Geoghegan
On Mon, Oct 23, 2023 at 04:46:23PM -0700, Peter Geoghegan wrote:
On Fri, Oct 20, 2023 at 8:55 PM Noah Misch <noah@leadboat.com> wrote:
I lean toward fixing this by
having amcheck scan left; if left links reach only half-dead or deleted pages,
that's as good as the present child block being P_LEFTMOST.Also my preference.
Done mostly that way, except I didn't accept deleted pages. Making this work
on !readonly would take more than that, and readonly shouldn't need that.That makes sense to me. I believe that it's not possible to have a
string of consecutive sibling pages that are all half-dead (regardless
of the BlockNumber order of sibling pages, even). But I'd probably
have written the fix in roughly the same way. Although...maybe you
should try to detect a string of half-dead pages? Hard to say if it's
worth the trouble.
I imagined a string of half-dead siblings could arise in structure like this:
* 1
* / | \
* 4 <-> 2 <-> 3
With events like this:
- DELETE renders blk 4 deletable.
- Crash with concurrent VACUUM, leaving 4 half-dead after having visited 1-4.
- DELETE renders blk 2 deletable.
- Crash with concurrent VACUUM, leaving 2 half-dead after having visited 1-2.
I didn't try to reproduce that, and something may well prevent it.
Suggest adding a CHECK_FOR_INTERRUPTS() call to the loop, too, just
for good luck.
Added. That gave me the idea to check for circular links, like other parts of
amcheck do. Net diff:
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -949,11 +949,16 @@ bt_leftmost_ignoring_half_dead(BtreeCheckState *state,
Page page = palloc_btree_page(state, reached);
BTPageOpaque reached_opaque = BTPageGetOpaque(page);
+ CHECK_FOR_INTERRUPTS();
+
/*
- * _bt_unlink_halfdead_page() writes that side-links will continue to
- * point to the siblings. We can easily check btpo_next.
+ * Try to detect btpo_prev circular links. _bt_unlink_halfdead_page()
+ * writes that side-links will continue to point to the siblings.
+ * Check btpo_next for that property.
*/
- all_half_dead = P_ISHALFDEAD(reached_opaque) &&
+ all_half_dead = P_ISHALFDEAD(reached_opaque);
+ reached != start &&
+ reached != reached_from &&
reached_opaque->btpo_next == reached_from;
if (all_half_dead)
{
On Mon, Oct 23, 2023 at 7:28 PM Noah Misch <noah@leadboat.com> wrote:
That makes sense to me. I believe that it's not possible to have a
string of consecutive sibling pages that are all half-dead (regardless
of the BlockNumber order of sibling pages, even). But I'd probably
have written the fix in roughly the same way. Although...maybe you
should try to detect a string of half-dead pages? Hard to say if it's
worth the trouble.I imagined a string of half-dead siblings could arise in structure like this:
* 1
* / | \
* 4 <-> 2 <-> 3With events like this:
- DELETE renders blk 4 deletable.
- Crash with concurrent VACUUM, leaving 4 half-dead after having visited 1-4.
- DELETE renders blk 2 deletable.
- Crash with concurrent VACUUM, leaving 2 half-dead after having visited 1-2.I didn't try to reproduce that, and something may well prevent it.
FWIW a couple of factors prevent it (in the absence of corruption). These are:
1. Only VACUUM can delete pages, and in general the only possible
source of half-dead pages is an unfortunately timed crash/error within
VACUUM. Each interrupted VACUUM can leave behind at most one half-dead
page.
2. One thing that makes VACUUM back out of deleting an empty page is
the presence of a half-dead right sibling leaf page left behind by
some VACUUM that was interrupted at some point in the past -- see
_bt_rightsib_halfdeadflag() for details.
Obviously, factors 1 and 2 together make three consecutive half-dead
sibling pages impossible. I'm not quite prepared to say that even two
neighboring half-dead sibling pages are an impossibility right now,
but I think that it might well be. Possibly for reasons that are more
accidental than anything else (is the _bt_rightsib_halfdeadflag thing
a "real invariant" or just something we do because we don't want to
add additional complicated handling to nbtpage.c?), so I'll avoid
going into further detail for now.
I'm pointing this out because it argues for softening the wording
about "accept[ing] an arbitrarily-long chain of half-dead,
sibling-linked pages to the left" from your patch.
I was also wondering (mostly to myself) about the relationship (if
any) between the _bt_rightsib_halfdeadflag/_bt_leftsib_splitflag
"invariants" and the bt_child_highkey_check() check. But I don't think
that it makes sense to put that in scope -- your fix seems like a
strict improvement. This relationship is perhaps in scope here to the
limited extent that talking about strings of consecutive half-dead
pages might make it even harder to understand the design of the
bt_child_highkey_check() check. On the other hand...I'm not sure that
I understand every nuance of it myself.
Suggest adding a CHECK_FOR_INTERRUPTS() call to the loop, too, just
for good luck.Added. That gave me the idea to check for circular links, like other parts of
amcheck do.
Good idea.
--
Peter Geoghegan
On Tue, Oct 24, 2023 at 07:03:34PM -0700, Peter Geoghegan wrote:
On Mon, Oct 23, 2023 at 7:28 PM Noah Misch <noah@leadboat.com> wrote:
That makes sense to me. I believe that it's not possible to have a
string of consecutive sibling pages that are all half-dead (regardless
of the BlockNumber order of sibling pages, even). But I'd probably
have written the fix in roughly the same way. Although...maybe you
should try to detect a string of half-dead pages? Hard to say if it's
worth the trouble.I imagined a string of half-dead siblings could arise in structure like this:
* 1
* / | \
* 4 <-> 2 <-> 3With events like this:
- DELETE renders blk 4 deletable.
- Crash with concurrent VACUUM, leaving 4 half-dead after having visited 1-4.
- DELETE renders blk 2 deletable.
- Crash with concurrent VACUUM, leaving 2 half-dead after having visited 1-2.I didn't try to reproduce that, and something may well prevent it.
FWIW a couple of factors prevent it (in the absence of corruption). These are:
1. Only VACUUM can delete pages, and in general the only possible
source of half-dead pages is an unfortunately timed crash/error within
VACUUM. Each interrupted VACUUM can leave behind at most one half-dead
page.
Agreed.
2. One thing that makes VACUUM back out of deleting an empty page is
the presence of a half-dead right sibling leaf page left behind by
some VACUUM that was interrupted at some point in the past -- see
_bt_rightsib_halfdeadflag() for details.Obviously, factors 1 and 2 together make three consecutive half-dead
sibling pages impossible.
Can't it still happen if the sequence of unfortunately timed crashes causes
deletions from left to right? Take this example, expanding the one above.
Half-kill 4, crash, half-kill 3, crash, half-kill 2 in:
* 1
* / / | \ \
* 4 <-> 3 <-> 2 <-> 1
(That's not to say it has ever happened outside of a test.)
On Tue, Oct 24, 2023 at 8:05 PM Noah Misch <noah@leadboat.com> wrote:
Can't it still happen if the sequence of unfortunately timed crashes causes
deletions from left to right? Take this example, expanding the one above.
Half-kill 4, crash, half-kill 3, crash, half-kill 2 in:* 1
* / / | \ \
* 4 <-> 3 <-> 2 <-> 1(That's not to say it has ever happened outside of a test.)
Hmm. Perhaps you're right. I thought that this wasn't possible in part
due to the fact that you'd have to access all of these leaf pages in
the same order each time, without ever passing over a previous
half-dead page. But I suppose that there's nothing stopping the index
tuples from being deleted from each page in an order that leaves open
the possibility of something like this. (It's extremely unlikely, of
course, but that wasn't ever in question.)
I withdraw my suggestion about the wording from your patch. It seems
committable.
Thanks
--
Peter Geoghegan