amcheck: add index-all-keys-match verification for B-Tree

Started by Andrey Borodin4 months ago9 messageshackers

amborodin@acm.org

4 months ago

Hi hackers,

Attached patch adds a new "indexallkeysmatch" option to bt_index_check()
and bt_index_parent_check() that verifies each index tuple points to a
heap tuple with the same key - the reverse of "heapallindexed".

I need the tool to investigate corruption, possibly inflicted by us
ourselves. But the tool might be useful for the community too.
We hit B-tree corruptions where index entries stored different keys than
their heap tuples (e.g. "foobar" in index vs "foo-bar" in heap).
This happened with UTF-8 Russian locales around hyphens/spaces. The
index structure stayed valid so existing checks didn't catch it.

The implementation uses a Bloom filter to avoid excessive random heap
I/O. A sequential heap scan fingerprints visible (key, tid) pairs
first. During the index traversal, each leaf tuple is probed against
the filter; only when the filter says "missing" do we fetch the heap
tuple and compare keys. Posting list entries are expanded and checked
individually.

When both heapallindexed and indexallkeysmatch are enabled, the heap
is scanned twice. Combining them into one pass would complicate the
code and possibly introduce some errors.

There's also a TAP test that detects corruption via expression function swap.
Someone might consider not using bug (corrupting indexes by changing expression)
in tests, but it's already used, so I reused this bug too.

WDYT? Would you like to see it on CF, or do we have enough amcheck patches
there already and it's better to postpone it to v20?

Best regards, Andrey Borodin.

Zsolt Parragi

zsolt.parragi@percona.com

4 months ago

In reply to: Andrey Borodin (#1)

Re: amcheck: add index-all-keys-match verification for B-Tree

Hello

+ /* Only verify tuples pointing to visible heap rows */
+ if (!heap_entry_is_visible(state, tid))
+ return;
...
+ slot = table_slot_create(state->heaprel, NULL);
+ found = table_tuple_fetch_row_version(state->heaprel, tid,
+   state->snapshot, slot);
+ if (!found)
+ {

This seems like a duplication, heap_entry_is_visible does the same
thing and returns found.

This also means that the if (!found) block should be unreachable?

Wouldn't it be simpler to remove the is_visible check completely and
use an if(found) later?

+ indexinfo = state->indexinfo;
+ estate = CreateExecutorState();
+ GetPerTupleExprContext(estate)->ecxt_scantuple = slot;
+ FormIndexDatum(indexinfo, slot, estate, values, isnull);
+ FreeExecutorState(estate);

This doesn't need the same cleanup code as heapam.c:1754 and :1997?
Seems like state remains, so we have dangling pointers there and could
crash later.

/* These may have been pointing to the now-gone estate */
indexInfo->ii_ExpressionsState = NIL;
indexInfo->ii_PredicateState = NULL;

Andrey Borodin

amborodin@acm.org

4 months ago

In reply to: Zsolt Parragi (#2)

Re: amcheck: add index-all-keys-match verification for B-Tree

Hi Zsolt!

Many thanks for the review!

On 26 Feb 2026, at 04:18, Zsolt Parragi <zsolt.parragi@percona.com> wrote:

Hello
+ /* Only verify tuples pointing to visible heap rows */
+ if (!heap_entry_is_visible(state, tid))
+ return;
...
+ slot = table_slot_create(state->heaprel, NULL);
+ found = table_tuple_fetch_row_version(state->heaprel, tid,
+   state->snapshot, slot);
+ if (!found)
+ {
This seems like a duplication, heap_entry_is_visible does the same
thing and returns found.

This also means that the if (!found) block should be unreachable?

Wouldn't it be simpler to remove the is_visible check completely and
use an if(found) later?

Yup, done so.

Also thinking about it more, I decided to add more corruption checks here:
non-existent tuples are also kind of corruption that must be reported.
This issue addressed in patch 2 so you can verify your notes were addressed.

I have a gut feeling that we might do without snapshot at all...

+ indexinfo = state->indexinfo;
+ estate = CreateExecutorState();
+ GetPerTupleExprContext(estate)->ecxt_scantuple = slot;
+ FormIndexDatum(indexinfo, slot, estate, values, isnull);
+ FreeExecutorState(estate);
This doesn't need the same cleanup code as heapam.c:1754 and :1997?
Seems like state remains, so we have dangling pointers there and could
crash later.

/* These may have been pointing to the now-gone estate */
indexInfo->ii_ExpressionsState = NIL;
indexInfo->ii_PredicateState = NULL;

Yup, fixed and added tests exercising ii_ExpressionsState.

Thanks!

Best regards, Andrey Borodin.

Andrey Borodin

amborodin@acm.org

4 months ago

In reply to: Andrey Borodin (#3)

Re: amcheck: add index-all-keys-match verification for B-Tree

On 26 Feb 2026, at 10:28, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I have a gut feeling that we might do without snapshot at all...

I decided that it's a good idea to verify only visible tuples.
Mismatch of dead tuples might be bad sign, but it's not a corruption.
So we need a snapshot.

PFA patch with documentation.

Thanks!

Best regards, Andrey Borodin.

Zsolt Parragi

zsolt.parragi@percona.com

4 months ago

In reply to: Andrey Borodin (#4)

Re: amcheck: add index-all-keys-match verification for B-Tree

I don't see any other logic problems in the code, I only have a few
minor comments/questions:

+ * Bloom filter says (key, tid) not in heap.  Follow TID to verify; this
+ * amortizes random heap lookups when the filter has false negatives, or

This comment could be a bit confusing, as bloom filters typically have
false positives, not negatives.
Maybe it would be better to phrase this somehow differently?

Another thing is that now amcheck can create two bloom filters,
allocated at the same time, both up to maintenance_work_mem.
Isn't that a detail that should be at least mentioned somewhere?

And in the documentation, shouldn't this new check mention something
similar to heapallindexed, which describes the check possibly missing
corruption because of the bloom filter?

+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index tuple in index \"%s\" does not match heap tuple",
+ RelationGetRelationName(state->rel)),

Shouldn't this also print out the table name?

Andrey Borodin

amborodin@acm.org

4 months ago

In reply to: Zsolt Parragi (#5)

Re: amcheck: add index-all-keys-match verification for B-Tree

On 3 Mar 2026, at 04:19, Zsolt Parragi <zsolt.parragi@percona.com> wrote:

I don't see any other logic problems in the code, I only have a few
minor comments/questions:
+ * Bloom filter says (key, tid) not in heap.  Follow TID to verify; this
+ * amortizes random heap lookups when the filter has false negatives, or
This comment could be a bit confusing, as bloom filters typically have
false positives, not negatives.
Maybe it would be better to phrase this somehow differently?

Another thing is that now amcheck can create two bloom filters,
allocated at the same time, both up to maintenance_work_mem.
Isn't that a detail that should be at least mentioned somewhere?

And in the documentation, shouldn't this new check mention something
similar to heapallindexed, which describes the check possibly missing
corruption because of the bloom filter?
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index tuple in index \"%s\" does not match heap tuple",
+ RelationGetRelationName(state->rel)),
Shouldn't this also print out the table name?

Thanks for the review! I've addressed all these issues in patch steps 4 and 5. Other steps are intact. PFA.

Best regards, Andrey Borodin.

Wenbo Lin

linwenbo1994@gmail.com

3 months ago

In reply to: Andrey Borodin (#6)

Re: amcheck: add index-all-keys-match verification for B-Tree

Hi Andrey
I've been reviewing the patch and ran some concurrent tests against it.
I found an issue with false positive corruption reports under concurrent
VACUUM.

+ slot = table_slot_create(state->heaprel, NULL);
+ found = table_tuple_fetch_row_version(state->heaprel, tid,
+                         SnapshotAny, slot);
+ if (!found)
+ {
+         ExecDropSingleTupleTableSlot(slot);
+         ereport(ERROR,

bt_verify_index_tuple_points_to_heap uses SnapshotAny with
table_tuple_fetch_row_version to distinguish "tuple doesn't exist" from
"tuple exists but is dead". However, bt_index_check only holds
AccessShareLock which compatible with VACUUM's ShareUpdateExclusiveLock.
A concurrent VACUUM Phase 1 can prune heap pages(ItemIdSetDead) between
Bloom filter probe and the heap lookup. Causing
table_tuple_fetch_row_version to return false for a legal
dead-but-now-pruned tuple and a false positive corruption report.

The table_tuple_satisfies_snapshot check does not help, it only runs
when the fetch succeeds(LP_NORMAL). Once VACUUM sets LP_DEAD,
heapam_fetch sees !ItemIdIsNormal(lp) and returns false before any
snapshot checks.

The reproduce should be: DELETE all rows form a big table, then launch
VACUUM and bt_index_check concurrently. A POC test script attached.

Jakub Wartak

jakub.wartak@enterprisedb.com

3 months ago

In reply to: Andrey Borodin (#6)

Re: amcheck: add index-all-keys-match verification for B-Tree

On Thu, Mar 5, 2026 at 7:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Thanks for the review! I've addressed all these issues in patch steps 4 and 5. Other steps are intact. PFA.

Hi Andrey and Wenboo,

1. I tried the v4 to see how it would cope with heap missing segments[0]/messages/by-id/013D63E2-5D75-492E-85FF-1D5CC0148C82@gmail.com
(with -Dsegsize_blocks=6), so given:

create table ptest(id bigint, tenant_id bigint);
insert into ptest select g, mod(g,10) from generate_series(1, 3000) g;
alter table ptest add primary key (id);
checkpoint;

postgres=# select count(*) from ptest;
count
-------
3000

-- simulate loss of middle segment:
$ mv base/5/24578.1 base/5/24578.1.ORG
$ ls -1 base/5/24578*
base/5/24578
base/5/24578.1.ORG
base/5/24578.2
base/5/24578_fsm
base/5/24578_vm

-- force _mfd_openseg() via new backend to show silent data loss:
postgres=# select count(*) from ptest;
count
-------
1110

-- try new amcheck
postgres=# select bt_index_check(index=>'ptest_pkey',
heapallindexed=>true, checkunique=>true, indexallkeysmatch=>false);
bt_index_check
----------------

-- try the new indexallkeysmatch
-- might need restart before it is able to properly bailout
postgres=# select bt_index_check(index=>'ptest_pkey',
heapallindexed=>true, checkunique=>true, indexallkeysmatch=>true);
ERROR: could not open file "base/5/24578.1" (target block 6): No such
file or directory

this error was not quite expected for me (altough it is much better than nothing
as on master today), because patch v4-0002 assumes in
bt_verify_index_tuple_points_to_heap() that we should get false and error in
more human-firendly way ("index tuple points to non-existent heap"). The error
itself is coming out of the following stacktrace:

_mdfd_getseg (forknum=MAIN_FORKNUM, blkno=6)
<- mdstartreadv (..) <- smgrstartreadv (blocknum=blocknum@entry=6 (!))
<- StartReadBuffersImpl (blockNum=6 (!)) <- StartReadBuffer()
<- ReadBuffer_common() <- ReadBufferExtended (blockNum=6 (!))
<- heap_fetch() <- heapam_fetch_row_version()
<- table_tuple_fetch_row_version()
<- bt_verify_index_tuple_points_to_heap(targetblock=5, offset=14)
<- bt_target_page_check() <- bt_check_level_from_leftmost()
<- bt_check_every_level() <- bt_index_check_callback()
<- amcheck_lock_relation_and_check() <- bt_index_check()
and it's because itup in bt_target_page_check() has bi_lo = 6, ip_posid = 1

So with the attached patch it becomes more human understandable, but it is
somehow orthogonal check? Still maybe we can combine this under this $thread
ambrella and make it as an option of indexallkeysmatch too? Attached
behaves likes this:

postgres=# select bt_index_check(index=>'ptest_pkey');
ERROR: index line pointer in index "ptest_pkey" points to missing
page in table "ptest"
DETAIL: Index tid=(5,14) points to heap tid=(6,1) but heap has only 6 blocks.
HINT: this can be caused by lost relation segment (missing or removed file).

(note it does not need to indexallkeysmatch, but it's related)

Also, the attached patch right now blows up pg_amcheck test (not
contrib/amcheck), where we ask to ignore certain corrupted schemas/tables to
test exclusion logic, but we still verify btrees and with attached this btree
verification ends up discovering short heap segment :D I honestly do not know
what to do in this situation (we ask to ignore one table T, but still launch
bt_index_check() on it's index and discover illegal sitation about state of T).

Also I'm not sure if the rechecking of heapnblocks is good way. It's just some
idea.

2. For some reason email from Wenbo didn't get to my emailbox, but I can
see his email here:
/messages/by-id/resend/ba92ac77-24f1-44ad-abf0-11e64e0a7831@gmail.com
(and resending is is somewhat not effective)

I've been reviewing the patch and ran some concurrent tests against it.
I found an issue with false positive corruption reports under concurrent
VACUUM.
+ slot = table_slot_create(state->heaprel, NULL);
+ found = table_tuple_fetch_row_version(state->heaprel, tid,
+                         SnapshotAny, slot);
+ if (!found)
+ {
+         ExecDropSingleTupleTableSlot(slot);
+         ereport(ERROR,
bt_verify_index_tuple_points_to_heap uses SnapshotAny with
table_tuple_fetch_row_version to distinguish "tuple doesn't exist" from
"tuple exists but is dead". However, bt_index_check only holds
AccessShareLock which compatible with VACUUM's ShareUpdateExclusiveLock.
A concurrent VACUUM Phase 1 can prune heap pages(ItemIdSetDead) between
Bloom filter probe and the heap lookup. Causing
table_tuple_fetch_row_version to return false for a legal
dead-but-now-pruned tuple and a false positive corruption report.

The table_tuple_satisfies_snapshot check does not help, it only runs
when the fetch succeeds(LP_NORMAL). Once VACUUM sets LP_DEAD,
heapam_fetch sees !ItemIdIsNormal(lp) and returns false before any
snapshot checks.

The reproduce should be: DELETE all rows form a big table, then launch
VACUUM and bt_index_check concurrently. A POC test script attached.

I think this is spot-on. I haven't run repro, but if we take ASL and VACUUM
is allowed to run (under ShareUpdateExclusiveLock) then amcheck performs the
heap bloom building (but ignores dead tuples, while btree still can
point to those),
but VACUUM could later remove the heap tuple. It seems to be impossible (under
ASL) to differentiate orphaned index tuple from concurrently vacuumed
dead tuple?

Maybe we should grab ShareUpdateExclusiveLock in bt_index_check for
indexallkeysmatch==true? With some new documentation warning about
this? (to block
just VACUUM, but not DML)

Alternative is to do it via bt_index_parent_check()/ShareLock, but it's kind of
hardly acceptable in many sitations.

3. This is more a question than finding: assuming extremly big tables, wouldn't
we benefit from some form of caching for EState in
bt_verify_index_tuple_points_to_heap()? Or we do not care about such stuff
in amcheck? (we seem to re-create Estate with each call to FormIndexDatum())

-J.

[0]: /messages/by-id/013D63E2-5D75-492E-85FF-1D5CC0148C82@gmail.com

Jakub Wartak

jakub.wartak@enterprisedb.com

about 2 months ago

In reply to: Jakub Wartak (#8)

Re: amcheck: add index-all-keys-match verification for B-Tree

Hi Andrey,

On Wed, Apr 8, 2026 at 3:17 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:

On Thu, Mar 5, 2026 at 7:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Thanks for the review! I've addressed all these issues in patch steps 4 and 5. Other steps are intact. PFA.

Hi Andrey and Wenboo,

1. I tried the v4 to see how it would cope with heap missing segments[0]
(with -Dsegsize_blocks=6), so given:

[..]

postgres=# select bt_index_check(index=>'ptest_pkey',
heapallindexed=>true, checkunique=>true, indexallkeysmatch=>true);
ERROR: could not open file "base/5/24578.1" (target block 6): No such
file or directory

this error was not quite expected for me (altough it is much better than nothing
as on master today), because patch v4-0002 assumes in
bt_verify_index_tuple_points_to_heap() that we should get false and error in
more human-firendly way ("index tuple points to non-existent heap"). The error
itself is coming out of the following stacktrace:

[..]

So with the attached patch it becomes more human understandable, but it is
somehow orthogonal check? Still maybe we can combine this under this $thread
ambrella and make it as an option of indexallkeysmatch too? Attached
behaves likes this:

postgres=# select bt_index_check(index=>'ptest_pkey');
ERROR: index line pointer in index "ptest_pkey" points to missing
page in table "ptest"
DETAIL: Index tid=(5,14) points to heap tid=(6,1) but heap has only 6 blocks.
HINT: this can be caused by lost relation segment (missing or removed file).

(note it does not need to indexallkeysmatch, but it's related)

[..]

2. For some reason email from Wenbo didn't get to my emailbox, but I can
see his email here:
/messages/by-id/resend/ba92ac77-24f1-44ad-abf0-11e64e0a7831@gmail.com
(and resending is is somewhat not effective)
I've been reviewing the patch and ran some concurrent tests against it.
I found an issue with false positive corruption reports under concurrent
VACUUM.
+ slot = table_slot_create(state->heaprel, NULL);
+ found = table_tuple_fetch_row_version(state->heaprel, tid,
+                         SnapshotAny, slot);
+ if (!found)
+ {
+         ExecDropSingleTupleTableSlot(slot);
+         ereport(ERROR,
bt_verify_index_tuple_points_to_heap uses SnapshotAny with
table_tuple_fetch_row_version to distinguish "tuple doesn't exist" from
"tuple exists but is dead". However, bt_index_check only holds
AccessShareLock which compatible with VACUUM's ShareUpdateExclusiveLock.
A concurrent VACUUM Phase 1 can prune heap pages(ItemIdSetDead) between
Bloom filter probe and the heap lookup. Causing
table_tuple_fetch_row_version to return false for a legal
dead-but-now-pruned tuple and a false positive corruption report.

The table_tuple_satisfies_snapshot check does not help, it only runs
when the fetch succeeds(LP_NORMAL). Once VACUUM sets LP_DEAD,
heapam_fetch sees !ItemIdIsNormal(lp) and returns false before any
snapshot checks.

The reproduce should be: DELETE all rows form a big table, then launch
VACUUM and bt_index_check concurrently. A POC test script attached.
I think this is spot-on. I haven't run repro, but if we take ASL and VACUUM
is allowed to run (under ShareUpdateExclusiveLock) then amcheck performs the
heap bloom building (but ignores dead tuples, while btree still can
point to those) (..)

[..]

[0] - /messages/by-id/013D63E2-5D75-492E-85FF-1D5CC0148C82@gmail.com

Here's my attempt at the fixing the LP_DEAD tuples being reported as false
positives by the v4 patchset as stated earlier. The simplest way for the
v4 to hit those was to:
1) get a fresh initdb cluster
2) run it on all pg_catalog tables/indexes, and to me it (v4*) always reliably
spotted some LP_DEAD ones there, e.g.:

ERROR: index tuple in index "pg_proc_proname_args_nsp_index" points
to non-existent heap tuple in table "pg_proc"
DETAIL: Index tid=(5,21) points to heap tid=(18,32) that no longer exists.

where lp_flags=3 ==> LP_DEAD

When we do bt_index_check() still with AccessShareLock we may expect:
- concurrent VACUUM producing LP_DEADs
- SELECT doing opportunistic heap pruning (micro-vacuuming) and that is
allowed to run even with ShareUpdate (bt_index_parent_check)
- LP_DEAD being already there before we even start (aborted VACUUM?)

I've tried to enhance v4 based on above assumptions, so v5 is attached.

Note: AFAIU opportunistic pruning can produce LP_UNUSED only on HOT-chain
intermediate tuples, and those should be never referenced by a btree index.
I think it's safe to still to stick to AccessShareLock (earlier I've
mentioned ShareUpdateExclusiveLock, but it looks it is doable without that)
as we are not affected by truncation and we have xmin horizon so stuff should
not be removed from underneath us.

Note that the pg_amcheck/003_check (not contrib/amcheck) still blows-up:
not ok 46 - pg_amcheck over schema s2 with corrupt tables excluded
reports no corruption status (got 2 vs expected 0)
not ok 47 - pg_amcheck over schema s2 with corrupt tables excluded
reports no corruption stdout /(?^:^$)/

it's due to SELECT "amcheck_schema".bt_index_check(index := c.oid,
heapallindexed := false )
ERROR: could not open file "base/16384/16508": No such file or directory

being called on index, while the *heap* relation has been intentionally
removed by plan_to_remove_relation_file() from that test. So even if the
tables are exluded by the pg_amcheck --exclude-table , amcheck with
v5-0005 blows up on RelationGetNumberOfBlocks() as the file is gone.
IMHO, we should remove that tes there, but I have not done so far this
far as I simply do not know if that's acceptable.

Moving that check into heapallindexed option is also a possibilty, but
that would
hide cheap check under big and costly option (reading whole heap).

Also, some rough measurements: One can expect indexallkeysmatched to multiply
bt_index_check runtime by roughly ~5x compared to a bare check (all flags off)
run, or - as expected - ~2x when compared to heapallindexed (that makes sense,
we are building another full bloom filter). Sample performance baseline of
corruption checking seems to be like that:
- ~5s for COPY pgbench_accounts to /dev/null (wont discover missing stuff)
- ~2s for COPY (SELECT aid FROM pgbench_accounts) to /dev/null (IOT,
wont discover
missing stuff)
amcheck:
- ~1.5s for bt_index_check('pgbench_accounts_pkey' (with v5b-0005 might discover
loss of heap relation segments)
- ~2s for bt_index_check() with just checkunique=on
- ~7.1s for bt_index_check() with just heapallindexed=on
- ~8.1s for bt_index_check() with just indexallkeysmatched=on (this new feature,
discovers dangling/missing/corrupted entries as Andrey explain in
original post)
- ~14s for full verification (everything on)

-J.

amcheck: add index-all-keys-match verification for B-Tree

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: