amcheck: add index-all-keys-match verification for B-Tree
Hi hackers,
Attached patch adds a new "indexallkeysmatch" option to bt_index_check()
and bt_index_parent_check() that verifies each index tuple points to a
heap tuple with the same key - the reverse of "heapallindexed".
I need the tool to investigate corruption, possibly inflicted by us
ourselves. But the tool might be useful for the community too.
We hit B-tree corruptions where index entries stored different keys than
their heap tuples (e.g. "foobar" in index vs "foo-bar" in heap).
This happened with UTF-8 Russian locales around hyphens/spaces. The
index structure stayed valid so existing checks didn't catch it.
The implementation uses a Bloom filter to avoid excessive random heap
I/O. A sequential heap scan fingerprints visible (key, tid) pairs
first. During the index traversal, each leaf tuple is probed against
the filter; only when the filter says "missing" do we fetch the heap
tuple and compare keys. Posting list entries are expanded and checked
individually.
When both heapallindexed and indexallkeysmatch are enabled, the heap
is scanned twice. Combining them into one pass would complicate the
code and possibly introduce some errors.
There's also a TAP test that detects corruption via expression function swap.
Someone might consider not using bug (corrupting indexes by changing expression)
in tests, but it's already used, so I reused this bug too.
WDYT? Would you like to see it on CF, or do we have enough amcheck patches
there already and it's better to postpone it to v20?
Best regards, Andrey Borodin.
Attachments:
v1-0001-amcheck-add-indexallkeysmatch-verification-for-B-.patchapplication/octet-stream; name=v1-0001-amcheck-add-indexallkeysmatch-verification-for-B-.patch; x-unix-mode=0644Download+432-41
Hello
+ /* Only verify tuples pointing to visible heap rows */
+ if (!heap_entry_is_visible(state, tid))
+ return;
...
+ slot = table_slot_create(state->heaprel, NULL);
+ found = table_tuple_fetch_row_version(state->heaprel, tid,
+ state->snapshot, slot);
+ if (!found)
+ {
This seems like a duplication, heap_entry_is_visible does the same
thing and returns found.
This also means that the if (!found) block should be unreachable?
Wouldn't it be simpler to remove the is_visible check completely and
use an if(found) later?
+ indexinfo = state->indexinfo;
+ estate = CreateExecutorState();
+ GetPerTupleExprContext(estate)->ecxt_scantuple = slot;
+ FormIndexDatum(indexinfo, slot, estate, values, isnull);
+ FreeExecutorState(estate);
This doesn't need the same cleanup code as heapam.c:1754 and :1997?
Seems like state remains, so we have dangling pointers there and could
crash later.
/* These may have been pointing to the now-gone estate */
indexInfo->ii_ExpressionsState = NIL;
indexInfo->ii_PredicateState = NULL;
Hi Zsolt!
Many thanks for the review!
On 26 Feb 2026, at 04:18, Zsolt Parragi <zsolt.parragi@percona.com> wrote:
Hello
+ /* Only verify tuples pointing to visible heap rows */ + if (!heap_entry_is_visible(state, tid)) + return; ... + slot = table_slot_create(state->heaprel, NULL); + found = table_tuple_fetch_row_version(state->heaprel, tid, + state->snapshot, slot); + if (!found) + {This seems like a duplication, heap_entry_is_visible does the same
thing and returns found.This also means that the if (!found) block should be unreachable?
Wouldn't it be simpler to remove the is_visible check completely and
use an if(found) later?
Yup, done so.
Also thinking about it more, I decided to add more corruption checks here:
non-existent tuples are also kind of corruption that must be reported.
This issue addressed in patch 2 so you can verify your notes were addressed.
I have a gut feeling that we might do without snapshot at all...
+ indexinfo = state->indexinfo; + estate = CreateExecutorState(); + GetPerTupleExprContext(estate)->ecxt_scantuple = slot; + FormIndexDatum(indexinfo, slot, estate, values, isnull); + FreeExecutorState(estate);This doesn't need the same cleanup code as heapam.c:1754 and :1997?
Seems like state remains, so we have dangling pointers there and could
crash later./* These may have been pointing to the now-gone estate */
indexInfo->ii_ExpressionsState = NIL;
indexInfo->ii_PredicateState = NULL;
Yup, fixed and added tests exercising ii_ExpressionsState.
Thanks!
Best regards, Andrey Borodin.
Attachments:
v22-0001-amcheck-add-indexallkeysmatch-verification-for-B.patchapplication/octet-stream; name=v22-0001-amcheck-add-indexallkeysmatch-verification-for-B.patch; x-unix-mode=0644Download+437-41
v22-0002-amcheck-report-corruption-when-index-points-to-n.patchapplication/octet-stream; name=v22-0002-amcheck-report-corruption-when-index-points-to-n.patch; x-unix-mode=0644Download+22-5
On 26 Feb 2026, at 10:28, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
I have a gut feeling that we might do without snapshot at all...
I decided that it's a good idea to verify only visible tuples.
Mismatch of dead tuples might be bad sign, but it's not a corruption.
So we need a snapshot.
PFA patch with documentation.
Thanks!
Best regards, Andrey Borodin.
Attachments:
v3-0001-amcheck-add-indexallkeysmatch-verification-for-B-.patchapplication/octet-stream; name=v3-0001-amcheck-add-indexallkeysmatch-verification-for-B-.patch; x-unix-mode=0644Download+437-41
v3-0003-Document-indexallkeysmatch-parameter-for-amcheck-.patchapplication/octet-stream; name=v3-0003-Document-indexallkeysmatch-parameter-for-amcheck-.patch; x-unix-mode=0644Download+49-7
v3-0002-amcheck-report-corruption-when-index-points-to-no.patchapplication/octet-stream; name=v3-0002-amcheck-report-corruption-when-index-points-to-no.patch; x-unix-mode=0644Download+22-5
I don't see any other logic problems in the code, I only have a few
minor comments/questions:
+ * Bloom filter says (key, tid) not in heap. Follow TID to verify; this
+ * amortizes random heap lookups when the filter has false negatives, or
This comment could be a bit confusing, as bloom filters typically have
false positives, not negatives.
Maybe it would be better to phrase this somehow differently?
Another thing is that now amcheck can create two bloom filters,
allocated at the same time, both up to maintenance_work_mem.
Isn't that a detail that should be at least mentioned somewhere?
And in the documentation, shouldn't this new check mention something
similar to heapallindexed, which describes the check possibly missing
corruption because of the bloom filter?
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index tuple in index \"%s\" does not match heap tuple",
+ RelationGetRelationName(state->rel)),
Shouldn't this also print out the table name?
On 3 Mar 2026, at 04:19, Zsolt Parragi <zsolt.parragi@percona.com> wrote:
I don't see any other logic problems in the code, I only have a few
minor comments/questions:+ * Bloom filter says (key, tid) not in heap. Follow TID to verify; this + * amortizes random heap lookups when the filter has false negatives, orThis comment could be a bit confusing, as bloom filters typically have
false positives, not negatives.
Maybe it would be better to phrase this somehow differently?Another thing is that now amcheck can create two bloom filters,
allocated at the same time, both up to maintenance_work_mem.
Isn't that a detail that should be at least mentioned somewhere?And in the documentation, shouldn't this new check mention something
similar to heapallindexed, which describes the check possibly missing
corruption because of the bloom filter?+ (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index tuple in index \"%s\" does not match heap tuple", + RelationGetRelationName(state->rel)),Shouldn't this also print out the table name?
Thanks for the review! I've addressed all these issues in patch steps 4 and 5. Other steps are intact. PFA.
Best regards, Andrey Borodin.
Attachments:
v4-0001-amcheck-add-indexallkeysmatch-verification-for-B-.patchapplication/octet-stream; name=v4-0001-amcheck-add-indexallkeysmatch-verification-for-B-.patch; x-unix-mode=0644Download+437-41
v4-0005-amcheck-docs-note-two-Bloom-filters-memory-usage-.patchapplication/octet-stream; name=v4-0005-amcheck-docs-note-two-Bloom-filters-memory-usage-.patch; x-unix-mode=0644Download+11-4
v4-0004-amcheck-address-review-fix-Bloom-filter-comment-a.patchapplication/octet-stream; name=v4-0004-amcheck-address-review-fix-Bloom-filter-comment-a.patch; x-unix-mode=0644Download+10-9
v4-0003-Document-indexallkeysmatch-parameter-for-amcheck-.patchapplication/octet-stream; name=v4-0003-Document-indexallkeysmatch-parameter-for-amcheck-.patch; x-unix-mode=0644Download+49-7
v4-0002-amcheck-report-corruption-when-index-points-to-no.patchapplication/octet-stream; name=v4-0002-amcheck-report-corruption-when-index-points-to-no.patch; x-unix-mode=0644Download+22-5
Hi Andrey
I've been reviewing the patch and ran some concurrent tests against it.
I found an issue with false positive corruption reports under concurrent
VACUUM.
+ slot = table_slot_create(state->heaprel, NULL);
+ found = table_tuple_fetch_row_version(state->heaprel, tid,
+ SnapshotAny, slot);
+ if (!found)
+ {
+ ExecDropSingleTupleTableSlot(slot);
+ ereport(ERROR,
bt_verify_index_tuple_points_to_heap uses SnapshotAny with
table_tuple_fetch_row_version to distinguish "tuple doesn't exist" from
"tuple exists but is dead". However, bt_index_check only holds
AccessShareLock which compatible with VACUUM's ShareUpdateExclusiveLock.
A concurrent VACUUM Phase 1 can prune heap pages(ItemIdSetDead) between
Bloom filter probe and the heap lookup. Causing
table_tuple_fetch_row_version to return false for a legal
dead-but-now-pruned tuple and a false positive corruption report.
The table_tuple_satisfies_snapshot check does not help, it only runs
when the fetch succeeds(LP_NORMAL). Once VACUUM sets LP_DEAD,
heapam_fetch sees !ItemIdIsNormal(lp) and returns false before any
snapshot checks.
The reproduce should be: DELETE all rows form a big table, then launch
VACUUM and bt_index_check concurrently. A POC test script attached.
Attachments:
008_poc_false_positive.pltext/x-perl-script; charset=UTF-8; name=008_poc_false_positive.plDownload
On Thu, Mar 5, 2026 at 7:00 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Thanks for the review! I've addressed all these issues in patch steps 4 and 5. Other steps are intact. PFA.
Hi Andrey and Wenboo,
1. I tried the v4 to see how it would cope with heap missing segments[0]/messages/by-id/013D63E2-5D75-492E-85FF-1D5CC0148C82@gmail.com
(with -Dsegsize_blocks=6), so given:
create table ptest(id bigint, tenant_id bigint);
insert into ptest select g, mod(g,10) from generate_series(1, 3000) g;
alter table ptest add primary key (id);
checkpoint;
postgres=# select count(*) from ptest;
count
-------
3000
-- simulate loss of middle segment:
$ mv base/5/24578.1 base/5/24578.1.ORG
$ ls -1 base/5/24578*
base/5/24578
base/5/24578.1.ORG
base/5/24578.2
base/5/24578_fsm
base/5/24578_vm
-- force _mfd_openseg() via new backend to show silent data loss:
postgres=# select count(*) from ptest;
count
-------
1110
-- try new amcheck
postgres=# select bt_index_check(index=>'ptest_pkey',
heapallindexed=>true, checkunique=>true, indexallkeysmatch=>false);
bt_index_check
----------------
-- try the new indexallkeysmatch
-- might need restart before it is able to properly bailout
postgres=# select bt_index_check(index=>'ptest_pkey',
heapallindexed=>true, checkunique=>true, indexallkeysmatch=>true);
ERROR: could not open file "base/5/24578.1" (target block 6): No such
file or directory
this error was not quite expected for me (altough it is much better than nothing
as on master today), because patch v4-0002 assumes in
bt_verify_index_tuple_points_to_heap() that we should get false and error in
more human-firendly way ("index tuple points to non-existent heap"). The error
itself is coming out of the following stacktrace:
_mdfd_getseg (forknum=MAIN_FORKNUM, blkno=6)
<- mdstartreadv (..) <- smgrstartreadv (blocknum=blocknum@entry=6 (!))
<- StartReadBuffersImpl (blockNum=6 (!)) <- StartReadBuffer()
<- ReadBuffer_common() <- ReadBufferExtended (blockNum=6 (!))
<- heap_fetch() <- heapam_fetch_row_version()
<- table_tuple_fetch_row_version()
<- bt_verify_index_tuple_points_to_heap(targetblock=5, offset=14)
<- bt_target_page_check() <- bt_check_level_from_leftmost()
<- bt_check_every_level() <- bt_index_check_callback()
<- amcheck_lock_relation_and_check() <- bt_index_check()
and it's because itup in bt_target_page_check() has bi_lo = 6, ip_posid = 1
So with the attached patch it becomes more human understandable, but it is
somehow orthogonal check? Still maybe we can combine this under this $thread
ambrella and make it as an option of indexallkeysmatch too? Attached
behaves likes this:
postgres=# select bt_index_check(index=>'ptest_pkey');
ERROR: index line pointer in index "ptest_pkey" points to missing
page in table "ptest"
DETAIL: Index tid=(5,14) points to heap tid=(6,1) but heap has only 6 blocks.
HINT: this can be caused by lost relation segment (missing or removed file).
(note it does not need to indexallkeysmatch, but it's related)
Also, the attached patch right now blows up pg_amcheck test (not
contrib/amcheck), where we ask to ignore certain corrupted schemas/tables to
test exclusion logic, but we still verify btrees and with attached this btree
verification ends up discovering short heap segment :D I honestly do not know
what to do in this situation (we ask to ignore one table T, but still launch
bt_index_check() on it's index and discover illegal sitation about state of T).
Also I'm not sure if the rechecking of heapnblocks is good way. It's just some
idea.
2. For some reason email from Wenbo didn't get to my emailbox, but I can
see his email here:
/messages/by-id/resend/ba92ac77-24f1-44ad-abf0-11e64e0a7831@gmail.com
(and resending is is somewhat not effective)
I've been reviewing the patch and ran some concurrent tests against it.
I found an issue with false positive corruption reports under concurrent
VACUUM.+ slot = table_slot_create(state->heaprel, NULL); + found = table_tuple_fetch_row_version(state->heaprel, tid, + SnapshotAny, slot); + if (!found) + { + ExecDropSingleTupleTableSlot(slot); + ereport(ERROR,bt_verify_index_tuple_points_to_heap uses SnapshotAny with
table_tuple_fetch_row_version to distinguish "tuple doesn't exist" from
"tuple exists but is dead". However, bt_index_check only holds
AccessShareLock which compatible with VACUUM's ShareUpdateExclusiveLock.
A concurrent VACUUM Phase 1 can prune heap pages(ItemIdSetDead) between
Bloom filter probe and the heap lookup. Causing
table_tuple_fetch_row_version to return false for a legal
dead-but-now-pruned tuple and a false positive corruption report.The table_tuple_satisfies_snapshot check does not help, it only runs
when the fetch succeeds(LP_NORMAL). Once VACUUM sets LP_DEAD,
heapam_fetch sees !ItemIdIsNormal(lp) and returns false before any
snapshot checks.The reproduce should be: DELETE all rows form a big table, then launch
VACUUM and bt_index_check concurrently. A POC test script attached.
I think this is spot-on. I haven't run repro, but if we take ASL and VACUUM
is allowed to run (under ShareUpdateExclusiveLock) then amcheck performs the
heap bloom building (but ignores dead tuples, while btree still can
point to those),
but VACUUM could later remove the heap tuple. It seems to be impossible (under
ASL) to differentiate orphaned index tuple from concurrently vacuumed
dead tuple?
Maybe we should grab ShareUpdateExclusiveLock in bt_index_check for
indexallkeysmatch==true? With some new documentation warning about
this? (to block
just VACUUM, but not DML)
Alternative is to do it via bt_index_parent_check()/ShareLock, but it's kind of
hardly acceptable in many sitations.
3. This is more a question than finding: assuming extremly big tables, wouldn't
we benefit from some form of caching for EState in
bt_verify_index_tuple_points_to_heap()? Or we do not care about such stuff
in amcheck? (we seem to re-create Estate with each call to FormIndexDatum())
-J.
[0]: /messages/by-id/013D63E2-5D75-492E-85FF-1D5CC0148C82@gmail.com