bt_index_parent_check and concurrently build indexes
Hello, everyone!
While working on [0]https://commitfest.postgresql.org/51/4971/, I encountered an issue involving a missing tuple in
an index that was built concurrently. The problem only occurred once, but
it caused me a significant amount of frustration. :)
After some time, I managed to find a way to reproduce the issue. It turns
out that bt_index_parent_check is not suitable for validating indexes built
concurrently. The reason is that bt_index_parent_check uses SnapshotAny
during the heap scan, whereas an MVCC snapshot is used for the index build.
I’ve attached a patch that reproduces the issue (it incorrectly reports the
index as invalid, even though it is actually valid).
I’m unsure of the best way to address this issue, but here are some
possible options:
* Simply update the documentation.
* Issue a WARNING if !tupleIsAlive.
* Modify bt_index_parent_check to use an MVCC snapshot for the heap scan
Best regards,
Mikhail.
Attachments:
v1-0001-test-to-reproduce-issue-with-bt_index_parent_chec.patchapplication/x-patch; name=v1-0001-test-to-reproduce-issue-with-bt_index_parent_chec.patchDownload
From 60d17152f47e8d85fd3bad961d1965c335493c94 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sun, 8 Dec 2024 17:32:15 +0100
Subject: [PATCH v1] test to reproduce issue with bt_index_parent_check and
CREATE INDEX CONCURRENTLY
---
contrib/amcheck/meson.build | 1 +
.../t/006_cic_bt_index_parent_check.pl | 59 +++++++++++++++++++
2 files changed, 60 insertions(+)
create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl
diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index fc08e32539a..2eb7ff11bd7 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -45,6 +45,7 @@ tests += {
't/003_cic_2pc.pl',
't/004_verify_nbtree_unique.pl',
't/005_pitr.pl',
+ 't/006_cic_bt_index_parent_check.pl',
],
},
}
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..d4bb9c2a99d
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,59 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Test CREATE INDEX CONCURRENTLY with concurrent prepared-xact modifications
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+ if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_HOT_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key, updated_at timestamp)));
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i,now() FROM generate_series(1, 10000) s(i);));
+
+# Run pgbench.
+$node->pgbench(
+ '--no-vacuum --client=30 --jobs=5 --exit-on-abort --transactions=500',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent UPDATES w/ and CIC',
+ {
+ 'pgbench_concurrent_cic' => q(
+ SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+ \if :gotlock
+ CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);
+ SELECT bt_index_parent_check('idx', heapallindexed => true, rootdescend => true, checkunique => true);
+ DROP INDEX CONCURRENTLY idx;
+ SELECT pg_advisory_unlock(42);
+ \else
+ BEGIN;
+ UPDATE tbl SET updated_at = now() WHERE i = floor(random()*10000);
+ UPDATE tbl SET updated_at = now() WHERE i = floor(random()*10000);
+ UPDATE tbl SET updated_at = now() WHERE i = floor(random()*10000);
+ UPDATE tbl SET updated_at = now() WHERE i = floor(random()*10000);
+ UPDATE tbl SET updated_at = now() WHERE i = floor(random()*10000);
+ UPDATE tbl SET updated_at = now() WHERE i = floor(random()*10000);
+ COMMIT;
+ \endif
+ )
+ });
+
+$node->stop;
+done_testing();
--
2.43.0
Hello, everyone and Peter.
I simplified the patch to reproduce the issue more effectively. Now,
bt_index_parent_check fails on a valid index containing just two tuples.
Peter, I included you in this message because you are the author of
bt_index_parent_check, so I thought you might find this relevant.
Best regards,
Mikhail.
Show quoted text
Attachments:
v2-0001-test-to-reproduce-issue-with-bt_index_parent_chec.patchtext/x-patch; charset=US-ASCII; name=v2-0001-test-to-reproduce-issue-with-bt_index_parent_chec.patchDownload
From 6de9549360b7d92b0184c50016aec1c41534e127 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sun, 8 Dec 2024 17:32:15 +0100
Subject: [PATCH v2] test to reproduce issue with bt_index_parent_check and
CREATE INDEX CONCURRENTLY
---
contrib/amcheck/meson.build | 1 +
.../t/006_cic_bt_index_parent_check.pl | 51 +++++++++++++++++++
2 files changed, 52 insertions(+)
create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl
diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index fc08e32539..2eb7ff11bd 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -45,6 +45,7 @@ tests += {
't/003_cic_2pc.pl',
't/004_verify_nbtree_unique.pl',
't/005_pitr.pl',
+ 't/006_cic_bt_index_parent_check.pl',
],
},
}
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 0000000000..1d8bb84f22
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,51 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+ if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(
+ q(
+BEGIN;
+SELECT pg_current_xact_id();
+));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+# it fails, because it is expect to find the deleted row in index
+is($result, '0', 'bt_index_parent_check after removed rows');
+
+$in_progress_h->quit;
+done_testing();
--
2.43.0
On 9 Dec 2024, at 23:51, Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
* Modify bt_index_parent_check to use an MVCC snapshot for the heap scan
Hi!
Interesting bug. It's amazing how long it stand, giving that it would be triggered by almost any check after updating a table.
From my POV correct fix direction is to use approach similar to index building.
E.i. remove "if (!state->readonly)" check. Are there any known downsides of this?
Best regards, Andrey Borodin.
Hello, Andrey!
Interesting bug. It's amazing how long it stand, giving that it would be
triggered by almost any check after updating a table.
Probably because in real cases, bt_index_check is used much more frequently
than bt_index_parent_check.
From my POV correct fix direction is to use approach similar to index
building.
E.i. remove "if (!state->readonly)" check. Are there any known downsides
of this?
Yes, it also looks correct to me. I have attached the patch with such
changes.
Also, I have registered a commit fest entry for the issue:
https://commitfest.postgresql.org/51/5438/
Attachments:
v3-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patchapplication/octet-stream; name=v3-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patchDownload
From adb42e0093cdc5f1e4546d4499760144478679a5 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Fri, 13 Dec 2024 00:53:49 +0100
Subject: [PATCH v3] amcheck: Fix bt_index_parent_check behavior with CREATE
INDEX CONCURRENTLY
Modify snapshot handling in verify_nbtree to properly handle indexes created with
CREATE INDEX CONCURRENTLY. Previously, the function could fail to find some tuples in bloom when
verifying such indexes because it was using the SnapshotAny, while indexes were build using MVCC snapshot.
This could lead to false alarms in bt_index_parent_check's validation.
Add a regression test that demonstrates the issue.
Discussion: https://postgr.es/m/CANtu0ojmVd27fEhfpST7RG2KZvwkX%3DdMyKUqg0KM87FkOSdz8Q%40mail.gmail.com
---
contrib/amcheck/meson.build | 1 +
.../t/006_cic_bt_index_parent_check.pl | 51 ++++++++++++++
contrib/amcheck/verify_nbtree.c | 68 ++++++++-----------
3 files changed, 81 insertions(+), 39 deletions(-)
create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl
diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index fc08e32539a..2eb7ff11bd7 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -45,6 +45,7 @@ tests += {
't/003_cic_2pc.pl',
't/004_verify_nbtree_unique.pl',
't/005_pitr.pl',
+ 't/006_cic_bt_index_parent_check.pl',
],
},
}
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..1d8bb84f225
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,51 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+ if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->append_conf('postgresql.conf',
+ 'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(
+ q(
+BEGIN;
+SELECT pg_current_xact_id();
+));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+# it fails, because it is expect to find the deleted row in index
+is($result, '0', 'bt_index_parent_check after removed rows');
+
+$in_progress_h->quit;
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..89526900d21 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -506,7 +506,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
BTMetaPageData *metad;
uint32 previouslevel;
BtreeLevel current;
- Snapshot snapshot = SnapshotAny;
if (!readonly)
elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -557,38 +556,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent = 0;
/*
- * Register our own snapshot in !readonly case, rather than asking
+ * Register our own snapshot for heapallindexed, rather than asking
* table_index_build_scan() to do this for us later. This needs to
* happen before index fingerprinting begins, so we can later be
* certain that index fingerprinting should have reached all tuples
* returned by table_index_build_scan().
*/
- if (!state->readonly)
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
- /*
- * GetTransactionSnapshot() always acquires a new MVCC snapshot in
- * READ COMMITTED mode. A new snapshot is guaranteed to have all
- * the entries it requires in the index.
- *
- * We must defend against the possibility that an old xact
- * snapshot was returned at higher isolation levels when that
- * snapshot is not safe for index scans of the target index. This
- * is possible when the snapshot sees tuples that are before the
- * index's indcheckxmin horizon. Throwing an error here should be
- * very rare. It doesn't seem worth using a secondary snapshot to
- * avoid this.
- */
- if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
- !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
- snapshot->xmin))
- ereport(ERROR,
- (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
- errmsg("index \"%s\" cannot be verified using transaction snapshot",
- RelationGetRelationName(rel))));
- }
- }
+ /*
+ * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+ * READ COMMITTED mode. A new snapshot is guaranteed to have all
+ * the entries it requires in the index.
+ *
+ * We must defend against the possibility that an old xact
+ * snapshot was returned at higher isolation levels when that
+ * snapshot is not safe for index scans of the target index. This
+ * is possible when the snapshot sees tuples that are before the
+ * index's indcheckxmin horizon. Throwing an error here should be
+ * very rare. It doesn't seem worth using a secondary snapshot to
+ * avoid this.
+ */
+ if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+ !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+ state->snapshot->xmin))
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("index \"%s\" cannot be verified using transaction snapshot",
+ RelationGetRelationName(rel))));
+}
/*
* We need a snapshot to check the uniqueness of the index. For better
@@ -600,9 +596,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->indexinfo = BuildIndexInfo(state->rel);
if (state->indexinfo->ii_Unique)
{
- if (snapshot != SnapshotAny)
- state->snapshot = snapshot;
- else
+ if (state->snapshot == InvalidSnapshot)
state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
}
}
@@ -679,13 +673,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Create our own scan for table_index_build_scan(), rather than
* getting it to do so for us. This is required so that we can
- * actually use the MVCC snapshot registered earlier in !readonly
- * case.
+ * actually use the MVCC snapshot registered earlier.
*
* Note that table_index_build_scan() calls heap_endscan() for us.
*/
scan = table_beginscan_strat(state->heaprel, /* relation */
- snapshot, /* snapshot */
+ state->snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
@@ -693,7 +686,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
- * behaves in !readonly case.
+ * behaves.
*
* It's okay that we don't actually use the same lock strength for the
* heap relation as any other ii_Concurrent caller would in !readonly
@@ -702,7 +695,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
* that needs to be sure that there was no concurrent recycling of
* TIDs.
*/
- indexinfo->ii_Concurrent = !state->readonly;
+ indexinfo->ii_Concurrent = true;
/*
* Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -726,14 +719,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent, RelationGetRelationName(heaprel),
100.0 * bloom_prop_bits_set(state->filter))));
- if (snapshot != SnapshotAny)
- UnregisterSnapshot(snapshot);
-
bloom_free(state->filter);
}
/* Be tidy: */
- if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+ if (state->snapshot != InvalidSnapshot)
UnregisterSnapshot(state->snapshot);
MemoryContextDelete(state->targetcontext);
}
--
2.43.0
On Mon, Dec 9, 2024 at 3:51 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
After some time, I managed to find a way to reproduce the issue. It turns out that bt_index_parent_check is not suitable for validating indexes built concurrently.
Good catch.
I’ve attached a patch that reproduces the issue (it incorrectly reports the index as invalid, even though it is actually valid).
I’m unsure of the best way to address this issue, but here are some possible options:
* Simply update the documentation.
* Issue a WARNING if !tupleIsAlive.
* Modify bt_index_parent_check to use an MVCC snapshot for the heap scan
Offhand, I think that using an MVCC snapshot would make the most sense.
It's not as if there is a big benefit to not doing so with
bt_index_parent_check. My recollection is that we did it this way
because it was as close as possible to the CREATE INDEX code that
heapallindexed verification was loosely based on.
--
Peter Geoghegan
On 13 Dec 2024, at 04:59, Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
<v3-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patch>
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
I think usually write only commit year. Something tells me you can safely write 2025 there.
+Test::More->builder->todo_start('filesystem bug')
+ if PostgreSQL::Test::Utils::has_wal_read_bug;
Can't wrap my head why do you need this?
+# it fails, because it is expect to find the deleted row in index
I think this comment describes behavior before the fix in present tense.
- if (snapshot != SnapshotAny)
- UnregisterSnapshot(snapshot);
Snapshot business seems incorrect to me here...
Best regards, Andrey Borodin.
Hello, Andrey!
Thanks for the review!
I think usually write only commit year. Something tells me you can safely
write 2025 there.
Done.
Can't wrap my head why do you need this?
Oops, copy-paste, fixed.
I think this comment describes behavior before the fix in present tense.
Fixed.
Snapshot business seems incorrect to me here...
Hm, it seems like it is correct. `snapshot` variable is deleted, we only
use `state->snapshot` now (if it is required at all).
Best regards,
Mikhail.
Attachments:
v2-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patchapplication/octet-stream; name=v2-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patchDownload
From 53f9e0f3e42d0804e4af635a785b591b960dcf21 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Fri, 13 Dec 2024 00:53:49 +0100
Subject: [PATCH v2] amcheck: Fix bt_index_parent_check behavior with CREATE
INDEX CONCURRENTLY
Modify snapshot handling in verify_nbtree to properly handle indexes created with
CREATE INDEX CONCURRENTLY. Previously, the function could fail to find some tuples in bloom when
verifying such indexes because it was using the SnapshotAny, while indexes were build using MVCC snapshot.
This could lead to false alarms in bt_index_parent_check's validation.
Add a regression test for the the issue.
Discussion: https://postgr.es/m/CANtu0ojmVd27fEhfpST7RG2KZvwkX%3DdMyKUqg0KM87FkOSdz8Q%40mail.gmail.com
---
contrib/amcheck/meson.build | 1 +
.../t/006_cic_bt_index_parent_check.pl | 39 +++++++++++
contrib/amcheck/verify_nbtree.c | 68 ++++++++-----------
3 files changed, 69 insertions(+), 39 deletions(-)
create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl
diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index fc08e32539a..2eb7ff11bd7 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -45,6 +45,7 @@ tests += {
't/003_cic_2pc.pl',
't/004_verify_nbtree_unique.pl',
't/005_pitr.pl',
+ 't/006_cic_bt_index_parent_check.pl',
],
},
}
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..6e52c5e39ec
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,39 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(q(BEGIN; SELECT pg_current_xact_id();));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+is($result, '0', 'bt_index_parent_check for CIC after removed row');
+
+$in_progress_h->quit;
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..89526900d21 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -506,7 +506,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
BTMetaPageData *metad;
uint32 previouslevel;
BtreeLevel current;
- Snapshot snapshot = SnapshotAny;
if (!readonly)
elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -557,38 +556,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent = 0;
/*
- * Register our own snapshot in !readonly case, rather than asking
+ * Register our own snapshot for heapallindexed, rather than asking
* table_index_build_scan() to do this for us later. This needs to
* happen before index fingerprinting begins, so we can later be
* certain that index fingerprinting should have reached all tuples
* returned by table_index_build_scan().
*/
- if (!state->readonly)
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
- /*
- * GetTransactionSnapshot() always acquires a new MVCC snapshot in
- * READ COMMITTED mode. A new snapshot is guaranteed to have all
- * the entries it requires in the index.
- *
- * We must defend against the possibility that an old xact
- * snapshot was returned at higher isolation levels when that
- * snapshot is not safe for index scans of the target index. This
- * is possible when the snapshot sees tuples that are before the
- * index's indcheckxmin horizon. Throwing an error here should be
- * very rare. It doesn't seem worth using a secondary snapshot to
- * avoid this.
- */
- if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
- !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
- snapshot->xmin))
- ereport(ERROR,
- (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
- errmsg("index \"%s\" cannot be verified using transaction snapshot",
- RelationGetRelationName(rel))));
- }
- }
+ /*
+ * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+ * READ COMMITTED mode. A new snapshot is guaranteed to have all
+ * the entries it requires in the index.
+ *
+ * We must defend against the possibility that an old xact
+ * snapshot was returned at higher isolation levels when that
+ * snapshot is not safe for index scans of the target index. This
+ * is possible when the snapshot sees tuples that are before the
+ * index's indcheckxmin horizon. Throwing an error here should be
+ * very rare. It doesn't seem worth using a secondary snapshot to
+ * avoid this.
+ */
+ if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+ !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+ state->snapshot->xmin))
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("index \"%s\" cannot be verified using transaction snapshot",
+ RelationGetRelationName(rel))));
+}
/*
* We need a snapshot to check the uniqueness of the index. For better
@@ -600,9 +596,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->indexinfo = BuildIndexInfo(state->rel);
if (state->indexinfo->ii_Unique)
{
- if (snapshot != SnapshotAny)
- state->snapshot = snapshot;
- else
+ if (state->snapshot == InvalidSnapshot)
state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
}
}
@@ -679,13 +673,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Create our own scan for table_index_build_scan(), rather than
* getting it to do so for us. This is required so that we can
- * actually use the MVCC snapshot registered earlier in !readonly
- * case.
+ * actually use the MVCC snapshot registered earlier.
*
* Note that table_index_build_scan() calls heap_endscan() for us.
*/
scan = table_beginscan_strat(state->heaprel, /* relation */
- snapshot, /* snapshot */
+ state->snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
@@ -693,7 +686,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
- * behaves in !readonly case.
+ * behaves.
*
* It's okay that we don't actually use the same lock strength for the
* heap relation as any other ii_Concurrent caller would in !readonly
@@ -702,7 +695,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
* that needs to be sure that there was no concurrent recycling of
* TIDs.
*/
- indexinfo->ii_Concurrent = !state->readonly;
+ indexinfo->ii_Concurrent = true;
/*
* Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -726,14 +719,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent, RelationGetRelationName(heaprel),
100.0 * bloom_prop_bits_set(state->filter))));
- if (snapshot != SnapshotAny)
- UnregisterSnapshot(snapshot);
-
bloom_free(state->filter);
}
/* Be tidy: */
- if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+ if (state->snapshot != InvalidSnapshot)
UnregisterSnapshot(state->snapshot);
MemoryContextDelete(state->targetcontext);
}
--
2.43.0
Hi Michail
It turns out that bt_index_parent_check is not suitable for
validating indexes built concurrently.
Your finding is right on point! We recently used bt_index_parent_check to
verify concurrently built indexes in a concurrent workload,
bt_index_parent_check often gave such false positive error.
- indexinfo->ii_Concurrent = !state->readonly;
+ indexinfo->ii_Concurrent = true;
One suggestion to this change is that we might need to update the amcheck
doc to reflect that
"This consists of a “dummy” CREATE INDEX CONCURRENTLY operation" rather
than "CREATE INDEX" operation.
Regards,
Donghang
On Mon, Jun 02, 2025 at 05:40:18PM -0700, Donghang Lin wrote:
Your finding is right on point! We recently used bt_index_parent_check to
verify concurrently built indexes in a concurrent workload,
bt_index_parent_check often gave such false positive error.
Good thing is that this is tracked in the CF app:
https://commitfest.postgresql.org/patch/5438/
Peter, could you look at that? amcheck and btree are both in your
area of expertise. Getting this error because of routine CIC or
REINDEX CONCURRENTLY runs is annoying.
--
Michael
Hello, Donghang!
One suggestion to this change is that we might need to update the amcheck doc to reflect that
"This consists of a “dummy” CREATE INDEX CONCURRENTLY operation" rather than "CREATE INDEX" operation.
+1, done. Also fixed some typos in the commit message.
Best regards,
Mikhail.
Attachments:
v3-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patchapplication/octet-stream; name=v3-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patchDownload
From 2deb56411509955d66399bb052c966c5b77d61b7 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Fri, 13 Dec 2024 00:53:49 +0100
Subject: [PATCH v3] amcheck: Fix bt_index_parent_check behavior with CREATE
INDEX CONCURRENTLY
Modify snapshot handling in verify_nbtree to properly handle indexes created with CREATE INDEX CONCURRENTLY.
Previously, the function could fail to find some tuples in bloom when verifying such indexes because it was using the SnapshotAny, while indexes were built using MVCC snapshot.
This could lead to false alarms in bt_index_parent_check's validation.
Add a regression test for the issue.
Discussion: https://postgr.es/m/CANtu0ojmVd27fEhfpST7RG2KZvwkX%3DdMyKUqg0KM87FkOSdz8Q%40mail.gmail.com
---
contrib/amcheck/meson.build | 1 +
.../t/006_cic_bt_index_parent_check.pl | 39 +++++++++++
contrib/amcheck/verify_nbtree.c | 68 ++++++++-----------
doc/src/sgml/amcheck.sgml | 2 +-
4 files changed, 70 insertions(+), 40 deletions(-)
create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl
diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index b33e8c9b062..b040000dd55 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -49,6 +49,7 @@ tests += {
't/003_cic_2pc.pl',
't/004_verify_nbtree_unique.pl',
't/005_pitr.pl',
+ 't/006_cic_bt_index_parent_check.pl',
],
},
}
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..6e52c5e39ec
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,39 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(q(BEGIN; SELECT pg_current_xact_id();));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+is($result, '0', 'bt_index_parent_check for CIC after removed row');
+
+$in_progress_h->quit;
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..3048e044aec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
BTMetaPageData *metad;
uint32 previouslevel;
BtreeLevel current;
- Snapshot snapshot = SnapshotAny;
if (!readonly)
elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent = 0;
/*
- * Register our own snapshot in !readonly case, rather than asking
+ * Register our own snapshot for heapallindexed, rather than asking
* table_index_build_scan() to do this for us later. This needs to
* happen before index fingerprinting begins, so we can later be
* certain that index fingerprinting should have reached all tuples
* returned by table_index_build_scan().
*/
- if (!state->readonly)
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
- /*
- * GetTransactionSnapshot() always acquires a new MVCC snapshot in
- * READ COMMITTED mode. A new snapshot is guaranteed to have all
- * the entries it requires in the index.
- *
- * We must defend against the possibility that an old xact
- * snapshot was returned at higher isolation levels when that
- * snapshot is not safe for index scans of the target index. This
- * is possible when the snapshot sees tuples that are before the
- * index's indcheckxmin horizon. Throwing an error here should be
- * very rare. It doesn't seem worth using a secondary snapshot to
- * avoid this.
- */
- if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
- !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
- snapshot->xmin))
- ereport(ERROR,
- (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
- errmsg("index \"%s\" cannot be verified using transaction snapshot",
- RelationGetRelationName(rel))));
- }
- }
+ /*
+ * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+ * READ COMMITTED mode. A new snapshot is guaranteed to have all
+ * the entries it requires in the index.
+ *
+ * We must defend against the possibility that an old xact
+ * snapshot was returned at higher isolation levels when that
+ * snapshot is not safe for index scans of the target index. This
+ * is possible when the snapshot sees tuples that are before the
+ * index's indcheckxmin horizon. Throwing an error here should be
+ * very rare. It doesn't seem worth using a secondary snapshot to
+ * avoid this.
+ */
+ if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+ !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+ state->snapshot->xmin))
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("index \"%s\" cannot be verified using transaction snapshot",
+ RelationGetRelationName(rel))));
+}
/*
* We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->indexinfo = BuildIndexInfo(state->rel);
if (state->indexinfo->ii_Unique)
{
- if (snapshot != SnapshotAny)
- state->snapshot = snapshot;
- else
+ if (state->snapshot == InvalidSnapshot)
state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
}
}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Create our own scan for table_index_build_scan(), rather than
* getting it to do so for us. This is required so that we can
- * actually use the MVCC snapshot registered earlier in !readonly
- * case.
+ * actually use the MVCC snapshot registered earlier.
*
* Note that table_index_build_scan() calls heap_endscan() for us.
*/
scan = table_beginscan_strat(state->heaprel, /* relation */
- snapshot, /* snapshot */
+ state->snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
- * behaves in !readonly case.
+ * behaves.
*
* It's okay that we don't actually use the same lock strength for the
* heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
* that needs to be sure that there was no concurrent recycling of
* TIDs.
*/
- indexinfo->ii_Concurrent = !state->readonly;
+ indexinfo->ii_Concurrent = true;
/*
* Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent, RelationGetRelationName(heaprel),
100.0 * bloom_prop_bits_set(state->filter))));
- if (snapshot != SnapshotAny)
- UnregisterSnapshot(snapshot);
-
bloom_free(state->filter);
}
/* Be tidy: */
- if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+ if (state->snapshot != InvalidSnapshot)
UnregisterSnapshot(state->snapshot);
MemoryContextDelete(state->targetcontext);
}
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 98f836e15e7..62e2a51406c 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -382,7 +382,7 @@ SET client_min_messages = DEBUG1;
verification functions is <literal>true</literal>, an additional
phase of verification is performed against the table associated with
the target index relation. This consists of a <quote>dummy</quote>
- <command>CREATE INDEX</command> operation, which checks for the
+ <command>CREATE INDEX CONCURRENTLY</command> operation, which checks for the
presence of all hypothetical new index tuples against a temporary,
in-memory summarizing structure (this is built when needed during
the basic first phase of verification). The summarizing structure
--
2.43.0
Hello, everyone!
Rebased.
Best regards,
Mikhail.
Attachments:
v4-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patchapplication/octet-stream; name=v4-0001-amcheck-Fix-bt_index_parent_check-behavior-with-C.patchDownload
From f69cd99cdb82c5ddcc6486a431f9ad73cfe9ccfd Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Fri, 13 Dec 2024 00:53:49 +0100
Subject: [PATCH v4] amcheck: Fix bt_index_parent_check behavior with CREATE
INDEX CONCURRENTLY
Modify snapshot handling in verify_nbtree to properly handle indexes created with CREATE INDEX CONCURRENTLY.
Previously, the function could fail to find some tuples in bloom when verifying such indexes because it was using the SnapshotAny, while indexes were built using MVCC snapshot.
This could lead to false alarms in bt_index_parent_check's validation.
Add a regression test for the issue.
Discussion: https://postgr.es/m/CANtu0ojmVd27fEhfpST7RG2KZvwkX%3DdMyKUqg0KM87FkOSdz8Q%40mail.gmail.com
---
contrib/amcheck/meson.build | 1 +
.../t/007_cic_bt_index_parent_check.pl | 39 +++++++++++
contrib/amcheck/verify_nbtree.c | 68 ++++++++-----------
doc/src/sgml/amcheck.sgml | 2 +-
4 files changed, 70 insertions(+), 40 deletions(-)
create mode 100644 contrib/amcheck/t/007_cic_bt_index_parent_check.pl
diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index 1f0c347ed54..decbd441266 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -50,6 +50,7 @@ tests += {
't/004_verify_nbtree_unique.pl',
't/005_pitr.pl',
't/006_verify_gin.pl',
+ 't/007_cic_bt_index_parent_check.pl',
],
},
}
diff --git a/contrib/amcheck/t/007_cic_bt_index_parent_check.pl b/contrib/amcheck/t/007_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..6e52c5e39ec
--- /dev/null
+++ b/contrib/amcheck/t/007_cic_bt_index_parent_check.pl
@@ -0,0 +1,39 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(q(BEGIN; SELECT pg_current_xact_id();));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+is($result, '0', 'bt_index_parent_check for CIC after removed row');
+
+$in_progress_h->quit;
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..3048e044aec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
BTMetaPageData *metad;
uint32 previouslevel;
BtreeLevel current;
- Snapshot snapshot = SnapshotAny;
if (!readonly)
elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent = 0;
/*
- * Register our own snapshot in !readonly case, rather than asking
+ * Register our own snapshot for heapallindexed, rather than asking
* table_index_build_scan() to do this for us later. This needs to
* happen before index fingerprinting begins, so we can later be
* certain that index fingerprinting should have reached all tuples
* returned by table_index_build_scan().
*/
- if (!state->readonly)
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
- /*
- * GetTransactionSnapshot() always acquires a new MVCC snapshot in
- * READ COMMITTED mode. A new snapshot is guaranteed to have all
- * the entries it requires in the index.
- *
- * We must defend against the possibility that an old xact
- * snapshot was returned at higher isolation levels when that
- * snapshot is not safe for index scans of the target index. This
- * is possible when the snapshot sees tuples that are before the
- * index's indcheckxmin horizon. Throwing an error here should be
- * very rare. It doesn't seem worth using a secondary snapshot to
- * avoid this.
- */
- if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
- !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
- snapshot->xmin))
- ereport(ERROR,
- (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
- errmsg("index \"%s\" cannot be verified using transaction snapshot",
- RelationGetRelationName(rel))));
- }
- }
+ /*
+ * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+ * READ COMMITTED mode. A new snapshot is guaranteed to have all
+ * the entries it requires in the index.
+ *
+ * We must defend against the possibility that an old xact
+ * snapshot was returned at higher isolation levels when that
+ * snapshot is not safe for index scans of the target index. This
+ * is possible when the snapshot sees tuples that are before the
+ * index's indcheckxmin horizon. Throwing an error here should be
+ * very rare. It doesn't seem worth using a secondary snapshot to
+ * avoid this.
+ */
+ if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+ !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+ state->snapshot->xmin))
+ ereport(ERROR,
+ (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("index \"%s\" cannot be verified using transaction snapshot",
+ RelationGetRelationName(rel))));
+}
/*
* We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->indexinfo = BuildIndexInfo(state->rel);
if (state->indexinfo->ii_Unique)
{
- if (snapshot != SnapshotAny)
- state->snapshot = snapshot;
- else
+ if (state->snapshot == InvalidSnapshot)
state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
}
}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Create our own scan for table_index_build_scan(), rather than
* getting it to do so for us. This is required so that we can
- * actually use the MVCC snapshot registered earlier in !readonly
- * case.
+ * actually use the MVCC snapshot registered earlier.
*
* Note that table_index_build_scan() calls heap_endscan() for us.
*/
scan = table_beginscan_strat(state->heaprel, /* relation */
- snapshot, /* snapshot */
+ state->snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
- * behaves in !readonly case.
+ * behaves.
*
* It's okay that we don't actually use the same lock strength for the
* heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
* that needs to be sure that there was no concurrent recycling of
* TIDs.
*/
- indexinfo->ii_Concurrent = !state->readonly;
+ indexinfo->ii_Concurrent = true;
/*
* Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent, RelationGetRelationName(heaprel),
100.0 * bloom_prop_bits_set(state->filter))));
- if (snapshot != SnapshotAny)
- UnregisterSnapshot(snapshot);
-
bloom_free(state->filter);
}
/* Be tidy: */
- if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+ if (state->snapshot != InvalidSnapshot)
UnregisterSnapshot(state->snapshot);
MemoryContextDelete(state->targetcontext);
}
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 211a0ae1945..0402172a5ff 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -382,7 +382,7 @@ SET client_min_messages = DEBUG1;
verification functions is <literal>true</literal>, an additional
phase of verification is performed against the table associated with
the target index relation. This consists of a <quote>dummy</quote>
- <command>CREATE INDEX</command> operation, which checks for the
+ <command>CREATE INDEX CONCURRENTLY</command> operation, which checks for the
presence of all hypothetical new index tuples against a temporary,
in-memory summarizing structure (this is built when needed during
the basic first phase of verification). The summarizing structure
--
2.43.0
On 21 Jun 2025, at 21:10, Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:
Rebased
IMO the patch is RfC, I've just updated the status of the CF iteam.
Thanks.
Best regards, Andrey Borodin.
Hello, everyone!
Could someone please take a look?
We have a confirmed bug, which happens in production [0]/messages/by-id/CAA=D8a2Q23miHJtHRDk_TSQKvd6oHk8wGpkQt99B=9yd-oqnfg@mail.gmail.com, probably
needs to be backported and has a simple patch to address for about 9
months already....
I am not trying to blame someone, just to highlight how it sometimes
feels from the non-committer side.
Best regards,
Mikhail.
[0]: /messages/by-id/CAA=D8a2Q23miHJtHRDk_TSQKvd6oHk8wGpkQt99B=9yd-oqnfg@mail.gmail.com
Hi, I pushed this to 17 and up. Thanks!
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Selbst das größte Genie würde nicht weit kommen, wenn es
alles seinem eigenen Innern verdanken wollte." (Johann Wolfgang von Goethe)
Ni aún el genio más grande llegaría muy lejos si
quisiera sacarlo todo de su propio interior.
Hi, I pushed this to 17 and up. Thanks!
Thanks!
I'll update my CIC patch then - now it is not required to have any
other patch inbuilt.
Hi Álvaro,
Thanks for looking at it. The test in the commit still fails back to 14.
I think it should be backported to 14 and up.
Thanks,
Donghang
On Thu, Dec 4, 2025 at 9:47 AM Álvaro Herrera <alvherre@kurilemu.de> wrote:
Show quoted text
Hi, I pushed this to 17 and up. Thanks!
--
Álvaro Herrera 48°01'N 7°57'E —
https://www.EnterpriseDB.com/
"Selbst das größte Genie würde nicht weit kommen, wenn es
alles seinem eigenen Innern verdanken wollte." (Johann Wolfgang von Goethe)
Ni aún el genio más grande llegaría muy lejos si
quisiera sacarlo todo de su propio interior.
Hello,
On 2025-Dec-04, Donghang Lin wrote:
Hi Álvaro,
Thanks for looking at it. The test in the commit still fails back to 14.
I think it should be backported to 14 and up.
You're correct that the bug is older. I'm not sure about backporting
the fix to 16 and earlier though, because it depends on BtreeCheckState
having the snapshot member which was only added in commit 5ae2087202af.
It's not a difficult patch -- attached.
I was being cautious and under the impression that the bug had never
been actually encountered by users, but I just noticed that in your
reply from 2 Jun 2025 you mentioned hitting this problem. So maybe we
should bite the bullet ...
Is anybody against backpatching this to 14-16?
(One thing I realized is that the comment for BtreeCheckState->snapshot
says that it's used for the uniqueness test only, but that's no longer
the case. So that needs fixed ...)
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Attachments:
0001-amcheck-Fix-snapshot-usage-in-bt_index_parent_check.patchtext/x-diff; charset=utf-8Download
From d6109ae0970e3d1c3c1c79de30ce663cb4f82d39 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <alvherre@kurilemu.de>
Date: Thu, 4 Dec 2025 18:12:08 +0100
Subject: [PATCH] amcheck: Fix snapshot usage in bt_index_parent_check
We were using SnapshotAny to do some index checks, but that's wrong and
causes spurious errors when used on indexes created by CREATE INDEX
CONCURRENTLY. Fix it to use an MVCC snapshot, and add a test for it.
This problem came in with commit 5ae2087202af, which introduced
uniqueness check. Backpatch to 17.
Author: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Backpatch-through: 17
Discussion: https://postgr.es/m/CANtu0ojmVd27fEhfpST7RG2KZvwkX=dMyKUqg0KM87FkOSdz8Q@mail.gmail.com
---
contrib/amcheck/t/002_cic.pl | 23 +++++++++++
contrib/amcheck/verify_nbtree.c | 72 +++++++++++++++------------------
doc/src/sgml/amcheck.sgml | 2 +-
3 files changed, 57 insertions(+), 40 deletions(-)
diff --git a/contrib/amcheck/t/002_cic.pl b/contrib/amcheck/t/002_cic.pl
index 42a047a3571..9a23109052d 100644
--- a/contrib/amcheck/t/002_cic.pl
+++ b/contrib/amcheck/t/002_cic.pl
@@ -60,5 +60,28 @@ $node->pgbench(
)
});
+# Test bt_index_parent_check() with indexes created with
+# CREATE INDEX CONCURRENTLY.
+$node->safe_psql('postgres', q(CREATE TABLE quebec(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres',
+ q(INSERT INTO quebec SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(q(BEGIN; SELECT pg_current_xact_id();));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM quebec WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY oscar ON quebec(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres',
+ q(SELECT bt_index_parent_check('oscar', heapallindexed => true)));
+is($result, '0', 'bt_index_parent_check for CIC after removed row');
+
+$in_progress_h->quit;
+
$node->stop;
done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ce80052ea66..b4ee073fce9 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -87,6 +87,8 @@ typedef struct BtreeCheckState
/* Buffer access strategy */
BufferAccessStrategy checkstrategy;
+ Snapshot snapshot;
+
/*
* Mutable state, for verification of particular page:
*/
@@ -468,7 +470,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
BTMetaPageData *metad;
uint32 previouslevel;
BtreeLevel current;
- Snapshot snapshot = SnapshotAny;
if (!readonly)
elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -517,37 +518,33 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent = 0;
/*
- * Register our own snapshot in !readonly case, rather than asking
+ * Register our own snapshot for heapallindexed, rather than asking
* table_index_build_scan() to do this for us later. This needs to
* happen before index fingerprinting begins, so we can later be
* certain that index fingerprinting should have reached all tuples
* returned by table_index_build_scan().
*/
- if (!state->readonly)
- {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
- /*
- * GetTransactionSnapshot() always acquires a new MVCC snapshot in
- * READ COMMITTED mode. A new snapshot is guaranteed to have all
- * the entries it requires in the index.
- *
- * We must defend against the possibility that an old xact
- * snapshot was returned at higher isolation levels when that
- * snapshot is not safe for index scans of the target index. This
- * is possible when the snapshot sees tuples that are before the
- * index's indcheckxmin horizon. Throwing an error here should be
- * very rare. It doesn't seem worth using a secondary snapshot to
- * avoid this.
- */
- if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
- !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
- snapshot->xmin))
- ereport(ERROR,
- (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
- errmsg("index \"%s\" cannot be verified using transaction snapshot",
- RelationGetRelationName(rel))));
- }
+ /*
+ * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+ * READ COMMITTED mode. A new snapshot is guaranteed to have all the
+ * entries it requires in the index.
+ *
+ * We must defend against the possibility that an old xact snapshot
+ * was returned at higher isolation levels when that snapshot is not
+ * safe for index scans of the target index. This is possible when
+ * the snapshot sees tuples that are before the index's indcheckxmin
+ * horizon. Throwing an error here should be very rare. It doesn't
+ * seem worth using a secondary snapshot to avoid this.
+ */
+ if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+ !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+ state->snapshot->xmin))
+ ereport(ERROR,
+ errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+ errmsg("index \"%s\" cannot be verified using transaction snapshot",
+ RelationGetRelationName(rel)));
}
Assert(!state->rootdescend || state->readonly);
@@ -622,13 +619,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Create our own scan for table_index_build_scan(), rather than
* getting it to do so for us. This is required so that we can
- * actually use the MVCC snapshot registered earlier in !readonly
- * case.
+ * actually use the MVCC snapshot registered earlier.
*
* Note that table_index_build_scan() calls heap_endscan() for us.
*/
scan = table_beginscan_strat(state->heaprel, /* relation */
- snapshot, /* snapshot */
+ state->snapshot, /* snapshot */
0, /* number of keys */
NULL, /* scan key */
true, /* buffer access strategy OK */
@@ -636,16 +632,15 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
/*
* Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
- * behaves in !readonly case.
+ * behaves.
*
* It's okay that we don't actually use the same lock strength for the
- * heap relation as any other ii_Concurrent caller would in !readonly
- * case. We have no reason to care about a concurrent VACUUM
- * operation, since there isn't going to be a second scan of the heap
- * that needs to be sure that there was no concurrent recycling of
- * TIDs.
+ * heap relation as any other ii_Concurrent caller would. We have no
+ * reason to care about a concurrent VACUUM operation, since there
+ * isn't going to be a second scan of the heap that needs to be sure
+ * that there was no concurrent recycling of TIDs.
*/
- indexinfo->ii_Concurrent = !state->readonly;
+ indexinfo->ii_Concurrent = true;
/*
* Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -669,13 +664,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
state->heaptuplespresent, RelationGetRelationName(heaprel),
100.0 * bloom_prop_bits_set(state->filter))));
- if (snapshot != SnapshotAny)
- UnregisterSnapshot(snapshot);
-
bloom_free(state->filter);
}
/* Be tidy: */
+ if (state->snapshot != InvalidSnapshot)
+ UnregisterSnapshot(state->snapshot);
MemoryContextDelete(state->targetcontext);
}
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 2b9c1a9205f..4c7c6e88820 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -353,7 +353,7 @@ SET client_min_messages = DEBUG1;
verification functions is <literal>true</literal>, an additional
phase of verification is performed against the table associated with
the target index relation. This consists of a <quote>dummy</quote>
- <command>CREATE INDEX</command> operation, which checks for the
+ <command>CREATE INDEX CONCURRENTLY</command> operation, which checks for the
presence of all hypothetical new index tuples against a temporary,
in-memory summarizing structure (this is built when needed during
the basic first phase of verification). The summarizing structure
--
2.47.3
Hi Álvaro
You're correct that the bug is older. I'm not sure about backporting
the fix to 16 and earlier though, because it depends on BtreeCheckState
having the snapshot member which was only added in commit 5ae2087202af.
It's not a difficult patch -- attached.
The issue itself in fact doesn't depend on BtreeCheckState's
snapshot member.
So the fix for 14-16 can make snapshot a local variable in
the bt_check_every_level
function to make the scope small. Do you think it's worth changing it to a
local variable?
I don't think the issue itself is related to 5ae2087202af but actually
7f563c09f890 where
the heapallindexed flag is introduced for bt_index_check. The committed test
also doesn't check the uniqueness of an index. Should the commit message be
restated
to reflect this and also restate it's backport to 14 and up?
I applied the patch in 14 and noticed that it needs to fix the test number
to get the test passed.
15 and 16 do not need this diff.
-use Test::More tests => 3;
+use Test::More tests => 4;
(One thing I realized is that the comment for BtreeCheckState->snapshot
says that it's used for the uniqueness test only, but that's no longer
the case. So that needs fixed ...)
I think we'll have another chance to reflect that the issue is since
7f563c09f890 but not 5ae2087202af
in master for this fix. I can give it a try.
Thanks,
Donghang
On 2025-Dec-06, Donghang Lin wrote:
The issue itself in fact doesn't depend on BtreeCheckState's snapshot
member. So the fix for 14-16 can make snapshot a local variable in
the bt_check_every_level function to make the scope small. Do you
think it's worth changing it to a local variable?
Ah, excellent observation. Yes, it's definitely worth it.
I don't think the issue itself is related to 5ae2087202af but actually
7f563c09f890 where the heapallindexed flag is introduced for
bt_index_check.
Hmm, interesting. I wonder what made me think it was about uniqueness.
The committed test also doesn't check the uniqueness of an index.
Should the commit message be restated to reflect this and also restate
it's backport to 14 and up?
Sure.
I think we'll have another chance to reflect that the issue is since
7f563c09f890 but not 5ae2087202af in master for this fix. I can give
it a try.
Go ahead.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes... Fixed.
/messages/by-id/482D1632.8010507@sigaev.ru
Hi Álvaro,
I think we'll have another chance to reflect that the issue is since
7f563c09f890 but not 5ae2087202af in master for this fix. I can give
it a try.Go ahead.
Attach the fix for the inaccurate comment in master.
Thanks,
Donghang
Attachments:
0001-Fix-the-comment-for-snapshot-field-BtreeCheckState.patchapplication/octet-stream; name=0001-Fix-the-comment-for-snapshot-field-BtreeCheckState.patchDownload
From d5456ab0bcf35adeaafb06c7a43fc684389de42c Mon Sep 17 00:00:00 2001
From: Donghang Lin <donghanglin@gmail.com>
Date: Sun, 21 Dec 2025 07:46:36 -0800
Subject: [PATCH] Fix the comment for snapshot field BtreeCheckState.
The comment was not updated in commit 6bd469d which fixed the snapshot
usage. it's used for both heapallindexed and uniqueness checking now.
The reasoning in 6bd469d is not accurate. The original problem actually
comes from 7f563c0 rather than 5ae2087. Correct it in this message.
---
contrib/amcheck/verify_nbtree.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f91392a3a49..545c9b2cf32 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -92,9 +92,13 @@ typedef struct BtreeCheckState
BufferAccessStrategy checkstrategy;
/*
- * Info for uniqueness checking. Fill these fields once per index check.
+ * Info for uniqueness checking. Fill this field once per index check.
*/
IndexInfo *indexinfo;
+
+ /**
+ * Table scan snapshot for heapallindexed and uniqueness checking.
+ */
Snapshot snapshot;
/*
--
2.49.0