[BUG?] check_exclusion_or_unique_constraint false negative
Hello, everyone!
While reviewing [1]/messages/by-id/CANtu0ogs10w=DgbYzZ8MswXE3PUC3J4SGDc0YEuZZeWbL0b6HA@mail.gmail.com, I noticed that check_exclusion_or_unique_constraint
occasionally returns false negatives for btree unique indexes during UPSERT
operations.
Although this doesn't cause any real issues with INSERT ON CONFLICT, I
wanted to bring it to your attention, as it might indicate an underlying
problem.
Attached is a patch to reproduce the issue.
make -C src/test/modules/test_misc/ check PROVE_TESTS='t/006_*'
....
Failed test 'concurrent INSERTs status (got 2 vs expected 0)'
# at t/006_concurrently_unique_fail.pl line 26.
# Failed test 'concurrent INSERTs stderr /(?^:^$)/'
# at t/006_concurrently_unique_fail.pl line 26.
# 'pgbench: error: client 34 script 0 aborted in command
0 query 0: ERROR: we know 31337 in the index!
Best regards,
Mikhail,
[1]: /messages/by-id/CANtu0ogs10w=DgbYzZ8MswXE3PUC3J4SGDc0YEuZZeWbL0b6HA@mail.gmail.com
/messages/by-id/CANtu0ogs10w=DgbYzZ8MswXE3PUC3J4SGDc0YEuZZeWbL0b6HA@mail.gmail.com
Attachments:
test_+_assert_to_reproduce_possible_issue_with_check_exclusion_or_unique_constraint.patchtext/x-patch; charset=US-ASCII; name=test_+_assert_to_reproduce_possible_issue_with_check_exclusion_or_unique_constraint.patchDownload
Subject: [PATCH] test + assert to reproduce possible issue with check_exclusion_or_unique_constraint
---
Index: src/backend/executor/execIndexing.c
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
--- a/src/backend/executor/execIndexing.c (revision d5e6891502ca9e359aa5f5a381d904fe9d606338)
+++ b/src/backend/executor/execIndexing.c (date 1720979367766)
@@ -889,6 +889,11 @@
}
index_endscan(index_scan);
+ if (!conflict && values[0] == 31337) {
+ ereport(ERROR,
+ (errcode(ERRCODE_EXCLUSION_VIOLATION),
+ errmsg("we know 31337 in the index!")));
+ }
/*
* Ordinarily, at this point the search should have found the originally
Index: src/test/modules/test_misc/t/006_concurrently_unique_fail.pl
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/test/modules/test_misc/t/006_concurrently_unique_fail.pl b/src/test/modules/test_misc/t/006_concurrently_unique_fail.pl
new file mode 100644
--- /dev/null (date 1720979285840)
+++ b/src/test/modules/test_misc/t/006_concurrently_unique_fail.pl (date 1720979285840)
@@ -0,0 +1,39 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key, n int)));
+
+$node->safe_psql('postgres', q(INSERT INTO tbl VALUES(31337,1)));
+
+$node->pgbench(
+ '--no-vacuum --client=40 --transactions=1000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent INSERTs',
+ {
+ 'on_conflicts' => q(
+ INSERT INTO tbl VALUES(31337,1) on conflict(i) do update set n = EXCLUDED.n + 1;
+ )
+ });
+
+$node->stop;
+done_testing();
\ No newline at end of file
Hello, Andres.
Sorry to bother you, but I feel it's necessary to validate the possible
issue regarding someone who can decide whether it is okay or not.
The issue is reproducible with the first UPSERT implementation (your commit
168d5805e4c08bed7b95d351bf097cff7c07dd65 from 2015) and up to now.
The problem appears as follows:
* A unique index contains a specific value (in the test, it is the only
value for the entire index).
* check_exclusion_or_unique_constraint returns FALSE for that value in some
random cases.
* Technically, this means index_getnext finds 0 records, even though we
know the value exists in the index.
I was able to reproduce this only with an UNLOGGED table.
I can't find any scenarios that are actually broken (since the issue is
resolved by speculative insertion later), but this looks suspicious to me.
It could be a symptom of some tricky race condition in the btree.
Best regards,
Mikhail
Show quoted text
Hello, everyone!
Updates so far:
* issue happens with both LOGGED and UNLOGGED relations
* issue happens with DirtySnapshot
* not happens with SnapshotSelf
* not happens with SnapshotAny
* not related to speculative inserted tuples - I have commented the code of
its insertion - and the issue continues to occur.
Best regards,
Mikhail.
It seems like I've identified the cause of the issue.
Currently, any DirtySnapshot (or SnapshotSelf) scan over a B-tree index may
skip (not find the TID for) some records in the case of parallel updates.
The following scenario is possible:
* Session 1 reads a B-tree page using SnapshotDirty and copies item X to
the buffer.
* Session 2 updates item X, inserting a new TID Y into the same page.
* Session 2 commits its transaction.
* Session 1 starts to fetch from the heap and tries to fetch X, but it was
already deleted by session 2. So, it goes to the B-tree for the next TID.
* The B-tree goes to the next page, skipping Y.
* Therefore, the search finds nothing, but tuple Y is still alive.
This situation is somewhat controversial. DirtySnapshot might seem to show
more (or more recent, even uncommitted) data than MVCC, but not less. So,
DirtySnapshot scan over a B-tree does not provide any guarantees, as far as
I understand.
Why does it work for MVCC? Because tuple X will be visible due to the
snapshot, making Y unnecessary.
This might be "as designed," but I think it needs to be clearly documented
(I couldn't find any documentation on this particular case, only
_bt_drop_lock_and_maybe_pin - related).
Here are the potential consequences of the issue:
* check_exclusion_or_unique_constraint
It may not find a record in a UNIQUE index during INSERT ON CONFLICT
UPDATE. However, this is just a minor performance issue.
* Exclusion constraints with B-tree, like ADD CONSTRAINT exclusion_data
EXCLUDE USING btree (data WITH =)
It should work correctly because the first inserter may "skip" the TID from
a concurrent inserter, but the second one should still find the TID from
the first.
* RelationFindReplTupleByIndex
Amit, this is why I've included you in this previously solo thread :)
RelationFindReplTupleByIndex uses DirtySnapshot and may not find some
records if they are updated by a parallel transaction. This could lead to
lost deletes/updates, especially in the case of streaming=parallel mode.
I'm not familiar with how parallel workers apply transactions, so maybe
this isn't possible.
Best regards,
Mikhail
Show quoted text
Dear Michail,
Thanks for pointing out the issue!
* RelationFindReplTupleByIndex
Amit, this is why I've included you in this previously solo thread :)
RelationFindReplTupleByIndex uses DirtySnapshot and may not find some records
if they are updated by a parallel transaction. This could lead to lost
deletes/updates, especially in the case of streaming=parallel mode.
I'm not familiar with how parallel workers apply transactions, so maybe this
isn't possible.
IIUC, the issue can happen when two concurrent transactions using DirtySnapshot access
the same tuples, which is not specific to the parallel apply. Consider that two
subscriptions exist and publishers modify the same tuple of the same table.
In this case, two workers access the tuple, so one of the changes may be missed
by the scenario you said. I feel we do not need special treatments for parallel
apply.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Hello, Hayato!
Thanks for pointing out the issue!
Thanks for your attention!
IIUC, the issue can happen when two concurrent transactions using
DirtySnapshot access
the same tuples, which is not specific to the parallel apply
Not exactly, it happens for any DirtySnapshot scan over a B-tree index with
some other transaction updating the same index page (even using the MVCC
snapshot).
So, logical replication related scenario looks like this:
* subscriber worker receives a tuple update\delete from the publisher
* it calls RelationFindReplTupleByIndex to find the tuple in the local table
* some other transaction updates the tuple in the local table (on
subscriber side) in parallel
* RelationFindReplTupleByIndex may not find the tuple because it uses
DirtySnapshot
* update\delete is lost
Parallel apply mode looks like more dangerous because it uses multiple
workers on the subscriber side, so the probability of the issue is higher.
In that case, "some other transaction" is just another worker applying
changes of different transaction in parallel.
Best regards,
Mikhail.
On Thu, Aug 1, 2024 at 2:55 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
Thanks for pointing out the issue!
Thanks for your attention!
IIUC, the issue can happen when two concurrent transactions using DirtySnapshot access
the same tuples, which is not specific to the parallel applyNot exactly, it happens for any DirtySnapshot scan over a B-tree index with some other transaction updating the same index page (even using the MVCC snapshot).
So, logical replication related scenario looks like this:
* subscriber worker receives a tuple update\delete from the publisher
* it calls RelationFindReplTupleByIndex to find the tuple in the local table
* some other transaction updates the tuple in the local table (on subscriber side) in parallel
* RelationFindReplTupleByIndex may not find the tuple because it uses DirtySnapshot
* update\delete is lostParallel apply mode looks like more dangerous because it uses multiple workers on the subscriber side, so the probability of the issue is higher.
In that case, "some other transaction" is just another worker applying changes of different transaction in parallel.
I think it is rather less likely or not possible in a parallel apply
case because such conflicting updates (updates on the same tuple)
should be serialized at the publisher itself. So one of the updates
will be after the commit that has the second update.
I haven't tried the test based on your description of the general
problem with DirtySnapshot scan. In case of logical replication, we
will LOG update_missing type of conflict and the user may need to take
some manual action based on that. I have not tried a test so I could
be wrong as well. I am not sure we can do anything specific to logical
replication for this but feel free to suggest if you have ideas to
solve this problem in general or specific to logical replication.
--
With Regards,
Amit Kapila.
Hello, Amit!
I think it is rather less likely or not possible in a parallel apply
case because such conflicting updates (updates on the same tuple)
should be serialized at the publisher itself. So one of the updates
will be after the commit that has the second update.
Glad to hear! But anyway, such logic looks very fragile to me.
I haven't tried the test based on your description of the general
problem with DirtySnapshot scan. In case of logical replication, we
will LOG update_missing type of conflict and the user may need to take
some manual action based on that.
Current it is just DEBUG1, so it will be probably missed by the user.
* XXX should this be promoted to ereport(LOG) perhaps?
*/
elog(DEBUG1,
"logical replication did not find row to be updated "
"in replication target relation \"%s\"",
RelationGetRelationName(localrel));
}
I have not tried a test so I could
be wrong as well. I am not sure we can do anything specific to logical
replication for this but feel free to suggest if you have ideas to
solve this problem in general or specific to logical replication.
I've implemented a solution to address the problem more generally, attached
the patch (and also the link [1]https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:concurrent_unique).
Here's a summary of the changes:
* For each tuple skipped because it was deleted, we now accumulate the
maximum xmax.
* Before the scan begins, we store the value of the latest completed
transaction.
* If no tuples are found in the index, we check the max(xmax) value. If
this value is newer than the latest completed transaction stored before the
scan, it indicates that a tuple was deleted by another transaction after
the scan started. To ensure all tuples are correctly processed we then
rescan the index.
Also added a test case to cover this scenario using the new injection point
mechanism and
updated the b-tree index documentation to include a description of this
case.
I'll add this into the next commitfest.
Best regards,
Mikhail.
[1]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:concurrent_unique
https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:concurrent_unique
Attachments:
v1-0001-fix-for-lost-record-in-case-of-DirtySnapshot-inde.patchapplication/x-patch; name=v1-0001-fix-for-lost-record-in-case-of-DirtySnapshot-inde.patchDownload
From b639379c393c70ca322bab57222513e71c96ad78 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Fri, 2 Aug 2024 16:20:32 +0200
Subject: [PATCH v1] fix for lost record in case of DirtySnapshot index scans +
docs + test
---
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/heapam_visibility.c | 20 +++++++-
src/backend/access/nbtree/README | 10 ++++
src/backend/access/nbtree/nbtinsert.c | 2 +-
src/backend/access/transam/varsup.c | 11 +++++
src/backend/executor/execIndexing.c | 29 +++++++++++-
src/backend/executor/execReplication.c | 26 +++++++++-
src/backend/replication/logical/origin.c | 2 +-
src/include/access/transam.h | 16 +++++++
src/include/utils/snapmgr.h | 8 +++-
src/include/utils/snapshot.h | 5 +-
.../test_misc/t/006_dirty_index_scan.pl | 47 +++++++++++++++++++
13 files changed, 168 insertions(+), 12 deletions(-)
create mode 100644 src/test/modules/test_misc/t/006_dirty_index_scan.pl
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 3bd8b96197..4d1b469d8e 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
hscan = (HeapScanDesc) scan;
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
nblocks = hscan->rs_nblocks; /* # blocks to be scanned */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6f8b1b7929..7ea4b205d5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -404,7 +404,7 @@ tuple_lock_retry:
*
* Loop here to deal with updated or busy tuples
*/
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
for (;;)
{
if (ItemPointerIndicatesMovedPartitions(tid))
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..91de5dcea1 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -719,6 +719,12 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
return TM_Deleted; /* deleted by other */
}
+inline static void UpdateDirtyMaxXmax(Snapshot snapshot, TransactionId xmax)
+{
+ if (snapshot->xip != NULL)
+ snapshot->xip[0] = TransactionIdNewer(xmax, snapshot->xip[0]);
+}
+
/*
* HeapTupleSatisfiesDirty
* True iff heap tuple is valid including effects of open transactions.
@@ -737,7 +743,9 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
* Similarly for snapshot->xmax and the tuple's xmax. If the tuple was
* inserted speculatively, meaning that the inserter might still back down
* on the insertion without aborting the whole transaction, the associated
- * token is also returned in snapshot->speculativeToken.
+ * token is also returned in snapshot->speculativeToken. If xip is != NULL
+ * xip[0] may be set to xid of deleter if it newer than previously store
+ * value.
*/
static bool
HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
@@ -750,6 +758,10 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
snapshot->xmin = snapshot->xmax = InvalidTransactionId;
snapshot->speculativeToken = 0;
+ /*
+ * We intentionally keep snapshot->xip values unchanged as they should
+ * be reset by logic out of the single heap fetch.
+ */
if (!HeapTupleHeaderXminCommitted(tuple))
{
@@ -870,6 +882,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
{
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
return true;
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false; /* updated by other */
}
@@ -893,7 +906,10 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
return true;
}
if (TransactionIdDidCommit(xmax))
+ {
+ UpdateDirtyMaxXmax(snapshot, xmax);
return false;
+ }
/* it must have aborted or crashed */
return true;
}
@@ -902,6 +918,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
{
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
return true;
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false;
}
@@ -931,6 +948,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
HeapTupleHeaderGetRawXmax(tuple));
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false; /* updated by other */
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..6de72a29ca 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -489,6 +489,16 @@ on the leaf page at all when the page's LSN has changed. (That won't work
with an unlogged index, so for now we don't ever apply the "don't hold
onto pin" optimization there.)
+Despite the locking protocol in place, it is still possible to receive an
+incorrect result during non-MVCC scans. This issue can occur if a concurrent
+transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content
+in the buffer cache, it might skip the old tuple due to deletion and miss
+the new tuple because of the cache. This is a known limitation of the
+SnapshotDirty and SnapshotAny non-MVCC scans. However, for SnapshotDirty,
+it is possible to work around this limitation by using the returned max(xmax)
+to compare it with the latest committed transaction before the scan started.
+
Fastpath For Index Insertion
----------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7e8902e48c..943aee087a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -427,7 +427,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* Assume unique until we find a duplicate */
*is_unique = true;
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fb6a86afcb..52109635b4 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -296,6 +296,17 @@ ReadNextFullTransactionId(void)
return fullXid;
}
+FullTransactionId ReadLastCompletedFullTransactionId(void)
+{
+ FullTransactionId fullXid;
+
+ LWLockAcquire(XidGenLock, LW_SHARED);
+ fullXid = TransamVariables->latestCompletedXid;
+ LWLockRelease(XidGenLock);
+
+ return fullXid;
+}
+
/*
* Advance nextXid to the value after a given xid. The epoch is inferred.
* This must only be called during recovery or from two-phase start-up code.
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9f05b3654c..45767b4e20 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -115,6 +115,7 @@
#include "nodes/nodeFuncs.h"
#include "storage/lmgr.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -702,6 +703,8 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
IndexScanDesc index_scan;
ScanKeyData scankeys[INDEX_MAX_KEYS];
SnapshotData DirtySnapshot;
+ TransactionId maxXmax,
+ latestCompletedXid;
int i;
bool conflict;
bool found_self;
@@ -738,9 +741,10 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* Search the tuples that are in the index for any violations, including
- * tuples that aren't visible yet.
+ * tuples that aren't visible yet. Also, detect cases index scan skip the
+ * tuple in case of parallel update after index page content was cached.
*/
- InitDirtySnapshot(DirtySnapshot);
+ InitDirtySnapshot(DirtySnapshot, &maxXmax);
for (i = 0; i < indnkeyatts; i++)
{
@@ -774,6 +778,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
retry:
conflict = false;
found_self = false;
+ /*
+ * Each time we retry - remember last completed transaction before start
+ * of the scan. Aso reset maxXmax.
+ */
+ latestCompletedXid = XidFromFullTransactionId(ReadLastCompletedFullTransactionId());
+ maxXmax = InvalidTransactionId;
index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
@@ -889,6 +899,19 @@ retry:
}
index_endscan(index_scan);
+ /*
+ * Check for the case when index scan fetched records before some other
+ * transaction deleted tuple and inserted a new one.
+ */
+ if (!conflict && TransactionIdIsValid(maxXmax) && !TransactionIdIsCurrentTransactionId(maxXmax))
+ {
+ /*
+ * If we have skipped some tuple because it was deleted, but deletion happened after
+ * start of the index scan - retry to be sure.
+ */
+ if (TransactionIdPrecedes(latestCompletedXid, maxXmax))
+ goto retry;
+ }
/*
* Ordinarily, at this point the search should have found the originally
@@ -902,6 +925,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index d0a89cd577..fbddb6442b 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -183,6 +183,8 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
IndexScanDesc scan;
SnapshotData snap;
TransactionId xwait;
+ TransactionId maxXmax,
+ latestCompletedXid;
Relation idxrel;
bool found;
TypeCacheEntry **eq = NULL;
@@ -193,7 +195,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
isIdxSafeToSkipDuplicates = (GetRelationIdentityOrPK(rel) == idxoid);
- InitDirtySnapshot(snap);
+ InitDirtySnapshot(snap, &maxXmax);
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
@@ -203,6 +205,12 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
retry:
found = false;
+ /*
+ * Each time we retry - remember last completed transaction before start
+ * of the scan. Aso reset maxXmax.
+ */
+ maxXmax = InvalidTransactionId;
+ latestCompletedXid = XidFromFullTransactionId(ReadLastCompletedFullTransactionId());
index_rescan(scan, skey, skey_attoff, NULL, 0);
@@ -242,6 +250,20 @@ retry:
break;
}
+ /*
+ * Check for the case when index scan fetched records before some other
+ * transaction deleted tuple and inserted a new one.
+ */
+ if (!found && TransactionIdIsValid(maxXmax) && !TransactionIdIsCurrentTransactionId(maxXmax))
+ {
+ /*
+ * If we have skipped some tuple because it was deleted, but deletion happened after
+ * start of the index scan - retry to be sure.
+ */
+ if (TransactionIdPrecedes(latestCompletedXid, maxXmax))
+ goto retry;
+ }
+
/* Found tuple, try to lock it in the lockmode. */
if (found)
{
@@ -391,7 +413,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
/* Start a heap scan. */
- InitDirtySnapshot(snap);
+ InitDirtySnapshot(snap, NULL);
scan = table_beginscan(rel, &snap, 0, NULL);
scanslot = table_slot_create(rel, NULL);
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 419e4814f0..04ba0d9ba1 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -278,7 +278,7 @@ replorigin_create(const char *roname)
* to the exclusive lock there's no danger that new rows can appear while
* we're checking.
*/
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd..aae0ad90c2 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -288,6 +288,7 @@ extern void VarsupShmemInit(void);
extern FullTransactionId GetNewTransactionId(bool isSubXact);
extern void AdvanceNextFullTransactionIdPastXid(TransactionId xid);
extern FullTransactionId ReadNextFullTransactionId(void);
+extern FullTransactionId ReadLastCompletedFullTransactionId(void);
extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
Oid oldest_datoid);
extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
@@ -344,6 +345,21 @@ TransactionIdOlder(TransactionId a, TransactionId b)
return b;
}
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+ if (!TransactionIdIsValid(a))
+ return b;
+
+ if (!TransactionIdIsValid(b))
+ return a;
+
+ if (TransactionIdPrecedes(a, b))
+ return b;
+ return a;
+}
+
/* return the older of the two IDs, assuming they're both normal */
static inline TransactionId
NormalTransactionIdOlder(TransactionId a, TransactionId b)
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..5e6f3a7e76 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -36,9 +36,13 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
* We don't provide a static SnapshotDirty variable because it would be
* non-reentrant. Instead, users of that snapshot type should declare a
* local variable of type SnapshotData, and initialize it with this macro.
+ * pxid is optional and can be NULL. If it is not NULL, pxid[0] will be set
+ * to the transaction ID of deleting transaction if the tuple is deleted
+ * and it newer than pxid[0].
*/
-#define InitDirtySnapshot(snapshotdata) \
- ((snapshotdata).snapshot_type = SNAPSHOT_DIRTY)
+#define InitDirtySnapshot(snapshotdata, pxid) \
+ ((snapshotdata).snapshot_type = SNAPSHOT_DIRTY, \
+ (snapshotdata).xip = (pxid))
/*
* Similarly, some initialization is required for a NonVacuumable snapshot.
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 8d1e31e888..a68114e500 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -96,7 +96,10 @@ typedef enum SnapshotType
* xmax. If the tuple was inserted speculatively, meaning that the
* inserter might still back down on the insertion without aborting the
* whole transaction, the associated token is also returned in
- * snapshot->speculativeToken. See also InitDirtySnapshot().
+ * snapshot->speculativeToken. If xip is non-NULL, the xid of the
+ * deleting transaction is stored into xip[0] if it newer than existing
+ * xip[0] value.
+ * See also InitDirtySnapshot().
* -------------------------------------------------------------------------
*/
SNAPSHOT_DIRTY,
diff --git a/src/test/modules/test_misc/t/006_dirty_index_scan.pl b/src/test/modules/test_misc/t/006_dirty_index_scan.pl
new file mode 100644
index 0000000000..4d116e659e
--- /dev/null
+++ b/src/test/modules/test_misc/t/006_dirty_index_scan.pl
@@ -0,0 +1,47 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test issue with lost tuple in case of DirtySnapshot index scans
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('DirtyScan_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key, n int)));
+
+$node->safe_psql('postgres', q(INSERT INTO tbl VALUES(42,1)));
+$node->safe_psql('postgres', q(SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error')));
+
+$node->pgbench(
+ '--no-vacuum --client=40 --transactions=1000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent UPSERT',
+ {
+ 'on_conflicts' => q(
+ INSERT INTO tbl VALUES(42,1) on conflict(i) do update set n = EXCLUDED.n + 1;
+ )
+ });
+
+$node->safe_psql('postgres', q(SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict')));
+
+$node->stop;
+done_testing();
\ No newline at end of file
--
2.34.1
On Fri, Aug 2, 2024 at 10:38 PM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
I think it is rather less likely or not possible in a parallel apply
case because such conflicting updates (updates on the same tuple)
should be serialized at the publisher itself. So one of the updates
will be after the commit that has the second update.Glad to hear! But anyway, such logic looks very fragile to me.
I haven't tried the test based on your description of the general
problem with DirtySnapshot scan. In case of logical replication, we
will LOG update_missing type of conflict and the user may need to take
some manual action based on that.Current it is just DEBUG1, so it will be probably missed by the user.
* XXX should this be promoted to ereport(LOG) perhaps?
*/
elog(DEBUG1,
"logical replication did not find row to be updated "
"in replication target relation \"%s\"",
RelationGetRelationName(localrel));
}
Right, but we are extending this functionality to detect and resolve
such conflicts [1]https://commitfest.postgresql.org/49/5064/[2]https://commitfest.postgresql.org/49/5021/. I am hoping after that such updates won't be
missed.
[1]: https://commitfest.postgresql.org/49/5064/
[2]: https://commitfest.postgresql.org/49/5021/
--
With Regards,
Amit Kapila.
Hello!
Right, but we are extending this functionality to detect and resolve
such conflicts [1][2]. I am hoping after that such updates won't be
missed.
Yes, this is a nice feature. However, without the DirtySnapshot index scan
fix, it will fail in numerous instances, especially in master-master
replication.
The update_missing feature is helpful in this case, but it is still not the
correct event because a real tuple exists, and we should receive
update_differ instead. As a result, some conflict resolution systems may
malfunction. For example, if the resolution method is set to apply_or_skip,
it will insert the new row, causing two rows to exist. This system is quite
fragile, and I am sure there are many more complicated scenarios that could
arise.
Best regards,
Mikhail.
Hi,
Thanks for reporting the issue !
I tried to reproduce this in logical replication but failed. If possible,
could you please share some steps to reproduce it in logicalrep context ?
In my test, if the tuple is updated and new tuple is in the same page,
heapam_index_fetch_tuple should find the new tuple using HOT chain. So, it's a
bit unclear to me how the updated tuple is missing. Maybe I missed some other
conditions for this issue.
It would be better if we can reproduce this by adding some breakpoints using
gdb, which may help us to write a tap test using injection point to reproduce
this reliably. I see the tap test you shared used pgbench to reproduce this,
it works, but It would be great if we can analyze the issue more deeply by
debugging the code.
And I have few questions related the steps you shared:
* Session 1 reads a B-tree page using SnapshotDirty and copies item X to the buffer.
* Session 2 updates item X, inserting a new TID Y into the same page.
* Session 2 commits its transaction.
* Session 1 starts to fetch from the heap and tries to fetch X, but it was
already deleted by session 2. So, it goes to the B-tree for the next TID.
* The B-tree goes to the next page, skipping Y.
* Therefore, the search finds nothing, but tuple Y is still alive.
I am wondering at which point should the update happen ? should it happen after
calling index_getnext_tid and before index_fetch_heap ? It would be great if
you could give more details in above steps. Thanks !
Best Regards,
Hou zj
Hello, Hou zj!
In my test, if the tuple is updated and new tuple is in the same page,
heapam_index_fetch_tuple should find the new tuple using HOT chain. So,
it's a
bit unclear to me how the updated tuple is missing. Maybe I missed some
other
conditions for this issue.
Yeah, I think the pgbench-based reproducer may also cause page splits in
btree.
But we may add an index to the table to disable HOT.
I have attached a reproducer for this case using a spec and injection
points.
I hope it helps, check the attached files.
Best regards,
Mikhail.
Attachments:
v2-0002-additional-test-spec-to-reproduce-dirty-snapshot-.patchtext/x-patch; charset=US-ASCII; name=v2-0002-additional-test-spec-to-reproduce-dirty-snapshot-.patchDownload
From e3c1beb4d2739fb3b1cb7e068a7ef91b0da61fc6 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Mon, 12 Aug 2024 13:07:15 +0200
Subject: [PATCH v2 2/2] additional test spec to reproduce dirty snapshot scan
issue
---
src/backend/access/index/indexam.c | 8 ++++
src/backend/executor/execIndexing.c | 1 +
src/test/modules/injection_points/Makefile | 2 +-
.../expected/dirty_index_scan.out | 39 +++++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../specs/dirty_index_scan.spec | 43 +++++++++++++++++++
6 files changed, 93 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/dirty_index_scan.out
create mode 100644 src/test/modules/injection_points/specs/dirty_index_scan.spec
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index dcd04b813d..78b3a58b3b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -57,6 +57,7 @@
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -694,6 +695,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelationOid(scan->indexRelation->rd_id))
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch");
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 45767b4e20..479f145d99 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -775,6 +775,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
* May have to restart scan from this point if a potential conflict is
* found.
*/
+ INJECTION_POINT("check_exclusion_or_unique_constraint_before_index_scan");
retry:
conflict = false;
found_self = false;
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 2ffd2f77ed..981dff4e6f 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -9,7 +9,7 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = inplace
+ISOLATION = inplace dirty_index_scan
# The injection points are cluster-wide, so disable installcheck
NO_INSTALLCHECK = 1
diff --git a/src/test/modules/injection_points/expected/dirty_index_scan.out b/src/test/modules/injection_points/expected/dirty_index_scan.out
new file mode 100644
index 0000000000..249e69cb72
--- /dev/null
+++ b/src/test/modules/injection_points/expected/dirty_index_scan.out
@@ -0,0 +1,39 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_s1 s3_s1 s2_s1 s3_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s1_s1: INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; <waiting ...>
+step s3_s1:
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_before_index_scan');
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_before_index_scan');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+step s2_s1: UPDATE test.tbl SET n = n + 1 WHERE i = 42;
+step s3_s2:
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+step s1_s1: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 3c23c14d81..4969952797 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -40,6 +40,7 @@ tests += {
'isolation': {
'specs': [
'inplace',
+ 'dirty_index_scan',
],
},
}
diff --git a/src/test/modules/injection_points/specs/dirty_index_scan.spec b/src/test/modules/injection_points/specs/dirty_index_scan.spec
new file mode 100644
index 0000000000..17c5ec37e6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/dirty_index_scan.spec
@@ -0,0 +1,43 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, n int);
+ CREATE INDEX tbl_n_idx ON test.tbl(n);
+ INSERT INTO test.tbl VALUES(42,1);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error');
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_before_index_scan', 'wait');
+ SELECT injection_points_attach('index_getnext_slot_before_fetch', 'wait');
+}
+
+step s1_s1 { INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; }
+
+session s2
+step s2_s1 { UPDATE test.tbl SET n = n + 1 WHERE i = 42; }
+
+session s3
+step s3_s1 {
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_before_index_scan');
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_before_index_scan');
+}
+step s3_s2 {
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+}
+
+permutation
+ s1_s1
+ s3_s1
+ s2_s1
+ s3_s2
\ No newline at end of file
--
2.34.1
v2-0001-fix-for-lost-record-in-case-of-DirtySnapshot-inde.patchtext/x-patch; charset=US-ASCII; name=v2-0001-fix-for-lost-record-in-case-of-DirtySnapshot-inde.patchDownload
From b639379c393c70ca322bab57222513e71c96ad78 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Fri, 2 Aug 2024 16:20:32 +0200
Subject: [PATCH v2 1/2] fix for lost record in case of DirtySnapshot index
scans + docs + test
---
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/heapam_visibility.c | 20 +++++++-
src/backend/access/nbtree/README | 10 ++++
src/backend/access/nbtree/nbtinsert.c | 2 +-
src/backend/access/transam/varsup.c | 11 +++++
src/backend/executor/execIndexing.c | 29 +++++++++++-
src/backend/executor/execReplication.c | 26 +++++++++-
src/backend/replication/logical/origin.c | 2 +-
src/include/access/transam.h | 16 +++++++
src/include/utils/snapmgr.h | 8 +++-
src/include/utils/snapshot.h | 5 +-
.../test_misc/t/006_dirty_index_scan.pl | 47 +++++++++++++++++++
13 files changed, 168 insertions(+), 12 deletions(-)
create mode 100644 src/test/modules/test_misc/t/006_dirty_index_scan.pl
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 3bd8b96197..4d1b469d8e 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
hscan = (HeapScanDesc) scan;
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
nblocks = hscan->rs_nblocks; /* # blocks to be scanned */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6f8b1b7929..7ea4b205d5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -404,7 +404,7 @@ tuple_lock_retry:
*
* Loop here to deal with updated or busy tuples
*/
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
for (;;)
{
if (ItemPointerIndicatesMovedPartitions(tid))
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..91de5dcea1 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -719,6 +719,12 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
return TM_Deleted; /* deleted by other */
}
+inline static void UpdateDirtyMaxXmax(Snapshot snapshot, TransactionId xmax)
+{
+ if (snapshot->xip != NULL)
+ snapshot->xip[0] = TransactionIdNewer(xmax, snapshot->xip[0]);
+}
+
/*
* HeapTupleSatisfiesDirty
* True iff heap tuple is valid including effects of open transactions.
@@ -737,7 +743,9 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
* Similarly for snapshot->xmax and the tuple's xmax. If the tuple was
* inserted speculatively, meaning that the inserter might still back down
* on the insertion without aborting the whole transaction, the associated
- * token is also returned in snapshot->speculativeToken.
+ * token is also returned in snapshot->speculativeToken. If xip is != NULL
+ * xip[0] may be set to xid of deleter if it newer than previously store
+ * value.
*/
static bool
HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
@@ -750,6 +758,10 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
snapshot->xmin = snapshot->xmax = InvalidTransactionId;
snapshot->speculativeToken = 0;
+ /*
+ * We intentionally keep snapshot->xip values unchanged as they should
+ * be reset by logic out of the single heap fetch.
+ */
if (!HeapTupleHeaderXminCommitted(tuple))
{
@@ -870,6 +882,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
{
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
return true;
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false; /* updated by other */
}
@@ -893,7 +906,10 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
return true;
}
if (TransactionIdDidCommit(xmax))
+ {
+ UpdateDirtyMaxXmax(snapshot, xmax);
return false;
+ }
/* it must have aborted or crashed */
return true;
}
@@ -902,6 +918,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
{
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
return true;
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false;
}
@@ -931,6 +948,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
HeapTupleHeaderGetRawXmax(tuple));
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false; /* updated by other */
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..6de72a29ca 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -489,6 +489,16 @@ on the leaf page at all when the page's LSN has changed. (That won't work
with an unlogged index, so for now we don't ever apply the "don't hold
onto pin" optimization there.)
+Despite the locking protocol in place, it is still possible to receive an
+incorrect result during non-MVCC scans. This issue can occur if a concurrent
+transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content
+in the buffer cache, it might skip the old tuple due to deletion and miss
+the new tuple because of the cache. This is a known limitation of the
+SnapshotDirty and SnapshotAny non-MVCC scans. However, for SnapshotDirty,
+it is possible to work around this limitation by using the returned max(xmax)
+to compare it with the latest committed transaction before the scan started.
+
Fastpath For Index Insertion
----------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7e8902e48c..943aee087a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -427,7 +427,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* Assume unique until we find a duplicate */
*is_unique = true;
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fb6a86afcb..52109635b4 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -296,6 +296,17 @@ ReadNextFullTransactionId(void)
return fullXid;
}
+FullTransactionId ReadLastCompletedFullTransactionId(void)
+{
+ FullTransactionId fullXid;
+
+ LWLockAcquire(XidGenLock, LW_SHARED);
+ fullXid = TransamVariables->latestCompletedXid;
+ LWLockRelease(XidGenLock);
+
+ return fullXid;
+}
+
/*
* Advance nextXid to the value after a given xid. The epoch is inferred.
* This must only be called during recovery or from two-phase start-up code.
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9f05b3654c..45767b4e20 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -115,6 +115,7 @@
#include "nodes/nodeFuncs.h"
#include "storage/lmgr.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -702,6 +703,8 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
IndexScanDesc index_scan;
ScanKeyData scankeys[INDEX_MAX_KEYS];
SnapshotData DirtySnapshot;
+ TransactionId maxXmax,
+ latestCompletedXid;
int i;
bool conflict;
bool found_self;
@@ -738,9 +741,10 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* Search the tuples that are in the index for any violations, including
- * tuples that aren't visible yet.
+ * tuples that aren't visible yet. Also, detect cases index scan skip the
+ * tuple in case of parallel update after index page content was cached.
*/
- InitDirtySnapshot(DirtySnapshot);
+ InitDirtySnapshot(DirtySnapshot, &maxXmax);
for (i = 0; i < indnkeyatts; i++)
{
@@ -774,6 +778,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
retry:
conflict = false;
found_self = false;
+ /*
+ * Each time we retry - remember last completed transaction before start
+ * of the scan. Aso reset maxXmax.
+ */
+ latestCompletedXid = XidFromFullTransactionId(ReadLastCompletedFullTransactionId());
+ maxXmax = InvalidTransactionId;
index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
@@ -889,6 +899,19 @@ retry:
}
index_endscan(index_scan);
+ /*
+ * Check for the case when index scan fetched records before some other
+ * transaction deleted tuple and inserted a new one.
+ */
+ if (!conflict && TransactionIdIsValid(maxXmax) && !TransactionIdIsCurrentTransactionId(maxXmax))
+ {
+ /*
+ * If we have skipped some tuple because it was deleted, but deletion happened after
+ * start of the index scan - retry to be sure.
+ */
+ if (TransactionIdPrecedes(latestCompletedXid, maxXmax))
+ goto retry;
+ }
/*
* Ordinarily, at this point the search should have found the originally
@@ -902,6 +925,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index d0a89cd577..fbddb6442b 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -183,6 +183,8 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
IndexScanDesc scan;
SnapshotData snap;
TransactionId xwait;
+ TransactionId maxXmax,
+ latestCompletedXid;
Relation idxrel;
bool found;
TypeCacheEntry **eq = NULL;
@@ -193,7 +195,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
isIdxSafeToSkipDuplicates = (GetRelationIdentityOrPK(rel) == idxoid);
- InitDirtySnapshot(snap);
+ InitDirtySnapshot(snap, &maxXmax);
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
@@ -203,6 +205,12 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
retry:
found = false;
+ /*
+ * Each time we retry - remember last completed transaction before start
+ * of the scan. Aso reset maxXmax.
+ */
+ maxXmax = InvalidTransactionId;
+ latestCompletedXid = XidFromFullTransactionId(ReadLastCompletedFullTransactionId());
index_rescan(scan, skey, skey_attoff, NULL, 0);
@@ -242,6 +250,20 @@ retry:
break;
}
+ /*
+ * Check for the case when index scan fetched records before some other
+ * transaction deleted tuple and inserted a new one.
+ */
+ if (!found && TransactionIdIsValid(maxXmax) && !TransactionIdIsCurrentTransactionId(maxXmax))
+ {
+ /*
+ * If we have skipped some tuple because it was deleted, but deletion happened after
+ * start of the index scan - retry to be sure.
+ */
+ if (TransactionIdPrecedes(latestCompletedXid, maxXmax))
+ goto retry;
+ }
+
/* Found tuple, try to lock it in the lockmode. */
if (found)
{
@@ -391,7 +413,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
/* Start a heap scan. */
- InitDirtySnapshot(snap);
+ InitDirtySnapshot(snap, NULL);
scan = table_beginscan(rel, &snap, 0, NULL);
scanslot = table_slot_create(rel, NULL);
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 419e4814f0..04ba0d9ba1 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -278,7 +278,7 @@ replorigin_create(const char *roname)
* to the exclusive lock there's no danger that new rows can appear while
* we're checking.
*/
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd..aae0ad90c2 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -288,6 +288,7 @@ extern void VarsupShmemInit(void);
extern FullTransactionId GetNewTransactionId(bool isSubXact);
extern void AdvanceNextFullTransactionIdPastXid(TransactionId xid);
extern FullTransactionId ReadNextFullTransactionId(void);
+extern FullTransactionId ReadLastCompletedFullTransactionId(void);
extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
Oid oldest_datoid);
extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
@@ -344,6 +345,21 @@ TransactionIdOlder(TransactionId a, TransactionId b)
return b;
}
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+ if (!TransactionIdIsValid(a))
+ return b;
+
+ if (!TransactionIdIsValid(b))
+ return a;
+
+ if (TransactionIdPrecedes(a, b))
+ return b;
+ return a;
+}
+
/* return the older of the two IDs, assuming they're both normal */
static inline TransactionId
NormalTransactionIdOlder(TransactionId a, TransactionId b)
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051..5e6f3a7e76 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -36,9 +36,13 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
* We don't provide a static SnapshotDirty variable because it would be
* non-reentrant. Instead, users of that snapshot type should declare a
* local variable of type SnapshotData, and initialize it with this macro.
+ * pxid is optional and can be NULL. If it is not NULL, pxid[0] will be set
+ * to the transaction ID of deleting transaction if the tuple is deleted
+ * and it newer than pxid[0].
*/
-#define InitDirtySnapshot(snapshotdata) \
- ((snapshotdata).snapshot_type = SNAPSHOT_DIRTY)
+#define InitDirtySnapshot(snapshotdata, pxid) \
+ ((snapshotdata).snapshot_type = SNAPSHOT_DIRTY, \
+ (snapshotdata).xip = (pxid))
/*
* Similarly, some initialization is required for a NonVacuumable snapshot.
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 8d1e31e888..a68114e500 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -96,7 +96,10 @@ typedef enum SnapshotType
* xmax. If the tuple was inserted speculatively, meaning that the
* inserter might still back down on the insertion without aborting the
* whole transaction, the associated token is also returned in
- * snapshot->speculativeToken. See also InitDirtySnapshot().
+ * snapshot->speculativeToken. If xip is non-NULL, the xid of the
+ * deleting transaction is stored into xip[0] if it newer than existing
+ * xip[0] value.
+ * See also InitDirtySnapshot().
* -------------------------------------------------------------------------
*/
SNAPSHOT_DIRTY,
diff --git a/src/test/modules/test_misc/t/006_dirty_index_scan.pl b/src/test/modules/test_misc/t/006_dirty_index_scan.pl
new file mode 100644
index 0000000000..4d116e659e
--- /dev/null
+++ b/src/test/modules/test_misc/t/006_dirty_index_scan.pl
@@ -0,0 +1,47 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test issue with lost tuple in case of DirtySnapshot index scans
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('DirtyScan_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key, n int)));
+
+$node->safe_psql('postgres', q(INSERT INTO tbl VALUES(42,1)));
+$node->safe_psql('postgres', q(SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error')));
+
+$node->pgbench(
+ '--no-vacuum --client=40 --transactions=1000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent UPSERT',
+ {
+ 'on_conflicts' => q(
+ INSERT INTO tbl VALUES(42,1) on conflict(i) do update set n = EXCLUDED.n + 1;
+ )
+ });
+
+$node->safe_psql('postgres', q(SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict')));
+
+$node->stop;
+done_testing();
\ No newline at end of file
--
2.34.1
On Monday, August 12, 2024 7:11 PM Michail Nikolaev <michail.nikolaev@gmail.com> wrote:
In my test, if the tuple is updated and new tuple is in the same page,
heapam_index_fetch_tuple should find the new tuple using HOT chain. So, it's a
bit unclear to me how the updated tuple is missing. Maybe I missed some other
conditions for this issue.Yeah, I think the pgbench-based reproducer may also cause page splits in btree.
But we may add an index to the table to disable HOT.I have attached a reproducer for this case using a spec and injection points.
I hope it helps, check the attached files.
Thanks a lot for the steps!
I successfully reproduced the issue you mentioned in the context of logical
replication[1]The steps to reproduce the tuple missing in logical replication.. As you said, it could increase the possibility of tuple missing
when applying updates or deletes in the logical apply worker. I think this is a
long-standing issue and I will investigate the fix you proposed.
In addition, I think the bug is not a blocker for the conflict detection
feature. As the feature simply reports the current behavior of the logical
apply worker (either unique violation or tuple missing) without introducing any
new functionality. Furthermore, I think that the new ExecCheckIndexConstraints
call after ExecInsertIndexTuples() is not affected by the dirty snapshot bug.
This is because a tuple has already been inserted into the btree before the
dirty snapshot scan, which means that a concurrent non-HOT update would not be
possible (it would be blocked after finding the just inserted tuple and wait
for the apply worker to commit the current transaction).
It would be good if others could also share their opinion on this.
[1]: The steps to reproduce the tuple missing in logical replication.
1. setup pub/sub env, and publish a table with 1 row.
pub:
CREATE TABLE t(a int primary key, b int);
INSERT INTO t VALUES(1,1);
CREATE PUBLICATION pub FOR TABLE t;
sub:
CREATE TABLE t (a int primary key, b int check (b < 5));
CREATE INDEX t_b_idx ON t(b);
CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=$port_publisher' PUBLICATION pub;
2. Execute an UPDATE(UPDATE t set b = b + 1) on the publisher and use gdb to
stop the apply worker at the point after index_getnext_tid() and before
index_fetch_heap().
3. execute a concurrent update(UPDATE t set b = b + 100) on the subscriber to
update a non-key column value and commit the update.
4. release the apply worker and it would report the update_missing conflict.
Best Regards,
Hou zj
Hello!
In addition, I think the bug is not a blocker for the conflict detection
feature. As the feature simply reports the current behavior of the logical
apply worker (either unique violation or tuple missing) without
introducing any
new functionality. Furthermore, I think that the new
ExecCheckIndexConstraints
call after ExecInsertIndexTuples() is not affected by the dirty snapshot
bug.
This is because a tuple has already been inserted into the btree before
the
dirty snapshot scan, which means that a concurrent non-HOT update would
not be
possible (it would be blocked after finding the just inserted tuple and
wait
for the apply worker to commit the current transaction).
It would be good if others could also share their opinion on this.
Yes, you are right. At least, I can't find any scenario for that case.
Best regards,
Mikhail.
Hello, Hou!
I have sent [0]/messages/by-id/CANtu0ojMjAwMRJK=H8y0YBB0ZEcN+JbdZeoXQn8dWO5F67jgsA@mail.gmail.com reproducer within the context of conflict detection and
resolution to the original thread.
[0]: /messages/by-id/CANtu0ojMjAwMRJK=H8y0YBB0ZEcN+JbdZeoXQn8dWO5F67jgsA@mail.gmail.com
/messages/by-id/CANtu0ojMjAwMRJK=H8y0YBB0ZEcN+JbdZeoXQn8dWO5F67jgsA@mail.gmail.com
Show quoted text
Hello, everyone!
A rebased version is attached. Also, fixed potential race in the test and
detailed commit messages.
Best regards,
Mikhail.
Show quoted text
Attachments:
v3-0001-Fix-possible-lost-tuples-in-non-MVCC-index-scans-.patchapplication/octet-stream; name=v3-0001-Fix-possible-lost-tuples-in-non-MVCC-index-scans-.patchDownload
From f1d6b9d0096e18c9f187914d3722ba8675bc964d Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:14:40 +0100
Subject: [PATCH v3 1/2] Fix possible lost tuples in non-MVCC index scans using
SnapshotDirty
In certain scenarios, non-MVCC index scans using SnapshotDirty can miss tuples that are deleted and re-inserted concurrently during the scan. This issue arises because the scan might skip over deleted tuples and fail to see newly inserted ones if the page content is cached.
To address this, we modify the SnapshotDirty mechanism to track the maximum xmax (the highest transaction ID of deleting transactions) encountered during the scan. If this xmax is newer than the latest completed transaction ID at the start of the scan, we retry the index scan to ensure all relevant tuples are observed.
Key changes include:
* Updated HeapTupleSatisfiesDirty to record the maximum xmax seen.
* Modified InitDirtySnapshot to accept an optional parameter for tracking xmax.
* Added logic in index uniqueness checks (_bt_check_unique) and replication tuple searches (RelationFindReplTupleByIndex) to retry scans based on the tracked xmax.
* Introduced a new function ReadLastCompletedFullTransactionId to obtain the latest completed transaction ID.
* Updated documentation in nbtree/README to explain the issue and the solution.
* Added a regression test to cover this edge case.
This fix ensures that non-MVCC index scans are more robust in the face of concurrent data modifications, preventing potential data inconsistencies.
---
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/heapam_visibility.c | 20 +++++++-
src/backend/access/nbtree/README | 10 ++++
src/backend/access/nbtree/nbtinsert.c | 2 +-
src/backend/access/transam/varsup.c | 11 +++++
src/backend/executor/execIndexing.c | 29 +++++++++++-
src/backend/executor/execReplication.c | 26 +++++++++-
src/backend/replication/logical/origin.c | 2 +-
src/include/access/transam.h | 16 +++++++
src/include/utils/snapmgr.h | 8 +++-
src/include/utils/snapshot.h | 5 +-
src/test/modules/test_misc/meson.build | 1 +
.../test_misc/t/007_dirty_index_scan.pl | 47 +++++++++++++++++++
14 files changed, 169 insertions(+), 12 deletions(-)
create mode 100644 src/test/modules/test_misc/t/007_dirty_index_scan.pl
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..bc310fcb332 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
hscan = (HeapScanDesc) scan;
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
nblocks = hscan->rs_nblocks; /* # blocks to be scanned */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..4e99ea61c6e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -402,7 +402,7 @@ tuple_lock_retry:
*
* Loop here to deal with updated or busy tuples
*/
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
for (;;)
{
if (ItemPointerIndicatesMovedPartitions(tid))
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01f..9001ce4362c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -719,6 +719,12 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
return TM_Deleted; /* deleted by other */
}
+static void UpdateDirtyMaxXmax(Snapshot snapshot, TransactionId xmax)
+{
+ if (snapshot->xip != NULL)
+ snapshot->xip[0] = TransactionIdNewer(xmax, snapshot->xip[0]);
+}
+
/*
* HeapTupleSatisfiesDirty
* True iff heap tuple is valid including effects of open transactions.
@@ -737,7 +743,9 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
* Similarly for snapshot->xmax and the tuple's xmax. If the tuple was
* inserted speculatively, meaning that the inserter might still back down
* on the insertion without aborting the whole transaction, the associated
- * token is also returned in snapshot->speculativeToken.
+ * token is also returned in snapshot->speculativeToken. If xip is != NULL
+ * xip[0] may be set to xid of deleter if it newer than previously store
+ * value.
*/
static bool
HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
@@ -750,6 +758,10 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
snapshot->xmin = snapshot->xmax = InvalidTransactionId;
snapshot->speculativeToken = 0;
+ /*
+ * We intentionally keep snapshot->xip values unchanged as they should
+ * be reset by logic out of the single heap fetch.
+ */
if (!HeapTupleHeaderXminCommitted(tuple))
{
@@ -870,6 +882,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
{
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
return true;
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false; /* updated by other */
}
@@ -893,7 +906,10 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
return true;
}
if (TransactionIdDidCommit(xmax))
+ {
+ UpdateDirtyMaxXmax(snapshot, xmax);
return false;
+ }
/* it must have aborted or crashed */
return true;
}
@@ -902,6 +918,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
{
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
return true;
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false;
}
@@ -931,6 +948,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
HeapTupleHeaderGetRawXmax(tuple));
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false; /* updated by other */
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f75..6de72a29ca0 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -489,6 +489,16 @@ on the leaf page at all when the page's LSN has changed. (That won't work
with an unlogged index, so for now we don't ever apply the "don't hold
onto pin" optimization there.)
+Despite the locking protocol in place, it is still possible to receive an
+incorrect result during non-MVCC scans. This issue can occur if a concurrent
+transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content
+in the buffer cache, it might skip the old tuple due to deletion and miss
+the new tuple because of the cache. This is a known limitation of the
+SnapshotDirty and SnapshotAny non-MVCC scans. However, for SnapshotDirty,
+it is possible to work around this limitation by using the returned max(xmax)
+to compare it with the latest committed transaction before the scan started.
+
Fastpath For Index Insertion
----------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 99043da8412..ef63d62778e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -427,7 +427,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* Assume unique until we find a duplicate */
*is_unique = true;
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index cfe8c6cf8dc..43026e58406 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -296,6 +296,17 @@ ReadNextFullTransactionId(void)
return fullXid;
}
+FullTransactionId ReadLastCompletedFullTransactionId(void)
+{
+ FullTransactionId fullXid;
+
+ LWLockAcquire(XidGenLock, LW_SHARED);
+ fullXid = TransamVariables->latestCompletedXid;
+ LWLockRelease(XidGenLock);
+
+ return fullXid;
+}
+
/*
* Advance nextXid to the value after a given xid. The epoch is inferred.
* This must only be called during recovery or from two-phase start-up code.
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f9a2fac79e4..34e09dee17f 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -711,6 +712,8 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
IndexScanDesc index_scan;
ScanKeyData scankeys[INDEX_MAX_KEYS];
SnapshotData DirtySnapshot;
+ TransactionId maxXmax,
+ latestCompletedXid;
int i;
bool conflict;
bool found_self;
@@ -773,9 +776,10 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* Search the tuples that are in the index for any violations, including
- * tuples that aren't visible yet.
+ * tuples that aren't visible yet. Also, detect cases index scan skip the
+ * tuple in case of parallel update after index page content was cached.
*/
- InitDirtySnapshot(DirtySnapshot);
+ InitDirtySnapshot(DirtySnapshot, &maxXmax);
for (i = 0; i < indnkeyatts; i++)
{
@@ -809,6 +813,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
retry:
conflict = false;
found_self = false;
+ /*
+ * Each time we retry - remember last completed transaction before start
+ * of the scan. Aso reset maxXmax.
+ */
+ latestCompletedXid = XidFromFullTransactionId(ReadLastCompletedFullTransactionId());
+ maxXmax = InvalidTransactionId;
index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
@@ -924,6 +934,19 @@ retry:
}
index_endscan(index_scan);
+ /*
+ * Check for the case when index scan fetched records before some other
+ * transaction deleted tuple and inserted a new one.
+ */
+ if (!conflict && TransactionIdIsValid(maxXmax) && !TransactionIdIsCurrentTransactionId(maxXmax))
+ {
+ /*
+ * If we have skipped some tuple because it was deleted, but deletion happened after
+ * start of the index scan - retry to be sure.
+ */
+ if (TransactionIdPrecedes(latestCompletedXid, maxXmax))
+ goto retry;
+ }
/*
* Ordinarily, at this point the search should have found the originally
@@ -937,6 +960,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 54025c9f150..cf8e847262a 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -229,6 +229,8 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
IndexScanDesc scan;
SnapshotData snap;
TransactionId xwait;
+ TransactionId maxXmax,
+ latestCompletedXid;
Relation idxrel;
bool found;
TypeCacheEntry **eq = NULL;
@@ -239,7 +241,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
isIdxSafeToSkipDuplicates = (GetRelationIdentityOrPK(rel) == idxoid);
- InitDirtySnapshot(snap);
+ InitDirtySnapshot(snap, &maxXmax);
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
@@ -249,6 +251,12 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
retry:
found = false;
+ /*
+ * Each time we retry - remember last completed transaction before start
+ * of the scan. Aso reset maxXmax.
+ */
+ maxXmax = InvalidTransactionId;
+ latestCompletedXid = XidFromFullTransactionId(ReadLastCompletedFullTransactionId());
index_rescan(scan, skey, skey_attoff, NULL, 0);
@@ -288,6 +296,20 @@ retry:
break;
}
+ /*
+ * Check for the case when index scan fetched records before some other
+ * transaction deleted tuple and inserted a new one.
+ */
+ if (!found && TransactionIdIsValid(maxXmax) && !TransactionIdIsCurrentTransactionId(maxXmax))
+ {
+ /*
+ * If we have skipped some tuple because it was deleted, but deletion happened after
+ * start of the index scan - retry to be sure.
+ */
+ if (TransactionIdPrecedes(latestCompletedXid, maxXmax))
+ goto retry;
+ }
+
/* Found tuple, try to lock it in the lockmode. */
if (found)
{
@@ -411,7 +433,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
/* Start a heap scan. */
- InitDirtySnapshot(snap);
+ InitDirtySnapshot(snap, NULL);
scan = table_beginscan(rel, &snap, 0, NULL);
scanslot = table_slot_create(rel, NULL);
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index baf696d8e68..e1e52c639d1 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -282,7 +282,7 @@ replorigin_create(const char *roname)
* to the exclusive lock there's no danger that new rows can appear while
* we're checking.
*/
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd5..aae0ad90c2e 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -288,6 +288,7 @@ extern void VarsupShmemInit(void);
extern FullTransactionId GetNewTransactionId(bool isSubXact);
extern void AdvanceNextFullTransactionIdPastXid(TransactionId xid);
extern FullTransactionId ReadNextFullTransactionId(void);
+extern FullTransactionId ReadLastCompletedFullTransactionId(void);
extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
Oid oldest_datoid);
extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
@@ -344,6 +345,21 @@ TransactionIdOlder(TransactionId a, TransactionId b)
return b;
}
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+ if (!TransactionIdIsValid(a))
+ return b;
+
+ if (!TransactionIdIsValid(b))
+ return a;
+
+ if (TransactionIdPrecedes(a, b))
+ return b;
+ return a;
+}
+
/* return the older of the two IDs, assuming they're both normal */
static inline TransactionId
NormalTransactionIdOlder(TransactionId a, TransactionId b)
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 9398a84051c..5e6f3a7e76a 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -36,9 +36,13 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
* We don't provide a static SnapshotDirty variable because it would be
* non-reentrant. Instead, users of that snapshot type should declare a
* local variable of type SnapshotData, and initialize it with this macro.
+ * pxid is optional and can be NULL. If it is not NULL, pxid[0] will be set
+ * to the transaction ID of deleting transaction if the tuple is deleted
+ * and it newer than pxid[0].
*/
-#define InitDirtySnapshot(snapshotdata) \
- ((snapshotdata).snapshot_type = SNAPSHOT_DIRTY)
+#define InitDirtySnapshot(snapshotdata, pxid) \
+ ((snapshotdata).snapshot_type = SNAPSHOT_DIRTY, \
+ (snapshotdata).xip = (pxid))
/*
* Similarly, some initialization is required for a NonVacuumable snapshot.
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 8d1e31e888e..a68114e500f 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -96,7 +96,10 @@ typedef enum SnapshotType
* xmax. If the tuple was inserted speculatively, meaning that the
* inserter might still back down on the insertion without aborting the
* whole transaction, the associated token is also returned in
- * snapshot->speculativeToken. See also InitDirtySnapshot().
+ * snapshot->speculativeToken. If xip is non-NULL, the xid of the
+ * deleting transaction is stored into xip[0] if it newer than existing
+ * xip[0] value.
+ * See also InitDirtySnapshot().
* -------------------------------------------------------------------------
*/
SNAPSHOT_DIRTY,
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa751aa..68d291336e1 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -15,6 +15,7 @@ tests += {
't/004_io_direct.pl',
't/005_timeouts.pl',
't/006_signal_autovacuum.pl',
+ 't/007_dirty_index_scan.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/007_dirty_index_scan.pl b/src/test/modules/test_misc/t/007_dirty_index_scan.pl
new file mode 100644
index 00000000000..4d116e659e7
--- /dev/null
+++ b/src/test/modules/test_misc/t/007_dirty_index_scan.pl
@@ -0,0 +1,47 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test issue with lost tuple in case of DirtySnapshot index scans
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('DirtyScan_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key, n int)));
+
+$node->safe_psql('postgres', q(INSERT INTO tbl VALUES(42,1)));
+$node->safe_psql('postgres', q(SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error')));
+
+$node->pgbench(
+ '--no-vacuum --client=40 --transactions=1000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent UPSERT',
+ {
+ 'on_conflicts' => q(
+ INSERT INTO tbl VALUES(42,1) on conflict(i) do update set n = EXCLUDED.n + 1;
+ )
+ });
+
+$node->safe_psql('postgres', q(SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict')));
+
+$node->stop;
+done_testing();
\ No newline at end of file
--
2.43.0
v3-0002-Add-isolation-test-to-reproduce-dirty-snapshot-sc.patchapplication/octet-stream; name=v3-0002-Add-isolation-test-to-reproduce-dirty-snapshot-sc.patchDownload
From 3c663bfd3fcb62b5a4d4726919be67becc7c1ba5 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v3 2/2] Add isolation test to reproduce dirty snapshot scan
issue
This commit introduces an isolation test to reliably reproduce the issue where non-MVCC index scans using SnapshotDirty can miss tuples due to concurrent modifications. This situation can lead to incorrect results.
To facilitate this test, new injection points are added in the index_getnext_slot and check_exclusion_or_unique_constraint functions. These injection points allow the test to control the timing of operations, ensuring the race condition is triggered consistently.
Changes include:
* Added injection points in src/backend/access/index/indexam.c and src/backend/executor/execIndexing.c.
* Updated Makefile and meson.build to include the new dirty_index_scan isolation test.
* Created a new isolation spec dirty_index_scan.spec and its expected output to define and verify the test steps.
* This test complements the previous fix by demonstrating the issue and verifying that the fix effectively addresses it.
---
src/backend/access/index/indexam.c | 8 ++++
src/backend/executor/execIndexing.c | 1 +
src/test/modules/injection_points/Makefile | 2 +-
.../expected/dirty_index_scan.out | 39 +++++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../specs/dirty_index_scan.spec | 43 +++++++++++++++++++
6 files changed, 93 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/dirty_index_scan.out
create mode 100644 src/test/modules/injection_points/specs/dirty_index_scan.spec
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 1859be614c0..1a70de2f470 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -57,6 +57,7 @@
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -696,6 +697,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelationOid(scan->indexRelation->rd_id))
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch");
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 34e09dee17f..2cf5bd236d4 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -810,6 +810,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
* May have to restart scan from this point if a potential conflict is
* found.
*/
+ INJECTION_POINT("check_exclusion_or_unique_constraint_before_index_scan");
retry:
conflict = false;
found_self = false;
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..11a9bacc750 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,7 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace dirty_index_scan
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/dirty_index_scan.out b/src/test/modules/injection_points/expected/dirty_index_scan.out
new file mode 100644
index 00000000000..0451c7513b6
--- /dev/null
+++ b/src/test/modules/injection_points/expected/dirty_index_scan.out
@@ -0,0 +1,39 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_s1 s3_s1 s2_s1 s3_s2
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s1_s1: INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; <waiting ...>
+step s3_s1:
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_before_index_scan');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_before_index_scan');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s2_s1: UPDATE test.tbl SET n = n + 1 WHERE i = 42;
+step s3_s2:
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
+step s1_s1: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..26a98bfa148 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,6 +44,7 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'dirty_index_scan',
],
},
'tap': {
diff --git a/src/test/modules/injection_points/specs/dirty_index_scan.spec b/src/test/modules/injection_points/specs/dirty_index_scan.spec
new file mode 100644
index 00000000000..6fb5b985431
--- /dev/null
+++ b/src/test/modules/injection_points/specs/dirty_index_scan.spec
@@ -0,0 +1,43 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, n int);
+ CREATE INDEX tbl_n_idx ON test.tbl(n);
+ INSERT INTO test.tbl VALUES(42,1);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error');
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_before_index_scan', 'wait');
+ SELECT injection_points_attach('index_getnext_slot_before_fetch', 'wait');
+}
+
+step s1_s1 { INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; }
+
+session s2
+step s2_s1 { UPDATE test.tbl SET n = n + 1 WHERE i = 42; }
+
+session s3
+step s3_s1 {
+ SELECT injection_points_detach('check_exclusion_or_unique_constraint_before_index_scan');
+ SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_before_index_scan');
+}
+step s3_s2 {
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+}
+
+permutation
+ s1_s1
+ s3_s1
+ s2_s1
+ s3_s2
\ No newline at end of file
--
2.43.0
Hello, everyone!
Simplified (and stabilized, I hope) the test.
Best regards,
Mikhail.
Attachments:
v4-0001-Fix-possible-lost-tuples-in-non-MVCC-index-scans-.patchapplication/octet-stream; name=v4-0001-Fix-possible-lost-tuples-in-non-MVCC-index-scans-.patchDownload
From 1ee7ec9565679d167585acc1068b1726374c556e Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:14:40 +0100
Subject: [PATCH v4 1/2] Fix possible lost tuples in non-MVCC index scans using
SnapshotDirty
In certain scenarios, non-MVCC index scans using SnapshotDirty can miss tuples that are deleted and re-inserted concurrently during the scan. This issue arises because the scan might skip over deleted tuples and fail to see newly inserted ones if the page content is cached.
To address this, we modify the SnapshotDirty mechanism to track the maximum xmax (the highest transaction ID of deleting transactions) encountered during the scan. If this xmax is newer than the latest completed transaction ID at the start of the scan, we retry the index scan to ensure all relevant tuples are observed.
Key changes include:
* Updated HeapTupleSatisfiesDirty to record the maximum xmax seen.
* Modified InitDirtySnapshot to accept an optional parameter for tracking xmax.
* Added logic in index uniqueness checks (_bt_check_unique) and replication tuple searches (RelationFindReplTupleByIndex) to retry scans based on the tracked xmax.
* Introduced a new function ReadLastCompletedFullTransactionId to obtain the latest completed transaction ID.
* Updated documentation in nbtree/README to explain the issue and the solution.
* Added a regression test to cover this edge case.
This fix ensures that non-MVCC index scans are more robust in the face of concurrent data modifications, preventing potential data inconsistencies.
---
contrib/pgstattuple/pgstattuple.c | 2 +-
src/backend/access/heap/heapam_handler.c | 2 +-
src/backend/access/heap/heapam_visibility.c | 20 +++++++-
src/backend/access/nbtree/README | 10 ++++
src/backend/access/nbtree/nbtinsert.c | 2 +-
src/backend/access/transam/varsup.c | 11 +++++
src/backend/executor/execIndexing.c | 29 +++++++++++-
src/backend/executor/execReplication.c | 26 +++++++++-
src/backend/replication/logical/origin.c | 2 +-
src/include/access/transam.h | 16 +++++++
src/include/utils/snapmgr.h | 8 +++-
src/include/utils/snapshot.h | 5 +-
src/test/modules/test_misc/meson.build | 1 +
.../test_misc/t/007_dirty_index_scan.pl | 47 +++++++++++++++++++
14 files changed, 169 insertions(+), 12 deletions(-)
create mode 100644 src/test/modules/test_misc/t/007_dirty_index_scan.pl
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..bc310fcb332 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
hscan = (HeapScanDesc) scan;
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
nblocks = hscan->rs_nblocks; /* # blocks to be scanned */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e817f8f8f84..9836bd6c530 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -402,7 +402,7 @@ tuple_lock_retry:
*
* Loop here to deal with updated or busy tuples
*/
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
for (;;)
{
if (ItemPointerIndicatesMovedPartitions(tid))
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index e146605bd57..e713d5df24c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -719,6 +719,12 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
return TM_Deleted; /* deleted by other */
}
+static void UpdateDirtyMaxXmax(Snapshot snapshot, TransactionId xmax)
+{
+ if (snapshot->xip != NULL)
+ snapshot->xip[0] = TransactionIdNewer(xmax, snapshot->xip[0]);
+}
+
/*
* HeapTupleSatisfiesDirty
* True iff heap tuple is valid including effects of open transactions.
@@ -737,7 +743,9 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
* Similarly for snapshot->xmax and the tuple's xmax. If the tuple was
* inserted speculatively, meaning that the inserter might still back down
* on the insertion without aborting the whole transaction, the associated
- * token is also returned in snapshot->speculativeToken.
+ * token is also returned in snapshot->speculativeToken. If xip is != NULL
+ * xip[0] may be set to xid of deleter if it newer than previously store
+ * value.
*/
static bool
HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
@@ -750,6 +758,10 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
snapshot->xmin = snapshot->xmax = InvalidTransactionId;
snapshot->speculativeToken = 0;
+ /*
+ * We intentionally keep snapshot->xip values unchanged as they should
+ * be reset by logic out of the single heap fetch.
+ */
if (!HeapTupleHeaderXminCommitted(tuple))
{
@@ -870,6 +882,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
{
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
return true;
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false; /* updated by other */
}
@@ -893,7 +906,10 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
return true;
}
if (TransactionIdDidCommit(xmax))
+ {
+ UpdateDirtyMaxXmax(snapshot, xmax);
return false;
+ }
/* it must have aborted or crashed */
return true;
}
@@ -902,6 +918,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
{
if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
return true;
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false;
}
@@ -931,6 +948,7 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
HeapTupleHeaderGetRawXmax(tuple));
+ UpdateDirtyMaxXmax(snapshot, HeapTupleHeaderGetRawXmax(tuple));
return false; /* updated by other */
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..c8f6812cf60 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -489,6 +489,16 @@ on the leaf page at all when the page's LSN has changed. (That won't work
with an unlogged index, so for now we don't ever apply the "don't hold
onto pin" optimization there.)
+Despite the locking protocol in place, it is still possible to receive an
+incorrect result during non-MVCC scans. This issue can occur if a concurrent
+transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content
+in the buffer cache, it might skip the old tuple due to deletion and miss
+the new tuple because of the cache. This is a known limitation of the
+SnapshotDirty and SnapshotAny non-MVCC scans. However, for SnapshotDirty,
+it is possible to work around this limitation by using the returned max(xmax)
+to compare it with the latest committed transaction before the scan started.
+
Fastpath For Index Insertion
----------------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 3eddbcf3a82..33ae2d58097 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -427,7 +427,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
/* Assume unique until we find a duplicate */
*is_unique = true;
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index fe895787cb7..c6f0c769f62 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -296,6 +296,17 @@ ReadNextFullTransactionId(void)
return fullXid;
}
+FullTransactionId ReadLastCompletedFullTransactionId(void)
+{
+ FullTransactionId fullXid;
+
+ LWLockAcquire(XidGenLock, LW_SHARED);
+ fullXid = TransamVariables->latestCompletedXid;
+ LWLockRelease(XidGenLock);
+
+ return fullXid;
+}
+
/*
* Advance nextXid to the value after a given xid. The epoch is inferred.
* This must only be called during recovery or from two-phase start-up code.
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 7c87f012c30..e05a8767348 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -711,6 +712,8 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
IndexScanDesc index_scan;
ScanKeyData scankeys[INDEX_MAX_KEYS];
SnapshotData DirtySnapshot;
+ TransactionId maxXmax,
+ latestCompletedXid;
int i;
bool conflict;
bool found_self;
@@ -773,9 +776,10 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* Search the tuples that are in the index for any violations, including
- * tuples that aren't visible yet.
+ * tuples that aren't visible yet. Also, detect cases index scan skip the
+ * tuple in case of parallel update after index page content was cached.
*/
- InitDirtySnapshot(DirtySnapshot);
+ InitDirtySnapshot(DirtySnapshot, &maxXmax);
for (i = 0; i < indnkeyatts; i++)
{
@@ -809,6 +813,12 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
retry:
conflict = false;
found_self = false;
+ /*
+ * Each time we retry - remember last completed transaction before start
+ * of the scan. Aso reset maxXmax.
+ */
+ latestCompletedXid = XidFromFullTransactionId(ReadLastCompletedFullTransactionId());
+ maxXmax = InvalidTransactionId;
index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
@@ -924,6 +934,19 @@ retry:
}
index_endscan(index_scan);
+ /*
+ * Check for the case when index scan fetched records before some other
+ * transaction deleted tuple and inserted a new one.
+ */
+ if (!conflict && TransactionIdIsValid(maxXmax) && !TransactionIdIsCurrentTransactionId(maxXmax))
+ {
+ /*
+ * If we have skipped some tuple because it was deleted, but deletion happened after
+ * start of the index scan - retry to be sure.
+ */
+ if (TransactionIdPrecedes(latestCompletedXid, maxXmax))
+ goto retry;
+ }
/*
* Ordinarily, at this point the search should have found the originally
@@ -937,6 +960,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index e3e4e41ac38..0ebfcc7005f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -218,6 +218,8 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
IndexScanDesc scan;
SnapshotData snap;
TransactionId xwait;
+ TransactionId maxXmax,
+ latestCompletedXid;
Relation idxrel;
bool found;
TypeCacheEntry **eq = NULL;
@@ -228,7 +230,7 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
isIdxSafeToSkipDuplicates = (GetRelationIdentityOrPK(rel) == idxoid);
- InitDirtySnapshot(snap);
+ InitDirtySnapshot(snap, &maxXmax);
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
@@ -238,6 +240,12 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
retry:
found = false;
+ /*
+ * Each time we retry - remember last completed transaction before start
+ * of the scan. Aso reset maxXmax.
+ */
+ maxXmax = InvalidTransactionId;
+ latestCompletedXid = XidFromFullTransactionId(ReadLastCompletedFullTransactionId());
index_rescan(scan, skey, skey_attoff, NULL, 0);
@@ -277,6 +285,20 @@ retry:
break;
}
+ /*
+ * Check for the case when index scan fetched records before some other
+ * transaction deleted tuple and inserted a new one.
+ */
+ if (!found && TransactionIdIsValid(maxXmax) && !TransactionIdIsCurrentTransactionId(maxXmax))
+ {
+ /*
+ * If we have skipped some tuple because it was deleted, but deletion happened after
+ * start of the index scan - retry to be sure.
+ */
+ if (TransactionIdPrecedes(latestCompletedXid, maxXmax))
+ goto retry;
+ }
+
/* Found tuple, try to lock it in the lockmode. */
if (found)
{
@@ -400,7 +422,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
/* Start a heap scan. */
- InitDirtySnapshot(snap);
+ InitDirtySnapshot(snap, NULL);
scan = table_beginscan(rel, &snap, 0, NULL);
scanslot = table_slot_create(rel, NULL);
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 1b586cb1cf2..2dbe20912fa 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -282,7 +282,7 @@ replorigin_create(const char *roname)
* to the exclusive lock there's no danger that new rows can appear while
* we're checking.
*/
- InitDirtySnapshot(SnapshotDirty);
+ InitDirtySnapshot(SnapshotDirty, NULL);
rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 0cab8653f1b..dce992325d7 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -288,6 +288,7 @@ extern void VarsupShmemInit(void);
extern FullTransactionId GetNewTransactionId(bool isSubXact);
extern void AdvanceNextFullTransactionIdPastXid(TransactionId xid);
extern FullTransactionId ReadNextFullTransactionId(void);
+extern FullTransactionId ReadLastCompletedFullTransactionId(void);
extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
Oid oldest_datoid);
extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
@@ -344,6 +345,21 @@ TransactionIdOlder(TransactionId a, TransactionId b)
return b;
}
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+ if (!TransactionIdIsValid(a))
+ return b;
+
+ if (!TransactionIdIsValid(b))
+ return a;
+
+ if (TransactionIdPrecedes(a, b))
+ return b;
+ return a;
+}
+
/* return the older of the two IDs, assuming they're both normal */
static inline TransactionId
NormalTransactionIdOlder(TransactionId a, TransactionId b)
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index d346be71642..0b1adfc1ea5 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -38,9 +38,13 @@ extern PGDLLIMPORT SnapshotData SnapshotToastData;
* We don't provide a static SnapshotDirty variable because it would be
* non-reentrant. Instead, users of that snapshot type should declare a
* local variable of type SnapshotData, and initialize it with this macro.
+ * pxid is optional and can be NULL. If it is not NULL, pxid[0] will be set
+ * to the transaction ID of deleting transaction if the tuple is deleted
+ * and it newer than pxid[0].
*/
-#define InitDirtySnapshot(snapshotdata) \
- ((snapshotdata).snapshot_type = SNAPSHOT_DIRTY)
+#define InitDirtySnapshot(snapshotdata, pxid) \
+ ((snapshotdata).snapshot_type = SNAPSHOT_DIRTY, \
+ (snapshotdata).xip = (pxid))
/*
* Similarly, some initialization is required for a NonVacuumable snapshot.
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..4f45fccbe31 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -92,7 +92,10 @@ typedef enum SnapshotType
* xmax. If the tuple was inserted speculatively, meaning that the
* inserter might still back down on the insertion without aborting the
* whole transaction, the associated token is also returned in
- * snapshot->speculativeToken. See also InitDirtySnapshot().
+ * snapshot->speculativeToken. If xip is non-NULL, the xid of the
+ * deleting transaction is stored into xip[0] if it newer than existing
+ * xip[0] value.
+ * See also InitDirtySnapshot().
* -------------------------------------------------------------------------
*/
SNAPSHOT_DIRTY,
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 65a9518a00d..31f7901bdd4 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -15,6 +15,7 @@ tests += {
't/004_io_direct.pl',
't/005_timeouts.pl',
't/006_signal_autovacuum.pl',
+ 't/007_dirty_index_scan.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/007_dirty_index_scan.pl b/src/test/modules/test_misc/t/007_dirty_index_scan.pl
new file mode 100644
index 00000000000..4d116e659e7
--- /dev/null
+++ b/src/test/modules/test_misc/t/007_dirty_index_scan.pl
@@ -0,0 +1,47 @@
+
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test issue with lost tuple in case of DirtySnapshot index scans
+use strict;
+use warnings;
+
+use Config;
+use Errno;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+$node = PostgreSQL::Test::Cluster->new('DirtyScan_test');
+$node->init;
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key, n int)));
+
+$node->safe_psql('postgres', q(INSERT INTO tbl VALUES(42,1)));
+$node->safe_psql('postgres', q(SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error')));
+
+$node->pgbench(
+ '--no-vacuum --client=40 --transactions=1000',
+ 0,
+ [qr{actually processed}],
+ [qr{^$}],
+ 'concurrent UPSERT',
+ {
+ 'on_conflicts' => q(
+ INSERT INTO tbl VALUES(42,1) on conflict(i) do update set n = EXCLUDED.n + 1;
+ )
+ });
+
+$node->safe_psql('postgres', q(SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict')));
+
+$node->stop;
+done_testing();
\ No newline at end of file
--
2.43.0
v4-0002-Add-isolation-test-to-reproduce-dirty-snapshot-sc.patchapplication/octet-stream; name=v4-0002-Add-isolation-test-to-reproduce-dirty-snapshot-sc.patchDownload
From 2e20bc45afc2a4a530d786d7911ac1aadf57c47a Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v4 2/2] Add isolation test to reproduce dirty snapshot scan
issue
This commit introduces an isolation test to reliably reproduce the issue where non-MVCC index scans using SnapshotDirty can miss tuples due to concurrent modifications. This situation can lead to incorrect results.
To facilitate this test, new injection point added in the index_getnext_slot.
Changes include:
* Added injection point in src/backend/access/index/indexam.c
* Updated Makefile and meson.build to include the new dirty_index_scan isolation test.
* Created a new isolation spec dirty_index_scan.spec and its expected output to define and verify the test steps.
* This test complements the previous fix by demonstrating the issue and verifying that the fix effectively addresses it.
---
src/backend/access/index/indexam.c | 8 ++++
src/test/modules/injection_points/Makefile | 2 +-
.../expected/dirty_index_scan.out | 26 +++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../specs/dirty_index_scan.spec | 37 +++++++++++++++++++
5 files changed, 73 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/dirty_index_scan.out
create mode 100644 src/test/modules/injection_points/specs/dirty_index_scan.spec
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 8b1f555435b..ad3a3605282 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -57,6 +57,7 @@
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -696,6 +697,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelationOid(scan->indexRelation->rd_id))
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch");
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..11a9bacc750 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,7 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace
+ISOLATION = basic inplace dirty_index_scan
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/dirty_index_scan.out b/src/test/modules/injection_points/expected/dirty_index_scan.out
new file mode 100644
index 00000000000..51ff2d0b0d0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/dirty_index_scan.out
@@ -0,0 +1,26 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_s1 s2_s1 s3_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s1_s1: INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; <waiting ...>
+step s2_s1: UPDATE test.tbl SET n = n + 1 WHERE i = 42;
+step s3_s1:
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+ <waiting ...>
+step s1_s1: <... completed>
+step s3_s1: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 989b4db226b..3911aa0274d 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,6 +44,7 @@ tests += {
'specs': [
'basic',
'inplace',
+ 'dirty_index_scan',
],
},
'tap': {
diff --git a/src/test/modules/injection_points/specs/dirty_index_scan.spec b/src/test/modules/injection_points/specs/dirty_index_scan.spec
new file mode 100644
index 00000000000..54065f233e4
--- /dev/null
+++ b/src/test/modules/injection_points/specs/dirty_index_scan.spec
@@ -0,0 +1,37 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, n int);
+ CREATE INDEX tbl_n_idx ON test.tbl(n);
+ INSERT INTO test.tbl VALUES(42,1);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error');
+ SELECT injection_points_attach('index_getnext_slot_before_fetch', 'wait');
+}
+
+step s1_s1 { INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; }
+
+session s2
+step s2_s1 { UPDATE test.tbl SET n = n + 1 WHERE i = 42; }
+
+session s3
+step s3_s1 {
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+}
+
+permutation
+ s1_s1
+ s2_s1
+ s3_s1(s1_s1)
\ No newline at end of file
--
2.43.0
Hello!
I realize proposed solution does not guarantee absent of false negative
cases...
It happens because I am looking just at XID values, but them have nothing
with transaction commitment order in the common case.
I'll look for some other option.
Best regards,
Mikhail.
Hello, everyone and Peter!
Peter, I have added you because you may be interested in (or already know
about) this btree-related issue.
Short description of the problem:
I noticed a concurrency issue in btree index scans that affects
SnapshotDirty and SnapshotSelf scan types.
When using these non-MVCC snapshot types, a scan could miss tuples if
concurrent transactions delete existing tuples and insert new one with
different TIDs on the same page.
The problem occurs because:
1. The scan reads a page and caches its tuples in backend-local storage
2. A concurrent transaction deletes a tuple and inserts a new one with a
different TID
3. The scan misses the new tuple because it was already deleted by a
committed transaction and does not pass visibility check
4. But new version on the page is missed, because not in cached tuples
This may cause issues with:
- logical replication (RelationFindReplTupleByIndex fail) - invalid
conflict message (MISSING instead of ORIGIN_DIFFERS), probably other issues
with upcoming conflict resolution for logical replication
- check_exclusion_or_unique_constraint false negative (but currently it
does not cause any real issues as far as I can see)
The fix implemented in this version of the patch:
- Retains the read lock on a page for SnapshotDirty and SnapshotSelf
scans until we're completely done with all tuples from that page
- Introduces a new 'extra_unlock' field in BTScanPos to track when a lock
is being held longer than usual
- Updates documentation to explain this special locking behavior
Yes, it may cause some degradation in performance because of that
additional lock.
Another possible idea is to use a fresh MVCC snapshot for such cases (but I
think it is still better to fix or at least document that issue anyway).
Best regards,
Mikhail.
Show quoted text
Attachments:
v5-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchapplication/octet-stream; name=v5-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchDownload
From f0d9dffb33d1c069f47fb05f6662351520bbc3d2 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Wed, 12 Mar 2025 00:53:02 +0100
Subject: [PATCH v5 2/2] Fix btree index scan concurrency issues with dirty
snapshots
This patch addresses an issue where non-MVCC index scans using SnapshotDirty
or SnapshotSelf could miss tuples due to concurrent modifications. The fix
retains read locks on pages for these special snapshot types until the scan
is done with the page's tuples, preventing concurrent modifications from
causing inconsistent results.
Updated README to document this special case in the btree locking mechanism.
---
src/backend/access/nbtree/README | 13 +++++++++-
src/backend/access/nbtree/nbtree.c | 17 +++++++++++++
src/backend/access/nbtree/nbtsearch.c | 35 +++++++++++++++++++++++----
src/backend/access/nbtree/nbtutils.c | 8 +++++-
src/include/access/nbtree.h | 3 +++
5 files changed, 69 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..a9280415633 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -85,7 +85,8 @@ move right until we find a page whose right-link matches the page we
came from. (Actually, it's even harder than that; see page deletion
discussion below.)
-Page read locks are held only for as long as a scan is examining a page.
+Page read locks are held only for as long as a scan is examining a page
+(with exception for SnapshotDirty and SnapshotSelf scans - see below).
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
@@ -103,6 +104,16 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. To address this issue, for
+SnapshotDirty and SnapshotSelf scans, we retain the read lock on the page until
+we're completely done processing all the tuples from that page, preventing
+concurrent modifications that could lead to inconsistent results.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 45ea6afba1d..bd2c0d57de6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -373,6 +373,12 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ /* Release any extended lock held for SnapshotDirty/Self scans */
+ if (so->currPos.extra_unlock)
+ {
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ so->currPos.extra_unlock = false;
+ }
BTScanPosUnpinIfPinned(so->currPos);
BTScanPosInvalidate(so->currPos);
}
@@ -429,6 +435,12 @@ btendscan(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ /* Release any extended lock held for SnapshotDirty/Self scans */
+ if (so->currPos.extra_unlock)
+ {
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ so->currPos.extra_unlock = false;
+ }
BTScanPosUnpinIfPinned(so->currPos);
}
@@ -509,6 +521,11 @@ btrestrpos(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ if (so->currPos.extra_unlock)
+ {
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ so->currPos.extra_unlock = false;
+ }
BTScanPosUnpinIfPinned(so->currPos);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 472ce06f190..0ab89c770ce 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -61,11 +61,22 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* This will prevent vacuum from stalling in a blocked state trying to read a
* page when a cursor is sitting on it.
*
+ * For SnapshotDirty and SnapshotSelf scans, we don't actually unlock the buffer
+ * here, but instead set extra_unlock to indicate that the lock should be held
+ * until we're completely done with this page. This prevents concurrent
+ * modifications from causing inconsistent results during non-MVCC scans.
+ *
* See nbtree/README section on making concurrent TID recycling safe.
*/
static void
_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
{
+ if (scan->xs_snapshot->snapshot_type == SNAPSHOT_DIRTY ||
+ scan->xs_snapshot->snapshot_type == SNAPSHOT_SELF)
+ {
+ sp->extra_unlock = true;
+ return;
+ }
_bt_unlockbuf(scan->indexRelation, sp->buf);
if (IsMVCCSnapshot(scan->xs_snapshot) &&
@@ -1434,7 +1445,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* _bt_next() -- Get the next item in a scan.
*
* On entry, so->currPos describes the current page, which may be pinned
- * but is not locked, and so->currPos.itemIndex identifies which item was
+ * but is not locked (except for SnapshotDirty and SnapshotSelf scans, where
+ * the page remains locked), and so->currPos.itemIndex identifies which item was
* previously returned.
*
* On success exit, so->currPos is updated as needed, and _bt_returnitem
@@ -2002,10 +2014,11 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
*
* Wrapper on _bt_readnextpage that performs final steps for the current page.
*
- * On entry, if so->currPos.buf is valid the buffer is pinned but not locked.
- * If there's no pin held, it's because _bt_drop_lock_and_maybe_pin dropped
- * the pin eagerly earlier on. The scan must have so->currPos.currPage set to
- * a valid block, in any case.
+ * On entry, if so->currPos.buf is valid the buffer is pinned but not locked,
+ * except for SnapshotDirty and SnapshotSelf scans where the buffer remains locked
+ * until we're done with all tuples from the page. If there's no pin held, it's
+ * because _bt_drop_lock_and_maybe_pin dropped the pin eagerly earlier on.
+ * The scan must have so->currPos.currPage set to a valid block, in any case.
*/
static bool
_bt_steppage(IndexScanDesc scan, ScanDirection dir)
@@ -2064,8 +2077,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
/* mark/restore not supported by parallel scans */
Assert(!scan->parallel_scan);
+ Assert(scan->xs_snapshot->snapshot_type != SNAPSHOT_DIRTY);
+ Assert(scan->xs_snapshot->snapshot_type != SNAPSHOT_SELF);
}
+ /*
+ * For SnapshotDirty/Self scans, we kept the read lock after processing
+ * the page's tuples (see _bt_drop_lock_and_maybe_pin). Now that we're
+ * moving to another page, we need to explicitly release that lock.
+ */
+ if (so->currPos.extra_unlock)
+ {
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ so->currPos.extra_unlock = false;
+ }
BTScanPosUnpinIfPinned(so->currPos);
/* Walk to the next page with data */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 693e43c674b..2704f1e46fd 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2359,13 +2359,17 @@ _bt_killitems(IndexScanDesc scan)
* LSN.
*/
droppedpin = false;
- _bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
+ /* For SnapshotDirty/Self scans, the buffer is already locked */
+ if (!so->currPos.extra_unlock)
+ _bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
page = BufferGetPage(so->currPos.buf);
}
else
{
Buffer buf;
+ /* extra_unlock should never be set without a valid buffer pin */
+ Assert(!so->currPos.extra_unlock);
droppedpin = true;
/* Attempt to re-read the buffer, getting pin and lock. */
@@ -2502,6 +2506,8 @@ _bt_killitems(IndexScanDesc scan)
}
_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ /* Reset the extra_unlock flag since we've now released the lock */
+ so->currPos.extra_unlock = false;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e4fdeca3402..067d1caf9cf 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -956,6 +956,7 @@ typedef struct BTScanPosItem /* what we remember about each match */
typedef struct BTScanPosData
{
Buffer buf; /* currPage buf (invalid means unpinned) */
+ bool extra_unlock; /* for SnapshotDirty/Self, read lock is held even after _bt_drop_lock_and_maybe_pin */
/* page details as of the saved position's call to _bt_readpage */
BlockNumber currPage; /* page referenced by items array */
@@ -1003,6 +1004,7 @@ typedef BTScanPosData *BTScanPos;
)
#define BTScanPosUnpin(scanpos) \
do { \
+ Assert(!(scanpos).extra_unlock); \
ReleaseBuffer((scanpos).buf); \
(scanpos).buf = InvalidBuffer; \
} while (0)
@@ -1022,6 +1024,7 @@ typedef BTScanPosData *BTScanPos;
do { \
(scanpos).buf = InvalidBuffer; \
(scanpos).currPage = InvalidBlockNumber; \
+ (scanpos).extra_unlock = false; \
} while (0)
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
--
2.43.0
v5-0001-Add-isolation-test-to-reproduce-dirty-snapshot-sc.patchapplication/octet-stream; name=v5-0001-Add-isolation-test-to-reproduce-dirty-snapshot-sc.patchDownload
From 085f5635c1ddba02b74c2bdc7588d556cc0bd136 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v5 1/2] Add isolation test to reproduce dirty snapshot scan
issue
This commit introduces an isolation test to reliably reproduce the issue where non-MVCC index scans using SnapshotDirty can miss tuples due to concurrent modifications. This situation can lead to incorrect results.
To facilitate this test, new injection point added in the index_getnext_slot.
Changes include:
* Added injection point in src/backend/access/index/indexam.c
* Updated Makefile and meson.build to include the new dirty_index_scan isolation test.
* Created a new isolation spec dirty_index_scan.spec and its expected output to define and verify the test steps.
* This test complements the previous fix by demonstrating the issue and verifying that the fix effectively addresses it.
---
src/backend/access/index/indexam.c | 8 ++++
src/backend/executor/execIndexing.c | 3 ++
src/test/modules/injection_points/Makefile | 2 +-
.../expected/dirty_index_scan.out | 27 ++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../specs/dirty_index_scan.spec | 37 +++++++++++++++++++
6 files changed, 77 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/dirty_index_scan.out
create mode 100644 src/test/modules/injection_points/specs/dirty_index_scan.spec
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 8b1f555435b..ad3a3605282 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -57,6 +57,7 @@
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -696,6 +697,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelationOid(scan->indexRelation->rd_id))
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch");
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 742f3f8c08d..deca14d6326 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -944,6 +945,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
return !conflict;
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..b73f8ac80f2 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,7 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned dirty_index_scan
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/dirty_index_scan.out b/src/test/modules/injection_points/expected/dirty_index_scan.out
new file mode 100644
index 00000000000..c286a9fd5b6
--- /dev/null
+++ b/src/test/modules/injection_points/expected/dirty_index_scan.out
@@ -0,0 +1,27 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_s1 s2_s1 s3_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s1_s1: INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; <waiting ...>
+step s2_s1: UPDATE test.tbl SET n = n + 1 WHERE i = 42; <waiting ...>
+step s3_s1:
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+ <waiting ...>
+step s1_s1: <... completed>
+step s2_s1: <... completed>
+step s3_s1: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..bb3869f9a75 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -47,6 +47,7 @@ tests += {
'basic',
'inplace',
'syscache-update-pruned',
+ 'dirty_index_scan',
],
'runningcheck': false, # see syscache-update-pruned
},
diff --git a/src/test/modules/injection_points/specs/dirty_index_scan.spec b/src/test/modules/injection_points/specs/dirty_index_scan.spec
new file mode 100644
index 00000000000..373bcaf4929
--- /dev/null
+++ b/src/test/modules/injection_points/specs/dirty_index_scan.spec
@@ -0,0 +1,37 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, n int);
+ CREATE INDEX tbl_n_idx ON test.tbl(n);
+ INSERT INTO test.tbl VALUES(42,1);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error');
+ SELECT injection_points_attach('index_getnext_slot_before_fetch', 'wait');
+}
+
+step s1_s1 { INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; }
+
+session s2
+step s2_s1 { UPDATE test.tbl SET n = n + 1 WHERE i = 42; }
+
+session s3
+step s3_s1 {
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+}
+
+permutation
+ s1_s1
+ s2_s1(*)
+ s3_s1(s1_s1)
\ No newline at end of file
--
2.43.0
Hello, everyone!
Rebased + fix for compilation due the new INEJCTION_POINT signature.
Best regards,
Mikhail.
Attachments:
v6-0001-Add-an-isolation-test-to-reproduce-a-dirty-snapsh.patchapplication/octet-stream; name=v6-0001-Add-an-isolation-test-to-reproduce-a-dirty-snapsh.patchDownload
From 07f2d816c94d94c587cef16a6448de09ef1dc2d6 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v6 1/2] Add an isolation test to reproduce a dirty snapshot
scan issue
This commit introduces an isolation test to reliably reproduce the issue where non-MVCC index scans using SnapshotDirty can miss tuples due to concurrent modifications.
When using non-MVCC snapshot types, a scan could miss tuples if concurrent transactions delete existing tuples and insert new one with different TIDs on the same page.
---
src/backend/access/index/indexam.c | 8 ++++
src/backend/executor/execIndexing.c | 3 ++
src/test/modules/injection_points/Makefile | 2 +-
.../expected/dirty_index_scan.out | 27 ++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../specs/dirty_index_scan.spec | 37 +++++++++++++++++++
6 files changed, 77 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/dirty_index_scan.out
create mode 100644 src/test/modules/injection_points/specs/dirty_index_scan.spec
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 219df1971da..676e06c095c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -57,6 +57,7 @@
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -741,6 +742,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelationOid(scan->indexRelation->rd_id))
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..36748b39e68 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -943,6 +944,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..b73f8ac80f2 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,7 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned dirty_index_scan
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/dirty_index_scan.out b/src/test/modules/injection_points/expected/dirty_index_scan.out
new file mode 100644
index 00000000000..c286a9fd5b6
--- /dev/null
+++ b/src/test/modules/injection_points/expected/dirty_index_scan.out
@@ -0,0 +1,27 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_s1 s2_s1 s3_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s1_s1: INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; <waiting ...>
+step s2_s1: UPDATE test.tbl SET n = n + 1 WHERE i = 42; <waiting ...>
+step s3_s1:
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+ <waiting ...>
+step s1_s1: <... completed>
+step s2_s1: <... completed>
+step s3_s1: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..bb3869f9a75 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -47,6 +47,7 @@ tests += {
'basic',
'inplace',
'syscache-update-pruned',
+ 'dirty_index_scan',
],
'runningcheck': false, # see syscache-update-pruned
},
diff --git a/src/test/modules/injection_points/specs/dirty_index_scan.spec b/src/test/modules/injection_points/specs/dirty_index_scan.spec
new file mode 100644
index 00000000000..373bcaf4929
--- /dev/null
+++ b/src/test/modules/injection_points/specs/dirty_index_scan.spec
@@ -0,0 +1,37 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, n int);
+ CREATE INDEX tbl_n_idx ON test.tbl(n);
+ INSERT INTO test.tbl VALUES(42,1);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error');
+ SELECT injection_points_attach('index_getnext_slot_before_fetch', 'wait');
+}
+
+step s1_s1 { INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; }
+
+session s2
+step s2_s1 { UPDATE test.tbl SET n = n + 1 WHERE i = 42; }
+
+session s3
+step s3_s1 {
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+}
+
+permutation
+ s1_s1
+ s2_s1(*)
+ s3_s1(s1_s1)
\ No newline at end of file
--
2.43.0
v6-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchapplication/octet-stream; name=v6-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchDownload
From 318822f0d2848fb3a3400dd2ea5f2c0400f9505b Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Wed, 12 Mar 2025 00:53:02 +0100
Subject: [PATCH v6 2/2] Fix btree index scan concurrency issues with dirty
snapshots
This patch addresses an issue where non-MVCC index scans using SnapshotDirty
or SnapshotSelf could miss tuples due to concurrent modifications. The fix
retains read locks on pages for these special snapshot types until the scan
is done with the page's tuples, preventing concurrent modifications from
causing inconsistent results.
Updated README to document this special case in the btree locking mechanism.
---
src/backend/access/nbtree/README | 13 +++++++++-
src/backend/access/nbtree/nbtree.c | 17 +++++++++++++
src/backend/access/nbtree/nbtsearch.c | 35 +++++++++++++++++++++++----
src/backend/access/nbtree/nbtutils.c | 8 +++++-
src/include/access/nbtree.h | 3 +++
5 files changed, 69 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..a9280415633 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -85,7 +85,8 @@ move right until we find a page whose right-link matches the page we
came from. (Actually, it's even harder than that; see page deletion
discussion below.)
-Page read locks are held only for as long as a scan is examining a page.
+Page read locks are held only for as long as a scan is examining a page
+(with exception for SnapshotDirty and SnapshotSelf scans - see below).
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
@@ -103,6 +104,16 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. To address this issue, for
+SnapshotDirty and SnapshotSelf scans, we retain the read lock on the page until
+we're completely done processing all the tuples from that page, preventing
+concurrent modifications that could lead to inconsistent results.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 765659887af..8239076c518 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -389,6 +389,12 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ /* Release any extended lock held for SnapshotDirty/Self scans */
+ if (so->currPos.extra_unlock)
+ {
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ so->currPos.extra_unlock = false;
+ }
BTScanPosUnpinIfPinned(so->currPos);
BTScanPosInvalidate(so->currPos);
}
@@ -445,6 +451,12 @@ btendscan(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ /* Release any extended lock held for SnapshotDirty/Self scans */
+ if (so->currPos.extra_unlock)
+ {
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ so->currPos.extra_unlock = false;
+ }
BTScanPosUnpinIfPinned(so->currPos);
}
@@ -525,6 +537,11 @@ btrestrpos(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ if (so->currPos.extra_unlock)
+ {
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ so->currPos.extra_unlock = false;
+ }
BTScanPosUnpinIfPinned(so->currPos);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fe9a3886913..2c5a34cacfe 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -61,11 +61,22 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* This will prevent vacuum from stalling in a blocked state trying to read a
* page when a cursor is sitting on it.
*
+ * For SnapshotDirty and SnapshotSelf scans, we don't actually unlock the buffer
+ * here, but instead set extra_unlock to indicate that the lock should be held
+ * until we're completely done with this page. This prevents concurrent
+ * modifications from causing inconsistent results during non-MVCC scans.
+ *
* See nbtree/README section on making concurrent TID recycling safe.
*/
static void
_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
{
+ if (scan->xs_snapshot->snapshot_type == SNAPSHOT_DIRTY ||
+ scan->xs_snapshot->snapshot_type == SNAPSHOT_SELF)
+ {
+ sp->extra_unlock = true;
+ return;
+ }
_bt_unlockbuf(scan->indexRelation, sp->buf);
if (IsMVCCSnapshot(scan->xs_snapshot) &&
@@ -1527,7 +1538,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* _bt_next() -- Get the next item in a scan.
*
* On entry, so->currPos describes the current page, which may be pinned
- * but is not locked, and so->currPos.itemIndex identifies which item was
+ * but is not locked (except for SnapshotDirty and SnapshotSelf scans, where
+ * the page remains locked), and so->currPos.itemIndex identifies which item was
* previously returned.
*
* On success exit, so->currPos is updated as needed, and _bt_returnitem
@@ -2107,10 +2119,11 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
*
* Wrapper on _bt_readnextpage that performs final steps for the current page.
*
- * On entry, if so->currPos.buf is valid the buffer is pinned but not locked.
- * If there's no pin held, it's because _bt_drop_lock_and_maybe_pin dropped
- * the pin eagerly earlier on. The scan must have so->currPos.currPage set to
- * a valid block, in any case.
+ * On entry, if so->currPos.buf is valid the buffer is pinned but not locked,
+ * except for SnapshotDirty and SnapshotSelf scans where the buffer remains locked
+ * until we're done with all tuples from the page. If there's no pin held, it's
+ * because _bt_drop_lock_and_maybe_pin dropped the pin eagerly earlier on.
+ * The scan must have so->currPos.currPage set to a valid block, in any case.
*/
static bool
_bt_steppage(IndexScanDesc scan, ScanDirection dir)
@@ -2169,8 +2182,20 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
/* mark/restore not supported by parallel scans */
Assert(!scan->parallel_scan);
+ Assert(scan->xs_snapshot->snapshot_type != SNAPSHOT_DIRTY);
+ Assert(scan->xs_snapshot->snapshot_type != SNAPSHOT_SELF);
}
+ /*
+ * For SnapshotDirty/Self scans, we kept the read lock after processing
+ * the page's tuples (see _bt_drop_lock_and_maybe_pin). Now that we're
+ * moving to another page, we need to explicitly release that lock.
+ */
+ if (so->currPos.extra_unlock)
+ {
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ so->currPos.extra_unlock = false;
+ }
BTScanPosUnpinIfPinned(so->currPos);
/* Walk to the next page with data */
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7d3..61008a36b5d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3383,13 +3383,17 @@ _bt_killitems(IndexScanDesc scan)
* LSN.
*/
droppedpin = false;
- _bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
+ /* For SnapshotDirty/Self scans, the buffer is already locked */
+ if (!so->currPos.extra_unlock)
+ _bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
page = BufferGetPage(so->currPos.buf);
}
else
{
Buffer buf;
+ /* extra_unlock should never be set without a valid buffer pin */
+ Assert(!so->currPos.extra_unlock);
droppedpin = true;
/* Attempt to re-read the buffer, getting pin and lock. */
@@ -3526,6 +3530,8 @@ _bt_killitems(IndexScanDesc scan)
}
_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ /* Reset the extra_unlock flag since we've now released the lock */
+ so->currPos.extra_unlock = false;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..2c2485f34bd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -962,6 +962,7 @@ typedef struct BTScanPosItem /* what we remember about each match */
typedef struct BTScanPosData
{
Buffer buf; /* currPage buf (invalid means unpinned) */
+ bool extra_unlock; /* for SnapshotDirty/Self, read lock is held even after _bt_drop_lock_and_maybe_pin */
/* page details as of the saved position's call to _bt_readpage */
BlockNumber currPage; /* page referenced by items array */
@@ -1009,6 +1010,7 @@ typedef BTScanPosData *BTScanPos;
)
#define BTScanPosUnpin(scanpos) \
do { \
+ Assert(!(scanpos).extra_unlock); \
ReleaseBuffer((scanpos).buf); \
(scanpos).buf = InvalidBuffer; \
} while (0)
@@ -1028,6 +1030,7 @@ typedef BTScanPosData *BTScanPos;
do { \
(scanpos).buf = InvalidBuffer; \
(scanpos).currPage = InvalidBlockNumber; \
+ (scanpos).extra_unlock = false; \
} while (0)
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
--
2.43.0
Hello!
Rebased\reworked to align with the changes of [0]/messages/by-id/CAH2-WznZBhWqDBDVGh1VhVBLgLqaYHEkPhmVV7mJCr1Y3ZQhQQ@mail.gmail.com.
Best regards.
[0]: /messages/by-id/CAH2-WznZBhWqDBDVGh1VhVBLgLqaYHEkPhmVV7mJCr1Y3ZQhQQ@mail.gmail.com
Attachments:
v7-0001-Add-an-isolation-test-to-reproduce-a-dirty-snapsh.patchapplication/x-patch; name=v7-0001-Add-an-isolation-test-to-reproduce-a-dirty-snapsh.patchDownload
From 13934bd0b3588e71a96228f31395ab661e0e749d Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v7 1/2] Add an isolation test to reproduce a dirty snapshot
scan issue
This commit introduces an isolation test to reliably reproduce the issue where non-MVCC index scans using SnapshotDirty can miss tuples due to concurrent modifications.
When using non-MVCC snapshot types, a scan could miss tuples if concurrent transactions delete existing tuples and insert new one with different TIDs on the same page.
---
src/backend/access/index/indexam.c | 8 ++++
src/backend/executor/execIndexing.c | 3 ++
src/test/modules/injection_points/Makefile | 2 +-
.../expected/dirty_index_scan.out | 27 ++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../specs/dirty_index_scan.spec | 37 +++++++++++++++++++
6 files changed, 77 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/dirty_index_scan.out
create mode 100644 src/test/modules/injection_points/specs/dirty_index_scan.spec
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 219df1971da..676e06c095c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -57,6 +57,7 @@
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -741,6 +742,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelationOid(scan->indexRelation->rd_id))
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..36748b39e68 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -943,6 +944,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..b73f8ac80f2 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,7 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned dirty_index_scan
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/dirty_index_scan.out b/src/test/modules/injection_points/expected/dirty_index_scan.out
new file mode 100644
index 00000000000..c286a9fd5b6
--- /dev/null
+++ b/src/test/modules/injection_points/expected/dirty_index_scan.out
@@ -0,0 +1,27 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_s1 s2_s1 s3_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s1_s1: INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; <waiting ...>
+step s2_s1: UPDATE test.tbl SET n = n + 1 WHERE i = 42; <waiting ...>
+step s3_s1:
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+ <waiting ...>
+step s1_s1: <... completed>
+step s2_s1: <... completed>
+step s3_s1: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..bb3869f9a75 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -47,6 +47,7 @@ tests += {
'basic',
'inplace',
'syscache-update-pruned',
+ 'dirty_index_scan',
],
'runningcheck': false, # see syscache-update-pruned
},
diff --git a/src/test/modules/injection_points/specs/dirty_index_scan.spec b/src/test/modules/injection_points/specs/dirty_index_scan.spec
new file mode 100644
index 00000000000..373bcaf4929
--- /dev/null
+++ b/src/test/modules/injection_points/specs/dirty_index_scan.spec
@@ -0,0 +1,37 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, n int);
+ CREATE INDEX tbl_n_idx ON test.tbl(n);
+ INSERT INTO test.tbl VALUES(42,1);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error');
+ SELECT injection_points_attach('index_getnext_slot_before_fetch', 'wait');
+}
+
+step s1_s1 { INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; }
+
+session s2
+step s2_s1 { UPDATE test.tbl SET n = n + 1 WHERE i = 42; }
+
+session s3
+step s3_s1 {
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+}
+
+permutation
+ s1_s1
+ s2_s1(*)
+ s3_s1(s1_s1)
\ No newline at end of file
--
2.43.0
v7-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchapplication/x-patch; name=v7-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchDownload
From 261361b0016a729a46b6b8a24548645f118fd79b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <mihailnikalayeu@gmail.com>
Date: Mon, 9 Jun 2025 23:31:18 +0200
Subject: [PATCH v7 2/2] Fix btree index scan concurrency issues with dirty
snapshots
This patch addresses an issue where non-MVCC index scans using SnapshotDirty
or SnapshotSelf could miss tuples due to concurrent modifications. The fix
retains read locks on pages for these special snapshot types until the scan
is done with the page's tuples, preventing concurrent modifications from
causing inconsistent results.
Updated README to document this special case in the btree locking mechanism.
---
src/backend/access/nbtree/README | 13 ++++++++++++-
src/backend/access/nbtree/nbtree.c | 19 ++++++++++++++++++-
src/backend/access/nbtree/nbtsearch.c | 16 ++++++++++++----
src/backend/access/nbtree/nbtutils.c | 4 +++-
src/include/access/nbtree.h | 1 +
5 files changed, 46 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..a9280415633 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -85,7 +85,8 @@ move right until we find a page whose right-link matches the page we
came from. (Actually, it's even harder than that; see page deletion
discussion below.)
-Page read locks are held only for as long as a scan is examining a page.
+Page read locks are held only for as long as a scan is examining a page
+(with exception for SnapshotDirty and SnapshotSelf scans - see below).
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
@@ -103,6 +104,16 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. To address this issue, for
+SnapshotDirty and SnapshotSelf scans, we retain the read lock on the page until
+we're completely done processing all the tuples from that page, preventing
+concurrent modifications that could lead to inconsistent results.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 03a1d7b027a..f9ef256561d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -393,10 +393,22 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
BTScanPosInvalidate(so->currPos);
}
+ /*
+ * For SnapshotDirty and SnapshotSelf scans, we don't unlock the buffer
+ * and keep the lock should be until we're completely done with this page.
+ * This prevents concurrent modifications from causing inconsistent
+ * results during non-MVCC scans.
+ *
+ * See nbtree/README for information about SnapshotDirty and SnapshotSelf.
+ */
+ so->dropLock = scan->xs_snapshot->snapshot_type != SNAPSHOT_DIRTY
+ && scan->xs_snapshot->snapshot_type != SNAPSHOT_SELF;
/*
* We prefer to eagerly drop leaf page pins before btgettuple returns.
* This avoids making VACUUM wait to acquire a cleanup lock on the page.
@@ -418,7 +430,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
*
* See nbtree/README section on making concurrent TID recycling safe.
*/
- so->dropPin = (!scan->xs_want_itup &&
+ so->dropPin = (so->dropLock &&
+ !scan->xs_want_itup &&
IsMVCCSnapshot(scan->xs_snapshot) &&
RelationNeedsWAL(scan->indexRelation) &&
scan->heapRelation != NULL);
@@ -475,6 +488,8 @@ btendscan(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
}
@@ -555,6 +570,8 @@ btrestrpos(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 070f14c8b91..6e7f3c76162 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -57,12 +57,14 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
*
- * Unlock so->currPos.buf. If scan is so->dropPin, drop the pin, too.
+ * Unlock so->currPos.buf if so->dropLock. If scan is so->dropPin, drop the pin, too.
* Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
*/
static inline void
_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
{
+ if (!so->dropLock)
+ return;
if (!so->dropPin)
{
/* Just drop the lock (not the pin) */
@@ -1532,7 +1534,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* _bt_next() -- Get the next item in a scan.
*
* On entry, so->currPos describes the current page, which may be pinned
- * but is not locked, and so->currPos.itemIndex identifies which item was
+ * but is not locked (except for SnapshotDirty and SnapshotSelf scans, where
+ * the page remains locked), and so->currPos.itemIndex identifies which item was
* previously returned.
*
* On success exit, so->currPos is updated as needed, and _bt_returnitem
@@ -2111,7 +2114,9 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
* Wrapper on _bt_readnextpage that performs final steps for the current page.
*
* On entry, so->currPos must be valid. Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
+ * never locked, except for SnapshotDirty and SnapshotSelf scans where the buffer
+ * remains locked until we're done with all tuples from the page
+ * (Actually, when so->dropPin there won't even be a pin held,
* though so->currPos.currPage must still be set to a valid block number.)
*/
static bool
@@ -2126,6 +2131,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
/*
* Before we modify currPos, make a copy of the page data if there was a
@@ -2265,7 +2272,8 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
}
/* There's no actually-matching data on the page in so->currPos.buf */
- _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ if (so->dropLock)
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
/* Call _bt_readnextpage using its _bt_steppage wrapper function */
if (!_bt_steppage(scan, dir))
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 29f0dca1b08..a79d8bfc906 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3369,7 +3369,9 @@ _bt_killitems(IndexScanDesc scan)
* concurrent VACUUMs from recycling any of the TIDs on the page.
*/
Assert(BTScanPosIsPinned(so->currPos));
- _bt_lockbuf(rel, so->currPos.buf, BT_READ);
+ /* Lock only if the lock is dropped. */
+ if (so->dropLock)
+ _bt_lockbuf(rel, so->currPos.buf, BT_READ);
}
else
{
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0afe..ca8ebd7a418 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1070,6 +1070,7 @@ typedef struct BTScanOpaqueData
/* info about killed items if any (killedItems is NULL if never used) */
int *killedItems; /* currPos.items indexes of killed items */
int numKilled; /* number of currently stored items */
+ bool dropLock; /* drop lock on before btgettuple returns? */
bool dropPin; /* drop leaf pin before btgettuple returns? */
/*
--
2.43.0
Rebased.
Attachments:
v8-0001-Add-an-isolation-test-to-reproduce-a-dirty-snapsh.patchapplication/octet-stream; name=v8-0001-Add-an-isolation-test-to-reproduce-a-dirty-snapsh.patchDownload
From eca59038f86d038b68304e4fba4274b4d93dd189 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v8 1/2] Add an isolation test to reproduce a dirty snapshot
scan issue
This commit introduces an isolation test to reliably reproduce the issue where non-MVCC index scans using SnapshotDirty can miss tuples due to concurrent modifications.
When using non-MVCC snapshot types, a scan could miss tuples if concurrent transactions delete existing tuples and insert new one with different TIDs on the same page.
---
src/backend/access/index/indexam.c | 8 ++++
src/backend/executor/execIndexing.c | 3 ++
src/test/modules/injection_points/Makefile | 2 +-
.../expected/dirty_index_scan.out | 27 ++++++++++++++
src/test/modules/injection_points/meson.build | 1 +
.../specs/dirty_index_scan.spec | 37 +++++++++++++++++++
6 files changed, 77 insertions(+), 1 deletion(-)
create mode 100644 src/test/modules/injection_points/expected/dirty_index_scan.out
create mode 100644 src/test/modules/injection_points/specs/dirty_index_scan.spec
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 219df1971da..676e06c095c 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -57,6 +57,7 @@
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -741,6 +742,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelationOid(scan->indexRelation->rd_id))
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..36748b39e68 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -943,6 +944,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..b73f8ac80f2 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,7 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned dirty_index_scan
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/dirty_index_scan.out b/src/test/modules/injection_points/expected/dirty_index_scan.out
new file mode 100644
index 00000000000..c286a9fd5b6
--- /dev/null
+++ b/src/test/modules/injection_points/expected/dirty_index_scan.out
@@ -0,0 +1,27 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_s1 s2_s1 s3_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s1_s1: INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; <waiting ...>
+step s2_s1: UPDATE test.tbl SET n = n + 1 WHERE i = 42; <waiting ...>
+step s3_s1:
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+ <waiting ...>
+step s1_s1: <... completed>
+step s2_s1: <... completed>
+step s3_s1: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..bb3869f9a75 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -47,6 +47,7 @@ tests += {
'basic',
'inplace',
'syscache-update-pruned',
+ 'dirty_index_scan',
],
'runningcheck': false, # see syscache-update-pruned
},
diff --git a/src/test/modules/injection_points/specs/dirty_index_scan.spec b/src/test/modules/injection_points/specs/dirty_index_scan.spec
new file mode 100644
index 00000000000..373bcaf4929
--- /dev/null
+++ b/src/test/modules/injection_points/specs/dirty_index_scan.spec
@@ -0,0 +1,37 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, n int);
+ CREATE INDEX tbl_n_idx ON test.tbl(n);
+ INSERT INTO test.tbl VALUES(42,1);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error');
+ SELECT injection_points_attach('index_getnext_slot_before_fetch', 'wait');
+}
+
+step s1_s1 { INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; }
+
+session s2
+step s2_s1 { UPDATE test.tbl SET n = n + 1 WHERE i = 42; }
+
+session s3
+step s3_s1 {
+ SELECT injection_points_detach('index_getnext_slot_before_fetch');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch');
+}
+
+permutation
+ s1_s1
+ s2_s1(*)
+ s3_s1(s1_s1)
\ No newline at end of file
--
2.43.0
v8-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchapplication/octet-stream; name=v8-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchDownload
From 4793759ffa5aa1a31bca8e19d8aab863adbbd9c4 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <mihailnikalayeu@gmail.com>
Date: Mon, 16 Jun 2025 22:20:38 +0200
Subject: [PATCH v8 2/2] Fix btree index scan concurrency issues with dirty
snapshots
This patch addresses an issue where non-MVCC index scans using SnapshotDirty
or SnapshotSelf could miss tuples due to concurrent modifications. The fix
retains read locks on pages for these special snapshot types until the scan
is done with the page's tuples, preventing concurrent modifications from
causing inconsistent results.
Updated README to document this special case in the btree locking mechanism.
---
src/backend/access/nbtree/README | 13 ++++++++++++-
src/backend/access/nbtree/nbtree.c | 19 ++++++++++++++++++-
src/backend/access/nbtree/nbtsearch.c | 16 ++++++++++++----
src/backend/access/nbtree/nbtutils.c | 4 +++-
src/include/access/nbtree.h | 1 +
5 files changed, 46 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..a9280415633 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -85,7 +85,8 @@ move right until we find a page whose right-link matches the page we
came from. (Actually, it's even harder than that; see page deletion
discussion below.)
-Page read locks are held only for as long as a scan is examining a page.
+Page read locks are held only for as long as a scan is examining a page
+(with exception for SnapshotDirty and SnapshotSelf scans - see below).
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
@@ -103,6 +104,16 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. To address this issue, for
+SnapshotDirty and SnapshotSelf scans, we retain the read lock on the page until
+we're completely done processing all the tuples from that page, preventing
+concurrent modifications that could lead to inconsistent results.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c130..bda2b821a51 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -393,10 +393,22 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
BTScanPosInvalidate(so->currPos);
}
+ /*
+ * For SnapshotDirty and SnapshotSelf scans, we don't unlock the buffer
+ * and keep the lock should be until we're completely done with this page.
+ * This prevents concurrent modifications from causing inconsistent
+ * results during non-MVCC scans.
+ *
+ * See nbtree/README for information about SnapshotDirty and SnapshotSelf.
+ */
+ so->dropLock = scan->xs_snapshot->snapshot_type != SNAPSHOT_DIRTY
+ && scan->xs_snapshot->snapshot_type != SNAPSHOT_SELF;
/*
* We prefer to eagerly drop leaf page pins before btgettuple returns.
* This avoids making VACUUM wait to acquire a cleanup lock on the page.
@@ -420,7 +432,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
*
* Note: so->dropPin should never change across rescans.
*/
- so->dropPin = (!scan->xs_want_itup &&
+ so->dropPin = (so->dropLock &&
+ !scan->xs_want_itup &&
IsMVCCSnapshot(scan->xs_snapshot) &&
RelationNeedsWAL(scan->indexRelation) &&
scan->heapRelation != NULL);
@@ -477,6 +490,8 @@ btendscan(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
}
@@ -557,6 +572,8 @@ btrestrpos(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 36544ecfd58..04a7485c643 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -57,12 +57,14 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
*
- * Unlock so->currPos.buf. If scan is so->dropPin, drop the pin, too.
+ * Unlock so->currPos.buf if so->dropLock. If scan is so->dropPin, drop the pin, too.
* Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
*/
static inline void
_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
{
+ if (!so->dropLock)
+ return;
if (!so->dropPin)
{
/* Just drop the lock (not the pin) */
@@ -1532,7 +1534,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* _bt_next() -- Get the next item in a scan.
*
* On entry, so->currPos describes the current page, which may be pinned
- * but is not locked, and so->currPos.itemIndex identifies which item was
+ * but is not locked (except for SnapshotDirty and SnapshotSelf scans, where
+ * the page remains locked), and so->currPos.itemIndex identifies which item was
* previously returned.
*
* On success exit, so->currPos is updated as needed, and _bt_returnitem
@@ -2111,7 +2114,9 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
* Wrapper on _bt_readnextpage that performs final steps for the current page.
*
* On entry, so->currPos must be valid. Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
+ * never locked, except for SnapshotDirty and SnapshotSelf scans where the buffer
+ * remains locked until we're done with all tuples from the page
+ * (Actually, when so->dropPin there won't even be a pin held,
* though so->currPos.currPage must still be set to a valid block number.)
*/
static bool
@@ -2126,6 +2131,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
/*
* Before we modify currPos, make a copy of the page data if there was a
@@ -2265,7 +2272,8 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
}
/* There's no actually-matching data on the page in so->currPos.buf */
- _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ if (so->dropLock)
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
/* Call _bt_readnextpage using its _bt_steppage wrapper function */
if (!_bt_steppage(scan, dir))
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c71d1b6f2e1..33215c89dde 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3379,8 +3379,10 @@ _bt_killitems(IndexScanDesc scan)
* concurrent VACUUMs from recycling any of the TIDs on the page.
*/
Assert(BTScanPosIsPinned(so->currPos));
+ /* Lock only if the lock is dropped. */
buf = so->currPos.buf;
- _bt_lockbuf(rel, buf, BT_READ);
+ if (so->dropLock)
+ _bt_lockbuf(rel, buf, BT_READ);
}
else
{
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0afe..ca8ebd7a418 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1070,6 +1070,7 @@ typedef struct BTScanOpaqueData
/* info about killed items if any (killedItems is NULL if never used) */
int *killedItems; /* currPos.items indexes of killed items */
int numKilled; /* number of currently stored items */
+ bool dropLock; /* drop lock on before btgettuple returns? */
bool dropPin; /* drop leaf pin before btgettuple returns? */
/*
--
2.43.0
Hello, everyone!
Issue description is available at [0]/messages/by-id/CADzfLwWuXh8KO=OZvB71pZnQ8nH0NYXfuGbFU6FBiVZUbmuFGg@mail.gmail.com (in few words - SnapshotDirty
scan may miss tuple in index because of race condition with update of
that tuple).
But I have realised there are cases much more severe than invalid
conflict messages for logical replication - lost delete/updates in
logical replication.
New tests with reproducers are included in the new patch version.
Short description up issues:
1) Lost delete
Setup:
CREATE TABLE conf_tab(a int PRIMARY key, data text);
CREATE INDEX data_index ON conf_tab(data);
INSERT INTO conf_tab(a, data) VALUES (1,'frompub');
On publisher:
DELETE FROM conf_tab WHERE a=1;
On subscriber:
UPDATE conf_tab SET data = 'fromsubnew' WHERE (a=1);
Expected result:
Tuple is deleted on both subscriber and publisher.
Actual result:
Either as expected, or:
- Tuple is deleted on publisher, but 'fromsubnew' remains on subscriber.
2) Lost update
Setup:
On publisher:
CREATE TABLE conf_tab(a int PRIMARY key, data text);
INSERT INTO conf_tab(a, data) VALUES (1,'frompub');
On subscriber:
-- note additional subscriber-only column - i
CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
CREATE INDEX i_index ON conf_tab(i);
On publisher:
UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);
On subscriber:
UPDATE conf_tab SET i = 1 WHERE (a=1);
Expected result:
On subscriber: tuple (a=1, data='frompubnew', i=1).
Actual result:
Either as expected, or:
- Publisher update is lost, leaving (a=1, data='frompub', i=1) on subscriber.
Best regards,
Mikhail.
[0]: /messages/by-id/CADzfLwWuXh8KO=OZvB71pZnQ8nH0NYXfuGbFU6FBiVZUbmuFGg@mail.gmail.com
Attachments:
v9-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchapplication/octet-stream; name=v9-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patchDownload
From 62a3fcf6118ed5858d2899f40a61b101ebe872d9 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <mihailnikalayeu@gmail.com>
Date: Mon, 16 Jun 2025 22:20:38 +0200
Subject: [PATCH v9 2/2] Fix btree index scan concurrency issues with dirty
snapshots
This patch addresses an issue where non-MVCC index scans using SnapshotDirty or SnapshotSelf could miss tuples due to concurrent modifications. The fix retains read locks on pages for these special snapshot types until the scan is done with the page's tuples, preventing concurrent modifications from causing inconsistent results.
Updated README to document this special case in the btree locking mechanism.
---
src/backend/access/nbtree/README | 13 ++++++++++++-
src/backend/access/nbtree/nbtree.c | 19 ++++++++++++++++++-
src/backend/access/nbtree/nbtsearch.c | 16 ++++++++++++----
src/backend/access/nbtree/nbtutils.c | 4 +++-
src/backend/executor/execReplication.c | 8 ++++++--
src/include/access/nbtree.h | 1 +
6 files changed, 52 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..a9280415633 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -85,7 +85,8 @@ move right until we find a page whose right-link matches the page we
came from. (Actually, it's even harder than that; see page deletion
discussion below.)
-Page read locks are held only for as long as a scan is examining a page.
+Page read locks are held only for as long as a scan is examining a page
+(with exception for SnapshotDirty and SnapshotSelf scans - see below).
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
@@ -103,6 +104,16 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. To address this issue, for
+SnapshotDirty and SnapshotSelf scans, we retain the read lock on the page until
+we're completely done processing all the tuples from that page, preventing
+concurrent modifications that could lead to inconsistent results.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c130..bda2b821a51 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -393,10 +393,22 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
BTScanPosInvalidate(so->currPos);
}
+ /*
+ * For SnapshotDirty and SnapshotSelf scans, we don't unlock the buffer
+ * and keep the lock should be until we're completely done with this page.
+ * This prevents concurrent modifications from causing inconsistent
+ * results during non-MVCC scans.
+ *
+ * See nbtree/README for information about SnapshotDirty and SnapshotSelf.
+ */
+ so->dropLock = scan->xs_snapshot->snapshot_type != SNAPSHOT_DIRTY
+ && scan->xs_snapshot->snapshot_type != SNAPSHOT_SELF;
/*
* We prefer to eagerly drop leaf page pins before btgettuple returns.
* This avoids making VACUUM wait to acquire a cleanup lock on the page.
@@ -420,7 +432,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
*
* Note: so->dropPin should never change across rescans.
*/
- so->dropPin = (!scan->xs_want_itup &&
+ so->dropPin = (so->dropLock &&
+ !scan->xs_want_itup &&
IsMVCCSnapshot(scan->xs_snapshot) &&
RelationNeedsWAL(scan->indexRelation) &&
scan->heapRelation != NULL);
@@ -477,6 +490,8 @@ btendscan(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
}
@@ -557,6 +572,8 @@ btrestrpos(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d69798795b4..f92dba17fa4 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -57,12 +57,14 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
*
- * Unlock so->currPos.buf. If scan is so->dropPin, drop the pin, too.
+ * Unlock so->currPos.buf if so->dropLock. If scan is so->dropPin, drop the pin, too.
* Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
*/
static inline void
_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
{
+ if (!so->dropLock)
+ return;
if (!so->dropPin)
{
/* Just drop the lock (not the pin) */
@@ -1579,7 +1581,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* _bt_next() -- Get the next item in a scan.
*
* On entry, so->currPos describes the current page, which may be pinned
- * but is not locked, and so->currPos.itemIndex identifies which item was
+ * but is not locked (except for SnapshotDirty and SnapshotSelf scans, where
+ * the page remains locked), and so->currPos.itemIndex identifies which item was
* previously returned.
*
* On success exit, so->currPos is updated as needed, and _bt_returnitem
@@ -2158,7 +2161,9 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
* Wrapper on _bt_readnextpage that performs final steps for the current page.
*
* On entry, so->currPos must be valid. Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
+ * never locked, except for SnapshotDirty and SnapshotSelf scans where the buffer
+ * remains locked until we're done with all tuples from the page
+ * (Actually, when so->dropPin there won't even be a pin held,
* though so->currPos.currPage must still be set to a valid block number.)
*/
static bool
@@ -2173,6 +2178,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
/*
* Before we modify currPos, make a copy of the page data if there was a
@@ -2312,7 +2319,8 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
}
/* There's no actually-matching data on the page in so->currPos.buf */
- _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ if (so->dropLock)
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
/* Call _bt_readnextpage using its _bt_steppage wrapper function */
if (!_bt_steppage(scan, dir))
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index edfea2acaff..56d5bf44785 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3283,8 +3283,10 @@ _bt_killitems(IndexScanDesc scan)
* concurrent VACUUMs from recycling any of the TIDs on the page.
*/
Assert(BTScanPosIsPinned(so->currPos));
+ /* Lock only if the lock is dropped. */
buf = so->currPos.buf;
- _bt_lockbuf(rel, buf, BT_READ);
+ if (so->dropLock)
+ _bt_lockbuf(rel, buf, BT_READ);
}
else
{
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index da0cbf41d6f..c2f5aa2ba5c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -205,12 +205,11 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
/* Start an index scan. */
scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+ index_rescan(scan, skey, skey_attoff, NULL, 0);
retry:
found = false;
- index_rescan(scan, skey, skey_attoff, NULL, 0);
-
/* Try to find the tuple */
while (index_getnext_slot(scan, ForwardScanDirection, outslot))
{
@@ -238,6 +237,8 @@ retry:
*/
if (TransactionIdIsValid(xwait))
{
+ /* We need to call rescan before wait to ensure we release all the index page locks. */
+ index_rescan(scan, skey, skey_attoff, NULL, 0);
XactLockTableWait(xwait, NULL, NULL, XLTW_None);
goto retry;
}
@@ -266,7 +267,10 @@ retry:
PopActiveSnapshot();
if (should_refetch_tuple(res, &tmfd))
+ {
+ index_rescan(scan, skey, skey_attoff, NULL, 0);
goto retry;
+ }
}
index_endscan(scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9ab467cb8fd..9c10931c8e2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1069,6 +1069,7 @@ typedef struct BTScanOpaqueData
/* info about killed items if any (killedItems is NULL if never used) */
int *killedItems; /* currPos.items indexes of killed items */
int numKilled; /* number of currently stored items */
+ bool dropLock; /* drop lock on before btgettuple returns? */
bool dropPin; /* drop leaf pin before btgettuple returns? */
/*
--
2.43.0
v9-0001-This-patch-introduces-new-injection-points-and-TA.patchapplication/octet-stream; name=v9-0001-This-patch-introduces-new-injection-points-and-TA.patchDownload
From 48d9d5ce095e1de1af5a14d211fb5de50fa9d510 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v9 1/2] This patch introduces new injection points and TAP
tests to reproduce and verify conflict detection issues that arise during
SNAPSHOT_DIRTY index scans in logical replication and
check_exclusion_or_unique_constraint.
---
src/backend/access/index/indexam.c | 8 +
src/backend/executor/execIndexing.c | 3 +
src/test/modules/injection_points/Makefile | 2 +-
.../expected/dirty_index_scan.out | 27 ++++
src/test/modules/injection_points/meson.build | 1 +
.../specs/dirty_index_scan.spec | 37 +++++
src/test/subscription/Makefile | 1 +
src/test/subscription/meson.build | 7 +-
.../subscription/t/036_delete_missing_race.pl | 135 +++++++++++++++++
.../subscription/t/037_update_missing_race.pl | 137 ++++++++++++++++++
10 files changed, 356 insertions(+), 2 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/dirty_index_scan.out
create mode 100644 src/test/modules/injection_points/specs/dirty_index_scan.spec
create mode 100644 src/test/subscription/t/036_delete_missing_race.pl
create mode 100644 src/test/subscription/t/037_update_missing_race.pl
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 1a4f36fe0a9..2e65750979e 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -57,6 +57,7 @@
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -741,6 +742,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (scan->xs_snapshot->snapshot_type == SNAPSHOT_DIRTY)
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch_apply_dirty", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..c07ba230946 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -943,6 +944,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index fc82cd67f6c..15f5e6d23d0 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,7 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc vacuum
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned dirty_index_scan
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/dirty_index_scan.out b/src/test/modules/injection_points/expected/dirty_index_scan.out
new file mode 100644
index 00000000000..82d46397d61
--- /dev/null
+++ b/src/test/modules/injection_points/expected/dirty_index_scan.out
@@ -0,0 +1,27 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_s1 s2_s1 s3_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s1_s1: INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; <waiting ...>
+step s2_s1: UPDATE test.tbl SET n = n + 1 WHERE i = 42; <waiting ...>
+step s3_s1:
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ <waiting ...>
+step s1_s1: <... completed>
+step s2_s1: <... completed>
+step s3_s1: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 20390d6b4bf..a126fe20c2d 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -48,6 +48,7 @@ tests += {
'basic',
'inplace',
'syscache-update-pruned',
+ 'dirty_index_scan',
],
'runningcheck': false, # see syscache-update-pruned
},
diff --git a/src/test/modules/injection_points/specs/dirty_index_scan.spec b/src/test/modules/injection_points/specs/dirty_index_scan.spec
new file mode 100644
index 00000000000..91d20ab4612
--- /dev/null
+++ b/src/test/modules/injection_points/specs/dirty_index_scan.spec
@@ -0,0 +1,37 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, n int);
+ CREATE INDEX tbl_n_idx ON test.tbl(n);
+ INSERT INTO test.tbl VALUES(42,1);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error');
+ SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait');
+}
+
+step s1_s1 { INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; }
+
+session s2
+step s2_s1 { UPDATE test.tbl SET n = n + 1 WHERE i = 42; }
+
+session s3
+step s3_s1 {
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+}
+
+permutation
+ s1_s1
+ s2_s1(*)
+ s3_s1(s1_s1)
\ No newline at end of file
diff --git a/src/test/subscription/Makefile b/src/test/subscription/Makefile
index 50b65d8f6ea..51d28eca091 100644
--- a/src/test/subscription/Makefile
+++ b/src/test/subscription/Makefile
@@ -16,6 +16,7 @@ include $(top_builddir)/src/Makefile.global
EXTRA_INSTALL = contrib/hstore
export with_icu
+export enable_injection_points
check:
$(prove_check)
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 586ffba434e..8ed38cec2d0 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -5,7 +5,10 @@ tests += {
'sd': meson.current_source_dir(),
'bd': meson.current_build_dir(),
'tap': {
- 'env': {'with_icu': icu.found() ? 'yes' : 'no'},
+ 'env': {
+ 'with_icu': icu.found() ? 'yes' : 'no',
+ 'enable_injection_points': get_option('injection_points') ? 'yes' : 'no'
+ },
'tests': [
't/001_rep_changes.pl',
't/002_types.pl',
@@ -42,6 +45,8 @@ tests += {
't/033_run_as_table_owner.pl',
't/034_temporal.pl',
't/035_conflicts.pl',
+ 't/036_delete_missing_race.pl',
+ 't/037_update_missing_race.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/036_delete_missing_race.pl b/src/test/subscription/t/036_delete_missing_race.pl
new file mode 100644
index 00000000000..446011ff54c
--- /dev/null
+++ b/src/test/subscription/t/036_delete_missing_race.pl
@@ -0,0 +1,135 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);
+ CREATE INDEX data_index ON conf_tab(data);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Delete tuple on publisher
+$node_publisher->safe_psql('postgres', "DELETE FROM conf_tab WHERE a=1;");
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker',
+ 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Updater tuple on subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET data = 'fromsubnew' WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# But tuple should be deleted on subscriber any way
+is($node_subscriber->safe_psql('postgres', 'SELECT count(*) from conf_tab'), 0, 'record deleted on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation "public.conf_tab": conflict=delete_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/037_update_missing_race.pl b/src/test/subscription/t/037_update_missing_race.pl
new file mode 100644
index 00000000000..acd2c87601a
--- /dev/null
+++ b/src/test/subscription/t/037_update_missing_race.pl
@@ -0,0 +1,137 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation "public.conf_tab": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
--
2.43.0
Hello,
Added one more test - for invalid "update_deleted" conflict detection.
Best regards,
Mikhail.
Attachments:
v10-0002-Fix-btree-index-scan-concurrency-issues-with-dir.patchapplication/octet-stream; name=v10-0002-Fix-btree-index-scan-concurrency-issues-with-dir.patchDownload
From 660d39f2a31882427522fe48387922dcd4091101 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <mihailnikalayeu@gmail.com>
Date: Mon, 16 Jun 2025 22:20:38 +0200
Subject: [PATCH v10 2/2] Fix btree index scan concurrency issues with dirty
snapshots
This patch addresses an issue where non-MVCC index scans using SnapshotDirty or SnapshotSelf could miss tuples due to concurrent modifications. The fix retains read locks on pages for these special snapshot types until the scan is done with the page's tuples, preventing concurrent modifications from causing inconsistent results.
Updated README to document this special case in the btree locking mechanism.
---
src/backend/access/nbtree/README | 13 ++++++++++++-
src/backend/access/nbtree/nbtree.c | 19 ++++++++++++++++++-
src/backend/access/nbtree/nbtsearch.c | 16 ++++++++++++----
src/backend/access/nbtree/nbtutils.c | 4 +++-
src/backend/executor/execReplication.c | 8 ++++++--
src/include/access/nbtree.h | 1 +
6 files changed, 52 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..a9280415633 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -85,7 +85,8 @@ move right until we find a page whose right-link matches the page we
came from. (Actually, it's even harder than that; see page deletion
discussion below.)
-Page read locks are held only for as long as a scan is examining a page.
+Page read locks are held only for as long as a scan is examining a page
+(with exception for SnapshotDirty and SnapshotSelf scans - see below).
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
@@ -103,6 +104,16 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page. If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. To address this issue, for
+SnapshotDirty and SnapshotSelf scans, we retain the read lock on the page until
+we're completely done processing all the tuples from that page, preventing
+concurrent modifications that could lead to inconsistent results.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c130..bda2b821a51 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -393,10 +393,22 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
BTScanPosInvalidate(so->currPos);
}
+ /*
+ * For SnapshotDirty and SnapshotSelf scans, we don't unlock the buffer
+ * and keep the lock should be until we're completely done with this page.
+ * This prevents concurrent modifications from causing inconsistent
+ * results during non-MVCC scans.
+ *
+ * See nbtree/README for information about SnapshotDirty and SnapshotSelf.
+ */
+ so->dropLock = scan->xs_snapshot->snapshot_type != SNAPSHOT_DIRTY
+ && scan->xs_snapshot->snapshot_type != SNAPSHOT_SELF;
/*
* We prefer to eagerly drop leaf page pins before btgettuple returns.
* This avoids making VACUUM wait to acquire a cleanup lock on the page.
@@ -420,7 +432,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
*
* Note: so->dropPin should never change across rescans.
*/
- so->dropPin = (!scan->xs_want_itup &&
+ so->dropPin = (so->dropLock &&
+ !scan->xs_want_itup &&
IsMVCCSnapshot(scan->xs_snapshot) &&
RelationNeedsWAL(scan->indexRelation) &&
scan->heapRelation != NULL);
@@ -477,6 +490,8 @@ btendscan(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
}
@@ -557,6 +572,8 @@ btrestrpos(IndexScanDesc scan)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
BTScanPosUnpinIfPinned(so->currPos);
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d69798795b4..f92dba17fa4 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -57,12 +57,14 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
/*
* _bt_drop_lock_and_maybe_pin()
*
- * Unlock so->currPos.buf. If scan is so->dropPin, drop the pin, too.
+ * Unlock so->currPos.buf if so->dropLock. If scan is so->dropPin, drop the pin, too.
* Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
*/
static inline void
_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
{
+ if (!so->dropLock)
+ return;
if (!so->dropPin)
{
/* Just drop the lock (not the pin) */
@@ -1579,7 +1581,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* _bt_next() -- Get the next item in a scan.
*
* On entry, so->currPos describes the current page, which may be pinned
- * but is not locked, and so->currPos.itemIndex identifies which item was
+ * but is not locked (except for SnapshotDirty and SnapshotSelf scans, where
+ * the page remains locked), and so->currPos.itemIndex identifies which item was
* previously returned.
*
* On success exit, so->currPos is updated as needed, and _bt_returnitem
@@ -2158,7 +2161,9 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
* Wrapper on _bt_readnextpage that performs final steps for the current page.
*
* On entry, so->currPos must be valid. Its buffer will be pinned, though
- * never locked. (Actually, when so->dropPin there won't even be a pin held,
+ * never locked, except for SnapshotDirty and SnapshotSelf scans where the buffer
+ * remains locked until we're done with all tuples from the page
+ * (Actually, when so->dropPin there won't even be a pin held,
* though so->currPos.currPage must still be set to a valid block number.)
*/
static bool
@@ -2173,6 +2178,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
/* Before leaving current page, deal with any killed items */
if (so->numKilled > 0)
_bt_killitems(scan);
+ else if (!so->dropLock) /* _bt_killitems always releases lock */
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
/*
* Before we modify currPos, make a copy of the page data if there was a
@@ -2312,7 +2319,8 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
}
/* There's no actually-matching data on the page in so->currPos.buf */
- _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+ if (so->dropLock)
+ _bt_unlockbuf(scan->indexRelation, so->currPos.buf);
/* Call _bt_readnextpage using its _bt_steppage wrapper function */
if (!_bt_steppage(scan, dir))
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index edfea2acaff..56d5bf44785 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3283,8 +3283,10 @@ _bt_killitems(IndexScanDesc scan)
* concurrent VACUUMs from recycling any of the TIDs on the page.
*/
Assert(BTScanPosIsPinned(so->currPos));
+ /* Lock only if the lock is dropped. */
buf = so->currPos.buf;
- _bt_lockbuf(rel, buf, BT_READ);
+ if (so->dropLock)
+ _bt_lockbuf(rel, buf, BT_READ);
}
else
{
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index da0cbf41d6f..c2f5aa2ba5c 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -205,12 +205,11 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
/* Start an index scan. */
scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+ index_rescan(scan, skey, skey_attoff, NULL, 0);
retry:
found = false;
- index_rescan(scan, skey, skey_attoff, NULL, 0);
-
/* Try to find the tuple */
while (index_getnext_slot(scan, ForwardScanDirection, outslot))
{
@@ -238,6 +237,8 @@ retry:
*/
if (TransactionIdIsValid(xwait))
{
+ /* We need to call rescan before wait to ensure we release all the index page locks. */
+ index_rescan(scan, skey, skey_attoff, NULL, 0);
XactLockTableWait(xwait, NULL, NULL, XLTW_None);
goto retry;
}
@@ -266,7 +267,10 @@ retry:
PopActiveSnapshot();
if (should_refetch_tuple(res, &tmfd))
+ {
+ index_rescan(scan, skey, skey_attoff, NULL, 0);
goto retry;
+ }
}
index_endscan(scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9ab467cb8fd..9c10931c8e2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1069,6 +1069,7 @@ typedef struct BTScanOpaqueData
/* info about killed items if any (killedItems is NULL if never used) */
int *killedItems; /* currPos.items indexes of killed items */
int numKilled; /* number of currently stored items */
+ bool dropLock; /* drop lock on before btgettuple returns? */
bool dropPin; /* drop leaf pin before btgettuple returns? */
/*
--
2.43.0
v10-0001-This-patch-introduces-new-injection-points-and-T.patchapplication/octet-stream; name=v10-0001-This-patch-introduces-new-injection-points-and-T.patchDownload
From dacda92357f397354a63aa5418f9bae802af06d3 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v10 1/2] This patch introduces new injection points and TAP
tests to reproduce and verify conflict detection issues that arise during
SNAPSHOT_DIRTY index scans in logical replication and
check_exclusion_or_unique_constraint.
---
src/backend/access/index/indexam.c | 8 +
src/backend/executor/execIndexing.c | 3 +
src/test/modules/injection_points/Makefile | 2 +-
.../expected/dirty_index_scan.out | 27 ++++
src/test/modules/injection_points/meson.build | 1 +
.../specs/dirty_index_scan.spec | 37 +++++
src/test/subscription/Makefile | 1 +
src/test/subscription/meson.build | 8 +-
.../subscription/t/036_delete_missing_race.pl | 137 +++++++++++++++++
.../subscription/t/037_update_missing_race.pl | 139 +++++++++++++++++
.../t/038_update_missing_with_retain.pl | 141 ++++++++++++++++++
11 files changed, 502 insertions(+), 2 deletions(-)
create mode 100644 src/test/modules/injection_points/expected/dirty_index_scan.out
create mode 100644 src/test/modules/injection_points/specs/dirty_index_scan.spec
create mode 100644 src/test/subscription/t/036_delete_missing_race.pl
create mode 100644 src/test/subscription/t/037_update_missing_race.pl
create mode 100644 src/test/subscription/t/038_update_missing_with_retain.pl
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 1a4f36fe0a9..2e65750979e 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -57,6 +57,7 @@
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -741,6 +742,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (scan->xs_snapshot->snapshot_type == SNAPSHOT_DIRTY)
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch_apply_dirty", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..c07ba230946 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -943,6 +944,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index fc82cd67f6c..15f5e6d23d0 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,7 @@ PGFILEDESC = "injection_points - facility for injection points"
REGRESS = injection_points hashagg reindex_conc vacuum
REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned dirty_index_scan
TAP_TESTS = 1
diff --git a/src/test/modules/injection_points/expected/dirty_index_scan.out b/src/test/modules/injection_points/expected/dirty_index_scan.out
new file mode 100644
index 00000000000..82d46397d61
--- /dev/null
+++ b/src/test/modules/injection_points/expected/dirty_index_scan.out
@@ -0,0 +1,27 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_s1 s2_s1 s3_s1
+injection_points_attach
+-----------------------
+
+(1 row)
+
+step s1_s1: INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; <waiting ...>
+step s2_s1: UPDATE test.tbl SET n = n + 1 WHERE i = 42; <waiting ...>
+step s3_s1:
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ <waiting ...>
+step s1_s1: <... completed>
+step s2_s1: <... completed>
+step s3_s1: <... completed>
+injection_points_detach
+-----------------------
+
+(1 row)
+
+injection_points_wakeup
+-----------------------
+
+(1 row)
+
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 20390d6b4bf..a126fe20c2d 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -48,6 +48,7 @@ tests += {
'basic',
'inplace',
'syscache-update-pruned',
+ 'dirty_index_scan',
],
'runningcheck': false, # see syscache-update-pruned
},
diff --git a/src/test/modules/injection_points/specs/dirty_index_scan.spec b/src/test/modules/injection_points/specs/dirty_index_scan.spec
new file mode 100644
index 00000000000..91d20ab4612
--- /dev/null
+++ b/src/test/modules/injection_points/specs/dirty_index_scan.spec
@@ -0,0 +1,37 @@
+setup
+{
+ CREATE EXTENSION injection_points;
+ CREATE SCHEMA test;
+ CREATE UNLOGGED TABLE test.tbl(i int primary key, n int);
+ CREATE INDEX tbl_n_idx ON test.tbl(n);
+ INSERT INTO test.tbl VALUES(42,1);
+}
+
+teardown
+{
+ DROP SCHEMA test CASCADE;
+ DROP EXTENSION injection_points;
+}
+
+session s1
+setup {
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'error');
+ SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait');
+}
+
+step s1_s1 { INSERT INTO test.tbl VALUES(42, 1) on conflict(i) do update set n = EXCLUDED.n + 1; }
+
+session s2
+step s2_s1 { UPDATE test.tbl SET n = n + 1 WHERE i = 42; }
+
+session s3
+step s3_s1 {
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+}
+
+permutation
+ s1_s1
+ s2_s1(*)
+ s3_s1(s1_s1)
\ No newline at end of file
diff --git a/src/test/subscription/Makefile b/src/test/subscription/Makefile
index 50b65d8f6ea..51d28eca091 100644
--- a/src/test/subscription/Makefile
+++ b/src/test/subscription/Makefile
@@ -16,6 +16,7 @@ include $(top_builddir)/src/Makefile.global
EXTRA_INSTALL = contrib/hstore
export with_icu
+export enable_injection_points
check:
$(prove_check)
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 586ffba434e..49f52db4dd1 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -5,7 +5,10 @@ tests += {
'sd': meson.current_source_dir(),
'bd': meson.current_build_dir(),
'tap': {
- 'env': {'with_icu': icu.found() ? 'yes' : 'no'},
+ 'env': {
+ 'with_icu': icu.found() ? 'yes' : 'no',
+ 'enable_injection_points': get_option('injection_points') ? 'yes' : 'no'
+ },
'tests': [
't/001_rep_changes.pl',
't/002_types.pl',
@@ -42,6 +45,9 @@ tests += {
't/033_run_as_table_owner.pl',
't/034_temporal.pl',
't/035_conflicts.pl',
+ 't/036_delete_missing_race.pl',
+ 't/037_update_missing_race.pl',
+ 't/038_update_missing_with_retain.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/036_delete_missing_race.pl b/src/test/subscription/t/036_delete_missing_race.pl
new file mode 100644
index 00000000000..82e16af9be3
--- /dev/null
+++ b/src/test/subscription/t/036_delete_missing_race.pl
@@ -0,0 +1,137 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);
+ CREATE INDEX data_index ON conf_tab(data);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Delete tuple on publisher
+$node_publisher->safe_psql('postgres', "DELETE FROM conf_tab WHERE a=1;");
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker',
+ 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Updater tuple on subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET data = 'fromsubnew' WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# But tuple should be deleted on subscriber any way
+is($node_subscriber->safe_psql('postgres', 'SELECT count(*) from conf_tab'), 0, 'record deleted on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation "public.conf_tab": conflict=delete_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/037_update_missing_race.pl b/src/test/subscription/t/037_update_missing_race.pl
new file mode 100644
index 00000000000..e29ad771d8e
--- /dev/null
+++ b/src/test/subscription/t/037_update_missing_race.pl
@@ -0,0 +1,139 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation "public.conf_tab": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/038_update_missing_with_retain.pl b/src/test/subscription/t/038_update_missing_with_retain.pl
new file mode 100644
index 00000000000..13769aa1c11
--- /dev/null
+++ b/src/test/subscription/t/038_update_missing_with_retain.pl
@@ -0,0 +1,141 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->append_conf('postgresql.conf',
+ qq(wal_level = 'replica'));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub WITH (retain_dead_tuples = true)");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_deleted/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation "public.conf_tab": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
--
2.43.0
On Wed, Mar 12, 2025 at 6:36 AM Mihail Nikalayeu
<michail.nikolaev@gmail.com> wrote:
Hello, everyone and Peter!
Peter, I have added you because you may be interested in (or already know about) this btree-related issue.
Short description of the problem:
I noticed a concurrency issue in btree index scans that affects SnapshotDirty and SnapshotSelf scan types.
When using these non-MVCC snapshot types, a scan could miss tuples if concurrent transactions delete existing tuples and insert new one with different TIDs on the same page.The problem occurs because:
1. The scan reads a page and caches its tuples in backend-local storage
2. A concurrent transaction deletes a tuple and inserts a new one with a different TID
3. The scan misses the new tuple because it was already deleted by a committed transaction and does not pass visibility check
4. But new version on the page is missed, because not in cached tuples
IIUC, the problem you are worried about can happen with DELETE+INSERT
in the same transaction on the subscriber, right? If so, this should
happen with DELETE and INSERT in a separate transaction as well. If
that happens then we anyway may not be able to detect such an INSERT
if it happens on a page earlier than the current page.
BTW, as the update (or DELETE+INSERT) happens at a later time than the
publisher's update/delete, so once we have the last_write_win
resolution strategy implemented, it is the subscriber operation that
will win. So, the current behavior shouldn't cause any problem.
--
With Regards,
Amit Kapila.
Hello, Amit,
IIUC, the problem you are worried about can happen with DELETE+INSERT
It seems there was some misunderstanding due to my bad explanation and wording.
I wrote "A concurrent transaction deletes a tuple and inserts a new
one with a different TID" - but I mean logical UPDATE causing new TID
in index page appear because HOT was applied...
Lets try again, I hope that explanation is better:
At the start, we have a table with a primary key and one extra index
(to disable HOT), and a tuple with i=13:
CREATE TABLE table (i int PRIMARY KEY, data text);
CREATE INDEX no_more_hot_data_index ON table (data);
INSERT INTO table (i, data) VALUES (13, 'data');
A btree scan using SnapshotDirty can miss tuples because of internal
locking logic. Here’s how the bug shows up:
1) we have a tuple in the index (i=13), committed long ago
2) transaction A starts an index search for that tuple using
SnapshotDirty (WHERE i = 13)
3) in parallel, transaction B updates that tuple (SET data='updated'
WHERE i=13) and commits (creating a new index entry because HOT is not
applied)
4) the scan from step 2 returns nothing at all - as if the tuple never existed
In other words, if you start a SnapshotDirty btree scan for i=13 and
update that row i=13 at the same physical moment, the scan may:
* return the TID of the pre‑update version - correct behavior
* return the TID of the post‑update version - also correct
* return nothing - this is the broken case
More broadly: any SnapshotDirty scan may completely miss existing data
when there are concurrent updates.
SnapshotDirty usage in Postgres is limited, so the impact isn’t huge,
but every case I found is reproducible with the tests from the first
commit from v10 in my previous email.
* check_exclusion_or_unique_constraint: only a minor performance
impact, handled by retry logic
* logical replication TAP tests: multiple scenarios fail because
RelationFindReplTupleByIndex cannot find existing committed tuples
These scenarios look like:
1) logical replication tries to apply a change for tuple X received
from the publisher
2) meanwhile, the subscriber updates the same tuple X and commits in
parallel transaction
3) due to the bug, RelationFindReplTupleByIndex concludes the tuple X
does not exist at all, leading to bad outcomes, including:
* incorrect conflict‑type messages (and, in the future,
potentially wrong conflict‑resolution choices)
* lost updates (see scenario 2 from [0]/messages/by-id/CADzfLwWC49oanFSGPTf=6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ@mail.gmail.com)
If you look at the tests and play with the $simulate_race_condition
flag, you can see the behavior directly. The second commit (a possible
fix) in v10 also includes documentation updates that try to explain
the issue in a more appropriate context.
I’m happy to provide additional reproducers or explanations if that would help.
[0]: /messages/by-id/CADzfLwWC49oanFSGPTf=6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ@mail.gmail.com
Best regards,
Mikhail.
Oh,
in index page appear because HOT was applied...
I mean "HOT was NOT applied", sorry for the inconvenience.
Amit, a few more explanations related to your message.
IIUC, the problem you are worried about can happen with DELETE+INSERT
in the same transaction on the subscriber, right?
Technically, yes - this can occur during a single UPDATE, as well as a
DELETE followed by an INSERT of the same key within the same
transaction (which is effectively equivalent to an UPDATE). However,
it should NOT occur, because at no point in the timeline does a row
with that key fail to exist; therefore, no scan should return “there
is no such row in the index.”
If so, this should
happen with DELETE and INSERT in a separate transaction as well.
Yes, it may happen - and in that case, it is correct. This is because
there is a moment between the DELETE and the INSERT when the row does
not exist. Therefore, it is acceptable for a scan to check the index
at that particular moment and find nothing.
BTW, as the update (or DELETE+INSERT) happens at a later time than the
publisher's update/delete, so once we have the last_write_win
resolution strategy implemented, it is the subscriber operation that
will win. So, the current behavior shouldn't cause any problem.
For the last_write_win and UPDATE vs UPDATE case - yes, probably, but
only by luck.
However, there are many scenarios that cannot be implemented
correctly, for example:
* DELETE always wins
* UPDATE with a higher version (column value) wins
* first_write_win
* etc.
Also, the cases from [0]/messages/by-id/CADzfLwWC49oanFSGPTf=6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ@mail.gmail.com are clearly wrong without any conflict
resolution. In particular, case 2 - there are no real conflicts at all
(since different sets of columns are involved), but an incorrect
result may still be produced.
[0]: /messages/by-id/CADzfLwWC49oanFSGPTf=6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ@mail.gmail.com
On Fri, Aug 22, 2025 at 9:12 PM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:
BTW, as the update (or DELETE+INSERT) happens at a later time than the
publisher's update/delete, so once we have the last_write_win
resolution strategy implemented, it is the subscriber operation that
will win. So, the current behavior shouldn't cause any problem.For the last_write_win and UPDATE vs UPDATE case - yes, probably, but
only by luck.
Why only by luck?
However, there are many scenarios that cannot be implemented
correctly, for example:
* DELETE always wins
* UPDATE with a higher version (column value) wins
* first_write_win
* etc.
Then these may not lead to eventual consistency for such cases. So,
not sure one should anyway rely on these.
Also, the cases from [0] are clearly wrong without any conflict
resolution. In particular, case 2 - there are no real conflicts at all
(since different sets of columns are involved), but an incorrect
result may still be produced.
I think this questions whether we consider the SnapshotDirty results
correct or not. The case of logical replication giving wrong results
[0]: /messages/by-id/CADzfLwWC49oanFSGPTf=6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ@mail.gmail.com
would like to know the opinion of others who were involved in the
initial commit, so added Peter E. to see what he thinks of the same.
If we don't get the opinion here (say people missed to read because of
an unrelated title) then I suggest you start a separate email thread
to discuss just that case and see what others think.
[0]: /messages/by-id/CADzfLwWC49oanFSGPTf=6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ@mail.gmail.com
--
With Regards,
Amit Kapila.
On Fri, Aug 22, 2025 at 9:12 PM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:
Amit, a few more explanations related to your message.
IIUC, the problem you are worried about can happen with DELETE+INSERT
in the same transaction on the subscriber, right?Technically, yes - this can occur during a single UPDATE, as well as a
DELETE followed by an INSERT of the same key within the same
transaction (which is effectively equivalent to an UPDATE).
BTW, then isn't it possible that INSERT happens on a different page?
--
With Regards,
Amit Kapila.
Hello!
Why only by luck?
I mean last_write_win provides the same results in the following cases:
* we found the tuple, detected a conflict, and decided to ignore the
update coming from the publisher
* we were unable to find the tuple, logged an error about it, and
ignored the update coming from the publisher
In both cases, the result is the same: the subscriber version remains
in the table.
Then these may not lead to eventual consistency for such cases. So,
not sure one should anyway rely on these.
But with the fixed snapshot dirty scan, it becomes possible to
implement such strategies.
Also, some strategies require some kind of merge function for tuples.
In my understanding, even last_write_win should probably compare
timestamps to determine which version is "newer" because time in
distributed systems can be tricky.
Therefore, we have to find the tuple if it exists.
BTW, then isn't it possible that INSERT happens on a different page?
Yes, it is possible - in that case, the bug does not occur. It only
happens if a new TID of some logical tuple is added to the same page.
Just to clarify, this is about B-tree pages, not the heap.
I think this questions whether we consider the SnapshotDirty results
correct or not.
In my understanding, this is clearly wrong:
* such behavior is not documented anywhere
* usage patterns assume that such things cannot happen
* new features struggle with it. For example, the new update_deleted
logging may fail to behave correctly
(038_update_missing_with_retain.pl in the patch) - so how should it be
used? It might be correct, but it also might not be...
Another option is to document the behavior and rename it to SnapshotMaybe :)
By the way, SnapshotSelf is also affected.
The case of logical replication giving wrong results
[0] is the behavior from the beginning of logical replication.
Logical replication was mainly focused on replication without any
concurrent updates on the subscriber side. So, I think this is why the
issue was overlooked.
Best regards,
Mikhail.
On Mon, Aug 25, 2025 at 4:19 PM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:
Why only by luck?
I mean last_write_win provides the same results in the following cases:
* we found the tuple, detected a conflict, and decided to ignore the
update coming from the publisher
* we were unable to find the tuple, logged an error about it, and
ignored the update coming from the publisherIn both cases, the result is the same: the subscriber version remains
in the table.
Right, so we can say that it will be consistent.
Then these may not lead to eventual consistency for such cases. So,
not sure one should anyway rely on these.But with the fixed snapshot dirty scan, it becomes possible to
implement such strategies.
Also, some strategies require some kind of merge function for tuples.
In my understanding, even last_write_win should probably compare
timestamps to determine which version is "newer" because time in
distributed systems can be tricky.
Therefore, we have to find the tuple if it exists.BTW, then isn't it possible that INSERT happens on a different page?
Yes, it is possible - in that case, the bug does not occur. It only
happens if a new TID of some logical tuple is added to the same page.
What if the new insert happens in a page prior to the current page? I
mean that the scan won't encounter the page where Insert happens.
Just to clarify, this is about B-tree pages, not the heap.
I think this questions whether we consider the SnapshotDirty results
correct or not.In my understanding, this is clearly wrong:
* such behavior is not documented anywhere
I agree. This is where we need inputs.
* usage patterns assume that such things cannot happen
* new features struggle with it. For example, the new update_deleted
logging may fail to behave correctly
(038_update_missing_with_retain.pl in the patch) - so how should it be
used? It might be correct, but it also might not be...Another option is to document the behavior and rename it to SnapshotMaybe :)
By the way, SnapshotSelf is also affected.
BTW, do we know the reason behind using SnapshotDirty in the first
place? I don't see any comments in the nearby code unless I am missing
something.
The case of logical replication giving wrong results
[0] is the behavior from the beginning of logical replication.Logical replication was mainly focused on replication without any
concurrent updates on the subscriber side. So, I think this is why the
issue was overlooked.
The other possibility is that as this is a rare scenario so we didn't
consider it.
--
With Regards,
Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com>:
What if the new insert happens in a page prior to the current page? I
mean that the scan won't encounter the page where Insert happens.
Hmm.... Yes - if the TID lands to the page left of the current
position, we’ll miss it as well.
A lock‑based solution (version in the v10) would require keeping all
pages with the same key under a read lock, which feels too expensive.
BTW, do we know the reason behind using SnapshotDirty in the first
place? I don't see any comments in the nearby code unless I am missing
something.
I think this is simply an attempt to lock the newest version of the
logical tuple, including INSERT cases.
For an existing tuple, the same can be achieved using MVCC snapshot + retry.
However, in the case of a not-yet-committed INSERT, a different type
of snapshot is required.
But I'm not sure if it provides any advantages.
On Mon, Aug 25, 2025 at 7:02 PM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:
Amit Kapila <amit.kapila16@gmail.com>:
What if the new insert happens in a page prior to the current page? I
mean that the scan won't encounter the page where Insert happens.Hmm.... Yes - if the TID lands to the page left of the current
position, we’ll miss it as well.
A lock‑based solution (version in the v10) would require keeping all
pages with the same key under a read lock, which feels too expensive.
Right.
BTW, do we know the reason behind using SnapshotDirty in the first
place? I don't see any comments in the nearby code unless I am missing
something.I think this is simply an attempt to lock the newest version of the
logical tuple, including INSERT cases.
For an existing tuple, the same can be achieved using MVCC snapshot + retry.
However, in the case of a not-yet-committed INSERT, a different type
of snapshot is required.But I'm not sure if it provides any advantages.
I think it is better to document this race somewhere in a logical
replication document for now unless we have a consensus on a way to
move forward.
--
With Regards,
Amit Kapila.
Hello, Amit!
Amit Kapila <amit.kapila16@gmail.com>:
Now, I
would like to know the opinion of others who were involved in the
initial commit, so added Peter E. to see what he thinks of the same.
Seems like you added another Peter in [0]/messages/by-id/CAA4eK1LZxzORgAoDhix9MWrOqYOsNZuZLW2sTfGsJFM99yRgrg@mail.gmail.com - I added Peter Eisentraut :)
Hmm.... Yes - if the TID lands to the page left of the current
position, we’ll miss it as well.
A lock‑based solution (version in the v10) would require keeping all
pages with the same key under a read lock, which feels too expensive.Right.
I think it is possible to achieve the same guarantees and logic using
GetLatestSnapshot + HeapTupleSatisfiesDirty, but without the "tuple
not found" case - I'll try to experiment with it.
GetLatestSnapshot is called before tuple lock anyway.
I think it is better to document this race somewhere in a logical
replication document for now unless we have a consensus on a way to
move forward.
Yes, it is an option, but what documentation is going to be strange:
* there is delete_missing type of conflict stats\logs, but be aware it
may be wrong (actually it delete_missing)
* the same for update_missing vs update_origin_differs
* the same for update_deleted vs update_origin_differs
* also DELETE or UPDATE from publisher may be missed in case of update
on subscriber even if update touches subscriber-only columns
It looks like "if something is updating on subscriber - no
guarantees". And the worst thing - it is the actual state.
[0]: /messages/by-id/CAA4eK1LZxzORgAoDhix9MWrOqYOsNZuZLW2sTfGsJFM99yRgrg@mail.gmail.com
Best regards,
MIkhail.
Hello, Amit!
Now, I
would like to know the opinion of others who were involved in the
initial commit, so added Peter E. to see what he thinks of the same.
Peter answered in [0]https://discord.com/channels/1258108670710124574/1407753138991009913/1411303541900841090:
I don’t remember. I was just the committer.
I’ve attached a new version of the proposed solution.
The first commit includes tests, some README updates, and an
additional pgbench test that reproduces the issue without explicitly
simulating the wait/resume race. This last test is heavy and isn't
intended to be committed.
Instead of adding extra locking in btree, a more lightweight approach
is used: since we already call GetLatestSnapshot before
table_tuple_lock, we can simply call it before each scan attempt and
use that snapshot for the scan.
As a result:
* MVCC scan will not miss updated tuples, while DirtyScan may
* in both cases, table_tuple_lock will wait for the updating
transaction to commit before retrying
* MVCC scan cannot see not-yet-committed new rows, while DirtyScan
can. However, this does not provide any stronger guarantee: in the
case of INSERT vs INSERT, two parallel inserts are still possible.
DirtyScan only slightly reduces the probability, but if the scan does
not find the row, there is still no guarantee that it won’t be
inserted immediately afterward.
Therefore, the MVCC version appears to provide the same guarantees,
without missing tuples, and with the same performance.
Best regards,
Mikhail.
[0]: https://discord.com/channels/1258108670710124574/1407753138991009913/1411303541900841090
https://discord.com/channels/1258108670710124574/1407753138991009913/1411303541900841090
Attachments:
v11-0002-Fix-logical-replication-conflict-detection-durin.patchapplication/x-patch; name=v11-0002-Fix-logical-replication-conflict-detection-durin.patchDownload
From 60dca743bf755b068ccdff4cb2f35467167f592a Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Wed, 3 Sep 2025 19:08:55 +0200
Subject: [PATCH v11 2/2] Fix logical replication conflict detection during
tuple lookup
SNAPSHOT_DIRTY scans could miss conflict detection with concurrent transactions during logical replication.
Replace SNAPSHOT_DIRTY scan with the GetLatestSnapshot in RelationFindReplTupleByIndex and RelationFindReplTupleSeq.
---
src/backend/executor/execReplication.c | 63 ++++++++------------------
1 file changed, 18 insertions(+), 45 deletions(-)
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index b409d4ecbf5..0de40aec733 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -186,8 +186,6 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
ScanKeyData skey[INDEX_MAX_KEYS];
int skey_attoff;
IndexScanDesc scan;
- SnapshotData snap;
- TransactionId xwait;
Relation idxrel;
bool found;
TypeCacheEntry **eq = NULL;
@@ -198,17 +196,17 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
isIdxSafeToSkipDuplicates = (GetRelationIdentityOrPK(rel) == idxoid);
- InitDirtySnapshot(snap);
-
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
- /* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+ /* Start an index scan. SnapshotAny will be replaced below. */
+ scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
retry:
found = false;
-
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->xs_snapshot = GetActiveSnapshot();
index_rescan(scan, skey, skey_attoff, NULL, 0);
/* Try to find the tuple */
@@ -229,19 +227,6 @@ retry:
ExecMaterializeSlot(outslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
found = true;
break;
@@ -253,8 +238,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -263,13 +246,15 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
-
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
index_endscan(scan);
+ PopActiveSnapshot();
/* Don't release lock until commit. */
index_close(idxrel, NoLock);
@@ -370,9 +355,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
{
TupleTableSlot *scanslot;
TableScanDesc scan;
- SnapshotData snap;
TypeCacheEntry **eq;
- TransactionId xwait;
bool found;
TupleDesc desc PG_USED_FOR_ASSERTS_ONLY = RelationGetDescr(rel);
@@ -380,13 +363,15 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
- /* Start a heap scan. */
- InitDirtySnapshot(snap);
- scan = table_beginscan(rel, &snap, 0, NULL);
+ /* Start a heap scan. SnapshotAny will be replaced below. */
+ scan = table_beginscan(rel, SnapshotAny, 0, NULL);
scanslot = table_slot_create(rel, NULL);
retry:
found = false;
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->rs_snapshot = GetActiveSnapshot();
table_rescan(scan, NULL);
@@ -399,19 +384,6 @@ retry:
found = true;
ExecCopySlot(outslot, scanslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
break;
}
@@ -422,8 +394,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -432,13 +402,16 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
table_endscan(scan);
+ PopActiveSnapshot();
ExecDropSingleTupleTableSlot(scanslot);
return found;
--
2.48.1
v11-0001-This-patch-introduces-new-injection-points-and-T.patchapplication/x-patch; name=v11-0001-This-patch-introduces-new-injection-points-and-T.patchDownload
From e73046a0da7213332a3701123e040de1fd3f2f54 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v11 1/2] This patch introduces new injection points and TAP
tests to reproduce and verify conflict detection issues that arise during
SNAPSHOT_DIRTY index scans in logical replication.
---
src/backend/access/index/indexam.c | 9 ++
src/backend/access/nbtree/README | 9 ++
src/backend/executor/execIndexing.c | 7 +-
src/backend/replication/logical/worker.c | 4 +
src/include/utils/snapshot.h | 14 ++
src/test/subscription/Makefile | 1 +
src/test/subscription/meson.build | 9 +-
.../subscription/t/036_delete_missing_race.pl | 137 +++++++++++++++++
.../subscription/t/037_update_missing_race.pl | 139 +++++++++++++++++
.../t/038_update_missing_with_retain.pl | 141 ++++++++++++++++++
.../t/039_update_missing_simulation.pl | 123 +++++++++++++++
11 files changed, 591 insertions(+), 2 deletions(-)
create mode 100644 src/test/subscription/t/036_delete_missing_race.pl
create mode 100644 src/test/subscription/t/037_update_missing_race.pl
create mode 100644 src/test/subscription/t/038_update_missing_with_retain.pl
create mode 100644 src/test/subscription/t/039_update_missing_simulation.pl
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 86d11f4ec79..a503fa02ac5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -52,11 +52,13 @@
#include "catalog/pg_type.h"
#include "nodes/execnodes.h"
#include "pgstat.h"
+#include "replication/logicalworker.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -751,6 +753,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelation(scan->heapRelation) && IsLogicalWorker())
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch_apply_dirty", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..634a3d10bb1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -103,6 +103,15 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page or to the left/right (depending on scan direction) of current scan position.
+If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. Note it affects not only btree
+scan but also a heap scan.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..61a5097f789 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -780,7 +781,9 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* Search the tuples that are in the index for any violations, including
* tuples that aren't visible yet.
- */
+ * Snapshot dirty may miss some tuples in the case of parallel updates,
+ * but it is acceptable here.
+ */
InitDirtySnapshot(DirtySnapshot);
for (i = 0; i < indnkeyatts; i++)
@@ -943,6 +946,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 22ad9051db3..bb3aaf21d65 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -270,6 +270,7 @@
#include "utils/acl.h"
#include "utils/dynahash.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -2932,7 +2933,10 @@ apply_handle_update_internal(ApplyExecutionData *edata,
conflicttuple.origin != replorigin_session_origin)
type = CT_UPDATE_DELETED;
else
+ {
+ INJECTION_POINT("apply_handle_update_internal_update_missing", NULL);
type = CT_UPDATE_MISSING;
+ }
/* Store the new tuple for conflict reporting */
slot_store_data(newslot, relmapentry, newtup);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..189dfd71103 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -53,6 +53,13 @@ typedef enum SnapshotType
* - previous commands of this transaction
* - changes made by the current command
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page
+ * but the old one marked with committed xmax - snapshot will skip the old
+ * one and never meet the new one during that scan - resulting in skipping
+ * that tuple at all.
+ *
* Does _not_ include:
* - in-progress transactions (as of the current instant)
* -------------------------------------------------------------------------
@@ -82,6 +89,13 @@ typedef enum SnapshotType
* transaction and committed/aborted xacts are concerned. However, it
* also includes the effects of other xacts still in progress.
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page but the
+ * old one marked with committed/in-progress xmax - snapshot will skip the old one
+ * and never meet the new one during that scan - resulting in skipping that tuple
+ * at all.
+ *
* A special hack is that when a snapshot of this type is used to
* determine tuple visibility, the passed-in snapshot struct is used as an
* output argument to return the xids of concurrent xacts that affected
diff --git a/src/test/subscription/Makefile b/src/test/subscription/Makefile
index 50b65d8f6ea..51d28eca091 100644
--- a/src/test/subscription/Makefile
+++ b/src/test/subscription/Makefile
@@ -16,6 +16,7 @@ include $(top_builddir)/src/Makefile.global
EXTRA_INSTALL = contrib/hstore
export with_icu
+export enable_injection_points
check:
$(prove_check)
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 586ffba434e..8b24d76a247 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -5,7 +5,10 @@ tests += {
'sd': meson.current_source_dir(),
'bd': meson.current_build_dir(),
'tap': {
- 'env': {'with_icu': icu.found() ? 'yes' : 'no'},
+ 'env': {
+ 'with_icu': icu.found() ? 'yes' : 'no',
+ 'enable_injection_points': get_option('injection_points') ? 'yes' : 'no'
+ },
'tests': [
't/001_rep_changes.pl',
't/002_types.pl',
@@ -42,6 +45,10 @@ tests += {
't/033_run_as_table_owner.pl',
't/034_temporal.pl',
't/035_conflicts.pl',
+ 't/036_delete_missing_race.pl',
+ 't/037_update_missing_race.pl',
+ 't/038_update_missing_with_retain.pl',
+ 't/039_update_missing_simulation.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/036_delete_missing_race.pl b/src/test/subscription/t/036_delete_missing_race.pl
new file mode 100644
index 00000000000..a319513fd60
--- /dev/null
+++ b/src/test/subscription/t/036_delete_missing_race.pl
@@ -0,0 +1,137 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);
+ CREATE INDEX data_index ON conf_tab(data);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Delete tuple on publisher
+$node_publisher->safe_psql('postgres', "DELETE FROM conf_tab WHERE a=1;");
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker',
+ 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Updater tuple on subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET data = 'fromsubnew' WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# But tuple should be deleted on subscriber any way
+is($node_subscriber->safe_psql('postgres', 'SELECT count(*) from conf_tab'), 0, 'record deleted on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/037_update_missing_race.pl b/src/test/subscription/t/037_update_missing_race.pl
new file mode 100644
index 00000000000..b71fdc0c136
--- /dev/null
+++ b/src/test/subscription/t/037_update_missing_race.pl
@@ -0,0 +1,139 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/038_update_missing_with_retain.pl b/src/test/subscription/t/038_update_missing_with_retain.pl
new file mode 100644
index 00000000000..6f7dfd28d37
--- /dev/null
+++ b/src/test/subscription/t/038_update_missing_with_retain.pl
@@ -0,0 +1,141 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->append_conf('postgresql.conf',
+ qq(wal_level = 'replica'));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub WITH (retain_dead_tuples = true)");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_deleted/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/039_update_missing_simulation.pl b/src/test/subscription/t/039_update_missing_simulation.pl
new file mode 100644
index 00000000000..322e931c171
--- /dev/null
+++ b/src/test/subscription/t/039_update_missing_simulation.pl
@@ -0,0 +1,123 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use IPC::Run qw(start finish);
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int, data_sub int default 0);
+ CREATE INDEX data_index ON tbl(data_pub);");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE tbl");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+my $num_rows = 10;
+my $num_updates = 10000;
+my $num_clients = 10;
+$node_publisher->safe_psql('postgres', "INSERT INTO tbl SELECT i, i * i FROM generate_series(1,$num_rows) i");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+# Prepare small pgbench scripts as files
+my $sub_sql = $node_subscriber->basedir . '/sub_update.sql';
+my $pub_sql = $node_publisher->basedir . '/pub_delete.sql';
+
+open my $fh1, '>', $sub_sql or die $!;
+print $fh1 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_sub = data_sub + 1 WHERE a = :num;\n";
+close $fh1;
+
+open my $fh2, '>', $pub_sql or die $!;
+print $fh2 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_pub = data_pub + 1 WHERE a = :num;\n";
+close $fh2;
+
+my @sub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_subscriber->port, '-h', $node_subscriber->host, '-f', $sub_sql, 'postgres'
+);
+
+my @pub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_publisher->port, '-h', $node_publisher->host, '-f', $pub_sql, 'postgres'
+);
+
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+# This should never happen
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('apply_handle_update_internal_update_missing', 'error')");
+my $log_offset = -s $node_subscriber->logfile;
+
+# Start both concurrently
+my ($sub_out, $sub_err, $pub_out, $pub_err) = ('', '', '', '');
+my $sub_h = start \@sub_cmd, '>', \$sub_out, '2>', \$sub_err;
+my $pub_h = start \@pub_cmd, '>', \$pub_out, '2>', \$pub_err;
+
+# Wait for completion
+finish $sub_h;
+finish $pub_h;
+
+like($sub_out, qr/actually processed/, 'subscriber pgbench completed');
+like($pub_out, qr/actually processed/, 'publisher pgbench completed');
+
+# Let subscription catch up, then check expectations
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');
+
+ok(!$node_subscriber->log_contains(
+ qr/ERROR: error triggered for injection point apply_handle_update_internal_update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+done_testing();
--
2.48.1
Rebased.
Also, separate thread with some additional explanation is here:
/messages/by-id/CADzfLwXZVmbo11tFS_G2i+6TfFVwHU4VUUSeoqb+8UQfuoJs8A@mail.gmail.com
Attachments:
v12-0002-Fix-logical-replication-conflict-detection-durin.patchapplication/octet-stream; name=v12-0002-Fix-logical-replication-conflict-detection-durin.patchDownload
From a6c8b8e7d8bd2f59e7ef45eaf521629dd2963085 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Wed, 3 Sep 2025 19:08:55 +0200
Subject: [PATCH v12 2/2] Fix logical replication conflict detection during
tuple lookup
SNAPSHOT_DIRTY scans could miss conflict detection with concurrent transactions during logical replication.
Replace SNAPSHOT_DIRTY scan with the GetLatestSnapshot in RelationFindReplTupleByIndex and RelationFindReplTupleSeq.
---
src/backend/executor/execReplication.c | 63 ++++++++------------------
1 file changed, 18 insertions(+), 45 deletions(-)
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index b409d4ecbf5..0de40aec733 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -186,8 +186,6 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
ScanKeyData skey[INDEX_MAX_KEYS];
int skey_attoff;
IndexScanDesc scan;
- SnapshotData snap;
- TransactionId xwait;
Relation idxrel;
bool found;
TypeCacheEntry **eq = NULL;
@@ -198,17 +196,17 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
isIdxSafeToSkipDuplicates = (GetRelationIdentityOrPK(rel) == idxoid);
- InitDirtySnapshot(snap);
-
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
- /* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+ /* Start an index scan. SnapshotAny will be replaced below. */
+ scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
retry:
found = false;
-
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->xs_snapshot = GetActiveSnapshot();
index_rescan(scan, skey, skey_attoff, NULL, 0);
/* Try to find the tuple */
@@ -229,19 +227,6 @@ retry:
ExecMaterializeSlot(outslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
found = true;
break;
@@ -253,8 +238,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -263,13 +246,15 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
-
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
index_endscan(scan);
+ PopActiveSnapshot();
/* Don't release lock until commit. */
index_close(idxrel, NoLock);
@@ -370,9 +355,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
{
TupleTableSlot *scanslot;
TableScanDesc scan;
- SnapshotData snap;
TypeCacheEntry **eq;
- TransactionId xwait;
bool found;
TupleDesc desc PG_USED_FOR_ASSERTS_ONLY = RelationGetDescr(rel);
@@ -380,13 +363,15 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
- /* Start a heap scan. */
- InitDirtySnapshot(snap);
- scan = table_beginscan(rel, &snap, 0, NULL);
+ /* Start a heap scan. SnapshotAny will be replaced below. */
+ scan = table_beginscan(rel, SnapshotAny, 0, NULL);
scanslot = table_slot_create(rel, NULL);
retry:
found = false;
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->rs_snapshot = GetActiveSnapshot();
table_rescan(scan, NULL);
@@ -399,19 +384,6 @@ retry:
found = true;
ExecCopySlot(outslot, scanslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
break;
}
@@ -422,8 +394,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -432,13 +402,16 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
table_endscan(scan);
+ PopActiveSnapshot();
ExecDropSingleTupleTableSlot(scanslot);
return found;
--
2.43.0
v12-0001-This-patch-introduces-new-injection-points-and-T.patchapplication/octet-stream; name=v12-0001-This-patch-introduces-new-injection-points-and-T.patchDownload
From ad4b8702ac905f598dbfcab1dcca53fa3be3c216 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v12 1/2] This patch introduces new injection points and TAP
tests to reproduce and verify conflict detection issues that arise during
SNAPSHOT_DIRTY index scans in logical replication.
---
src/backend/access/index/indexam.c | 9 ++
src/backend/access/nbtree/README | 9 ++
src/backend/executor/execIndexing.c | 7 +-
src/backend/replication/logical/worker.c | 4 +
src/include/utils/snapshot.h | 14 ++
src/test/subscription/meson.build | 4 +
.../subscription/t/036_delete_missing_race.pl | 137 +++++++++++++++++
.../subscription/t/037_update_missing_race.pl | 139 +++++++++++++++++
.../t/038_update_missing_with_retain.pl | 141 ++++++++++++++++++
.../t/039_update_missing_simulation.pl | 123 +++++++++++++++
10 files changed, 586 insertions(+), 1 deletion(-)
create mode 100644 src/test/subscription/t/036_delete_missing_race.pl
create mode 100644 src/test/subscription/t/037_update_missing_race.pl
create mode 100644 src/test/subscription/t/038_update_missing_with_retain.pl
create mode 100644 src/test/subscription/t/039_update_missing_simulation.pl
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 86d11f4ec79..a503fa02ac5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -52,11 +52,13 @@
#include "catalog/pg_type.h"
#include "nodes/execnodes.h"
#include "pgstat.h"
+#include "replication/logicalworker.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -751,6 +753,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelation(scan->heapRelation) && IsLogicalWorker())
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch_apply_dirty", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..634a3d10bb1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -103,6 +103,15 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page or to the left/right (depending on scan direction) of current scan position.
+If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. Note it affects not only btree
+scan but also a heap scan.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..61a5097f789 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -780,7 +781,9 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* Search the tuples that are in the index for any violations, including
* tuples that aren't visible yet.
- */
+ * Snapshot dirty may miss some tuples in the case of parallel updates,
+ * but it is acceptable here.
+ */
InitDirtySnapshot(DirtySnapshot);
for (i = 0; i < indnkeyatts; i++)
@@ -943,6 +946,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ee6ac22329f..cccbaeedfd7 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -277,6 +277,7 @@
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -2946,7 +2947,10 @@ apply_handle_update_internal(ApplyExecutionData *edata,
conflicttuple.origin != replorigin_session_origin)
type = CT_UPDATE_DELETED;
else
+ {
+ INJECTION_POINT("apply_handle_update_internal_update_missing", NULL);
type = CT_UPDATE_MISSING;
+ }
/* Store the new tuple for conflict reporting */
slot_store_data(newslot, relmapentry, newtup);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..189dfd71103 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -53,6 +53,13 @@ typedef enum SnapshotType
* - previous commands of this transaction
* - changes made by the current command
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page
+ * but the old one marked with committed xmax - snapshot will skip the old
+ * one and never meet the new one during that scan - resulting in skipping
+ * that tuple at all.
+ *
* Does _not_ include:
* - in-progress transactions (as of the current instant)
* -------------------------------------------------------------------------
@@ -82,6 +89,13 @@ typedef enum SnapshotType
* transaction and committed/aborted xacts are concerned. However, it
* also includes the effects of other xacts still in progress.
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page but the
+ * old one marked with committed/in-progress xmax - snapshot will skip the old one
+ * and never meet the new one during that scan - resulting in skipping that tuple
+ * at all.
+ *
* A special hack is that when a snapshot of this type is used to
* determine tuple visibility, the passed-in snapshot struct is used as an
* output argument to return the xids of concurrent xacts that affected
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 20b4e523d93..4f9a5c9209d 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -45,6 +45,10 @@ tests += {
't/033_run_as_table_owner.pl',
't/034_temporal.pl',
't/035_conflicts.pl',
+ 't/036_delete_missing_race.pl',
+ 't/037_update_missing_race.pl',
+ 't/038_update_missing_with_retain.pl',
+ 't/039_update_missing_simulation.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/036_delete_missing_race.pl b/src/test/subscription/t/036_delete_missing_race.pl
new file mode 100644
index 00000000000..a319513fd60
--- /dev/null
+++ b/src/test/subscription/t/036_delete_missing_race.pl
@@ -0,0 +1,137 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);
+ CREATE INDEX data_index ON conf_tab(data);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Delete tuple on publisher
+$node_publisher->safe_psql('postgres', "DELETE FROM conf_tab WHERE a=1;");
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker',
+ 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Updater tuple on subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET data = 'fromsubnew' WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# But tuple should be deleted on subscriber any way
+is($node_subscriber->safe_psql('postgres', 'SELECT count(*) from conf_tab'), 0, 'record deleted on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/037_update_missing_race.pl b/src/test/subscription/t/037_update_missing_race.pl
new file mode 100644
index 00000000000..b71fdc0c136
--- /dev/null
+++ b/src/test/subscription/t/037_update_missing_race.pl
@@ -0,0 +1,139 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/038_update_missing_with_retain.pl b/src/test/subscription/t/038_update_missing_with_retain.pl
new file mode 100644
index 00000000000..6f7dfd28d37
--- /dev/null
+++ b/src/test/subscription/t/038_update_missing_with_retain.pl
@@ -0,0 +1,141 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->append_conf('postgresql.conf',
+ qq(wal_level = 'replica'));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub WITH (retain_dead_tuples = true)");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_deleted/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/039_update_missing_simulation.pl b/src/test/subscription/t/039_update_missing_simulation.pl
new file mode 100644
index 00000000000..322e931c171
--- /dev/null
+++ b/src/test/subscription/t/039_update_missing_simulation.pl
@@ -0,0 +1,123 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use IPC::Run qw(start finish);
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int, data_sub int default 0);
+ CREATE INDEX data_index ON tbl(data_pub);");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE tbl");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+my $num_rows = 10;
+my $num_updates = 10000;
+my $num_clients = 10;
+$node_publisher->safe_psql('postgres', "INSERT INTO tbl SELECT i, i * i FROM generate_series(1,$num_rows) i");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+# Prepare small pgbench scripts as files
+my $sub_sql = $node_subscriber->basedir . '/sub_update.sql';
+my $pub_sql = $node_publisher->basedir . '/pub_delete.sql';
+
+open my $fh1, '>', $sub_sql or die $!;
+print $fh1 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_sub = data_sub + 1 WHERE a = :num;\n";
+close $fh1;
+
+open my $fh2, '>', $pub_sql or die $!;
+print $fh2 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_pub = data_pub + 1 WHERE a = :num;\n";
+close $fh2;
+
+my @sub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_subscriber->port, '-h', $node_subscriber->host, '-f', $sub_sql, 'postgres'
+);
+
+my @pub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_publisher->port, '-h', $node_publisher->host, '-f', $pub_sql, 'postgres'
+);
+
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+# This should never happen
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('apply_handle_update_internal_update_missing', 'error')");
+my $log_offset = -s $node_subscriber->logfile;
+
+# Start both concurrently
+my ($sub_out, $sub_err, $pub_out, $pub_err) = ('', '', '', '');
+my $sub_h = start \@sub_cmd, '>', \$sub_out, '2>', \$sub_err;
+my $pub_h = start \@pub_cmd, '>', \$pub_out, '2>', \$pub_err;
+
+# Wait for completion
+finish $sub_h;
+finish $pub_h;
+
+like($sub_out, qr/actually processed/, 'subscriber pgbench completed');
+like($pub_out, qr/actually processed/, 'publisher pgbench completed');
+
+# Let subscription catch up, then check expectations
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');
+
+ok(!$node_subscriber->log_contains(
+ qr/ERROR: error triggered for injection point apply_handle_update_internal_update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+done_testing();
--
2.43.0
Rebased again.
Attachments:
v13-0001-This-patch-introduces-new-injection-points-and-T.patchapplication/x-patch; name=v13-0001-This-patch-introduces-new-injection-points-and-T.patchDownload
From 69c2d56899f8729dc1d1476f10da50d879f177d9 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v13 1/2] This patch introduces new injection points and TAP
tests to reproduce and verify conflict detection issues that arise during
SNAPSHOT_DIRTY index scans in logical replication.
---
src/backend/access/index/indexam.c | 9 ++
src/backend/access/nbtree/README | 9 ++
src/backend/executor/execIndexing.c | 7 +-
src/backend/replication/logical/worker.c | 4 +
src/include/utils/snapshot.h | 14 ++
src/test/subscription/meson.build | 4 +
.../subscription/t/037_delete_missing_race.pl | 137 +++++++++++++++++
.../subscription/t/038_update_missing_race.pl | 139 +++++++++++++++++
.../t/039_update_missing_with_retain.pl | 141 ++++++++++++++++++
.../t/040_update_missing_simulation.pl | 123 +++++++++++++++
10 files changed, 586 insertions(+), 1 deletion(-)
create mode 100644 src/test/subscription/t/037_delete_missing_race.pl
create mode 100644 src/test/subscription/t/038_update_missing_race.pl
create mode 100644 src/test/subscription/t/039_update_missing_with_retain.pl
create mode 100644 src/test/subscription/t/040_update_missing_simulation.pl
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d23b..5987d90ee08 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -52,11 +52,13 @@
#include "catalog/pg_type.h"
#include "nodes/execnodes.h"
#include "pgstat.h"
+#include "replication/logicalworker.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -751,6 +753,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelation(scan->heapRelation) && IsLogicalWorker())
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch_apply_dirty", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..634a3d10bb1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -103,6 +103,15 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page or to the left/right (depending on scan direction) of current scan position.
+If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. Note it affects not only btree
+scan but also a heap scan.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..61a5097f789 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -780,7 +781,9 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* Search the tuples that are in the index for any violations, including
* tuples that aren't visible yet.
- */
+ * Snapshot dirty may miss some tuples in the case of parallel updates,
+ * but it is acceptable here.
+ */
InitDirtySnapshot(DirtySnapshot);
for (i = 0; i < indnkeyatts; i++)
@@ -943,6 +946,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5df5a4612b6..4f6976b4af4 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -286,6 +286,7 @@
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -2961,7 +2962,10 @@ apply_handle_update_internal(ApplyExecutionData *edata,
conflicttuple.origin != replorigin_session_origin)
type = CT_UPDATE_DELETED;
else
+ {
+ INJECTION_POINT("apply_handle_update_internal_update_missing", NULL);
type = CT_UPDATE_MISSING;
+ }
/* Store the new tuple for conflict reporting */
slot_store_data(newslot, relmapentry, newtup);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..189dfd71103 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -53,6 +53,13 @@ typedef enum SnapshotType
* - previous commands of this transaction
* - changes made by the current command
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page
+ * but the old one marked with committed xmax - snapshot will skip the old
+ * one and never meet the new one during that scan - resulting in skipping
+ * that tuple at all.
+ *
* Does _not_ include:
* - in-progress transactions (as of the current instant)
* -------------------------------------------------------------------------
@@ -82,6 +89,13 @@ typedef enum SnapshotType
* transaction and committed/aborted xacts are concerned. However, it
* also includes the effects of other xacts still in progress.
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page but the
+ * old one marked with committed/in-progress xmax - snapshot will skip the old one
+ * and never meet the new one during that scan - resulting in skipping that tuple
+ * at all.
+ *
* A special hack is that when a snapshot of this type is used to
* determine tuple visibility, the passed-in snapshot struct is used as an
* output argument to return the xids of concurrent xacts that affected
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 85d10a89994..7f29647b538 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -46,6 +46,10 @@ tests += {
't/034_temporal.pl',
't/035_conflicts.pl',
't/036_sequences.pl',
+ 't/037_delete_missing_race.pl',
+ 't/038_update_missing_race.pl',
+ 't/039_update_missing_with_retain.pl',
+ 't/040_update_missing_simulation.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/037_delete_missing_race.pl b/src/test/subscription/t/037_delete_missing_race.pl
new file mode 100644
index 00000000000..a319513fd60
--- /dev/null
+++ b/src/test/subscription/t/037_delete_missing_race.pl
@@ -0,0 +1,137 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);
+ CREATE INDEX data_index ON conf_tab(data);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Delete tuple on publisher
+$node_publisher->safe_psql('postgres', "DELETE FROM conf_tab WHERE a=1;");
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker',
+ 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Updater tuple on subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET data = 'fromsubnew' WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# But tuple should be deleted on subscriber any way
+is($node_subscriber->safe_psql('postgres', 'SELECT count(*) from conf_tab'), 0, 'record deleted on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/038_update_missing_race.pl b/src/test/subscription/t/038_update_missing_race.pl
new file mode 100644
index 00000000000..b71fdc0c136
--- /dev/null
+++ b/src/test/subscription/t/038_update_missing_race.pl
@@ -0,0 +1,139 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/039_update_missing_with_retain.pl b/src/test/subscription/t/039_update_missing_with_retain.pl
new file mode 100644
index 00000000000..6f7dfd28d37
--- /dev/null
+++ b/src/test/subscription/t/039_update_missing_with_retain.pl
@@ -0,0 +1,141 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->append_conf('postgresql.conf',
+ qq(wal_level = 'replica'));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub WITH (retain_dead_tuples = true)");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_deleted/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/040_update_missing_simulation.pl b/src/test/subscription/t/040_update_missing_simulation.pl
new file mode 100644
index 00000000000..322e931c171
--- /dev/null
+++ b/src/test/subscription/t/040_update_missing_simulation.pl
@@ -0,0 +1,123 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use IPC::Run qw(start finish);
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int, data_sub int default 0);
+ CREATE INDEX data_index ON tbl(data_pub);");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE tbl");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+my $num_rows = 10;
+my $num_updates = 10000;
+my $num_clients = 10;
+$node_publisher->safe_psql('postgres', "INSERT INTO tbl SELECT i, i * i FROM generate_series(1,$num_rows) i");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+# Prepare small pgbench scripts as files
+my $sub_sql = $node_subscriber->basedir . '/sub_update.sql';
+my $pub_sql = $node_publisher->basedir . '/pub_delete.sql';
+
+open my $fh1, '>', $sub_sql or die $!;
+print $fh1 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_sub = data_sub + 1 WHERE a = :num;\n";
+close $fh1;
+
+open my $fh2, '>', $pub_sql or die $!;
+print $fh2 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_pub = data_pub + 1 WHERE a = :num;\n";
+close $fh2;
+
+my @sub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_subscriber->port, '-h', $node_subscriber->host, '-f', $sub_sql, 'postgres'
+);
+
+my @pub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_publisher->port, '-h', $node_publisher->host, '-f', $pub_sql, 'postgres'
+);
+
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+# This should never happen
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('apply_handle_update_internal_update_missing', 'error')");
+my $log_offset = -s $node_subscriber->logfile;
+
+# Start both concurrently
+my ($sub_out, $sub_err, $pub_out, $pub_err) = ('', '', '', '');
+my $sub_h = start \@sub_cmd, '>', \$sub_out, '2>', \$sub_err;
+my $pub_h = start \@pub_cmd, '>', \$pub_out, '2>', \$pub_err;
+
+# Wait for completion
+finish $sub_h;
+finish $pub_h;
+
+like($sub_out, qr/actually processed/, 'subscriber pgbench completed');
+like($pub_out, qr/actually processed/, 'publisher pgbench completed');
+
+# Let subscription catch up, then check expectations
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');
+
+ok(!$node_subscriber->log_contains(
+ qr/ERROR: error triggered for injection point apply_handle_update_internal_update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+done_testing();
--
2.43.0
v13-0002-Fix-logical-replication-conflict-detection-durin.patchapplication/x-patch; name=v13-0002-Fix-logical-replication-conflict-detection-durin.patchDownload
From 26baa8be7cfaebef1af04b25a4ea7ca1b1e6d4eb Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Wed, 3 Sep 2025 19:08:55 +0200
Subject: [PATCH v13 2/2] Fix logical replication conflict detection during
tuple lookup
SNAPSHOT_DIRTY scans could miss conflict detection with concurrent transactions during logical replication.
Replace SNAPSHOT_DIRTY scan with the GetLatestSnapshot in RelationFindReplTupleByIndex and RelationFindReplTupleSeq.
---
src/backend/executor/execReplication.c | 63 ++++++++------------------
1 file changed, 18 insertions(+), 45 deletions(-)
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index def32774c90..1e434ab697a 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -186,8 +186,6 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
ScanKeyData skey[INDEX_MAX_KEYS];
int skey_attoff;
IndexScanDesc scan;
- SnapshotData snap;
- TransactionId xwait;
Relation idxrel;
bool found;
TypeCacheEntry **eq = NULL;
@@ -198,17 +196,17 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
isIdxSafeToSkipDuplicates = (GetRelationIdentityOrPK(rel) == idxoid);
- InitDirtySnapshot(snap);
-
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
- /* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+ /* Start an index scan. SnapshotAny will be replaced below. */
+ scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
retry:
found = false;
-
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->xs_snapshot = GetActiveSnapshot();
index_rescan(scan, skey, skey_attoff, NULL, 0);
/* Try to find the tuple */
@@ -229,19 +227,6 @@ retry:
ExecMaterializeSlot(outslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
found = true;
break;
@@ -253,8 +238,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -263,13 +246,15 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
-
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
index_endscan(scan);
+ PopActiveSnapshot();
/* Don't release lock until commit. */
index_close(idxrel, NoLock);
@@ -370,9 +355,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
{
TupleTableSlot *scanslot;
TableScanDesc scan;
- SnapshotData snap;
TypeCacheEntry **eq;
- TransactionId xwait;
bool found;
TupleDesc desc PG_USED_FOR_ASSERTS_ONLY = RelationGetDescr(rel);
@@ -380,13 +363,15 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
- /* Start a heap scan. */
- InitDirtySnapshot(snap);
- scan = table_beginscan(rel, &snap, 0, NULL);
+ /* Start a heap scan. SnapshotAny will be replaced below. */
+ scan = table_beginscan(rel, SnapshotAny, 0, NULL);
scanslot = table_slot_create(rel, NULL);
retry:
found = false;
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->rs_snapshot = GetActiveSnapshot();
table_rescan(scan, NULL);
@@ -399,19 +384,6 @@ retry:
found = true;
ExecCopySlot(outslot, scanslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
break;
}
@@ -422,8 +394,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -432,13 +402,16 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
table_endscan(scan);
+ PopActiveSnapshot();
ExecDropSingleTupleTableSlot(scanslot);
return found;
--
2.43.0
Fixed race in tests caused
https://cirrus-ci.com/task/5815107659235328?logs=test_world#L324 to
fail.
Attachments:
v14-0002-Fix-logical-replication-conflict-detection-durin.patchtext/x-patch; charset=US-ASCII; name=v14-0002-Fix-logical-replication-conflict-detection-durin.patchDownload
From 2a6a121c7cfe4823db0f8aec931c5dbcab672616 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Wed, 3 Sep 2025 19:08:55 +0200
Subject: [PATCH v14 2/2] Fix logical replication conflict detection during
tuple lookup
SNAPSHOT_DIRTY scans could miss conflict detection with concurrent transactions during logical replication.
Replace SNAPSHOT_DIRTY scan with the GetLatestSnapshot in RelationFindReplTupleByIndex and RelationFindReplTupleSeq.
---
src/backend/executor/execReplication.c | 63 ++++++++------------------
1 file changed, 18 insertions(+), 45 deletions(-)
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index b409d4ecbf5..0de40aec733 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -186,8 +186,6 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
ScanKeyData skey[INDEX_MAX_KEYS];
int skey_attoff;
IndexScanDesc scan;
- SnapshotData snap;
- TransactionId xwait;
Relation idxrel;
bool found;
TypeCacheEntry **eq = NULL;
@@ -198,17 +196,17 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
isIdxSafeToSkipDuplicates = (GetRelationIdentityOrPK(rel) == idxoid);
- InitDirtySnapshot(snap);
-
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
- /* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+ /* Start an index scan. SnapshotAny will be replaced below. */
+ scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
retry:
found = false;
-
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->xs_snapshot = GetActiveSnapshot();
index_rescan(scan, skey, skey_attoff, NULL, 0);
/* Try to find the tuple */
@@ -229,19 +227,6 @@ retry:
ExecMaterializeSlot(outslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
found = true;
break;
@@ -253,8 +238,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -263,13 +246,15 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
-
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
index_endscan(scan);
+ PopActiveSnapshot();
/* Don't release lock until commit. */
index_close(idxrel, NoLock);
@@ -370,9 +355,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
{
TupleTableSlot *scanslot;
TableScanDesc scan;
- SnapshotData snap;
TypeCacheEntry **eq;
- TransactionId xwait;
bool found;
TupleDesc desc PG_USED_FOR_ASSERTS_ONLY = RelationGetDescr(rel);
@@ -380,13 +363,15 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
- /* Start a heap scan. */
- InitDirtySnapshot(snap);
- scan = table_beginscan(rel, &snap, 0, NULL);
+ /* Start a heap scan. SnapshotAny will be replaced below. */
+ scan = table_beginscan(rel, SnapshotAny, 0, NULL);
scanslot = table_slot_create(rel, NULL);
retry:
found = false;
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->rs_snapshot = GetActiveSnapshot();
table_rescan(scan, NULL);
@@ -399,19 +384,6 @@ retry:
found = true;
ExecCopySlot(outslot, scanslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
break;
}
@@ -422,8 +394,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -432,13 +402,16 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
table_endscan(scan);
+ PopActiveSnapshot();
ExecDropSingleTupleTableSlot(scanslot);
return found;
--
2.48.1
v14-0001-This-patch-introduces-new-injection-points-and-T.patchtext/x-patch; charset=US-ASCII; name=v14-0001-This-patch-introduces-new-injection-points-and-T.patchDownload
From 66ede0f9001d9e841e9249091869dbde58df4d5c Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v14 1/2] This patch introduces new injection points and TAP
tests to reproduce and verify conflict detection issues that arise during
SNAPSHOT_DIRTY index scans in logical replication.
---
src/backend/access/index/indexam.c | 9 ++
src/backend/access/nbtree/README | 9 ++
src/backend/executor/execIndexing.c | 7 +-
src/backend/replication/logical/worker.c | 4 +
src/include/utils/snapshot.h | 14 ++
src/test/subscription/meson.build | 4 +
.../subscription/t/036_delete_missing_race.pl | 139 +++++++++++++++++
.../subscription/t/037_update_missing_race.pl | 141 +++++++++++++++++
.../t/038_update_missing_with_retain.pl | 143 ++++++++++++++++++
.../t/039_update_missing_simulation.pl | 125 +++++++++++++++
10 files changed, 594 insertions(+), 1 deletion(-)
create mode 100644 src/test/subscription/t/036_delete_missing_race.pl
create mode 100644 src/test/subscription/t/037_update_missing_race.pl
create mode 100644 src/test/subscription/t/038_update_missing_with_retain.pl
create mode 100644 src/test/subscription/t/039_update_missing_simulation.pl
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 86d11f4ec79..a503fa02ac5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -52,11 +52,13 @@
#include "catalog/pg_type.h"
#include "nodes/execnodes.h"
#include "pgstat.h"
+#include "replication/logicalworker.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -751,6 +753,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelation(scan->heapRelation) && IsLogicalWorker())
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch_apply_dirty", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..634a3d10bb1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -103,6 +103,15 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page or to the left/right (depending on scan direction) of current scan position.
+If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. Note it affects not only btree
+scan but also a heap scan.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..61a5097f789 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -780,7 +781,9 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* Search the tuples that are in the index for any violations, including
* tuples that aren't visible yet.
- */
+ * Snapshot dirty may miss some tuples in the case of parallel updates,
+ * but it is acceptable here.
+ */
InitDirtySnapshot(DirtySnapshot);
for (i = 0; i < indnkeyatts; i++)
@@ -943,6 +946,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ee6ac22329f..cccbaeedfd7 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -277,6 +277,7 @@
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -2946,7 +2947,10 @@ apply_handle_update_internal(ApplyExecutionData *edata,
conflicttuple.origin != replorigin_session_origin)
type = CT_UPDATE_DELETED;
else
+ {
+ INJECTION_POINT("apply_handle_update_internal_update_missing", NULL);
type = CT_UPDATE_MISSING;
+ }
/* Store the new tuple for conflict reporting */
slot_store_data(newslot, relmapentry, newtup);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..189dfd71103 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -53,6 +53,13 @@ typedef enum SnapshotType
* - previous commands of this transaction
* - changes made by the current command
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page
+ * but the old one marked with committed xmax - snapshot will skip the old
+ * one and never meet the new one during that scan - resulting in skipping
+ * that tuple at all.
+ *
* Does _not_ include:
* - in-progress transactions (as of the current instant)
* -------------------------------------------------------------------------
@@ -82,6 +89,13 @@ typedef enum SnapshotType
* transaction and committed/aborted xacts are concerned. However, it
* also includes the effects of other xacts still in progress.
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page but the
+ * old one marked with committed/in-progress xmax - snapshot will skip the old one
+ * and never meet the new one during that scan - resulting in skipping that tuple
+ * at all.
+ *
* A special hack is that when a snapshot of this type is used to
* determine tuple visibility, the passed-in snapshot struct is used as an
* output argument to return the xids of concurrent xacts that affected
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 20b4e523d93..4f9a5c9209d 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -45,6 +45,10 @@ tests += {
't/033_run_as_table_owner.pl',
't/034_temporal.pl',
't/035_conflicts.pl',
+ 't/036_delete_missing_race.pl',
+ 't/037_update_missing_race.pl',
+ 't/038_update_missing_with_retain.pl',
+ 't/039_update_missing_simulation.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/036_delete_missing_race.pl b/src/test/subscription/t/036_delete_missing_race.pl
new file mode 100644
index 00000000000..51dd351dc10
--- /dev/null
+++ b/src/test/subscription/t/036_delete_missing_race.pl
@@ -0,0 +1,139 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);
+ CREATE INDEX data_index ON conf_tab(data);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Delete tuple on publisher
+$node_publisher->safe_psql('postgres', "DELETE FROM conf_tab WHERE a=1;");
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker',
+ 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Updater tuple on subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET data = 'fromsubnew' WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+$node_publisher->wait_for_catchup($appname);
+
+# But tuple should be deleted on subscriber any way
+is($node_subscriber->safe_psql('postgres', 'SELECT count(*) from conf_tab'), 0, 'record deleted on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/037_update_missing_race.pl b/src/test/subscription/t/037_update_missing_race.pl
new file mode 100644
index 00000000000..1e120f74bbd
--- /dev/null
+++ b/src/test/subscription/t/037_update_missing_race.pl
@@ -0,0 +1,141 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+$node_publisher->wait_for_catchup($appname);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/038_update_missing_with_retain.pl b/src/test/subscription/t/038_update_missing_with_retain.pl
new file mode 100644
index 00000000000..7b225d45f7f
--- /dev/null
+++ b/src/test/subscription/t/038_update_missing_with_retain.pl
@@ -0,0 +1,143 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->append_conf('postgresql.conf',
+ qq(wal_level = 'replica'));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub WITH (retain_dead_tuples = true)");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+$node_publisher->wait_for_catchup($appname);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_deleted/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/039_update_missing_simulation.pl b/src/test/subscription/t/039_update_missing_simulation.pl
new file mode 100644
index 00000000000..21fcd1ceb53
--- /dev/null
+++ b/src/test/subscription/t/039_update_missing_simulation.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+# Not intended to be committed because quite heavy
+# Here to demonstrate reproducibility with pgbench
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use IPC::Run qw(start finish);
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int, data_sub int default 0);
+ CREATE INDEX data_index ON tbl(data_pub);");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE tbl");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+my $num_rows = 10;
+my $num_updates = 10000;
+my $num_clients = 10;
+$node_publisher->safe_psql('postgres', "INSERT INTO tbl SELECT i, i * i FROM generate_series(1,$num_rows) i");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+# Prepare small pgbench scripts as files
+my $sub_sql = $node_subscriber->basedir . '/sub_update.sql';
+my $pub_sql = $node_publisher->basedir . '/pub_delete.sql';
+
+open my $fh1, '>', $sub_sql or die $!;
+print $fh1 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_sub = data_sub + 1 WHERE a = :num;\n";
+close $fh1;
+
+open my $fh2, '>', $pub_sql or die $!;
+print $fh2 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_pub = data_pub + 1 WHERE a = :num;\n";
+close $fh2;
+
+my @sub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_subscriber->port, '-h', $node_subscriber->host, '-f', $sub_sql, 'postgres'
+);
+
+my @pub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_publisher->port, '-h', $node_publisher->host, '-f', $pub_sql, 'postgres'
+);
+
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+# This should never happen
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('apply_handle_update_internal_update_missing', 'error')");
+my $log_offset = -s $node_subscriber->logfile;
+
+# Start both concurrently
+my ($sub_out, $sub_err, $pub_out, $pub_err) = ('', '', '', '');
+my $sub_h = start \@sub_cmd, '>', \$sub_out, '2>', \$sub_err;
+my $pub_h = start \@pub_cmd, '>', \$pub_out, '2>', \$pub_err;
+
+# Wait for completion
+finish $sub_h;
+finish $pub_h;
+
+like($sub_out, qr/actually processed/, 'subscriber pgbench completed');
+like($pub_out, qr/actually processed/, 'publisher pgbench completed');
+
+# Let subscription catch up, then check expectations
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');
+
+ok(!$node_subscriber->log_contains(
+ qr/ERROR: error triggered for injection point apply_handle_update_internal_update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+done_testing();
--
2.48.1
And rebased :)
Attachments:
v16-0001-This-patch-introduces-new-injection-points-and-T.patchtext/x-patch; charset=US-ASCII; name=v16-0001-This-patch-introduces-new-injection-points-and-T.patchDownload
From 6a4e379df251d27e42b8191bd153691f4d2d9886 Mon Sep 17 00:00:00 2001
From: nkey <michail.nikolaev@gmail.com>
Date: Sat, 23 Nov 2024 13:25:11 +0100
Subject: [PATCH v16 1/2] This patch introduces new injection points and TAP
tests to reproduce and verify conflict detection issues that arise during
SNAPSHOT_DIRTY index scans in logical replication.
---
src/backend/access/index/indexam.c | 9 ++
src/backend/access/nbtree/README | 9 ++
src/backend/executor/execIndexing.c | 7 +-
src/backend/replication/logical/worker.c | 4 +
src/include/utils/snapshot.h | 14 ++
src/test/subscription/meson.build | 4 +
.../subscription/t/037_delete_missing_race.pl | 139 +++++++++++++++++
.../subscription/t/038_update_missing_race.pl | 141 +++++++++++++++++
.../t/039_update_missing_with_retain.pl | 143 ++++++++++++++++++
.../t/040_update_missing_simulation.pl | 125 +++++++++++++++
10 files changed, 594 insertions(+), 1 deletion(-)
create mode 100644 src/test/subscription/t/037_delete_missing_race.pl
create mode 100644 src/test/subscription/t/038_update_missing_race.pl
create mode 100644 src/test/subscription/t/039_update_missing_with_retain.pl
create mode 100644 src/test/subscription/t/040_update_missing_simulation.pl
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 0492d92d23b..5987d90ee08 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -52,11 +52,13 @@
#include "catalog/pg_type.h"
#include "nodes/execnodes.h"
#include "pgstat.h"
+#include "replication/logicalworker.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
+#include "utils/injection_point.h"
/* ----------------------------------------------------------------
@@ -751,6 +753,13 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
* the index.
*/
Assert(ItemPointerIsValid(&scan->xs_heaptid));
+#ifdef USE_INJECTION_POINTS
+ if (!IsCatalogRelation(scan->heapRelation) && IsLogicalWorker())
+ {
+ INJECTION_POINT("index_getnext_slot_before_fetch_apply_dirty", NULL);
+ }
+#endif
+
if (index_fetch_heap(scan, slot))
return true;
}
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 53d4a61dc3f..634a3d10bb1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -103,6 +103,15 @@ We also remember the left-link, and follow it when the scan moves backwards
(though this requires extra handling to account for concurrent splits of
the left sibling; see detailed move-left algorithm below).
+Despite the described mechanics in place, inconsistent results may still occur
+during non-MVCC scans (SnapshotDirty and SnapshotSelf). This issue can occur if a
+concurrent transaction deletes a tuple and inserts a new tuple with a new TID in the
+same page or to the left/right (depending on scan direction) of current scan position.
+If the scan has already visited the page and cached its content in the
+backend-local storage, it might skip the old tuple due to deletion and miss the new
+tuple because the scan does not re-read the page. Note it affects not only btree
+scan but also a heap scan.
+
In most cases we release our lock and pin on a page before attempting
to acquire pin and lock on the page we are moving to. In a few places
it is necessary to lock the next page before releasing the current one.
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..61a5097f789 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
#include "utils/multirangetypes.h"
#include "utils/rangetypes.h"
#include "utils/snapmgr.h"
+#include "utils/injection_point.h"
/* waitMode argument to check_exclusion_or_unique_constraint() */
typedef enum
@@ -780,7 +781,9 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* Search the tuples that are in the index for any violations, including
* tuples that aren't visible yet.
- */
+ * Snapshot dirty may miss some tuples in the case of parallel updates,
+ * but it is acceptable here.
+ */
InitDirtySnapshot(DirtySnapshot);
for (i = 0; i < indnkeyatts; i++)
@@ -943,6 +946,8 @@ retry:
ExecDropSingleTupleTableSlot(existing_slot);
+ if (!conflict)
+ INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
return !conflict;
}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7edd1c9cf06..0f2ffc754c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -286,6 +286,7 @@
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -2962,7 +2963,10 @@ apply_handle_update_internal(ApplyExecutionData *edata,
conflicttuple.origin != replorigin_session_origin)
type = CT_UPDATE_DELETED;
else
+ {
+ INJECTION_POINT("apply_handle_update_internal_update_missing", NULL);
type = CT_UPDATE_MISSING;
+ }
/* Store the new tuple for conflict reporting */
slot_store_data(newslot, relmapentry, newtup);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..189dfd71103 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -53,6 +53,13 @@ typedef enum SnapshotType
* - previous commands of this transaction
* - changes made by the current command
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page
+ * but the old one marked with committed xmax - snapshot will skip the old
+ * one and never meet the new one during that scan - resulting in skipping
+ * that tuple at all.
+ *
* Does _not_ include:
* - in-progress transactions (as of the current instant)
* -------------------------------------------------------------------------
@@ -82,6 +89,13 @@ typedef enum SnapshotType
* transaction and committed/aborted xacts are concerned. However, it
* also includes the effects of other xacts still in progress.
*
+ * Note: such a snapshot may miss an existing logical tuple in case of
+ * parallel update.
+ * If a new version of a tuple is inserted into an already processed page but the
+ * old one marked with committed/in-progress xmax - snapshot will skip the old one
+ * and never meet the new one during that scan - resulting in skipping that tuple
+ * at all.
+ *
* A special hack is that when a snapshot of this type is used to
* determine tuple visibility, the passed-in snapshot struct is used as an
* output argument to return the xids of concurrent xacts that affected
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 85d10a89994..b552ae60a88 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -46,6 +46,10 @@ tests += {
't/034_temporal.pl',
't/035_conflicts.pl',
't/036_sequences.pl',
+ 't/037_delete_missing_race.pl',
+ 't/038_update_missing_race.pl',
+ 't/039_update_missing_with_retain.pl',
+ 't/040_update_missing_simulation.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/037_delete_missing_race.pl b/src/test/subscription/t/037_delete_missing_race.pl
new file mode 100644
index 00000000000..51dd351dc10
--- /dev/null
+++ b/src/test/subscription/t/037_delete_missing_race.pl
@@ -0,0 +1,139 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);
+ CREATE INDEX data_index ON conf_tab(data);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Delete tuple on publisher
+$node_publisher->safe_psql('postgres', "DELETE FROM conf_tab WHERE a=1;");
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker',
+ 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Updater tuple on subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET data = 'fromsubnew' WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+$node_publisher->wait_for_catchup($appname);
+
+# But tuple should be deleted on subscriber any way
+is($node_subscriber->safe_psql('postgres', 'SELECT count(*) from conf_tab'), 0, 'record deleted on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=delete_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/038_update_missing_race.pl b/src/test/subscription/t/038_update_missing_race.pl
new file mode 100644
index 00000000000..1e120f74bbd
--- /dev/null
+++ b/src/test/subscription/t/038_update_missing_race.pl
@@ -0,0 +1,141 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+$node_publisher->wait_for_catchup($appname);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/039_update_missing_with_retain.pl b/src/test/subscription/t/039_update_missing_with_retain.pl
new file mode 100644
index 00000000000..7b225d45f7f
--- /dev/null
+++ b/src/test/subscription/t/039_update_missing_with_retain.pl
@@ -0,0 +1,143 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+############################## Set it to 0 to make set success; TODO: delete that for commit
+my $simulate_race_condition = 1;
+##############################
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->append_conf('postgresql.conf',
+ qq(wal_level = 'replica'));
+$node_subscriber->start;
+
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text);");
+
+# Create similar table on subscriber with additional index to disable HOT updates and additional column
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE conf_tab(a int PRIMARY key, data text, i int DEFAULT 0);
+ CREATE INDEX i_index ON conf_tab(i);");
+
+# Set up extension to simulate race condition
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE conf_tab");
+
+# Insert row to be updated later
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO conf_tab(a, data) VALUES (1,'frompub')");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub WITH (retain_dead_tuples = true)");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+############################################
+# Race condition because of DirtySnapshot
+############################################
+
+my $psql_session_subscriber = $node_subscriber->background_psql('postgres');
+if ($simulate_race_condition)
+{
+ $node_subscriber->safe_psql('postgres', "SELECT injection_points_attach('index_getnext_slot_before_fetch_apply_dirty', 'wait')");
+}
+
+my $log_offset = -s $node_subscriber->logfile;
+
+# Update tuple on publisher
+$node_publisher->safe_psql('postgres',
+ "UPDATE conf_tab SET data = 'frompubnew' WHERE (a=1);");
+
+
+if ($simulate_race_condition)
+{
+ # Wait apply worker to start the search for the tuple using index
+ $node_subscriber->wait_for_event('logical replication apply worker', 'index_getnext_slot_before_fetch_apply_dirty');
+}
+
+# Update additional(!) column on the subscriber
+$psql_session_subscriber->query_until(
+ qr/start/, qq[
+ \\echo start
+ UPDATE conf_tab SET i = 1 WHERE (a=1);
+]);
+
+
+if ($simulate_race_condition)
+{
+ # Wake up apply worker
+ $node_subscriber->safe_psql('postgres',"
+ SELECT injection_points_detach('index_getnext_slot_before_fetch_apply_dirty');
+ SELECT injection_points_wakeup('index_getnext_slot_before_fetch_apply_dirty');
+ ");
+}
+
+# Tuple was updated - so, we have conflict
+$node_subscriber->wait_for_log(
+ qr/conflict detected on relation \"public.conf_tab\"/,
+ $log_offset);
+
+$node_publisher->wait_for_catchup($appname);
+
+# We need new column value be synced with subscriber
+is($node_subscriber->safe_psql('postgres', 'SELECT data from conf_tab WHERE a = 1'), 'frompubnew', 'record updated on subscriber');
+# And additional column maintain updated value
+is($node_subscriber->safe_psql('postgres', 'SELECT i from conf_tab WHERE a = 1'), 1, 'column record updated on subscriber');
+
+ok(!$node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_deleted/,
+ $log_offset), 'invalid conflict detected');
+
+ok($node_subscriber->log_contains(
+ qr/LOG: conflict detected on relation \"public.conf_tab\": conflict=update_origin_differs/,
+ $log_offset), 'correct conflict detected');
+
+done_testing();
diff --git a/src/test/subscription/t/040_update_missing_simulation.pl b/src/test/subscription/t/040_update_missing_simulation.pl
new file mode 100644
index 00000000000..21fcd1ceb53
--- /dev/null
+++ b/src/test/subscription/t/040_update_missing_simulation.pl
@@ -0,0 +1,125 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test the conflict detection and resolution in logical replication
+# Not intended to be committed because quite heavy
+# Here to demonstrate reproducibility with pgbench
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use IPC::Run qw(start finish);
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+###############################
+# Setup
+###############################
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_publisher->start;
+
+# Create subscriber node with track_commit_timestamp enabled
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ qq(track_commit_timestamp = on));
+$node_subscriber->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+# Create table on publisher
+$node_publisher->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int);");
+
+# Create similar table on subscriber with additional index to disable HOT updates
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE TABLE tbl(a int PRIMARY key, data_pub int, data_sub int default 0);
+ CREATE INDEX data_index ON tbl(data_pub);");
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION tap_pub FOR TABLE tbl");
+
+# Create the subscription
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql(
+ 'postgres',
+ "CREATE SUBSCRIPTION tap_sub
+ CONNECTION '$publisher_connstr application_name=$appname'
+ PUBLICATION tap_pub");
+
+my $num_rows = 10;
+my $num_updates = 10000;
+my $num_clients = 10;
+$node_publisher->safe_psql('postgres', "INSERT INTO tbl SELECT i, i * i FROM generate_series(1,$num_rows) i");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, $appname);
+
+# Prepare small pgbench scripts as files
+my $sub_sql = $node_subscriber->basedir . '/sub_update.sql';
+my $pub_sql = $node_publisher->basedir . '/pub_delete.sql';
+
+open my $fh1, '>', $sub_sql or die $!;
+print $fh1 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_sub = data_sub + 1 WHERE a = :num;\n";
+close $fh1;
+
+open my $fh2, '>', $pub_sql or die $!;
+print $fh2 "\\set num random(1,$num_rows)\nUPDATE tbl SET data_pub = data_pub + 1 WHERE a = :num;\n";
+close $fh2;
+
+my @sub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_subscriber->port, '-h', $node_subscriber->host, '-f', $sub_sql, 'postgres'
+);
+
+my @pub_cmd = (
+ 'pgbench',
+ '--no-vacuum', "--client=$num_clients", '--jobs=4', '--exit-on-abort', "--transactions=$num_updates",
+ '-p', $node_publisher->port, '-h', $node_publisher->host, '-f', $pub_sql, 'postgres'
+);
+
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+# This should never happen
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('apply_handle_update_internal_update_missing', 'error')");
+my $log_offset = -s $node_subscriber->logfile;
+
+# Start both concurrently
+my ($sub_out, $sub_err, $pub_out, $pub_err) = ('', '', '', '');
+my $sub_h = start \@sub_cmd, '>', \$sub_out, '2>', \$sub_err;
+my $pub_h = start \@pub_cmd, '>', \$pub_out, '2>', \$pub_err;
+
+# Wait for completion
+finish $sub_h;
+finish $pub_h;
+
+like($sub_out, qr/actually processed/, 'subscriber pgbench completed');
+like($pub_out, qr/actually processed/, 'publisher pgbench completed');
+
+# Let subscription catch up, then check expectations
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');
+
+ok(!$node_subscriber->log_contains(
+ qr/ERROR: error triggered for injection point apply_handle_update_internal_update_missing/,
+ $log_offset), 'invalid conflict detected');
+
+done_testing();
--
2.48.1
v16-0002-Fix-logical-replication-conflict-detection-durin.patchtext/x-patch; charset=US-ASCII; name=v16-0002-Fix-logical-replication-conflict-detection-durin.patchDownload
From fe065bae15156cce590271b2977f516eb2ae7568 Mon Sep 17 00:00:00 2001
From: nkey <nkey@toloka.ai>
Date: Wed, 3 Sep 2025 19:08:55 +0200
Subject: [PATCH v16 2/2] Fix logical replication conflict detection during
tuple lookup
SNAPSHOT_DIRTY scans could miss conflict detection with concurrent transactions during logical replication.
Replace SNAPSHOT_DIRTY scan with the GetLatestSnapshot in RelationFindReplTupleByIndex and RelationFindReplTupleSeq.
---
src/backend/executor/execReplication.c | 63 ++++++++------------------
1 file changed, 18 insertions(+), 45 deletions(-)
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index def32774c90..1e434ab697a 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -186,8 +186,6 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
ScanKeyData skey[INDEX_MAX_KEYS];
int skey_attoff;
IndexScanDesc scan;
- SnapshotData snap;
- TransactionId xwait;
Relation idxrel;
bool found;
TypeCacheEntry **eq = NULL;
@@ -198,17 +196,17 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
isIdxSafeToSkipDuplicates = (GetRelationIdentityOrPK(rel) == idxoid);
- InitDirtySnapshot(snap);
-
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
- /* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0);
+ /* Start an index scan. SnapshotAny will be replaced below. */
+ scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0);
retry:
found = false;
-
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->xs_snapshot = GetActiveSnapshot();
index_rescan(scan, skey, skey_attoff, NULL, 0);
/* Try to find the tuple */
@@ -229,19 +227,6 @@ retry:
ExecMaterializeSlot(outslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
found = true;
break;
@@ -253,8 +238,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -263,13 +246,15 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
-
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
index_endscan(scan);
+ PopActiveSnapshot();
/* Don't release lock until commit. */
index_close(idxrel, NoLock);
@@ -370,9 +355,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
{
TupleTableSlot *scanslot;
TableScanDesc scan;
- SnapshotData snap;
TypeCacheEntry **eq;
- TransactionId xwait;
bool found;
TupleDesc desc PG_USED_FOR_ASSERTS_ONLY = RelationGetDescr(rel);
@@ -380,13 +363,15 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode,
eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
- /* Start a heap scan. */
- InitDirtySnapshot(snap);
- scan = table_beginscan(rel, &snap, 0, NULL);
+ /* Start a heap scan. SnapshotAny will be replaced below. */
+ scan = table_beginscan(rel, SnapshotAny, 0, NULL);
scanslot = table_slot_create(rel, NULL);
retry:
found = false;
+ PushActiveSnapshot(GetLatestSnapshot());
+ /* Update the actual scan snapshot each retry */
+ scan->rs_snapshot = GetActiveSnapshot();
table_rescan(scan, NULL);
@@ -399,19 +384,6 @@ retry:
found = true;
ExecCopySlot(outslot, scanslot);
- xwait = TransactionIdIsValid(snap.xmin) ?
- snap.xmin : snap.xmax;
-
- /*
- * If the tuple is locked, wait for locking transaction to finish and
- * retry.
- */
- if (TransactionIdIsValid(xwait))
- {
- XactLockTableWait(xwait, NULL, NULL, XLTW_None);
- goto retry;
- }
-
/* Found our tuple and it's not locked */
break;
}
@@ -422,8 +394,6 @@ retry:
TM_FailureData tmfd;
TM_Result res;
- PushActiveSnapshot(GetLatestSnapshot());
-
res = table_tuple_lock(rel, &(outslot->tts_tid), GetActiveSnapshot(),
outslot,
GetCurrentCommandId(false),
@@ -432,13 +402,16 @@ retry:
0 /* don't follow updates */ ,
&tmfd);
- PopActiveSnapshot();
if (should_refetch_tuple(res, &tmfd))
+ {
+ PopActiveSnapshot();
goto retry;
+ }
}
table_endscan(scan);
+ PopActiveSnapshot();
ExecDropSingleTupleTableSlot(scanslot);
return found;
--
2.48.1
Hello, Amit!
On Tue, Aug 26, 2025 at 1:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I think it is better to document this race somewhere in a logical
replication document for now unless we have a consensus on a way to
move forward.
I have added a draft of the documentation patch in related thread [0]/messages/by-id/CADzfLwU_o+eL2z9ifkpW2+kSJEXEbXkTmomvoEw-UY9CzgOn1A@mail.gmail.com.
[0]: /messages/by-id/CADzfLwU_o+eL2z9ifkpW2+kSJEXEbXkTmomvoEw-UY9CzgOn1A@mail.gmail.com
Mikhail.