Race conditions in logical decoding

Started by Antonin Houska5 days ago10 messages
#1Antonin Houska
ah@cybertec.at
1 attachment(s)

A stress test [1]/messages/by-id/CADzfLwU78as45To9a=-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng@mail.gmail.com for the REPACK patch [1]/messages/by-id/CADzfLwU78as45To9a=-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng@mail.gmail.com revealed data
corruption. Eventually I found out that the problem is in postgres core. In
particular, it can happen that a COMMIT record is decoded, but before the
commit could be recorded in CLOG, a snapshot that takes the commit into
account is created and even used. Visibility checks then work incorrectly
until the CLOG gets updated.

In logical replication, the consequences are not only wrong data on the
subscriber, but also corrutped table on publisher - this is due to incorrectly
set commit hint bits.

Attached is a spec file that demonstrates the issue. I did not add it to
Makefile because I don't expect the current version to be merged (see the
commit message for details.

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

[1]: /messages/by-id/CADzfLwU78as45To9a=-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng@mail.gmail.com
[2]: https://commitfest.postgresql.org/patch/5117/

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

0001-Demonstrate-possible-race-conditions-in-logical-decoding.patchtext/x-diffDownload
From ca765ee0a49ac1005c5eb7ebd1d29bea97a63552 Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 19 Jan 2026 12:54:55 +0100
Subject: [PATCH 1/2] Demonstrate possible race conditions in logical decoding.

The problem is that the snapshot builder can create a snapshot before CLOG has
been updated. That breaks visibility check that use such snapshot. For more
details, see startup_race.spec.

Success of the test means that the problem is present. Thus it would need to
be modified if it should be merged into the tree.

Another problem that currently prevents this test from being merged is that it
hard-wires the logical decoding setup into the SET TRANSACTION command. I
tried to modify the isolation tester so it can use the logical replication
protocol (in which case the test could use the "CREATE_REPLICATION_SNAPSHOT
... (SNAPSHOT 'use')" command), but the tester does things that are not
compatible with that protocol (e.g. it sets the application_name parameter).
---
 .../test_decoding/expected/startup_race.out   |  85 ++++++++++++
 contrib/test_decoding/specs/startup_race.spec | 126 ++++++++++++++++++
 src/backend/access/transam/xact.c             |   6 +
 src/backend/replication/logical/snapbuild.c   |   3 +
 src/backend/utils/time/snapmgr.c              |  66 +++++++++
 src/include/utils/snapmgr.h                   |   1 +
 6 files changed, 287 insertions(+)
 create mode 100644 contrib/test_decoding/expected/startup_race.out
 create mode 100644 contrib/test_decoding/specs/startup_race.spec

diff --git a/contrib/test_decoding/expected/startup_race.out b/contrib/test_decoding/expected/startup_race.out
new file mode 100644
index 00000000000..597fa617831
--- /dev/null
+++ b/contrib/test_decoding/expected/startup_race.out
@@ -0,0 +1,85 @@
+Parsed test spec with 6 sessions
+
+starting permutation: s1_assign_xid s2_set_snapshot s3_assign_xid s1_rollback s3_rollback s4_do_changes s5_wake_up_full_snapshot s2_scan s2_rollback s5_wake_up_before_clog s6_check
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s1_assign_xid: 
+	BEGIN;
+	CREATE TABLE b(i int);
+
+step s2_set_snapshot: 
+	BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
+	SET TRANSACTION SNAPSHOT 'from_slot';
+ <waiting ...>
+step s3_assign_xid: 
+	BEGIN;
+	CREATE TABLE c(i int);
+
+step s1_rollback: 
+	ROLLBACK;
+
+step s3_rollback: 
+	ROLLBACK;
+
+step s4_do_changes: 
+	INSERT INTO a(i, j) VALUES (3, 3);
+	UPDATE a SET j = j + 1 WHERE i = 1;
+	DELETE FROM a WHERE i = 2;
+ <waiting ...>
+step s5_wake_up_full_snapshot: 
+	SELECT injection_points_wakeup('snapbuild-full-snapshot');
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_set_snapshot: <... completed>
+step s2_scan: 
+	TABLE a;
+
+i|j
+-+-
+1|1
+2|2
+(2 rows)
+
+step s2_rollback: 
+	ROLLBACK;
+
+step s5_wake_up_before_clog: 
+	SELECT injection_points_wakeup('before-clog-update');
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_do_changes: <... completed>
+step s6_check: 
+	SELECT * FROM a ORDER BY i;
+
+i|j
+-+-
+1|1
+2|2
+(2 rows)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
diff --git a/contrib/test_decoding/specs/startup_race.spec b/contrib/test_decoding/specs/startup_race.spec
new file mode 100644
index 00000000000..8f67e07fa7a
--- /dev/null
+++ b/contrib/test_decoding/specs/startup_race.spec
@@ -0,0 +1,126 @@
+setup
+{
+	CREATE TABLE a(i int primary key, j int) WITH (autovacuum_enabled = off);
+	INSERT INTO a(i, j) VALUES (1, 1), (2, 2);
+	CREATE EXTENSION injection_points;
+}
+
+session s1
+step s1_assign_xid
+{
+	BEGIN;
+	CREATE TABLE b(i int);
+}
+step s1_rollback
+{
+	ROLLBACK;
+}
+
+session s2
+setup
+{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('snapbuild-full-snapshot', 'wait');
+}
+# Use special, hard-wired snapshot name to set the initial snapshot from
+# logical replication slot.
+step s2_set_snapshot
+{
+	BEGIN READ ONLY ISOLATION LEVEL REPEATABLE READ;
+	SET TRANSACTION SNAPSHOT 'from_slot';
+}
+# Perform the scan.
+step s2_scan
+{
+	TABLE a;
+}
+step s2_rollback
+{
+	ROLLBACK;
+}
+teardown
+{
+	SELECT injection_points_detach('snapbuild-full-snapshot');
+}
+
+session s3
+step s3_assign_xid
+{
+	BEGIN;
+	CREATE TABLE c(i int);
+}
+step s3_rollback
+{
+	ROLLBACK;
+}
+
+session s4
+setup
+{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('before-clog-update', 'wait');
+}
+step s4_do_changes
+{
+	INSERT INTO a(i, j) VALUES (3, 3);
+	UPDATE a SET j = j + 1 WHERE i = 1;
+	DELETE FROM a WHERE i = 2;
+}
+teardown
+{
+	SELECT injection_points_detach('before-clog-update');
+}
+
+session s5
+step s5_wake_up_full_snapshot
+{
+	SELECT injection_points_wakeup('snapbuild-full-snapshot');
+}
+step s5_wake_up_before_clog
+{
+	SELECT injection_points_wakeup('before-clog-update');
+}
+
+session s6
+step s6_check
+{
+	SELECT * FROM a ORDER BY i;
+}
+
+permutation
+# Let the snapshot builder go through all the states. The problematic case
+# happens in the FULL_SNAPSHOT.
+s1_assign_xid
+# This should leave the builder in BUILDING_SNAPSHOT, waiting for the active
+# transaction to end.
+s2_set_snapshot
+# Make sure that s1_rollback does not allow going to CONSISTENT directly.
+s3_assign_xid
+# Let the builder proceed to FULL_SNAPSHOT. It should stop at the injection
+# point 'snapbuild-full-snapshot'.
+s1_rollback
+# The transaction of s3 is not needed anymore, CONSISTENT should be the next
+# stage.
+s3_rollback
+# Perform data changes before the snapshot builder triggers creation of the
+# RUNNING_XACTS record. This will stop before setting transaction status in
+# CLOG.
+s4_do_changes
+# Unblock the injection point so that the snapshot can finally be created.
+s5_wake_up_full_snapshot
+# Use the snapshot for a scan. The snapshot will consider s4 not running
+# anymore, however CLOG is not aware of the commit yet. Thus
+# HeapTupleSatisfiesMVCC will consider the transaction aborted. In particular,
+# for UPDATE, if both xmax of the old version and xmin of the new version are
+# considered aborted, so the effects of the UPDATE are lost
+# altogether. Similarly, INSERT and DELETE have no effect because the xmin
+# transaction of the inserted tuple and xmax of the deleted tuple are
+# considered aborted
+s2_scan
+s2_rollback
+# CLOG can be updated now.
+s5_wake_up_before_clog
+# Scan the table again using a new transaction, with a normal transaction
+# snapshot. The results are still wrong due to hint bits set incorrectly.
+s6_check
+
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c857e23552f..2fea45b2fed 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -65,6 +65,7 @@
 #include "utils/builtins.h"
 #include "utils/combocid.h"
 #include "utils/guc.h"
+#include "utils/injection_point.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/relmapper.h"
@@ -1348,6 +1349,9 @@ RecordTransactionCommit(void)
 													 &RelcacheInitFileInval);
 	wrote_xlog = (XactLastRecEnd != 0);
 
+	/* Load the injection point before entering the critical section */
+	INJECTION_POINT_LOAD("before-clog-update");
+
 	/*
 	 * If we haven't been assigned an XID yet, we neither can, nor do we want
 	 * to write a COMMIT record.
@@ -1514,6 +1518,8 @@ RecordTransactionCommit(void)
 	{
 		XLogFlush(XactLastRecEnd);
 
+		INJECTION_POINT_CACHED("before-clog-update", NULL);
+
 		/*
 		 * Now we may update the CLOG, if we wrote a COMMIT record above
 		 */
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7f79621b57e..9b09dc8eac1 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -141,6 +141,7 @@
 #include "storage/procarray.h"
 #include "storage/standby.h"
 #include "utils/builtins.h"
+#include "utils/injection_point.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 #include "utils/snapshot.h"
@@ -1387,6 +1388,8 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				errdetail("Waiting for transactions (approximately %d) older than %u to end.",
 						  running->xcnt, running->nextXid));
 
+		INJECTION_POINT("snapbuild-full-snapshot", NULL);
+
 		SnapBuildWaitSnapshot(running, running->nextXid);
 	}
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2e6197f5f35..f327b779004 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -110,10 +110,13 @@
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "access/xlogutils.h"
 #include "datatype/timestamp.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
 #include "port/pg_lfind.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
 #include "storage/fd.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -1421,6 +1424,17 @@ ImportSnapshot(const char *idstr)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("a snapshot-importing transaction must have isolation level SERIALIZABLE or REPEATABLE READ")));
 
+	if (strcmp(idstr, "from_slot") == 0)
+	{
+		Snapshot	snap;
+
+		snap = create_test_snapshot();
+		/* XXX sourcevxid shouldn't be needed in this special case */
+		SetTransactionSnapshot(snap, NULL, MyProcPid, MyProc);
+
+		return;
+	}
+
 	/*
 	 * Verify the identifier: only 0-9, A-F and hyphens are allowed.  We do
 	 * this mainly to prevent reading arbitrary files.
@@ -1969,3 +1983,55 @@ ResOwnerReleaseSnapshot(Datum res)
 {
 	UnregisterSnapshotNoOwner((Snapshot) DatumGetPointer(res));
 }
+
+/*
+ * CreateReplicationSlot() with the CRS_USE_SNAPSHOT option would be useful
+ * for testing, but regression tests cannot speak both replication and query
+ * protocol at the same time. This function can be used instead to create a
+ * snapshot for special tests of logical replication.
+ */
+Snapshot
+create_test_snapshot(void)
+{
+	const	char	*slotname = "test_slot";
+	const	char *plugin;
+	LogicalDecodingContext *ctx;
+	Snapshot	snap;
+
+	/*
+	 * XXX Hard-wired values are fine for the special test that needs this
+	 * function.
+	 */
+	plugin = "test_decoding";
+
+	Assert(!MyReplicationSlot);
+
+	CheckLogicalDecodingRequirements();
+
+	ReplicationSlotCreate(slotname, true, RS_TEMPORARY,
+						  false, false, false);
+
+	/*
+	 * Ensure the logical decoding is enabled before initializing the
+	 * logical decoding context.
+	 */
+	EnsureLogicalDecodingEnabled();
+	Assert(IsLogicalDecodingEnabled());
+
+	ctx = CreateInitDecodingContext(plugin, NIL, true,
+									InvalidXLogRecPtr,
+									XL_ROUTINE(.page_read = read_local_xlog_page,
+											   .segment_open = wal_segment_open,
+											   .segment_close = wal_segment_close),
+									NULL, NULL, NULL);
+
+	/* build initial snapshot, might take a while */
+	DecodingContextFindStartpoint(ctx);
+
+	/* Do what the function is called for. */
+	snap = SnapBuildInitialSnapshot(ctx->snapshot_builder);
+	snap = CopySnapshot(snap);
+	FreeDecodingContext(ctx);
+
+	return snap;
+}
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b8c01a291a1..9f0d60ebc1b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -123,4 +123,5 @@ extern Snapshot RestoreSnapshot(char *start_address);
 struct PGPROC;
 extern void RestoreTransactionSnapshot(Snapshot snapshot, struct PGPROC *source_pgproc);
 
+extern Snapshot create_test_snapshot(void);
 #endif							/* SNAPMGR_H */
-- 
2.47.3

#2Antonin Houska
ah@cybertec.at
In reply to: Antonin Houska (#1)
1 attachment(s)
Re: Race conditions in logical decoding

Antonin Houska <ah@cybertec.at> wrote:

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

Unfortunately I have no idea right now how to test it using the isolation
tester. With the fix, the additional waiting makes the current test
block. (And if a step is added that unblock the session, it will not reliably
catch failure to wait.)

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

0001-Fix-race-conditions-during-the-setup-of-logical-deco.patchtext/x-diffDownload
From 5a6002215fb8ebeaf1dde120e5f6706bca7b62ae Mon Sep 17 00:00:00 2001
From: Antonin Houska <ah@cybertec.at>
Date: Mon, 19 Jan 2026 16:07:45 +0100
Subject: [PATCH] Fix race conditions during the setup of logical decoding.

Although it's rather unlikely, it can happen that the snapshot builder
considers transaction committed (according to WAL) before the commit could be
recorded in CLOG. In an extreme case, snapshot can even be created and used in
between. Since both snapshot and CLOG are needed for visibility checks, this
inconsistency can make them work incorrectly.

The typical symptom is that a transaction that the snapshot considers not
running anymore is (per CLOG) considered aborted instead of committed. Thus a
new tuple version can be evaluated as invisible (if xmin is incorrectly
considered aborted) or a deleted tuple version can be evaluated as visible (if
xmax is incorrectly considered aborted).

This patch fixes the problem by checking if all the XIDs that the new snapshot
considers committed are really committed per CLOG. If at least one is not, the
check is repeated after a short delay. However, a single check is sufficient
in almost all cases, so the performance impact should be minimal.
---
 src/backend/replication/logical/snapbuild.c   | 41 +++++++++++++++++++
 .../utils/activity/wait_event_names.txt       |  1 +
 2 files changed, 42 insertions(+)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9b09dc8eac1..c02d08f3417 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -400,6 +400,47 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->xmin = builder->xmin;
 	snapshot->xmax = builder->xmax;
 
+	/*
+	 * Although it's very unlikely, it's possible that a commit WAL record was
+	 * decoded but CLOG is not aware of the commit yet. Should the CLOG update
+	 * be delayed even more, visibility checks that use this snapshot could
+	 * work incorrectly. Therefore we check the CLOG status here.
+	 */
+	while (true)
+	{
+		bool	found = false;
+
+		for (int i = 0; i < builder->committed.xcnt; i++)
+		{
+			/*
+			 * XXX Is it worth remembering the XIDs that appear to be
+			 * committed per CLOG and skipping them in the next iteration of
+			 * the outer loop? Not sure it's worth the effort - a single
+			 * iteration is enough in most cases.
+			 */
+			if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
+			{
+				found = true;
+
+				/*
+				 * Wait a bit before going to the next iteration of the outer
+				 * loop. The race conditions we address here is pretty rare,
+				 * so we shouldn't need to wait too long.
+				 */
+				(void) WaitLatch(MyLatch,
+								 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+								 10L,
+								 WAIT_EVENT_SNAPBUILD_CLOG);
+				ResetLatch(MyLatch);
+
+				break;
+			}
+		}
+
+		if (!found)
+			break;
+	}
+
 	/* store all transactions to be treated as committed by this snapshot */
 	snapshot->xip =
 		(TransactionId *) ((char *) snapshot + sizeof(SnapshotData));
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 5537a2d2530..b6318b0cf37 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -181,6 +181,7 @@ PG_SLEEP	"Waiting due to a call to <function>pg_sleep</function> or a sibling fu
 RECOVERY_APPLY_DELAY	"Waiting to apply WAL during recovery because of a delay setting."
 RECOVERY_RETRIEVE_RETRY_INTERVAL	"Waiting during recovery when WAL data is not available from any source (<filename>pg_wal</filename>, archive or stream)."
 REGISTER_SYNC_REQUEST	"Waiting while sending synchronization requests to the checkpointer, because the request queue is full."
+SNAPBUILD_CLOG	"Waiting for CLOG update before building snapshot."
 SPIN_DELAY	"Waiting while acquiring a contended spinlock."
 VACUUM_DELAY	"Waiting in a cost-based vacuum delay point."
 VACUUM_TRUNCATE	"Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
-- 
2.47.3

#3Andres Freund
andres@anarazel.de
In reply to: Antonin Houska (#2)
Re: Race conditions in logical decoding

Hi,

On 2026-01-20 09:30:33 +0100, Antonin Houska wrote:

Antonin Houska <ah@cybertec.at> wrote:

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray. ISTM that we need to do that here too?

Greetings,

Andres

#4Mihail Nikalayeu
mihailnikalayeu@gmail.com
In reply to: Andres Freund (#3)
Re: Race conditions in logical decoding

Hello, Andres.

On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:

I don't think that's enough - during non-timetravel visibility semantics,

you

can only look at the clog if the transaction isn't marked as in-progress

in

the procarray. ISTM that we need to do that here too?

Do you mean replace

if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))

to

if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) ||

!TransactionIdDidCommit(builder->committed.xip[i])))
?

If so, yes, it feels correct to me.

Mikhail.

#5Antonin Houska
ah@cybertec.at
In reply to: Mihail Nikalayeu (#4)
Re: Race conditions in logical decoding

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Hello, Andres.

On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray. ISTM that we need to do that here too?

Do you mean replace

if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))

to

if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))

This way the synchronous replication gets stuck, as it did when I tried to use
XactLockTableWait(): subscriber cannot confirm replication of certain LSN
because publisher is not able to even finalize the commit (due to the waiting
for the subscriber's confirmation), and therefore publisher it's not able to
decode the data and send it to the subscriber.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Antonin Houska (#5)
Re: Race conditions in logical decoding

On 2026-Jan-22, Antonin Houska wrote:

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Hello, Andres.

On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray. ISTM that we need to do that here too?

Do you mean replace

if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))

to

if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))

This way the synchronous replication gets stuck, as it did when I tried to use
XactLockTableWait(): subscriber cannot confirm replication of certain LSN
because publisher is not able to even finalize the commit (due to the waiting
for the subscriber's confirmation), and therefore publisher it's not able to
decode the data and send it to the subscriber.

The layering here is wild, but if I understand it correctly, these XIDs
are all added to an array by SnapBuildAddCommittedTxn(), which in turn
is only called by SnapBuildCommitTxn(), which is only called by
DecodeCommit(), which is only called by xact_decode() when it sees a
XLOG_XACT_COMMIT or XLOG_XACT_COMMIT_PREPARED record by reading WAL.

Crucially, RecordTransactionCommit() writes the WAL first, then marks
everything as committed in CLOG, and finally does the waiting for
the synchronous replica to ACK the commit if necessary. However, the
transaction is only removed from procarray after RecordTransactionCommit
has returned.

This means that DecodeCommit() could add a transaction to the
SnapBuilder (that needs to be waited for) while that transaction is
still shown as running in ProcArray. This sounds problematic in itself,
so I'm wondering whether we should do anything (namely, wait) on
DecodeCommit() instead of hacking SnapBuildBuildSnapshot() to patch it
up by waiting after the fact.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#6)
Re: Race conditions in logical decoding

On 2026-Jan-22, Álvaro Herrera wrote:

On 2026-Jan-22, Antonin Houska wrote:

Do you mean replace

if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))

to

if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))

This way the synchronous replication gets stuck, as it did when I tried to use
XactLockTableWait(): subscriber cannot confirm replication of certain LSN
because publisher is not able to even finalize the commit (due to the waiting
for the subscriber's confirmation), and therefore publisher it's not able to
decode the data and send it to the subscriber.

BTW, the reason XactLockTableWait and TransactionIdIsInProgress() cause
a deadlock in the same way, is that they are using essentially the same
mechanism. The former uses the Lock object on the transaction, which is
released (by the ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) call in
CommitTransaction) after RecordTransactionCommit() has returned -- that
is, after the wait on a synchronous replica has happened.

XactLockTableWait does an _additional_ test for
TransactionIdIsInProgress, but that should be pretty much innocuous at
that point.

One thing that I just realized I don't know, is what exactly are the two
pieces that are deadlocking. I mean, one is this side that's decoding
commit. But how/why is that other side, the one trying to mark the
transaction as committed, waiting on the commit decoding? Maybe there's
something that we need to do to _that_ side to prevent the blockage, so
that we can use TransactionIdIsInProgress() here.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Antonin Houska (#2)
Re: Race conditions in logical decoding

On 2026-Jan-20, Antonin Houska wrote:

Antonin Houska <ah@cybertec.at> wrote:

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

Unfortunately I have no idea right now how to test it using the isolation
tester. With the fix, the additional waiting makes the current test
block. (And if a step is added that unblock the session, it will not reliably
catch failure to wait.)

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

@@ -400,6 +400,47 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->xmin = builder->xmin;
snapshot->xmax = builder->xmax;

+	/*
+	 * Although it's very unlikely, it's possible that a commit WAL record was
+	 * decoded but CLOG is not aware of the commit yet. Should the CLOG update
+	 * be delayed even more, visibility checks that use this snapshot could
+	 * work incorrectly. Therefore we check the CLOG status here.
+	 */
+	while (true)
+	{
+		bool	found = false;
+
+		for (int i = 0; i < builder->committed.xcnt; i++)
+		{
+			/*
+			 * XXX Is it worth remembering the XIDs that appear to be
+			 * committed per CLOG and skipping them in the next iteration of
+			 * the outer loop? Not sure it's worth the effort - a single
+			 * iteration is enough in most cases.
+			 */
+			if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
+			{
+				found = true;
+
+				/*
+				 * Wait a bit before going to the next iteration of the outer
+				 * loop. The race conditions we address here is pretty rare,
+				 * so we shouldn't need to wait too long.
+				 */
+				(void) WaitLatch(MyLatch,
+								 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+								 10L,
+								 WAIT_EVENT_SNAPBUILD_CLOG);
+				ResetLatch(MyLatch);
+
+				break;
+			}
+		}
+
+		if (!found)
+			break;
+	}

I think this algorithm is strange -- if you do have to wait more than
once for one transaction, it would lead to doing the
TransactionIdDidCommit again times for _all_ transactions by starting
the inner loop from scratch, which sounds really wasteful. Why not nest
the for() loops the other way around? Something like this perhaps,

for (int i = 0; i < builder->committed.xcnt; i++)
{
for (;;)
{
if (TransactionIdDidCommit(builder->committed.xip[i]))
break;
else
{
(void) WaitLatch(MyLatch,
WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
10L,
WAIT_EVENT_SNAPBUILD_CLOG);
ResetLatch(MyLatch);
}
CHECK_FOR_INTERRUPTS();
}
}

This way you wait repeatedly for one transaction until it is marked
committed; and once it does, you don't test it again.

I also wondered if it would make sense to get rid of the memcpy, given
that we're doing so much work per xid anyway it won't make any visible
difference (I believe), and do the copy per XID there, like

if (TransactionIdDidCommit(builder->committed.xip[i]))
{
snapshot->xip[i] = builder->committed.xip[i];
break;
}
else
...

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)

#9Antonin Houska
ah@cybertec.at
In reply to: Alvaro Herrera (#8)
Re: Race conditions in logical decoding

Álvaro Herrera <alvherre@kurilemu.de> wrote:

I think this algorithm is strange -- if you do have to wait more than
once for one transaction, it would lead to doing the
TransactionIdDidCommit again times for _all_ transactions by starting
the inner loop from scratch, which sounds really wasteful. Why not nest
the for() loops the other way around?

I'm quite sure I wanted to iterate through committed.xnt in the outer loop,
but probably got distracted by something else and messed things up.

Something like this perhaps,

for (int i = 0; i < builder->committed.xcnt; i++)
{
for (;;)
{
if (TransactionIdDidCommit(builder->committed.xip[i]))
break;
else
{
(void) WaitLatch(MyLatch,
WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
10L,
WAIT_EVENT_SNAPBUILD_CLOG);
ResetLatch(MyLatch);
}
CHECK_FOR_INTERRUPTS();
}
}

This way you wait repeatedly for one transaction until it is marked
committed; and once it does, you don't test it again.

Sure, that's much beter. Thanks.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#10Antonin Houska
ah@cybertec.at
In reply to: Andres Freund (#3)
Re: Race conditions in logical decoding

Andres Freund <andres@anarazel.de> wrote:

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray. ISTM that we need to do that here too?

I understand that CLOG must be up-to-date by the time the snapshot is used for
visibility checks, but I think that - from the snapshot user POV - what
matters is "snapshot->xip vs CLOG" rather than "procarray vs CLOG".

For procarray-based snapshots, this consistency is ensured by 1) not removing
the XID from procarray until the status is set in CLOG and 2) getting the list
of running transactions from procarray. Thus if an MVCC snapshot does not have
particular XID in its "xip" array, it implies that it's no longer in procarray
and therefore it's been marked in CLOG.

As for logical decoding based snapshots (whether HISTORIC_MVCC or those
converted eventually to regular MVCC), we currently do not check if CLOG is
consistent with the transaction list in snapshot->xip. What I proposed is that
we enforce this consistency by checking CLOG (and possibly waiting) before we
finalize the snapshot. Thus the snapshot user can safely assume that the
snapshot->xip array is consistent with CLOG, as if the snapshot was based on
procarray.

Or is there another issue with the CLOG itself? I thought about wraparound
(i.e. getting the XID status from a CLOG slot which is still being used by old
transactions) but I wouldn't expect that (AFAICS, CLOG truncation takes place
during XID freezing). Concurrent access to the slot should neither be a
problem since only a single byte (which is atomic) needs to be fetched during
the XID status check.

Another hypothetical problem that occurs to me is memory access ordering,
i.e. one backend creates and exports the snapshot and another one imports it
before it can see the CLOG update. It's hard to imagine though.

Or are there other concerns?

--
Antonin Houska
Web: https://www.cybertec-postgresql.com