Race conditions in logical decoding

Started by Antonin Houskaabout 2 months ago12 messages
Jump to latest
#1Antonin Houska
ah@cybertec.at

A stress test [1]/messages/by-id/CADzfLwU78as45To9a=-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng@mail.gmail.com for the REPACK patch [1]/messages/by-id/CADzfLwU78as45To9a=-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng@mail.gmail.com revealed data
corruption. Eventually I found out that the problem is in postgres core. In
particular, it can happen that a COMMIT record is decoded, but before the
commit could be recorded in CLOG, a snapshot that takes the commit into
account is created and even used. Visibility checks then work incorrectly
until the CLOG gets updated.

In logical replication, the consequences are not only wrong data on the
subscriber, but also corrutped table on publisher - this is due to incorrectly
set commit hint bits.

Attached is a spec file that demonstrates the issue. I did not add it to
Makefile because I don't expect the current version to be merged (see the
commit message for details.

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

[1]: /messages/by-id/CADzfLwU78as45To9a=-Qkr5jEg3tMxc5rUtdKy2MTv4r_SDGng@mail.gmail.com
[2]: https://commitfest.postgresql.org/patch/5117/

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

0001-Demonstrate-possible-race-conditions-in-logical-decoding.patchtext/x-diffDownload+287-1
#2Antonin Houska
ah@cybertec.at
In reply to: Antonin Houska (#1)
Re: Race conditions in logical decoding

Antonin Houska <ah@cybertec.at> wrote:

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

Unfortunately I have no idea right now how to test it using the isolation
tester. With the fix, the additional waiting makes the current test
block. (And if a step is added that unblock the session, it will not reliably
catch failure to wait.)

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachments:

0001-Fix-race-conditions-during-the-setup-of-logical-deco.patchtext/x-diffDownload+42-1
#3Andres Freund
andres@anarazel.de
In reply to: Antonin Houska (#2)
Re: Race conditions in logical decoding

Hi,

On 2026-01-20 09:30:33 +0100, Antonin Houska wrote:

Antonin Houska <ah@cybertec.at> wrote:

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray. ISTM that we need to do that here too?

Greetings,

Andres

#4Mihail Nikalayeu
mihailnikalayeu@gmail.com
In reply to: Andres Freund (#3)
Re: Race conditions in logical decoding

Hello, Andres.

On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:

I don't think that's enough - during non-timetravel visibility semantics,

you

can only look at the clog if the transaction isn't marked as in-progress

in

the procarray. ISTM that we need to do that here too?

Do you mean replace

if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))

to

if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) ||

!TransactionIdDidCommit(builder->committed.xip[i])))
?

If so, yes, it feels correct to me.

Mikhail.

#5Antonin Houska
ah@cybertec.at
In reply to: Mihail Nikalayeu (#4)
Re: Race conditions in logical decoding

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Hello, Andres.

On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray. ISTM that we need to do that here too?

Do you mean replace

if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))

to

if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))

This way the synchronous replication gets stuck, as it did when I tried to use
XactLockTableWait(): subscriber cannot confirm replication of certain LSN
because publisher is not able to even finalize the commit (due to the waiting
for the subscriber's confirmation), and therefore publisher it's not able to
decode the data and send it to the subscriber.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Antonin Houska (#5)
Re: Race conditions in logical decoding

On 2026-Jan-22, Antonin Houska wrote:

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Hello, Andres.

On Tue, Jan 20, 2026 at 6:50 PM Andres Freund <andres@anarazel.de> wrote:

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray. ISTM that we need to do that here too?

Do you mean replace

if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))

to

if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))

This way the synchronous replication gets stuck, as it did when I tried to use
XactLockTableWait(): subscriber cannot confirm replication of certain LSN
because publisher is not able to even finalize the commit (due to the waiting
for the subscriber's confirmation), and therefore publisher it's not able to
decode the data and send it to the subscriber.

The layering here is wild, but if I understand it correctly, these XIDs
are all added to an array by SnapBuildAddCommittedTxn(), which in turn
is only called by SnapBuildCommitTxn(), which is only called by
DecodeCommit(), which is only called by xact_decode() when it sees a
XLOG_XACT_COMMIT or XLOG_XACT_COMMIT_PREPARED record by reading WAL.

Crucially, RecordTransactionCommit() writes the WAL first, then marks
everything as committed in CLOG, and finally does the waiting for
the synchronous replica to ACK the commit if necessary. However, the
transaction is only removed from procarray after RecordTransactionCommit
has returned.

This means that DecodeCommit() could add a transaction to the
SnapBuilder (that needs to be waited for) while that transaction is
still shown as running in ProcArray. This sounds problematic in itself,
so I'm wondering whether we should do anything (namely, wait) on
DecodeCommit() instead of hacking SnapBuildBuildSnapshot() to patch it
up by waiting after the fact.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#6)
Re: Race conditions in logical decoding

On 2026-Jan-22, Álvaro Herrera wrote:

On 2026-Jan-22, Antonin Houska wrote:

Do you mean replace

if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))

to

if (unlikely(TransactionIdIsInProgress(builder->committed.xip[i]) || !TransactionIdDidCommit(builder->committed.xip[i])))

This way the synchronous replication gets stuck, as it did when I tried to use
XactLockTableWait(): subscriber cannot confirm replication of certain LSN
because publisher is not able to even finalize the commit (due to the waiting
for the subscriber's confirmation), and therefore publisher it's not able to
decode the data and send it to the subscriber.

BTW, the reason XactLockTableWait and TransactionIdIsInProgress() cause
a deadlock in the same way, is that they are using essentially the same
mechanism. The former uses the Lock object on the transaction, which is
released (by the ResourceOwnerRelease(RESOURCE_RELEASE_LOCKS) call in
CommitTransaction) after RecordTransactionCommit() has returned -- that
is, after the wait on a synchronous replica has happened.

XactLockTableWait does an _additional_ test for
TransactionIdIsInProgress, but that should be pretty much innocuous at
that point.

One thing that I just realized I don't know, is what exactly are the two
pieces that are deadlocking. I mean, one is this side that's decoding
commit. But how/why is that other side, the one trying to mark the
transaction as committed, waiting on the commit decoding? Maybe there's
something that we need to do to _that_ side to prevent the blockage, so
that we can use TransactionIdIsInProgress() here.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Antonin Houska (#2)
Re: Race conditions in logical decoding

On 2026-Jan-20, Antonin Houska wrote:

Antonin Houska <ah@cybertec.at> wrote:

I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
but it made at least one regression test (subscription/t/010_truncate.pl)
stuck - probably a deadlock. I can spend more time on it, but maybe someone
can come up with a good idea sooner than me.

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

Unfortunately I have no idea right now how to test it using the isolation
tester. With the fix, the additional waiting makes the current test
block. (And if a step is added that unblock the session, it will not reliably
catch failure to wait.)

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

@@ -400,6 +400,47 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
snapshot->xmin = builder->xmin;
snapshot->xmax = builder->xmax;

+	/*
+	 * Although it's very unlikely, it's possible that a commit WAL record was
+	 * decoded but CLOG is not aware of the commit yet. Should the CLOG update
+	 * be delayed even more, visibility checks that use this snapshot could
+	 * work incorrectly. Therefore we check the CLOG status here.
+	 */
+	while (true)
+	{
+		bool	found = false;
+
+		for (int i = 0; i < builder->committed.xcnt; i++)
+		{
+			/*
+			 * XXX Is it worth remembering the XIDs that appear to be
+			 * committed per CLOG and skipping them in the next iteration of
+			 * the outer loop? Not sure it's worth the effort - a single
+			 * iteration is enough in most cases.
+			 */
+			if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
+			{
+				found = true;
+
+				/*
+				 * Wait a bit before going to the next iteration of the outer
+				 * loop. The race conditions we address here is pretty rare,
+				 * so we shouldn't need to wait too long.
+				 */
+				(void) WaitLatch(MyLatch,
+								 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+								 10L,
+								 WAIT_EVENT_SNAPBUILD_CLOG);
+				ResetLatch(MyLatch);
+
+				break;
+			}
+		}
+
+		if (!found)
+			break;
+	}

I think this algorithm is strange -- if you do have to wait more than
once for one transaction, it would lead to doing the
TransactionIdDidCommit again times for _all_ transactions by starting
the inner loop from scratch, which sounds really wasteful. Why not nest
the for() loops the other way around? Something like this perhaps,

for (int i = 0; i < builder->committed.xcnt; i++)
{
for (;;)
{
if (TransactionIdDidCommit(builder->committed.xip[i]))
break;
else
{
(void) WaitLatch(MyLatch,
WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
10L,
WAIT_EVENT_SNAPBUILD_CLOG);
ResetLatch(MyLatch);
}
CHECK_FOR_INTERRUPTS();
}
}

This way you wait repeatedly for one transaction until it is marked
committed; and once it does, you don't test it again.

I also wondered if it would make sense to get rid of the memcpy, given
that we're doing so much work per xid anyway it won't make any visible
difference (I believe), and do the copy per XID there, like

if (TransactionIdDidCommit(builder->committed.xip[i]))
{
snapshot->xip[i] = builder->committed.xip[i];
break;
}
else
...

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)

#9Antonin Houska
ah@cybertec.at
In reply to: Alvaro Herrera (#8)
Re: Race conditions in logical decoding

Álvaro Herrera <alvherre@kurilemu.de> wrote:

I think this algorithm is strange -- if you do have to wait more than
once for one transaction, it would lead to doing the
TransactionIdDidCommit again times for _all_ transactions by starting
the inner loop from scratch, which sounds really wasteful. Why not nest
the for() loops the other way around?

I'm quite sure I wanted to iterate through committed.xnt in the outer loop,
but probably got distracted by something else and messed things up.

Something like this perhaps,

for (int i = 0; i < builder->committed.xcnt; i++)
{
for (;;)
{
if (TransactionIdDidCommit(builder->committed.xip[i]))
break;
else
{
(void) WaitLatch(MyLatch,
WL_LATCH_SET, WL_TIMEOUT, WL_EXIT_ON_PM_DEATH,
10L,
WAIT_EVENT_SNAPBUILD_CLOG);
ResetLatch(MyLatch);
}
CHECK_FOR_INTERRUPTS();
}
}

This way you wait repeatedly for one transaction until it is marked
committed; and once it does, you don't test it again.

Sure, that's much beter. Thanks.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#10Antonin Houska
ah@cybertec.at
In reply to: Andres Freund (#3)
Re: Race conditions in logical decoding

Andres Freund <andres@anarazel.de> wrote:

Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

I don't think that's enough - during non-timetravel visibility semantics, you
can only look at the clog if the transaction isn't marked as in-progress in
the procarray. ISTM that we need to do that here too?

I understand that CLOG must be up-to-date by the time the snapshot is used for
visibility checks, but I think that - from the snapshot user POV - what
matters is "snapshot->xip vs CLOG" rather than "procarray vs CLOG".

For procarray-based snapshots, this consistency is ensured by 1) not removing
the XID from procarray until the status is set in CLOG and 2) getting the list
of running transactions from procarray. Thus if an MVCC snapshot does not have
particular XID in its "xip" array, it implies that it's no longer in procarray
and therefore it's been marked in CLOG.

As for logical decoding based snapshots (whether HISTORIC_MVCC or those
converted eventually to regular MVCC), we currently do not check if CLOG is
consistent with the transaction list in snapshot->xip. What I proposed is that
we enforce this consistency by checking CLOG (and possibly waiting) before we
finalize the snapshot. Thus the snapshot user can safely assume that the
snapshot->xip array is consistent with CLOG, as if the snapshot was based on
procarray.

Or is there another issue with the CLOG itself? I thought about wraparound
(i.e. getting the XID status from a CLOG slot which is still being used by old
transactions) but I wouldn't expect that (AFAICS, CLOG truncation takes place
during XID freezing). Concurrent access to the slot should neither be a
problem since only a single byte (which is atomic) needs to be fetched during
the XID status check.

Another hypothetical problem that occurs to me is memory access ordering,
i.e. one backend creates and exports the snapshot and another one imports it
before it can see the CLOG update. It's hard to imagine though.

Or are there other concerns?

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Antonin Houska (#9)
Re: Race conditions in logical decoding

On 2026-Jan-23, Antonin Houska wrote:

This way you wait repeatedly for one transaction until it is marked
committed; and once it does, you don't test it again.

Sure, that's much beter. Thanks.

Actually, I wonder if it would make sense to sleep just once after
testing all the transactions for whether they are marked committed (not
once per transaction); and after sleeping, we only test again those that
were not marked committed in the previous iteration. I think you would
end up doing less tests overall. Something like this

/*
* Although it's very unlikely, it's possible that a commit WAL record was
* decoded but CLOG is not aware of the commit yet. Should the CLOG update
* be delayed even more, visibility checks that use this snapshot could
* work incorrectly. Therefore we check the CLOG status here.
*/
{
TransactionId *stillrunning;
int nstillrunning = builder->committed.xcnt;

stillrunning = palloc(sizeof(TransactionId) * builder->committed.xcnt);
nstillrunning = builder->committed.xcnt;
memcpy(stillrunning, builder->committed.xip, sizeof(TransactionId) * nstillrunning);

for (;;)
{
int next = 0;

if (nstillrunning == 0)
break;
for (int i = 0; i < nstillrunning; i++)
{
if (!TransactionIdDidCommit(stillrunning[i]))
stillrunning[next++] = stillrunning[i];
}
if (next == 0)
break;
nstillrunning = next;
(void) WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L,
WAIT_EVENT_SNAPBUILD_CLOG);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
pfree(stillrunning);
}

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"

#12Antonin Houska
ah@cybertec.at
In reply to: Alvaro Herrera (#11)
Re: Race conditions in logical decoding

Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2026-Jan-23, Antonin Houska wrote:

This way you wait repeatedly for one transaction until it is marked
committed; and once it does, you don't test it again.

Sure, that's much beter. Thanks.

Actually, I wonder if it would make sense to sleep just once after
testing all the transactions for whether they are marked committed (not
once per transaction); and after sleeping, we only test again those that
were not marked committed in the previous iteration. I think you would
end up doing less tests overall. Something like this

I suppose that TransactionIdDidCommit() returns false pretty rarely, so one
iteration is sufficient in almost all cases. Besides that, I preferred simpler
code because it's easier to test (It's not trivial to have debugger reach the
conditions.) However it's just my preference - no real objections to your
approach.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com