Assertion failure in SnapBuildInitialSnapshot()

/messages/by-id/CA+hUKG+WKsZpdoryeqM8_rk5uQPCqS2HGY92WiMGFsK2wVkcig@mail.gmail.com

andres@anarazel.de

about 3 years ago

In reply to: Amit Kapila (#1)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi,

On 2022-11-10 16:04:40 +0530, Amit Kapila wrote:

I don't have any good ideas on how to proceed with this. Any thoughts
on this would be helpful?

One thing worth doing might be to convert the assertion path into an elog(),
mentioning both xids (or add a framework for things like AssertLT(), but that
seems hard). With the concrete values we could make a better guess at what's
going wrong.

It'd probably not hurt to just perform this check independent of
USE_ASSERT_CHECKING - compared to the cost of creating a slot it's neglegible.

That'll obviously only help us whenever we re-encounter the issue, which will
likely be a while...

Have you tried reproducing the issue by running the test in a loop?

One thing I noticed just now is that we don't assert
builder->building_full_snapshot==true. I think we should? That didn't use to
be an option, but now it is... It doesn't look to me like that's the issue,
but ...

Hm, also, shouldn't the patch adding CRS_USE_SNAPSHOT have copied more of
SnapBuildExportSnapshot()? Why aren't the error checks for
SnapBuildExportSnapshot() needed? Why don't we need to set XactReadOnly? Which
transactions are we even in when we import the snapshot (cf.
SnapBuildExportSnapshot() doing a StartTransactionCommand()).

I'm also somewhat suspicious of calling RestoreTransactionSnapshot() with
source=MyProc. Looks like it works, but it'd be pretty easy to screw up, and
there's no comments in SetTransactionSnapshot() or
ProcArrayInstallRestoredXmin() warning that that might be the case.

Greetings,

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Andres Freund (#2)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi,

On 2022-11-14 17:25:31 -0800, Andres Freund wrote:

Hm, also, shouldn't the patch adding CRS_USE_SNAPSHOT have copied more of
SnapBuildExportSnapshot()? Why aren't the error checks for
SnapBuildExportSnapshot() needed? Why don't we need to set XactReadOnly? Which
transactions are we even in when we import the snapshot (cf.
SnapBuildExportSnapshot() doing a StartTransactionCommand()).

Most of the checks for that are in CreateReplicationSlot() - but not al,
e.g. XactReadOnly isn't set, nor do we enforce in an obvious place that we
don't already hold a snapshot.

I first thought this might directly explain the problem, due to the
MyProc->xmin assignment in SnapBuildInitialSnapshot() overwriting a value that
could influence the return value for GetOldestSafeDecodingTransactionId(). But
that happens later, and we check that MyProc->xmin is invalid at the start.

But it still seems supicious. This will e.g. influence whether
StartupDecodingContext() sets PROC_IN_LOGICAL_DECODING. Which probably is
fine, but...

Another theory: I dimly remember Thomas mentioning that there's some different
behaviour of xlogreader during shutdown as part of the v15 changes. I don't
quite remember what the scenario leading up to that was. Thomas?

It's certainly interesting that we see stuff like:

2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT: START_REPLICATION SLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL 0/1D2B650 (proto_version '3', origin 'any', publication_names '"testpub"')
ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR: error while shutting down streaming COPY: ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710

It could entirely be caused by postmaster slowly killing processes after the
assertion failure and that that is corrupting shared memory state though. But
it might also be related.

Greetings,

Andres Freund

Thomas Munro

thomas.munro@gmail.com

about 3 years ago

In reply to: Andres Freund (#3)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Nov 15, 2022 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-14 17:25:31 -0800, Andres Freund wrote:
Another theory: I dimly remember Thomas mentioning that there's some different
behaviour of xlogreader during shutdown as part of the v15 changes. I don't
quite remember what the scenario leading up to that was. Thomas?

Yeah. So as mentioned in:

I still have on my list to remove a new "missing contrecord" error
that can show up in a couple of different scenarios that aren't
exactly error conditions depending on how you think about it, but I
haven't done that yet. I am not currently aware of anything bad
happening because of those messages, but I could be wrong.

It's certainly interesting that we see stuff like:

2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT: START_REPLICATION SLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL 0/1D2B650 (proto_version '3', origin 'any', publication_names '"testpub"')
ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR: error while shutting down streaming COPY: ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710

Right, so that might fit the case described in my email above:
logical_read_xlog_page() notices that it has been asked to shut down
when it is between reads of pages with a spanning contrecord. Before,
it would fail silently, so XLogReadRecord() returns NULL without
setting *errmsg, but now it complains about a missing contrecord. In
the case where it was showing up on that other thread, just a few
machines often seemed to log that error when shutting down --
peripatus for example -- I don't know why, but I assume something to
do with shutdown timing and page alignment.

It could entirely be caused by postmaster slowly killing processes after the
assertion failure and that that is corrupting shared memory state though. But
it might also be related.

Hmm.

Takamichi Osumi (Fujitsu)

osumi.takamichi@fujitsu.com

about 3 years ago

In reply to: Andres Freund (#2)

RE: Assertion failure in SnapBuildInitialSnapshot()

Hi,

On Tuesday, November 15, 2022 10:26 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-10 16:04:40 +0530, Amit Kapila wrote:

I don't have any good ideas on how to proceed with this. Any thoughts
on this would be helpful?

One thing worth doing might be to convert the assertion path into an elog(),
mentioning both xids (or add a framework for things like AssertLT(), but that
seems hard). With the concrete values we could make a better guess at
what's going wrong.

It'd probably not hurt to just perform this check independent of
USE_ASSERT_CHECKING - compared to the cost of creating a slot it's
neglegible.

That'll obviously only help us whenever we re-encounter the issue, which will
likely be a while...

Have you tried reproducing the issue by running the test in a loop?

Just FYI, I've tried to reproduce this failure in a loop,
but I couldn't hit the same assertion failure.

I extracted the #16643 of 100_bugs.pl only and
executed the tests more than 500 times.

My env and test was done in rhel7.9 and gcc 4.8 with configure option of
--enable-cassert --enable-debug --enable-tap-tests --with-icu CFLAGS=-O0 and prefix.

Best Regards,
Takamichi Osumi

amit.kapila16@gmail.com

about 3 years ago

In reply to: Andres Freund (#3)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Nov 15, 2022 at 8:08 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-14 17:25:31 -0800, Andres Freund wrote:

Hm, also, shouldn't the patch adding CRS_USE_SNAPSHOT have copied more of
SnapBuildExportSnapshot()? Why aren't the error checks for
SnapBuildExportSnapshot() needed? Why don't we need to set XactReadOnly? Which
transactions are we even in when we import the snapshot (cf.
SnapBuildExportSnapshot() doing a StartTransactionCommand()).

Most of the checks for that are in CreateReplicationSlot() - but not al,
e.g. XactReadOnly isn't set,

Yeah, I think we can add the check for XactReadOnly along with other
checks in CreateReplicationSlot().

nor do we enforce in an obvious place that we
don't already hold a snapshot.

We have a check for (FirstXactSnapshot == NULL) in
RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
sufficient?

--
With Regards,
Amit Kapila.

amit.kapila16@gmail.com

about 3 years ago

In reply to: Andres Freund (#2)

1 attachment(s)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Nov 15, 2022 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-10 16:04:40 +0530, Amit Kapila wrote:

I don't have any good ideas on how to proceed with this. Any thoughts
on this would be helpful?

One thing worth doing might be to convert the assertion path into an elog(),
mentioning both xids (or add a framework for things like AssertLT(), but that
seems hard). With the concrete values we could make a better guess at what's
going wrong.

It'd probably not hurt to just perform this check independent of
USE_ASSERT_CHECKING - compared to the cost of creating a slot it's neglegible.

That'll obviously only help us whenever we re-encounter the issue, which will
likely be a while...

Agreed.

One thing I noticed just now is that we don't assert
builder->building_full_snapshot==true. I think we should? That didn't use to
be an option, but now it is... It doesn't look to me like that's the issue,
but ...

Agreed.

The attached patch contains both changes. It seems to me this issue
can happen, if somehow, either slot's effective_xmin increased after
we assign its initial value in CreateInitDecodingContext() or somehow
its value is InvalidTransactionId when we have invoked
SnapBuildInitialSnapshot(). The other possibility is that the
initial_xmin_horizon check in SnapBuildFindSnapshot() doesn't insulate
us from assigning builder->xmin value older than initial_xmin_horizon.
I am not able to see if any of this can be true.

--
With Regards,
Amit Kapila.

Attachments:

improve_checks_snapbuild_1.patchapplication/octet-stream; name=improve_checks_snapbuild_1.patchDownload

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5006a5c464..e85c75e0e6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -566,11 +566,13 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 {
 	Snapshot	snap;
 	TransactionId xid;
+	TransactionId safeXid;
 	TransactionId *newxip;
 	int			newxcnt = 0;
 
 	Assert(!FirstSnapshotSet);
 	Assert(XactIsoLevel == XACT_REPEATABLE_READ);
+	Assert(builder->building_full_snapshot);
 
 	if (builder->state != SNAPBUILD_CONSISTENT)
 		elog(ERROR, "cannot build an initial slot snapshot before reaching a consistent state");
@@ -589,17 +591,13 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	 * mechanism. Due to that we can do this without locks, we're only
 	 * changing our own value.
 	 */
-#ifdef USE_ASSERT_CHECKING
-	{
-		TransactionId safeXid;
-
-		LWLockAcquire(ProcArrayLock, LW_SHARED);
-		safeXid = GetOldestSafeDecodingTransactionId(false);
-		LWLockRelease(ProcArrayLock);
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	safeXid = GetOldestSafeDecodingTransactionId(false);
+	LWLockRelease(ProcArrayLock);
 
-		Assert(TransactionIdPrecedesOrEquals(safeXid, snap->xmin));
-	}
-#endif
+	if (TransactionIdFollows(safeXid, snap->xmin))
+		elog(ERROR, "cannot build an initial slot snapshot when oldest safe xid %u follows snapshot's xmin %u",
+			 safeXid, snap->xmin);
 
 	MyProc->xmin = snap->xmin;

andres@anarazel.de

about 3 years ago

In reply to: Amit Kapila (#7)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi,

On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:

On Tue, Nov 15, 2022 at 8:08 AM Andres Freund <andres@anarazel.de> wrote:

nor do we enforce in an obvious place that we
don't already hold a snapshot.

We have a check for (FirstXactSnapshot == NULL) in
RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
sufficient?

I don't think that'd e.g. catch a catalog snapshot being held, yet that'd
still be bad. And I think checking in SetTransactionSnapshot() is too late,
we've already overwritten MyProc->xmin by that point.

On 2022-11-15 17:21:44 +0530, Amit Kapila wrote:

One thing I noticed just now is that we don't assert
builder->building_full_snapshot==true. I think we should? That didn't use to
be an option, but now it is... It doesn't look to me like that's the issue,
but ...

Agreed.

The attached patch contains both changes. It seems to me this issue
can happen, if somehow, either slot's effective_xmin increased after
we assign its initial value in CreateInitDecodingContext() or somehow
its value is InvalidTransactionId when we have invoked
SnapBuildInitialSnapshot(). The other possibility is that the
initial_xmin_horizon check in SnapBuildFindSnapshot() doesn't insulate
us from assigning builder->xmin value older than initial_xmin_horizon.
I am not able to see if any of this can be true.

Yea, I don't immediately see anything either. Given the discussion in
/messages/by-id/Yz2hivgyjS1RfMKs@depesz.com I am
starting to wonder if we've introduced a race in the slot machinery.

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5006a5c464..e85c75e0e6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -566,11 +566,13 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
{
Snapshot	snap;
TransactionId xid;
+	TransactionId safeXid;
TransactionId *newxip;
int			newxcnt = 0;
Assert(!FirstSnapshotSet);
Assert(XactIsoLevel == XACT_REPEATABLE_READ);
+ Assert(builder->building_full_snapshot);

if (builder->state != SNAPBUILD_CONSISTENT)
elog(ERROR, "cannot build an initial slot snapshot before reaching a consistent state");
@@ -589,17 +591,13 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
* mechanism. Due to that we can do this without locks, we're only
* changing our own value.
*/

Perhaps add something like "Creating a snapshot is expensive and an unenforced
xmin horizon would have bad consequences, therefore always double-check that
the horizon is enforced"?

-#ifdef USE_ASSERT_CHECKING
-	{
-		TransactionId safeXid;
-
-		LWLockAcquire(ProcArrayLock, LW_SHARED);
-		safeXid = GetOldestSafeDecodingTransactionId(false);
-		LWLockRelease(ProcArrayLock);
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	safeXid = GetOldestSafeDecodingTransactionId(false);
+	LWLockRelease(ProcArrayLock);

-		Assert(TransactionIdPrecedesOrEquals(safeXid, snap->xmin));
-	}
-#endif
+	if (TransactionIdFollows(safeXid, snap->xmin))
+		elog(ERROR, "cannot build an initial slot snapshot when oldest safe xid %u follows snapshot's xmin %u",
+			 safeXid, snap->xmin);

MyProc->xmin = snap->xmin;

s/when/as/

Greetings,

Andres Freund

amit.kapila16@gmail.com

about 3 years ago

In reply to: Andres Freund (#8)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Wed, Nov 16, 2022 at 7:30 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:

On Tue, Nov 15, 2022 at 8:08 AM Andres Freund <andres@anarazel.de> wrote:

nor do we enforce in an obvious place that we
don't already hold a snapshot.

We have a check for (FirstXactSnapshot == NULL) in
RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
sufficient?

I don't think that'd e.g. catch a catalog snapshot being held, yet that'd
still be bad. And I think checking in SetTransactionSnapshot() is too late,
we've already overwritten MyProc->xmin by that point.

So, shall we add the below Asserts in SnapBuildInitialSnapshot() after
we have the Assert for Assert(!FirstSnapshotSet)?
Assert(FirstXactSnapshot == NULL);
Assert(!HistoricSnapshotActive());

--
With Regards,
Amit Kapila.

#10

andres@anarazel.de

about 3 years ago

In reply to: Amit Kapila (#9)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi,

On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:

On Wed, Nov 16, 2022 at 7:30 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:

On Tue, Nov 15, 2022 at 8:08 AM Andres Freund <andres@anarazel.de> wrote:

nor do we enforce in an obvious place that we
don't already hold a snapshot.

We have a check for (FirstXactSnapshot == NULL) in
RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
sufficient?

I don't think that'd e.g. catch a catalog snapshot being held, yet that'd
still be bad. And I think checking in SetTransactionSnapshot() is too late,
we've already overwritten MyProc->xmin by that point.

So, shall we add the below Asserts in SnapBuildInitialSnapshot() after
we have the Assert for Assert(!FirstSnapshotSet)?
Assert(FirstXactSnapshot == NULL);
Assert(!HistoricSnapshotActive());

I don't think that'd catch a catalog snapshot. But perhaps the better answer
for the catalog snapshot is to just invalidate it explicitly. The user doesn't
have control over the catalog snapshot being taken, and it's not too hard to
imagine the walsender code triggering one somewhere.

So maybe we should add something like:

InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */
if (HaveRegisteredOrActiveSnapshot())
elog(ERROR, "cannot build an initial slot snapshot when snapshots exist")
Assert(!HistoricSnapshotActive());

I think we'd not need to assert FirstXactSnapshot == NULL or !FirstSnapshotSet
with that, because those would show up in HaveRegisteredOrActiveSnapshot().

Greetings,

Andres Freund

#11

amit.kapila16@gmail.com

about 3 years ago

In reply to: Andres Freund (#10)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Wed, Nov 16, 2022 at 11:56 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:

On Wed, Nov 16, 2022 at 7:30 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:

On Tue, Nov 15, 2022 at 8:08 AM Andres Freund <andres@anarazel.de> wrote:

nor do we enforce in an obvious place that we
don't already hold a snapshot.

We have a check for (FirstXactSnapshot == NULL) in
RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
sufficient?

I don't think that'd e.g. catch a catalog snapshot being held, yet that'd
still be bad. And I think checking in SetTransactionSnapshot() is too late,
we've already overwritten MyProc->xmin by that point.

So, shall we add the below Asserts in SnapBuildInitialSnapshot() after
we have the Assert for Assert(!FirstSnapshotSet)?
Assert(FirstXactSnapshot == NULL);
Assert(!HistoricSnapshotActive());

I don't think that'd catch a catalog snapshot. But perhaps the better answer
for the catalog snapshot is to just invalidate it explicitly. The user doesn't
have control over the catalog snapshot being taken, and it's not too hard to
imagine the walsender code triggering one somewhere.

So maybe we should add something like:

InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */

The comment "/* about to overwrite MyProc->xmin */" is unclear to me.
We already have a check (/* so we don't overwrite the existing value
*/
if (TransactionIdIsValid(MyProc->xmin))) in SnapBuildInitialSnapshot()
which ensures that we don't overwrite MyProc->xmin, so the above
comment seems contradictory to me.

--
With Regards,
Amit Kapila.

#12

amit.kapila16@gmail.com

about 3 years ago

In reply to: Andres Freund (#10)

1 attachment(s)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Wed, Nov 16, 2022 at 11:56 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:

On Wed, Nov 16, 2022 at 7:30 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:

On Tue, Nov 15, 2022 at 8:08 AM Andres Freund <andres@anarazel.de> wrote:

nor do we enforce in an obvious place that we
don't already hold a snapshot.

We have a check for (FirstXactSnapshot == NULL) in
RestoreTransactionSnapshot->SetTransactionSnapshot. Won't that be
sufficient?

I don't think that'd e.g. catch a catalog snapshot being held, yet that'd
still be bad. And I think checking in SetTransactionSnapshot() is too late,
we've already overwritten MyProc->xmin by that point.

So, shall we add the below Asserts in SnapBuildInitialSnapshot() after
we have the Assert for Assert(!FirstSnapshotSet)?
Assert(FirstXactSnapshot == NULL);
Assert(!HistoricSnapshotActive());

I don't think that'd catch a catalog snapshot. But perhaps the better answer
for the catalog snapshot is to just invalidate it explicitly. The user doesn't
have control over the catalog snapshot being taken, and it's not too hard to
imagine the walsender code triggering one somewhere.

So maybe we should add something like:

InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */
if (HaveRegisteredOrActiveSnapshot())
elog(ERROR, "cannot build an initial slot snapshot when snapshots exist")
Assert(!HistoricSnapshotActive());

I think we'd not need to assert FirstXactSnapshot == NULL or !FirstSnapshotSet
with that, because those would show up in HaveRegisteredOrActiveSnapshot().

In the attached patch, I have incorporated this change and other
feedback. I think this should probably help us find the reason for
this problem when we see it in the future.

--
With Regards,
Amit Kapila.

Attachments:

v2-0001-Add-additional-checks-while-creating-the-initial-.patchapplication/octet-stream; name=v2-0001-Add-additional-checks-while-creating-the-initial-.patchDownload

From fe4b96bb4a22104596142f787eb7aa43e5c3e246 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 17 Nov 2022 11:53:54 +0530
Subject: [PATCH v2] Add additional checks while creating the initial decoding
 snapshot.

As per one of the CI reports, there is an assertion failure which
indicates that we were trying to use an unenforced xmin horizon for
decoding snapshots. Though, we couldn't figure out the reason for
assertion failure these checks would help us in finding the reason if the
problem happens again in the future.

Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
---
 src/backend/replication/logical/snapbuild.c | 29 ++++++++++++++++++-----------
 src/backend/replication/walsender.c         |  5 +++++
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 5006a5c..ca437de 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -566,11 +566,12 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 {
 	Snapshot	snap;
 	TransactionId xid;
+	TransactionId safeXid;
 	TransactionId *newxip;
 	int			newxcnt = 0;
 
-	Assert(!FirstSnapshotSet);
 	Assert(XactIsoLevel == XACT_REPEATABLE_READ);
+	Assert(builder->building_full_snapshot);
 
 	if (builder->state != SNAPBUILD_CONSISTENT)
 		elog(ERROR, "cannot build an initial slot snapshot before reaching a consistent state");
@@ -582,24 +583,30 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	if (TransactionIdIsValid(MyProc->xmin))
 		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
 
+	/* don't allow older snapshots */
+	InvalidateCatalogSnapshot();
+	if (HaveRegisteredOrActiveSnapshot())
+		elog(ERROR, "cannot build an initial slot snapshot when snapshots exist");
+	Assert(!HistoricSnapshotActive());
+
 	snap = SnapBuildBuildSnapshot(builder);
 
 	/*
 	 * We know that snap->xmin is alive, enforced by the logical xmin
 	 * mechanism. Due to that we can do this without locks, we're only
 	 * changing our own value.
+	 *
+	 * Building an initial snapshot is expensive and an unenforced xmin
+	 * horizon would have bad consequences, therefore always double-check that
+	 * the horizon is enforced.
 	 */
-#ifdef USE_ASSERT_CHECKING
-	{
-		TransactionId safeXid;
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	safeXid = GetOldestSafeDecodingTransactionId(false);
+	LWLockRelease(ProcArrayLock);
 
-		LWLockAcquire(ProcArrayLock, LW_SHARED);
-		safeXid = GetOldestSafeDecodingTransactionId(false);
-		LWLockRelease(ProcArrayLock);
-
-		Assert(TransactionIdPrecedesOrEquals(safeXid, snap->xmin));
-	}
-#endif
+	if (TransactionIdFollows(safeXid, snap->xmin))
+		elog(ERROR, "cannot build an initial slot snapshot as oldest safe xid %u follows snapshot's xmin %u",
+			 safeXid, snap->xmin);
 
 	MyProc->xmin = snap->xmin;
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a81ef6a..c11bb37 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1099,6 +1099,11 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 				/*- translator: %s is a CREATE_REPLICATION_SLOT statement */
 						(errmsg("%s must be called in REPEATABLE READ isolation mode transaction",
 								"CREATE_REPLICATION_SLOT ... (SNAPSHOT 'use')")));
+			if (!XactReadOnly)
+				ereport(ERROR,
+				/*- translator: %s is a CREATE_REPLICATION_SLOT statement */
+						(errmsg("%s must be called in a read only transaction",
+								"CREATE_REPLICATION_SLOT ... (SNAPSHOT 'use')")));
 
 			if (FirstSnapshotSet)
 				ereport(ERROR,
-- 
1.8.3.1

#13

andres@anarazel.de

about 3 years ago

In reply to: Amit Kapila (#11)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi,

On 2022-11-17 10:44:18 +0530, Amit Kapila wrote:

On Wed, Nov 16, 2022 at 11:56 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:

On Wed, Nov 16, 2022 at 7:30 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:

On Tue, Nov 15, 2022 at 8:08 AM Andres Freund <andres@anarazel.de> wrote:

I don't think that'd catch a catalog snapshot. But perhaps the better answer
for the catalog snapshot is to just invalidate it explicitly. The user doesn't
have control over the catalog snapshot being taken, and it's not too hard to
imagine the walsender code triggering one somewhere.

So maybe we should add something like:

InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */

The comment "/* about to overwrite MyProc->xmin */" is unclear to me.
We already have a check (/* so we don't overwrite the existing value
*/
if (TransactionIdIsValid(MyProc->xmin))) in SnapBuildInitialSnapshot()
which ensures that we don't overwrite MyProc->xmin, so the above
comment seems contradictory to me.

The point is that catalog snapshots could easily end up setting MyProc->xmin,
even though the caller hasn't done anything wrong. So the
InvalidateCatalogSnapshot() would avoid erroring out in a number of scenarios.

Greetings,

Andres Freund

#14

amit.kapila16@gmail.com

about 3 years ago

In reply to: Andres Freund (#13)

1 attachment(s)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, Nov 17, 2022 at 11:15 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-17 10:44:18 +0530, Amit Kapila wrote:

On Wed, Nov 16, 2022 at 11:56 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:

On Wed, Nov 16, 2022 at 7:30 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:

On Tue, Nov 15, 2022 at 8:08 AM Andres Freund <andres@anarazel.de> wrote:

I don't think that'd catch a catalog snapshot. But perhaps the better answer
for the catalog snapshot is to just invalidate it explicitly. The user doesn't
have control over the catalog snapshot being taken, and it's not too hard to
imagine the walsender code triggering one somewhere.

So maybe we should add something like:

InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */

The comment "/* about to overwrite MyProc->xmin */" is unclear to me.
We already have a check (/* so we don't overwrite the existing value
*/
if (TransactionIdIsValid(MyProc->xmin))) in SnapBuildInitialSnapshot()
which ensures that we don't overwrite MyProc->xmin, so the above
comment seems contradictory to me.

The point is that catalog snapshots could easily end up setting MyProc->xmin,
even though the caller hasn't done anything wrong. So the
InvalidateCatalogSnapshot() would avoid erroring out in a number of scenarios.

Okay, updated the patch accordingly.

--
With Regards,
Amit Kapila.

Attachments:

v3-0001-Add-additional-checks-while-creating-the-initial-.patchapplication/octet-stream; name=v3-0001-Add-additional-checks-while-creating-the-initial-.patchDownload

From 025a32b2588e6be55b2251ec4e296152ba3b85ab Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 17 Nov 2022 11:53:54 +0530
Subject: [PATCH v3] Add additional checks while creating the initial decoding
 snapshot.

As per one of the CI reports, there is an assertion failure which
indicates that we were trying to use an unenforced xmin horizon for
decoding snapshots. Though, we couldn't figure out the reason for
assertion failure these checks would help us in finding the reason if the
problem happens again in the future.

Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
---
 src/backend/replication/logical/snapbuild.c | 29 +++++++++++++--------
 src/backend/replication/walsender.c         |  5 ++++
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 556b7fcba3..a1fd1d92d6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -566,11 +566,18 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 {
 	Snapshot	snap;
 	TransactionId xid;
+	TransactionId safeXid;
 	TransactionId *newxip;
 	int			newxcnt = 0;
 
-	Assert(!FirstSnapshotSet);
 	Assert(XactIsoLevel == XACT_REPEATABLE_READ);
+	Assert(builder->building_full_snapshot);
+
+	/* don't allow older snapshots */
+	InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */
+	if (HaveRegisteredOrActiveSnapshot())
+		elog(ERROR, "cannot build an initial slot snapshot when snapshots exist");
+	Assert(!HistoricSnapshotActive());
 
 	if (builder->state != SNAPBUILD_CONSISTENT)
 		elog(ERROR, "cannot build an initial slot snapshot before reaching a consistent state");
@@ -588,18 +595,18 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	 * We know that snap->xmin is alive, enforced by the logical xmin
 	 * mechanism. Due to that we can do this without locks, we're only
 	 * changing our own value.
+	 *
+	 * Building an initial snapshot is expensive and an unenforced xmin
+	 * horizon would have bad consequences, therefore always double-check that
+	 * the horizon is enforced.
 	 */
-#ifdef USE_ASSERT_CHECKING
-	{
-		TransactionId safeXid;
+	LWLockAcquire(ProcArrayLock, LW_SHARED);
+	safeXid = GetOldestSafeDecodingTransactionId(false);
+	LWLockRelease(ProcArrayLock);
 
-		LWLockAcquire(ProcArrayLock, LW_SHARED);
-		safeXid = GetOldestSafeDecodingTransactionId(false);
-		LWLockRelease(ProcArrayLock);
-
-		Assert(TransactionIdPrecedesOrEquals(safeXid, snap->xmin));
-	}
-#endif
+	if (TransactionIdFollows(safeXid, snap->xmin))
+		elog(ERROR, "cannot build an initial slot snapshot as oldest safe xid %u follows snapshot's xmin %u",
+			 safeXid, snap->xmin);
 
 	MyProc->xmin = snap->xmin;
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index a81ef6a201..c11bb3716f 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1099,6 +1099,11 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 				/*- translator: %s is a CREATE_REPLICATION_SLOT statement */
 						(errmsg("%s must be called in REPEATABLE READ isolation mode transaction",
 								"CREATE_REPLICATION_SLOT ... (SNAPSHOT 'use')")));
+			if (!XactReadOnly)
+				ereport(ERROR,
+				/*- translator: %s is a CREATE_REPLICATION_SLOT statement */
+						(errmsg("%s must be called in a read only transaction",
+								"CREATE_REPLICATION_SLOT ... (SNAPSHOT 'use')")));
 
 			if (FirstSnapshotSet)
 				ereport(ERROR,
-- 
2.28.0.windows.1

#15

andres@anarazel.de

about 3 years ago

In reply to: Amit Kapila (#14)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi,

On 2022-11-18 11:20:36 +0530, Amit Kapila wrote:

Okay, updated the patch accordingly.

Assuming it passes tests etc, this'd work for me.

- Andres

#16

amit.kapila16@gmail.com

about 3 years ago

In reply to: Andres Freund (#15)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Sat, Nov 19, 2022 at 6:35 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-18 11:20:36 +0530, Amit Kapila wrote:

Okay, updated the patch accordingly.

Assuming it passes tests etc, this'd work for me.

Thanks, Pushed.

--
With Regards,
Amit Kapila.

#17

[1]: /messages/by-id/tencent_7EB71DA5D7BA00EB0B429DCE45D0452B6406@qq.com

sawada.mshk@gmail.com

about 3 years ago

In reply to: Amit Kapila (#16)

1 attachment(s)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Mon, Nov 21, 2022 at 4:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Nov 19, 2022 at 6:35 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-18 11:20:36 +0530, Amit Kapila wrote:

Okay, updated the patch accordingly.

Assuming it passes tests etc, this'd work for me.

Thanks, Pushed.

The same assertion failure has been reported on another thread[1]/messages/by-id/tencent_7EB71DA5D7BA00EB0B429DCE45D0452B6406@qq.com.
Since I could reproduce this issue several times in my environment
I've investigated the root cause.

I think there is a race condition of updating
procArray->replication_slot_xmin by CreateInitDecodingContext() and
LogicalConfirmReceivedLocation().

What I observed in the test was that a walsender process called:
SnapBuildProcessRunningXacts()
LogicalIncreaseXminForSlot()
LogicalConfirmReceivedLocation()
ReplicationSlotsComputeRequiredXmin(false).

In ReplicationSlotsComputeRequiredXmin() it acquired the
ReplicationSlotControlLock and got 0 as the minimum xmin since there
was no wal sender having effective_xmin. Before calling
ProcArraySetReplicationSlotXmin() (i.e. before acquiring
ProcArrayLock), another walsender process called
CreateInitDecodingContext(), acquired ProcArrayLock, computed
slot->effective_catalog_xmin, called
ReplicationSlotsComputeRequiredXmin(true). Since its
effective_catalog_xmin had been set, it got 39968 as the minimum xmin,
and updated replication_slot_xmin. However, as soon as the second
walsender released ProcArrayLock, the first walsender updated the
replication_slot_xmin to 0. After that, the second walsender called
SnapBuildInitialSnapshot(), and GetOldestSafeDecodingTransactionId()
returned an XID newer than snap->xmin.

One idea to fix this issue is that in
ReplicationSlotsComputeRequiredXmin(), we compute the minimum xmin
while holding both ProcArrayLock and ReplicationSlotControlLock, and
release only ReplicationSlotsControlLock before updating the
replication_slot_xmin. I'm concerned it will increase the contention
on ProcArrayLock but I've attached the patch for discussion.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_slot_xmin_race_condition.patchapplication/octet-stream; name=fix_slot_xmin_race_condition.patchDownload

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 899acfd912..b781d2ea81 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -840,6 +840,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	Assert(ReplicationSlotCtl != NULL);
 
+	if (!already_locked)
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
@@ -878,7 +881,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	LWLockRelease(ReplicationSlotControlLock);
 
-	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin);
+
+	if (!already_locked)
+		LWLockRelease(ProcArrayLock);
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0176f30270..eb74c34dbc 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3899,20 +3899,13 @@ TerminateOtherDBBackends(Oid databaseId)
  * replication slots.
  */
 void
-ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
-								bool already_locked)
+ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin)
 {
-	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-	if (!already_locked)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(LWLockHeldByMe(ProcArrayLock));
 
 	procArray->replication_slot_xmin = xmin;
 	procArray->replication_slot_catalog_xmin = catalog_xmin;
 
-	if (!already_locked)
-		LWLockRelease(ProcArrayLock);
-
 	elog(DEBUG1, "xmin required by slots: data %u, catalog %u",
 		 xmin, catalog_xmin);
 }
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 99a58fb162..f25b8c1be3 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -91,7 +91,7 @@ extern void XidCacheRemoveRunningXids(TransactionId xid,
 									  TransactionId latestXid);
 
 extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
-											TransactionId catalog_xmin, bool already_locked);
+											TransactionId catalog_xmin);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
 											TransactionId *catalog_xmin);

#18

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#17)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, Dec 8, 2022 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The same assertion failure has been reported on another thread[1].
Since I could reproduce this issue several times in my environment
I've investigated the root cause.

I think there is a race condition of updating
procArray->replication_slot_xmin by CreateInitDecodingContext() and
LogicalConfirmReceivedLocation().

What I observed in the test was that a walsender process called:
SnapBuildProcessRunningXacts()
LogicalIncreaseXminForSlot()
LogicalConfirmReceivedLocation()
ReplicationSlotsComputeRequiredXmin(false).

In ReplicationSlotsComputeRequiredXmin() it acquired the
ReplicationSlotControlLock and got 0 as the minimum xmin since there
was no wal sender having effective_xmin.

What about the current walsender process which is processing
running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender
slot's effective_xmin have a non-zero value? If not, then why?

--
With Regards,
Amit Kapila.

#19

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

almost 3 years ago

In reply to: Masahiko Sawada (#17)

RE: Assertion failure in SnapBuildInitialSnapshot()

Dear Sawada-san,

Thank you for making the patch! I'm still considering whether this approach is
correct, but I can put a comment to your patch anyway.

```
-	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-	if (!already_locked)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(LWLockHeldByMe(ProcArrayLock));
```

In this function, we regard that the ProcArrayLock has been already acquired as
exclusive mode and modify data. I think LWLockHeldByMeInMode() should be used
instead of LWLockHeldByMe().
I confirmed that there is only one caller that uses ReplicationSlotsComputeRequiredXmin(true)
and it acquires exclusive lock correctly, but it can avoid future bug.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#20

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

almost 3 years ago

In reply to: Amit Kapila (#18)

RE: Assertion failure in SnapBuildInitialSnapshot()

Dear Amit, Sawada-san,

I have also reproduced the failure on PG15 with some debug log, and I agreed that
somebody changed procArray->replication_slot_xmin to InvalidTransactionId.

The same assertion failure has been reported on another thread[1].
Since I could reproduce this issue several times in my environment
I've investigated the root cause.

I think there is a race condition of updating
procArray->replication_slot_xmin by CreateInitDecodingContext() and
LogicalConfirmReceivedLocation().

What I observed in the test was that a walsender process called:
SnapBuildProcessRunningXacts()
LogicalIncreaseXminForSlot()
LogicalConfirmReceivedLocation()
ReplicationSlotsComputeRequiredXmin(false).

In ReplicationSlotsComputeRequiredXmin() it acquired the
ReplicationSlotControlLock and got 0 as the minimum xmin since there
was no wal sender having effective_xmin.

What about the current walsender process which is processing
running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender
slot's effective_xmin have a non-zero value? If not, then why?

Normal walsenders which are not for tablesync create a replication slot with
NOEXPORT_SNAPSHOT option. I think in this case, CreateInitDecodingContext() is
called with need_full_snapshot = false, and slot->effective_xmin is not updated.
It is set as InvalidTransactionId at ReplicationSlotCreate() and no functions update
that. Hence the slot acquired by the walsender may have Invalid effective_min.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#21

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#20)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Sat, Jan 28, 2023 at 11:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Amit, Sawada-san,

I have also reproduced the failure on PG15 with some debug log, and I agreed that
somebody changed procArray->replication_slot_xmin to InvalidTransactionId.

The same assertion failure has been reported on another thread[1].
Since I could reproduce this issue several times in my environment
I've investigated the root cause.

I think there is a race condition of updating
procArray->replication_slot_xmin by CreateInitDecodingContext() and
LogicalConfirmReceivedLocation().

What I observed in the test was that a walsender process called:
SnapBuildProcessRunningXacts()
LogicalIncreaseXminForSlot()
LogicalConfirmReceivedLocation()
ReplicationSlotsComputeRequiredXmin(false).

In ReplicationSlotsComputeRequiredXmin() it acquired the
ReplicationSlotControlLock and got 0 as the minimum xmin since there
was no wal sender having effective_xmin.

What about the current walsender process which is processing
running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender
slot's effective_xmin have a non-zero value? If not, then why?

Normal walsenders which are not for tablesync create a replication slot with
NOEXPORT_SNAPSHOT option. I think in this case, CreateInitDecodingContext() is
called with need_full_snapshot = false, and slot->effective_xmin is not updated.

Right. This is how we create a slot used by an apply worker.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#22

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#21)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Sun, Jan 29, 2023 at 9:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jan 28, 2023 at 11:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Amit, Sawada-san,

I have also reproduced the failure on PG15 with some debug log, and I agreed that
somebody changed procArray->replication_slot_xmin to InvalidTransactionId.

The same assertion failure has been reported on another thread[1].
Since I could reproduce this issue several times in my environment
I've investigated the root cause.

I think there is a race condition of updating
procArray->replication_slot_xmin by CreateInitDecodingContext() and
LogicalConfirmReceivedLocation().

What I observed in the test was that a walsender process called:
SnapBuildProcessRunningXacts()
LogicalIncreaseXminForSlot()
LogicalConfirmReceivedLocation()
ReplicationSlotsComputeRequiredXmin(false).

In ReplicationSlotsComputeRequiredXmin() it acquired the
ReplicationSlotControlLock and got 0 as the minimum xmin since there
was no wal sender having effective_xmin.

What about the current walsender process which is processing
running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender
slot's effective_xmin have a non-zero value? If not, then why?

Normal walsenders which are not for tablesync create a replication slot with
NOEXPORT_SNAPSHOT option. I think in this case, CreateInitDecodingContext() is
called with need_full_snapshot = false, and slot->effective_xmin is not updated.

Right. This is how we create a slot used by an apply worker.

I was thinking about how that led to this problem because
GetOldestSafeDecodingTransactionId() ignores InvalidTransactionId. It
seems that is possible when both builder->xmin and
replication_slot_catalog_xmin precede replication_slot_catalog_xmin.
Do you see any different reason for it?

--
With Regards,
Amit Kapila.

#23

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#22)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Mon, Jan 30, 2023 at 10:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jan 29, 2023 at 9:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jan 28, 2023 at 11:54 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Amit, Sawada-san,

I have also reproduced the failure on PG15 with some debug log, and I agreed that
somebody changed procArray->replication_slot_xmin to InvalidTransactionId.

The same assertion failure has been reported on another thread[1].
Since I could reproduce this issue several times in my environment
I've investigated the root cause.

I think there is a race condition of updating
procArray->replication_slot_xmin by CreateInitDecodingContext() and
LogicalConfirmReceivedLocation().

What I observed in the test was that a walsender process called:
SnapBuildProcessRunningXacts()
LogicalIncreaseXminForSlot()
LogicalConfirmReceivedLocation()
ReplicationSlotsComputeRequiredXmin(false).

In ReplicationSlotsComputeRequiredXmin() it acquired the
ReplicationSlotControlLock and got 0 as the minimum xmin since there
was no wal sender having effective_xmin.

What about the current walsender process which is processing
running_xacts via SnapBuildProcessRunningXacts()? Isn't that walsender
slot's effective_xmin have a non-zero value? If not, then why?

Normal walsenders which are not for tablesync create a replication slot with
NOEXPORT_SNAPSHOT option. I think in this case, CreateInitDecodingContext() is
called with need_full_snapshot = false, and slot->effective_xmin is not updated.

Right. This is how we create a slot used by an apply worker.

I was thinking about how that led to this problem because
GetOldestSafeDecodingTransactionId() ignores InvalidTransactionId.

I have reproduced it manually. For this, I had to manually make the
debugger call ReplicationSlotsComputeRequiredXmin(false) via path
SnapBuildProcessRunningXacts()->LogicalIncreaseXminForSlot()->LogicalConfirmReceivedLocation()
->ReplicationSlotsComputeRequiredXmin(false) for the apply worker. The
sequence of events is something like (a) the replication_slot_xmin for
tablesync worker is overridden by apply worker as zero as explained in
Sawada-San's email, (b) another transaction happened on the publisher
that will increase the value of ShmemVariableCache->nextXid (c)
tablesync worker invokes
SnapBuildInitialSnapshot()->GetOldestSafeDecodingTransactionId() which
will return an oldestSafeXid which is higher than snapshot's xmin.
This happens because replication_slot_xmin has an InvalidTransactionId
value and we won't consider replication_slot_catalog_xmin because
catalogOnly flag is false and there is no other open running
transaction. I think we should try to get a simplified test to
reproduce this problem if possible.

--
With Regards,
Amit Kapila.

#24

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#17)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, Dec 8, 2022 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 21, 2022 at 4:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

One idea to fix this issue is that in
ReplicationSlotsComputeRequiredXmin(), we compute the minimum xmin
while holding both ProcArrayLock and ReplicationSlotControlLock, and
release only ReplicationSlotsControlLock before updating the
replication_slot_xmin. I'm concerned it will increase the contention
on ProcArrayLock but I've attached the patch for discussion.

But what kind of workload are you worried about? This will be called
while processing XLOG_RUNNING_XACTS to update
procArray->replication_slot_xmin/procArray->replication_slot_catalog_xmin
only when required. So, if we want we can test some concurrent
workloads along with walsenders doing the decoding to check if it
impacts performance.

What other way we can fix this? Do you think we can try to avoid
retreating xmin values in ProcArraySetReplicationSlotXmin() to avoid
this problem? Personally, I think taking the lock as proposed by your
patch is a better idea. BTW, this problem seems to be only logical
replication specific, so if we are too worried then we can change this
locking only for logical replication.

--
With Regards,
Amit Kapila.

#25

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#23)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Mon, Jan 30, 2023 at 11:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I have reproduced it manually. For this, I had to manually make the
debugger call ReplicationSlotsComputeRequiredXmin(false) via path
SnapBuildProcessRunningXacts()->LogicalIncreaseXminForSlot()->LogicalConfirmReceivedLocation()
->ReplicationSlotsComputeRequiredXmin(false) for the apply worker. The
sequence of events is something like (a) the replication_slot_xmin for
tablesync worker is overridden by apply worker as zero as explained in
Sawada-San's email, (b) another transaction happened on the publisher
that will increase the value of ShmemVariableCache->nextXid (c)
tablesync worker invokes
SnapBuildInitialSnapshot()->GetOldestSafeDecodingTransactionId() which
will return an oldestSafeXid which is higher than snapshot's xmin.
This happens because replication_slot_xmin has an InvalidTransactionId
value and we won't consider replication_slot_catalog_xmin because
catalogOnly flag is false and there is no other open running
transaction. I think we should try to get a simplified test to
reproduce this problem if possible.

Here are steps to reproduce it manually with the help of a debugger:

Session-1
==========
select pg_create_logical_replication_slot('s', 'test_decoding');
create table t2(c1 int);
select pg_replication_slot_advance('s', pg_current_wal_lsn()); --
Debug this statement. Stop before taking procarraylock in
ProcArraySetReplicationSlotXmin.

Session-2
============
psql -d postgres
Begin;

Session-3
===========
psql -d "dbname=postgres replication=database"

begin transaction isolation level repeatable read read only;
CREATE_REPLICATION_SLOT slot1 LOGICAL test_decoding USE_SNAPSHOT;
--Debug this statement. Stop in SnapBuildInitialSnapshot() before
taking procarraylock

Session-1
==========
Continue debugging and finish execution of
ProcArraySetReplicationSlotXmin. Verify
procArray->replication_slot_xmin is zero.

Session-2
=========
Select txid_current();
Commit;

Session-3
==========
Continue debugging.
Verify that safeXid follows snap->xmin. This leads to assertion (in
back branches) or error (in HEAD).

--
With Regards,
Amit Kapila.

#26

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#19)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Fri, Jan 27, 2023 at 4:31 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Thank you for making the patch! I'm still considering whether this approach is
correct, but I can put a comment to your patch anyway.
```
-       Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-       if (!already_locked)
-               LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+       Assert(LWLockHeldByMe(ProcArrayLock));
```
In this function, we regard that the ProcArrayLock has been already acquired as
exclusive mode and modify data. I think LWLockHeldByMeInMode() should be used
instead of LWLockHeldByMe().

Right, this is even evident from the comments atop
ReplicationSlotsComputeRequiredXmin("If already_locked is true,
ProcArrayLock has already been acquired exclusively.". But, I am not
sure if it is a good idea to remove 'already_locked' parameter,
especially in back branches as this is an exposed API.

--
With Regards,
Amit Kapila.

#27

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#26)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Mon, Jan 30, 2023 at 8:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 27, 2023 at 4:31 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Thank you for making the patch! I'm still considering whether this approach is
correct, but I can put a comment to your patch anyway.
```
-       Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-       if (!already_locked)
-               LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+       Assert(LWLockHeldByMe(ProcArrayLock));
```
In this function, we regard that the ProcArrayLock has been already acquired as
exclusive mode and modify data. I think LWLockHeldByMeInMode() should be used
instead of LWLockHeldByMe().
Right, this is even evident from the comments atop
ReplicationSlotsComputeRequiredXmin("If already_locked is true,
ProcArrayLock has already been acquired exclusively.".

Agreed, will fix in the next version patch.

But, I am not
sure if it is a good idea to remove 'already_locked' parameter,
especially in back branches as this is an exposed API.

Yes, we should not remove the already_locked parameter in
backbranches. So I was thinking of keeping it on back branches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#28

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#24)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Mon, Jan 30, 2023 at 8:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 8, 2022 at 8:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 21, 2022 at 4:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

One idea to fix this issue is that in
ReplicationSlotsComputeRequiredXmin(), we compute the minimum xmin
while holding both ProcArrayLock and ReplicationSlotControlLock, and
release only ReplicationSlotsControlLock before updating the
replication_slot_xmin. I'm concerned it will increase the contention
on ProcArrayLock but I've attached the patch for discussion.

But what kind of workload are you worried about? This will be called
while processing XLOG_RUNNING_XACTS to update
procArray->replication_slot_xmin/procArray->replication_slot_catalog_xmin
only when required. So, if we want we can test some concurrent
workloads along with walsenders doing the decoding to check if it
impacts performance.

I was slightly concerned about holding ProcArrayLock while iterating
over replication slots especially when there are many replication
slots in the system. But you're right; we need it only when processing
XLOG_RUNINNG_XACTS and it's not frequent. So it doesn't introduce
visible overhead or negligible overhead.

What other way we can fix this? Do you think we can try to avoid
retreating xmin values in ProcArraySetReplicationSlotXmin() to avoid
this problem? Personally, I think taking the lock as proposed by your
patch is a better idea.

Agreed.

BTW, this problem seems to be only logical
replication specific, so if we are too worried then we can change this
locking only for logical replication.

Yes, but I agree that there won't be a big overhead by this fix.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#29

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#27)

4 attachment(s)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Mon, Jan 30, 2023 at 9:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 30, 2023 at 8:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Jan 27, 2023 at 4:31 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Thank you for making the patch! I'm still considering whether this approach is
correct, but I can put a comment to your patch anyway.
```
-       Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-       if (!already_locked)
-               LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+       Assert(LWLockHeldByMe(ProcArrayLock));
```
In this function, we regard that the ProcArrayLock has been already acquired as
exclusive mode and modify data. I think LWLockHeldByMeInMode() should be used
instead of LWLockHeldByMe().
Right, this is even evident from the comments atop
ReplicationSlotsComputeRequiredXmin("If already_locked is true,
ProcArrayLock has already been acquired exclusively.".
Agreed, will fix in the next version patch.

But, I am not
sure if it is a good idea to remove 'already_locked' parameter,
especially in back branches as this is an exposed API.

Yes, we should not remove the already_locked parameter in
backbranches. So I was thinking of keeping it on back branches.

I've attached patches for HEAD and backbranches. Please review them.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

master_v2-0001-Fix-a-race-condition-of-updating-procArray-replic.patchapplication/octet-stream; name=master_v2-0001-Fix-a-race-condition-of-updating-procArray-replic.patchDownload

From f5c8b7755dfebf8de60d01d5e9aa227944bbc4bf Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 10:25:57 +0900
Subject: [PATCH v2] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots while not holding ProcArrayLock if
already_locked is false, and acquires the ProcArrayLock just before
updating the replication slot xmin. Therefore, if a process calls
ReplicationSlotsComputeRequiredXmin() with already_locked being false
and another process updates the replication slot xmin before the
process acquiring the lock, the slot xmin was overwritten with an old
value.

In the reported failure, a walsender for an apply worker computes
InvalidTransaction as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker with this value. Then the walsender for a tablesync worker
ended up computing the transaction id by
GetOldestSafeDecodingTransactionId() without considering replication
slot xmin. That led to an error ""cannot build an initial slot
snapshot as oldest safe xid %u follows snapshot's xmin %u", which was
an assertion failure prior to 240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.
---
 src/backend/replication/slot.c      |  8 +++++++-
 src/backend/storage/ipc/procarray.c | 13 +++----------
 src/include/storage/procarray.h     |  2 +-
 3 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..d7dda24645 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -840,6 +840,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	Assert(ReplicationSlotCtl != NULL);
 
+	if (!already_locked)
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
@@ -878,7 +881,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	LWLockRelease(ReplicationSlotControlLock);
 
-	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin);
+
+	if (!already_locked)
+		LWLockRelease(ProcArrayLock);
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4340bf9641..a9e4f59440 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3896,23 +3896,16 @@ TerminateOtherDBBackends(Oid databaseId)
  *
  * Install limits to future computations of the xmin horizon to prevent vacuum
  * and HOT pruning from removing affected rows still needed by clients with
- * replication slots.
+ * replication slots. The caller must hold ProcArrayLock in exclusive mode.
  */
 void
-ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
-								bool already_locked)
+ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin)
 {
-	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-	if (!already_locked)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE));
 
 	procArray->replication_slot_xmin = xmin;
 	procArray->replication_slot_catalog_xmin = catalog_xmin;
 
-	if (!already_locked)
-		LWLockRelease(ProcArrayLock);
-
 	elog(DEBUG1, "xmin required by slots: data %u, catalog %u",
 		 xmin, catalog_xmin);
 }
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index d8cae3ce1c..b7554f1b53 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -91,7 +91,7 @@ extern void XidCacheRemoveRunningXids(TransactionId xid,
 									  TransactionId latestXid);
 
 extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
-											TransactionId catalog_xmin, bool already_locked);
+											TransactionId catalog_xmin);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
 											TransactionId *catalog_xmin);
-- 
2.31.1

REL13-14_v2-0001-Fix-a-race-condition-of-updating-procArray-replic.patchapplication/octet-stream; name=REL13-14_v2-0001-Fix-a-race-condition-of-updating-procArray-replic.patchDownload

From b09ca362cd5c748c5c7861aef60b52031629c174 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 10:58:26 +0900
Subject: [PATCH v2] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots while not holding ProcArrayLock if
already_locked is false, and acquires the ProcArrayLock just before
updating the replication slot xmin. Therefore, if a process calls
ReplicationSlotsComputeRequiredXmin() with already_locked being false
and another process updates the replication slot xmin before the
process acquiring the lock, the slot xmin was overwritten with an old
value.

In the reported failure, a walsender for an apply worker computes
InvalidTransaction as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker with this value. Then the walsender for a tablesync worker
ended up computing the transaction id by
GetOldestSafeDecodingTransactionId() without considering replication
slot xmin. That led to an error ""cannot build an initial slot
snapshot as oldest safe xid %u follows snapshot's xmin %u", which was
an assertion failure prior to 240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.
---
 src/backend/replication/slot.c      |  6 ++++++
 src/backend/storage/ipc/procarray.c | 11 ++++-------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 037a347cba..b39c31d85b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -771,6 +771,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	Assert(ReplicationSlotCtl != NULL);
 
+	if (!already_locked)
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
@@ -810,6 +813,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	LWLockRelease(ReplicationSlotControlLock);
 
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ProcArrayLock);
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 33fe9c06a3..198b238304 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3942,22 +3942,19 @@ TerminateOtherDBBackends(Oid databaseId)
  * Install limits to future computations of the xmin horizon to prevent vacuum
  * and HOT pruning from removing affected rows still needed by clients with
  * replication slots.
+ *
+ * NB: the caller must hold ProcArrayLock in an exclusive mode regardless of
+ * already_locked which is unused now but kept for ABI compatibility.
  */
 void
 ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
 								bool already_locked)
 {
-	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-	if (!already_locked)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE));
 
 	procArray->replication_slot_xmin = xmin;
 	procArray->replication_slot_catalog_xmin = catalog_xmin;
 
-	if (!already_locked)
-		LWLockRelease(ProcArrayLock);
-
 	elog(DEBUG1, "xmin required by slots: data %u, catalog %u",
 		 xmin, catalog_xmin);
 }
-- 
2.31.1

REL15_v2-0001-Fix-a-race-condition-of-updating-procArray-replic.patchapplication/octet-stream; name=REL15_v2-0001-Fix-a-race-condition-of-updating-procArray-replic.patchDownload

From 1d131bbf3b260fffd8a5cfcaf9595839e681f4ba Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 10:58:26 +0900
Subject: [PATCH v2] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots while not holding ProcArrayLock if
already_locked is false, and acquires the ProcArrayLock just before
updating the replication slot xmin. Therefore, if a process calls
ReplicationSlotsComputeRequiredXmin() with already_locked being false
and another process updates the replication slot xmin before the
process acquiring the lock, the slot xmin was overwritten with an old
value.

In the reported failure, a walsender for an apply worker computes
InvalidTransaction as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker with this value. Then the walsender for a tablesync worker
ended up computing the transaction id by
GetOldestSafeDecodingTransactionId() without considering replication
slot xmin. That led to an error ""cannot build an initial slot
snapshot as oldest safe xid %u follows snapshot's xmin %u", which was
an assertion failure prior to 240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.
---
 src/backend/replication/slot.c      |  6 ++++++
 src/backend/storage/ipc/procarray.c | 11 ++++-------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 329599e99d..cb726f9236 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -839,6 +839,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	Assert(ReplicationSlotCtl != NULL);
 
+	if (!already_locked)
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
@@ -871,6 +874,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	LWLockRelease(ReplicationSlotControlLock);
 
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ProcArrayLock);
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index dadaa958a8..6c7c0fbb43 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3885,21 +3885,18 @@ TerminateOtherDBBackends(Oid databaseId)
  * Install limits to future computations of the xmin horizon to prevent vacuum
  * and HOT pruning from removing affected rows still needed by clients with
  * replication slots.
+ *
+ * NB: the caller must hold ProcArrayLock in an exclusive mode regardless of
+ * already_locked which is unused now but kept for ABI compatibility.
  */
 void
 ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
 								bool already_locked)
 {
-	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-	if (!already_locked)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE));
 
 	procArray->replication_slot_xmin = xmin;
 	procArray->replication_slot_catalog_xmin = catalog_xmin;
-
-	if (!already_locked)
-		LWLockRelease(ProcArrayLock);
 }
 
 /*
-- 
2.31.1

REL11-12_v2-0001-Fix-a-race-condition-of-updating-procArray-replic.patchapplication/octet-stream; name=REL11-12_v2-0001-Fix-a-race-condition-of-updating-procArray-replic.patchDownload

From d37a1b097b4572cbbe8a76d39577cb9f96b5b21a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 10:58:26 +0900
Subject: [PATCH v2] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots while not holding ProcArrayLock if
already_locked is false, and acquires the ProcArrayLock just before
updating the replication slot xmin. Therefore, if a process calls
ReplicationSlotsComputeRequiredXmin() with already_locked being false
and another process updates the replication slot xmin before the
process acquiring the lock, the slot xmin was overwritten with an old
value.

In the reported failure, a walsender for an apply worker computes
InvalidTransaction as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker with this value. Then the walsender for a tablesync worker
ended up computing the transaction id by
GetOldestSafeDecodingTransactionId() without considering replication
slot xmin. That led to an error ""cannot build an initial slot
snapshot as oldest safe xid %u follows snapshot's xmin %u", which was
an assertion failure prior to 240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.
---
 src/backend/replication/slot.c      |  6 ++++++
 src/backend/storage/ipc/procarray.c | 11 ++++-------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index a2ca9b070e..98c9d72bd5 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -707,6 +707,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	Assert(ReplicationSlotCtl != NULL);
 
+	if (!already_locked)
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
@@ -739,6 +742,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	LWLockRelease(ReplicationSlotControlLock);
 
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ProcArrayLock);
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bf44db1b6a..ee6be3afb1 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3077,21 +3077,18 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
  * Install limits to future computations of the xmin horizon to prevent vacuum
  * and HOT pruning from removing affected rows still needed by clients with
  * replication slots.
+ *
+ * NB: the caller must hold ProcArrayLock in an exclusive mode regardless of
+ * already_locked which is unused now but kept for ABI compatibility.
  */
 void
 ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
 								bool already_locked)
 {
-	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-	if (!already_locked)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE));
 
 	procArray->replication_slot_xmin = xmin;
 	procArray->replication_slot_catalog_xmin = catalog_xmin;
-
-	if (!already_locked)
-		LWLockRelease(ProcArrayLock);
 }
 
 /*
-- 
2.31.1

#30

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#29)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Jan 31, 2023 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 30, 2023 at 9:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached patches for HEAD and backbranches. Please review them.

Shall we add a comment like the one below in
ReplicationSlotsComputeRequiredXmin()?
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..e28d48bca7 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -840,6 +840,13 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)

Assert(ReplicationSlotCtl != NULL);

+       /*
+        * It is possible that by the time we compute the agg_xmin
here and before
+        * updating replication_slot_xmin, the CreateInitDecodingContext() will
+        * compute and update replication_slot_xmin. So, we need to acquire
+        * ProcArrayLock here to avoid retreating the value of
replication_slot_xmin.
+        */
+

--
With Regards,
Amit Kapila.

#31

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#30)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Jan 31, 2023 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 31, 2023 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 30, 2023 at 9:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached patches for HEAD and backbranches. Please review them.

Shall we add a comment like the one below in
ReplicationSlotsComputeRequiredXmin()?
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..e28d48bca7 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -840,6 +840,13 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)

Assert(ReplicationSlotCtl != NULL);

+       /*
+        * It is possible that by the time we compute the agg_xmin
here and before
+        * updating replication_slot_xmin, the CreateInitDecodingContext() will
+        * compute and update replication_slot_xmin. So, we need to acquire
+        * ProcArrayLock here to avoid retreating the value of
replication_slot_xmin.
+        */
+

Agreed. It looks good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#32

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#31)

3 attachment(s)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Jan 31, 2023 at 3:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jan 31, 2023 at 3:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jan 31, 2023 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 30, 2023 at 9:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached patches for HEAD and backbranches. Please review them.
Shall we add a comment like the one below in
ReplicationSlotsComputeRequiredXmin()?
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..e28d48bca7 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -840,6 +840,13 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
Assert(ReplicationSlotCtl != NULL);
+       /*
+        * It is possible that by the time we compute the agg_xmin
here and before
+        * updating replication_slot_xmin, the CreateInitDecodingContext() will
+        * compute and update replication_slot_xmin. So, we need to acquire
+        * ProcArrayLock here to avoid retreating the value of
replication_slot_xmin.
+        */
+
Agreed. It looks good to me.

Attached updated patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

master_v3-0001-Fix-a-race-condition-of-updating-procArray-replic.patchapplication/octet-stream; name=master_v3-0001-Fix-a-race-condition-of-updating-procArray-replic.patchDownload

From e06e44ce9930793f5a0383580db8ebb3e9b6a6b4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 10:25:57 +0900
Subject: [PATCH v3] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, if already_locked is false,
ReplicationSlotsComputeRequiredXmin() computed the oldest xmin across
all slots while not holding the ProcArrayLock and acquires the
ProcArrayLock just before updating the
replication_slot_xmin. Therefore, it was possible that by the time a
process computes the oldest xmin and before updating the
replication_slot_xmin, another process computes and updates it. As a
result, the replication_slot_xmin could be overwritten with an old
value and retreated.

In the reported failure, after a walsender who was creating a
replication slot updated the replication_slot_xmin via
CreateInitDecodingContext(), another walsender overwrote it with
InvalidTransactionId. Then the walsender creating the replication slot
ended up computing the oldest safe decoding transaction id without
considering the replication_slot_xmin. That led to an error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to
240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding the ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.

Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 11
---
 src/backend/replication/slot.c      | 15 ++++++++++++++-
 src/backend/storage/ipc/procarray.c | 13 +++----------
 src/include/storage/procarray.h     |  2 +-
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..063f6aa95c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -840,6 +840,16 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	Assert(ReplicationSlotCtl != NULL);
 
+	/*
+	 * It is possible that by the time we compute the agg_xmin here and before
+	 * updating replication_slot_xmin, the CreateInitDecodingContext() will
+	 * compute and update replication_slot_xmin. So, we need to acquire
+	 * ProcArrayLock here to avoid retreating the value of
+	 * replication_slot_xmin.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
@@ -878,7 +888,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 
 	LWLockRelease(ReplicationSlotControlLock);
 
-	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin);
+
+	if (!already_locked)
+		LWLockRelease(ProcArrayLock);
 }
 
 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4340bf9641..a9e4f59440 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3896,23 +3896,16 @@ TerminateOtherDBBackends(Oid databaseId)
  *
  * Install limits to future computations of the xmin horizon to prevent vacuum
  * and HOT pruning from removing affected rows still needed by clients with
- * replication slots.
+ * replication slots. The caller must hold ProcArrayLock in exclusive mode.
  */
 void
-ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
-								bool already_locked)
+ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin)
 {
-	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-	if (!already_locked)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE));
 
 	procArray->replication_slot_xmin = xmin;
 	procArray->replication_slot_catalog_xmin = catalog_xmin;
 
-	if (!already_locked)
-		LWLockRelease(ProcArrayLock);
-
 	elog(DEBUG1, "xmin required by slots: data %u, catalog %u",
 		 xmin, catalog_xmin);
 }
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index d8cae3ce1c..b7554f1b53 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -91,7 +91,7 @@ extern void XidCacheRemoveRunningXids(TransactionId xid,
 									  TransactionId latestXid);
 
 extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
-											TransactionId catalog_xmin, bool already_locked);
+											TransactionId catalog_xmin);
 
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
 											TransactionId *catalog_xmin);
-- 
2.31.1

REL11-12_v3-0001-Fix-a-race-condition-of-updating-procArray-replic.patchapplication/octet-stream; name=REL11-12_v3-0001-Fix-a-race-condition-of-updating-procArray-replic.patchDownload

From 73f6cbef7125bd4e95b11397d58af182583979a8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 10:58:26 +0900
Subject: [PATCH v3] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, if already_locked is false,
ReplicationSlotsComputeRequiredXmin() computed the oldest xmin across
all slots while not holding the ProcArrayLock and acquires the
ProcArrayLock just before updating the
replication_slot_xmin. Therefore, it was possible that by the time a
process computes the oldest xmin and before updating the
replication_slot_xmin, another process computes and updates it. As a
result, the replication_slot_xmin could be overwritten with an old
value and retreated.

In the reported failure, after a walsender who was creating a
replication slot updated the replication_slot_xmin via
CreateInitDecodingContext(), another walsender overwrote it with
InvalidTransactionId. Then the walsender creating the replication slot
ended up computing the oldest safe decoding transaction id without
considering the replication_slot_xmin. That led to an error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to
240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding the ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.

Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 11
---
 src/backend/replication/slot.c      | 13 +++++++++++++
 src/backend/storage/ipc/procarray.c | 11 ++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index a2ca9b070e..f581247dd8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -707,6 +707,16 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)

 	Assert(ReplicationSlotCtl != NULL);

+	/*
+	 * It is possible that by the time we compute the agg_xmin here and before
+	 * updating replication_slot_xmin, the CreateInitDecodingContext() will
+	 * compute and update replication_slot_xmin. So, we need to acquire
+	 * ProcArrayLock here to avoid retreating the value of
+	 * replication_slot_xmin.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

 	for (i = 0; i < max_replication_slots; i++)
@@ -739,6 +749,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	LWLockRelease(ReplicationSlotControlLock);

 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ProcArrayLock);
 }

 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bf44db1b6a..ee6be3afb1 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3077,21 +3077,18 @@ CountOtherDBBackends(Oid databaseId, int *nbackends, int *nprepared)
  * Install limits to future computations of the xmin horizon to prevent vacuum
  * and HOT pruning from removing affected rows still needed by clients with
  * replication slots.
+ *
+ * NB: the caller must hold ProcArrayLock in an exclusive mode regardless of
+ * already_locked which is unused now but kept for ABI compatibility.
  */
 void
 ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
 								bool already_locked)
 {
-	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-	if (!already_locked)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE));

 	procArray->replication_slot_xmin = xmin;
 	procArray->replication_slot_catalog_xmin = catalog_xmin;
-
-	if (!already_locked)
-		LWLockRelease(ProcArrayLock);
 }

 /*
-- 
2.31.1

REL13-15_v3-0001-Fix-a-race-condition-of-updating-procArray-replic.patchapplication/octet-stream; name=REL13-15_v3-0001-Fix-a-race-condition-of-updating-procArray-replic.patchDownload

From 58fa583c5496390e1919b6a3fca975d8de45cb0e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 10:58:26 +0900
Subject: [PATCH v3] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, if already_locked is false,
ReplicationSlotsComputeRequiredXmin() computed the oldest xmin across
all slots while not holding the ProcArrayLock and acquires the
ProcArrayLock just before updating the
replication_slot_xmin. Therefore, it was possible that by the time a
process computes the oldest xmin and before updating the
replication_slot_xmin, another process computes and updates it. As a
result, the replication_slot_xmin could be overwritten with an old
value and retreated.

In the reported failure, after a walsender who was creating a
replication slot updated the replication_slot_xmin via
CreateInitDecodingContext(), another walsender overwrote it with
InvalidTransactionId. Then the walsender creating the replication slot
ended up computing the oldest safe decoding transaction id without
considering the replication_slot_xmin. That led to an error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to
240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding the ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.

Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 11
---
 src/backend/replication/slot.c      | 13 +++++++++++++
 src/backend/storage/ipc/procarray.c | 11 ++++-------
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 80d96db8eb..916190ae72 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -839,6 +839,16 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)

 	Assert(ReplicationSlotCtl != NULL);

+	/*
+	 * It is possible that by the time we compute the agg_xmin here and before
+	 * updating replication_slot_xmin, the CreateInitDecodingContext() will
+	 * compute and update replication_slot_xmin. So, we need to acquire
+	 * ProcArrayLock here to avoid retreating the value of
+	 * replication_slot_xmin.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

 	for (i = 0; i < max_replication_slots; i++)
@@ -878,6 +888,9 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	LWLockRelease(ReplicationSlotControlLock);

 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ProcArrayLock);
 }

 /*
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 655d11e2f9..4b9306f917 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -3896,22 +3896,19 @@ TerminateOtherDBBackends(Oid databaseId)
  * Install limits to future computations of the xmin horizon to prevent vacuum
  * and HOT pruning from removing affected rows still needed by clients with
  * replication slots.
+ *
+ * NB: the caller must hold ProcArrayLock in an exclusive mode regardless of
+ * already_locked which is unused now but kept for ABI compatibility.
  */
 void
 ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
 								bool already_locked)
 {
-	Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
-	if (!already_locked)
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	Assert(LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE));

 	procArray->replication_slot_xmin = xmin;
 	procArray->replication_slot_catalog_xmin = catalog_xmin;

-	if (!already_locked)
-		LWLockRelease(ProcArrayLock);
-
 	elog(DEBUG1, "xmin required by slots: data %u, catalog %u",
 		 xmin, catalog_xmin);
 }
-- 
2.31.1

#33

[1]: /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
[2]: /messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#32)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patches.

Thanks, Andres, others, do you see a better way to fix this problem? I
have reproduced it manually and the steps are shared at [1]/messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com and
Sawada-San also reproduced it, see [2]/messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com.

--
With Regards,
Amit Kapila.

#34

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#32)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patches.

In back-branch patches, the change is as below:
+ *
+ * NB: the caller must hold ProcArrayLock in an exclusive mode regardless of
+ * already_locked which is unused now but kept for ABI compatibility.
  */
 void
 ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  bool already_locked)
 {
- Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));
-
- if (!already_locked)
- LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+ Assert(LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE));

This change looks odd to me. I think it would be better to pass
'already_locked' as true from the caller.

--
With Regards,
Amit Kapila.

#35

andres@anarazel.de

almost 3 years ago

In reply to: Amit Kapila (#33)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi,

On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:

On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patches.

Thanks, Andres, others, do you see a better way to fix this problem? I
have reproduced it manually and the steps are shared at [1] and
Sawada-San also reproduced it, see [2].

[1] - /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
[2] - /messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com

Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
already_locked=true worry me a lot less, because obviously that's not a very
frequent operation.

This is particularly not great because we need to acquire
ReplicationSlotControlLock while already holding ProcArrayLock.

But clearly there's a pretty large hole in the lock protection right now. I'm
a bit confused about why we (Robert and I, or just I) thought it's ok to do it
this way.

I wonder if we could instead invert the locks, and hold
ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin(). That'd mean
that already_locked = true callers have to do a bit more work (we have to be
sure the locks are always acquired in the same order, or we end up in
unresolved deadlock land), but I think we can live with that.

This would still allow concurrent invocations of
ReplicationSlotsComputeRequiredXmin() come up with slightly different values,
but that's possible with the proposed patch as well, as effective_xmin is
updated without any of the other locks. But I don't see a problem with that.

Greetings,

Andres Freund

#36

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#35)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi,

On 2023-02-07 11:49:03 -0800, Andres Freund wrote:

On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:

On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patches.

Thanks, Andres, others, do you see a better way to fix this problem? I
have reproduced it manually and the steps are shared at [1] and
Sawada-San also reproduced it, see [2].

[1] - /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
[2] - /messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com

Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
already_locked=true worry me a lot less, because obviously that's not a very
frequent operation.

Separately from this change:

I wonder if we ought to change the setup in CreateInitDecodingContext() to be a
bit less intricate. One idea:

Instead of having GetOldestSafeDecodingTransactionId() compute a value, that
we then enter into a slot, that then computes the global horizon via
ReplicationSlotsComputeRequiredXmin(), we could have a successor to
GetOldestSafeDecodingTransactionId() change procArray->replication_slot_xmin
(if needed).

As long as CreateInitDecodingContext() prevents a concurent
ReplicationSlotsComputeRequiredXmin(), by holding ReplicationSlotControlLock
exclusively, that should suffice to ensure that no "wrong" horizon was
determined / no needed rows have been removed. And we'd not need a lock nested
inside ProcArrayLock anymore.

Not sure if it's sufficiently better to be worth bothering with though :(

Greetings,

Andres Freund

#37

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Andres Freund (#35)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Wed, Feb 8, 2023 at 1:19 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:

On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patches.

Thanks, Andres, others, do you see a better way to fix this problem? I
have reproduced it manually and the steps are shared at [1] and
Sawada-San also reproduced it, see [2].

[1] - /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
[2] - /messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com

Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
already_locked=true worry me a lot less, because obviously that's not a very
frequent operation.

This is particularly not great because we need to acquire
ReplicationSlotControlLock while already holding ProcArrayLock.

But clearly there's a pretty large hole in the lock protection right now. I'm
a bit confused about why we (Robert and I, or just I) thought it's ok to do it
this way.

I wonder if we could instead invert the locks, and hold
ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin().

Along with inverting, doesn't this mean that we need to acquire
ReplicationSlotControlLock in Exclusive mode instead of acquiring it
in shared mode? My understanding of the above locking scheme is that
in CreateInitDecodingContext(), we acquire ReplicationSlotControlLock
in Exclusive mode before acquiring ProcArrayLock in Exclusive mode and
release it after releasing ProcArrayLock. Then,
ReplicationSlotsComputeRequiredXmin() acquires
ReplicationSlotControlLock in Exclusive mode only when already_locked
is false and releases it after a call to
ProcArraySetReplicationSlotXmin(). ProcArraySetReplicationSlotXmin()
won't change.

I don't think just inverting the order without changing the lock mode
will solve the problem because still apply worker will be able to
override the replication_slot_xmin value.

--
With Regards,
Amit Kapila.

#38

amit.kapila16@gmail.com

almost 3 years ago

In reply to: Andres Freund (#36)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Wed, Feb 8, 2023 at 1:35 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-07 11:49:03 -0800, Andres Freund wrote:

On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:

On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patches.

Thanks, Andres, others, do you see a better way to fix this problem? I
have reproduced it manually and the steps are shared at [1] and
Sawada-San also reproduced it, see [2].

[1] - /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
[2] - /messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com

Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
already_locked=true worry me a lot less, because obviously that's not a very
frequent operation.

Separately from this change:

I wonder if we ought to change the setup in CreateInitDecodingContext() to be a
bit less intricate. One idea:

Instead of having GetOldestSafeDecodingTransactionId() compute a value, that
we then enter into a slot, that then computes the global horizon via
ReplicationSlotsComputeRequiredXmin(), we could have a successor to
GetOldestSafeDecodingTransactionId() change procArray->replication_slot_xmin
(if needed).

As long as CreateInitDecodingContext() prevents a concurent
ReplicationSlotsComputeRequiredXmin(), by holding ReplicationSlotControlLock
exclusively, that should suffice to ensure that no "wrong" horizon was
determined / no needed rows have been removed. And we'd not need a lock nested
inside ProcArrayLock anymore.

Not sure if it's sufficiently better to be worth bothering with though :(

I am also not sure because it would improve concurrency for
CreateInitDecodingContext() which shouldn't be called at a higher
frequency. Also, to some extent, the current coding or the approach we
are discussing is easier to follow as we would always update
procArray->replication_slot_xmin after checking all the slots.

--
With Regards,
Amit Kapila.

#39

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-05-15%2020%3A55%3A17

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Amit Kapila (#37)

1 attachment(s)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Wed, Feb 8, 2023 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 8, 2023 at 1:19 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:

On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patches.

Thanks, Andres, others, do you see a better way to fix this problem? I
have reproduced it manually and the steps are shared at [1] and
Sawada-San also reproduced it, see [2].

[1] - /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
[2] - /messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com

Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
already_locked=true worry me a lot less, because obviously that's not a very
frequent operation.

This is particularly not great because we need to acquire
ReplicationSlotControlLock while already holding ProcArrayLock.

But clearly there's a pretty large hole in the lock protection right now. I'm
a bit confused about why we (Robert and I, or just I) thought it's ok to do it
this way.

I wonder if we could instead invert the locks, and hold
ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin().

Along with inverting, doesn't this mean that we need to acquire
ReplicationSlotControlLock in Exclusive mode instead of acquiring it
in shared mode? My understanding of the above locking scheme is that
in CreateInitDecodingContext(), we acquire ReplicationSlotControlLock
in Exclusive mode before acquiring ProcArrayLock in Exclusive mode and
release it after releasing ProcArrayLock. Then,
ReplicationSlotsComputeRequiredXmin() acquires
ReplicationSlotControlLock in Exclusive mode only when already_locked
is false and releases it after a call to
ProcArraySetReplicationSlotXmin(). ProcArraySetReplicationSlotXmin()
won't change.

I've attached the patch of this idea for discussion. In
GetOldestSafeDecodingTransactionId() called by
CreateInitDecodingContext(), we hold ReplicationSlotControlLock,
ProcArrayLock, and XidGenLock at a time. So we would need to be
careful about the ordering.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_concurrent_slot_xmin_update.patchapplication/octet-stream; name=fix_concurrent_slot_xmin_update.patchDownload

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1a58dd7649..607a605075 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -386,11 +386,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -403,6 +403,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -417,6 +418,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index f286918f69..852ec9564a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -828,8 +828,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and
+ * the ProcArrayLock have already been acquired exclusively.
+ *
+ * Note that the ReplicationSlotControlLock must be locked first to avoid
+ * deadlocks.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -839,8 +842,26 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock exclusive until after updating
+	 * the slot xmin values, so no backend can compute and update the new
+	 * value concurrently.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively,
+	 * compute the xmin values while holding the ReplicationSlotControlLock
+	 * in shared mode, and update the slot xmin values, but it could increase
+	 * lock contention on the ProcArrayLock, which is not great since this
+	 * function can be called at non-negligible frequency.
+	 *
+	 * We instead increase lock contention on the ReplicationSlotControlLock
+	 * but it would be less harmful.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -876,9 +897,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*

#40

Daniel Gustafsson

daniel@yesql.se

over 2 years ago

In reply to: Masahiko Sawada (#39)

Re: Assertion failure in SnapBuildInitialSnapshot()

On 9 Feb 2023, at 07:32, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the patch of this idea for discussion.

Amit, Andres: have you had a chance to look at the updated version of this
patch?

--
Daniel Gustafsson

#41

Robert Haas

robertmhaas@gmail.com

about 2 years ago

In reply to: Andres Freund (#35)

Re: Assertion failure in SnapBuildInitialSnapshot()

This thread has gone for about a year here without making any
progress, which isn't great.

On Tue, Feb 7, 2023 at 2:49 PM Andres Freund <andres@anarazel.de> wrote:

Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
already_locked=true worry me a lot less, because obviously that's not a very
frequent operation.

Maybe, but it would be good to have some data indicating whether this
is really an issue.

I wonder if we could instead invert the locks, and hold
ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin(). That'd mean
that already_locked = true callers have to do a bit more work (we have to be
sure the locks are always acquired in the same order, or we end up in
unresolved deadlock land), but I think we can live with that.

This seems like it could be made to work, but there's apparently a
shortage of people willing to write the patch.

As another thought, Masahiko-san writes in his proposed commit message:

"As a result, the replication_slot_xmin could be overwritten with an
old value and retreated."

But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does
nothing?

--
Robert Haas
EDB: http://www.enterprisedb.com

#42

vignesh C

vignesh21@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#39)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, 9 Feb 2023 at 12:02, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 8, 2023 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 8, 2023 at 1:19 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:

On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patches.

Thanks, Andres, others, do you see a better way to fix this problem? I
have reproduced it manually and the steps are shared at [1] and
Sawada-San also reproduced it, see [2].

[1] - /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
[2] - /messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com

Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
already_locked=true worry me a lot less, because obviously that's not a very
frequent operation.

This is particularly not great because we need to acquire
ReplicationSlotControlLock while already holding ProcArrayLock.

But clearly there's a pretty large hole in the lock protection right now. I'm
a bit confused about why we (Robert and I, or just I) thought it's ok to do it
this way.

I wonder if we could instead invert the locks, and hold
ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin().

Along with inverting, doesn't this mean that we need to acquire
ReplicationSlotControlLock in Exclusive mode instead of acquiring it
in shared mode? My understanding of the above locking scheme is that
in CreateInitDecodingContext(), we acquire ReplicationSlotControlLock
in Exclusive mode before acquiring ProcArrayLock in Exclusive mode and
release it after releasing ProcArrayLock. Then,
ReplicationSlotsComputeRequiredXmin() acquires
ReplicationSlotControlLock in Exclusive mode only when already_locked
is false and releases it after a call to
ProcArraySetReplicationSlotXmin(). ProcArraySetReplicationSlotXmin()
won't change.

I've attached the patch of this idea for discussion. In
GetOldestSafeDecodingTransactionId() called by
CreateInitDecodingContext(), we hold ReplicationSlotControlLock,
ProcArrayLock, and XidGenLock at a time. So we would need to be
careful about the ordering.

I have changed the status of the patch to "Waiting on Author" as
Robert's issues were not addressed yet. Feel free to change the status
accordingly after addressing them.

Regards,
Vignesh

#43

vignesh C

vignesh21@gmail.com

almost 2 years ago

In reply to: vignesh C (#42)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, 11 Jan 2024 at 19:55, vignesh C <vignesh21@gmail.com> wrote:

On Thu, 9 Feb 2023 at 12:02, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 8, 2023 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 8, 2023 at 1:19 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-02-01 11:23:57 +0530, Amit Kapila wrote:

On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Attached updated patches.

Thanks, Andres, others, do you see a better way to fix this problem? I
have reproduced it manually and the steps are shared at [1] and
Sawada-San also reproduced it, see [2].

[1] - /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
[2] - /messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com

Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass
already_locked=true worry me a lot less, because obviously that's not a very
frequent operation.

This is particularly not great because we need to acquire
ReplicationSlotControlLock while already holding ProcArrayLock.

But clearly there's a pretty large hole in the lock protection right now. I'm
a bit confused about why we (Robert and I, or just I) thought it's ok to do it
this way.

I wonder if we could instead invert the locks, and hold
ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin().

Along with inverting, doesn't this mean that we need to acquire
ReplicationSlotControlLock in Exclusive mode instead of acquiring it
in shared mode? My understanding of the above locking scheme is that
in CreateInitDecodingContext(), we acquire ReplicationSlotControlLock
in Exclusive mode before acquiring ProcArrayLock in Exclusive mode and
release it after releasing ProcArrayLock. Then,
ReplicationSlotsComputeRequiredXmin() acquires
ReplicationSlotControlLock in Exclusive mode only when already_locked
is false and releases it after a call to
ProcArraySetReplicationSlotXmin(). ProcArraySetReplicationSlotXmin()
won't change.

I've attached the patch of this idea for discussion. In
GetOldestSafeDecodingTransactionId() called by
CreateInitDecodingContext(), we hold ReplicationSlotControlLock,
ProcArrayLock, and XidGenLock at a time. So we would need to be
careful about the ordering.

I have changed the status of the patch to "Waiting on Author" as
Robert's issues were not addressed yet. Feel free to change the status
accordingly after addressing them.

The patch which you submitted has been awaiting your attention for
quite some time now. As such, we have moved it to "Returned with
Feedback" and removed it from the reviewing queue. Depending on
timing, this may be reversible. Kindly address the feedback you have
received, and resubmit the patch to the next CommitFest.

Regards,
Vignesh

#44

Alexander Lakhin

exclusion@gmail.com

over 1 year ago

In reply to: vignesh C (#43)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hello,

01.02.2024 21:20, vignesh C wrote:

The patch which you submitted has been awaiting your attention for
quite some time now. As such, we have moved it to "Returned with
Feedback" and removed it from the reviewing queue. Depending on
timing, this may be reversible. Kindly address the feedback you have
received, and resubmit the patch to the next CommitFest.

While analyzing buildfarm failures, I found [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-05-15%2020%3A55%3A17, which demonstrates the
assertion failure discussed here:
---
031_column_list_publisher.log
TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid, snap->xmin)", File:
"/home/bf/bf-build/skink/REL_15_STABLE/pgsql.build/../pgsql/src/backend/replication/logical/snapbuild.c", Line: 614,
PID: 1882382)
---

I've managed to reproduce the assertion failure on REL_15_STABLE with the
following modification:
@@ -3928,6 +3928,7 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
{
Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));

+pg_usleep(1000);
if (!already_locked)
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

using the script:
numjobs=100
createdb db
export PGDATABASE=db

for ((i=1;i<=100;i++)); do
echo "iteration $i"

for ((j=1;j<=numjobs;j++)); do
echo "
SELECT pg_create_logical_replication_slot('s$j', 'test_decoding');
SELECT txid_current();
" | psql >>/dev/null 2>&1 &

echo "
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
CREATE_REPLICATION_SLOT slot$j LOGICAL test_decoding USE_SNAPSHOT;
" | psql -d "dbname=db replication=database" >>/dev/null 2>&1 &
done
wait

for ((j=1;j<=numjobs;j++)); do
echo "
DROP_REPLICATION_SLOT slot$j;
" | psql -d "dbname=db replication=database" >/dev/null

echo "SELECT pg_drop_replication_slot('s$j');" | psql >/dev/null
done

grep 'TRAP' server.log && break;
done

(with
wal_level = logical
max_replication_slots = 200
max_wal_senders = 200
in postgresql.conf)

iteration 18
ERROR: replication slot "slot13" is active for PID 538431
TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid, snap->xmin)", File: "snapbuild.c", Line: 614, PID: 538431)

I've also confirmed that fix_concurrent_slot_xmin_update.patch fixes the
issue.

Best regards,
Alexander

#45

Pradeep Kumar

spradeepkumar29@gmail.com

3 months ago

In reply to: Alexander Lakhin (#44)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi All,
In this thread
</messages/by-id/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com>
they
proposed fix_concurrent_slot_xmin_update.patch will solve this assert
failure. After applying this patch I execute pg_sync_replication_slots()
(which calls SyncReplicationSlots → synchronize_slots() →
synchronize_one_slot() → ReplicationSlotsComputeRequiredXmin(true)) can hit
an assertion failure in ReplicationSlotsComputeRequiredXmin() because the
ReplicationSlotControlLock is not held in that code path. By default
sync_replication_slots is off, so the background slot-sync worker is not
spawned; invoking the UDF directly exercises the path without the lock. I
have a small patch that acquires ReplicationSlotControlLock in the manual
sync path; that stops the assert.

Call Stack :
TRAP: failed Assert("!already_locked ||
(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot.
c", Line: 1061, PID: 67056
0 postgres 0x000000010104aad4
ExceptionalCondition + 216
1 postgres 0x0000000100d8718c
ReplicationSlotsComputeRequiredXmin + 180
2 postgres 0x0000000100d6fba8
synchronize_one_slot + 1488
3 postgres 0x0000000100d6e8cc
synchronize_slots + 1480
4 postgres 0x0000000100d6efe4
SyncReplicationSlots + 164
5 postgres 0x0000000100d8da84
pg_sync_replication_slots + 476
6 postgres 0x0000000100b34c58 ExecInterpExpr +
2388
7 postgres 0x0000000100b33ee8
ExecInterpExprStillValid + 76
8 postgres 0x00000001008acd5c
ExecEvalExprSwitchContext + 64
9 postgres 0x0000000100b54d48 ExecProject + 76
10 postgres 0x0000000100b925d4 ExecResult + 312
11 postgres 0x0000000100b5083c
ExecProcNodeFirst + 92
12 postgres 0x0000000100b48b88 ExecProcNode + 60
13 postgres 0x0000000100b44410 ExecutePlan + 184
14 postgres 0x0000000100b442dc
standard_ExecutorRun + 644
15 postgres 0x0000000100b44048 ExecutorRun + 104
16 postgres 0x0000000100e3053c PortalRunSelect
+ 308
17 postgres 0x0000000100e2ff40 PortalRun + 736
18 postgres 0x0000000100e2b21c
exec_simple_query + 1368
19 postgres 0x0000000100e2a42c PostgresMain +
2508
20 postgres 0x0000000100e22ce4
BackendInitialize + 0
21 postgres 0x0000000100d1fd4c
postmaster_child_launch + 304
22 postgres 0x0000000100d26d9c BackendStartup +
448
23 postgres 0x0000000100d23f18 ServerLoop + 372
24 postgres 0x0000000100d22f18 PostmasterMain +
6396
25 postgres 0x0000000100bcffd4 init_locale + 0
26 dyld 0x0000000186d82b98 start + 6076

The assert is raised inside ReplicationSlotsComputeRequiredXmin() because
that function expects either that already_locked is false (and it will
acquire what it needs), or that callers already hold both
ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In
the manual-sync path called by the UDF, neither lock is held, so the
assertion trips.

Why this happens:
The background slot sync worker (spawned when sync_replication_slots = on)
acquires the necessary locks before calling the routines that
update/compute slot xmins, so the worker path is safe.The manual path
through the SQL-callable UDF does not take the same locks before calling
synchronize_slots()/synchronize_one_slot(). As a result the invariant
assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading
to the assert.

Proposed fix:
In synchronize_slots() (the code path used by
SyncReplicationSlots()/pg_sync_replication_slots()), acquire
ReplicationSlotControlLock before any call that can end up calling
ReplicationSlotsComputeRequiredXmin(true).

Thanks and Regards
Pradeep

On Mon, Oct 27, 2025 at 3:09 PM Alexander Lakhin <exclusion@gmail.com>
wrote:

Show quoted text

Hello,

01.02.2024 21:20, vignesh C wrote:

The patch which you submitted has been awaiting your attention for
quite some time now. As such, we have moved it to "Returned with
Feedback" and removed it from the reviewing queue. Depending on
timing, this may be reversible. Kindly address the feedback you have
received, and resubmit the patch to the next CommitFest.

While analyzing buildfarm failures, I found [1], which demonstrates the
assertion failure discussed here:
---
031_column_list_publisher.log
TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid,
snap->xmin)", File:
"/home/bf/bf-build/skink/REL_15_STABLE/pgsql.build/../pgsql/src/backend/replication/logical/snapbuild.c",
Line: 614,
PID: 1882382)
---

I've managed to reproduce the assertion failure on REL_15_STABLE with the
following modification:
@@ -3928,6 +3928,7 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin,
TransactionId catalog_xmin,
{
Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));

+pg_usleep(1000);
if (!already_locked)
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

using the script:
numjobs=100
createdb db
export PGDATABASE=db

for ((i=1;i<=100;i++)); do
echo "iteration $i"

for ((j=1;j<=numjobs;j++)); do
echo "
SELECT pg_create_logical_replication_slot('s$j', 'test_decoding');
SELECT txid_current();
" | psql >>/dev/null 2>&1 &

echo "
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
CREATE_REPLICATION_SLOT slot$j LOGICAL test_decoding USE_SNAPSHOT;
" | psql -d "dbname=db replication=database" >>/dev/null 2>&1 &
done
wait

for ((j=1;j<=numjobs;j++)); do
echo "
DROP_REPLICATION_SLOT slot$j;
" | psql -d "dbname=db replication=database" >/dev/null

echo "SELECT pg_drop_replication_slot('s$j');" | psql >/dev/null
done

grep 'TRAP' server.log && break;
done

(with
wal_level = logical
max_replication_slots = 200
max_wal_senders = 200
in postgresql.conf)

iteration 18
ERROR: replication slot "slot13" is active for PID 538431
TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid,
snap->xmin)", File: "snapbuild.c", Line: 614, PID: 538431)

I've also confirmed that fix_concurrent_slot_xmin_update.patch fixes the
issue.

[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-05-15%2020%3A55%3A17

Best regards,
Alexander

#46

[1]: /messages/by-id/CA+TgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ@mail.gmail.com

sawada.mshk@gmail.com

2 months ago

In reply to: Pradeep Kumar (#45)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Mon, Oct 27, 2025 at 5:22 AM Pradeep Kumar <spradeepkumar29@gmail.com> wrote:

Hi All,
In this thread they proposed fix_concurrent_slot_xmin_update.patch will solve this assert failure. After applying this patch I execute pg_sync_replication_slots() (which calls SyncReplicationSlots → synchronize_slots() → synchronize_one_slot() → ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in ReplicationSlotsComputeRequiredXmin() because the ReplicationSlotControlLock is not held in that code path. By default sync_replication_slots is off, so the background slot-sync worker is not spawned; invoking the UDF directly exercises the path without the lock. I have a small patch that acquires ReplicationSlotControlLock in the manual sync path; that stops the assert.

Call Stack :
TRAP: failed Assert("!already_locked || (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) && LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot.
c", Line: 1061, PID: 67056
0 postgres 0x000000010104aad4 ExceptionalCondition + 216
1 postgres 0x0000000100d8718c ReplicationSlotsComputeRequiredXmin + 180
2 postgres 0x0000000100d6fba8 synchronize_one_slot + 1488
3 postgres 0x0000000100d6e8cc synchronize_slots + 1480
4 postgres 0x0000000100d6efe4 SyncReplicationSlots + 164
5 postgres 0x0000000100d8da84 pg_sync_replication_slots + 476
6 postgres 0x0000000100b34c58 ExecInterpExpr + 2388
7 postgres 0x0000000100b33ee8 ExecInterpExprStillValid + 76
8 postgres 0x00000001008acd5c ExecEvalExprSwitchContext + 64
9 postgres 0x0000000100b54d48 ExecProject + 76
10 postgres 0x0000000100b925d4 ExecResult + 312
11 postgres 0x0000000100b5083c ExecProcNodeFirst + 92
12 postgres 0x0000000100b48b88 ExecProcNode + 60
13 postgres 0x0000000100b44410 ExecutePlan + 184
14 postgres 0x0000000100b442dc standard_ExecutorRun + 644
15 postgres 0x0000000100b44048 ExecutorRun + 104
16 postgres 0x0000000100e3053c PortalRunSelect + 308
17 postgres 0x0000000100e2ff40 PortalRun + 736
18 postgres 0x0000000100e2b21c exec_simple_query + 1368
19 postgres 0x0000000100e2a42c PostgresMain + 2508
20 postgres 0x0000000100e22ce4 BackendInitialize + 0
21 postgres 0x0000000100d1fd4c postmaster_child_launch + 304
22 postgres 0x0000000100d26d9c BackendStartup + 448
23 postgres 0x0000000100d23f18 ServerLoop + 372
24 postgres 0x0000000100d22f18 PostmasterMain + 6396
25 postgres 0x0000000100bcffd4 init_locale + 0
26 dyld 0x0000000186d82b98 start + 6076

The assert is raised inside ReplicationSlotsComputeRequiredXmin() because that function expects either that already_locked is false (and it will acquire what it needs), or that callers already hold both ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In the manual-sync path called by the UDF, neither lock is held, so the assertion trips.

Why this happens:
The background slot sync worker (spawned when sync_replication_slots = on) acquires the necessary locks before calling the routines that update/compute slot xmins, so the worker path is safe.The manual path through the SQL-callable UDF does not take the same locks before calling synchronize_slots()/synchronize_one_slot(). As a result the invariant assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading to the assert.

Proposed fix:
In synchronize_slots() (the code path used by SyncReplicationSlots()/pg_sync_replication_slots()), acquire ReplicationSlotControlLock before any call that can end up calling ReplicationSlotsComputeRequiredXmin(true).

It would be great if we have a test case for this issue possibly using
injection points.

Also, I think it's worth considering the idea Robert shared before[1]/messages/by-id/CA+TgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ@mail.gmail.com:

---
But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does
nothing?
---

We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582,
and it sounds reasonable to me that ProcArraySetReplicationSlotXmin()
refuses to retreat the values.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#47

Pradeep Kumar

spradeepkumar29@gmail.com

2 months ago

In reply to: Masahiko Sawada (#46)

1 attachment(s)

Re: Assertion failure in SnapBuildInitialSnapshot()

Hi ,
Thanks for reviewing this issue, After applying the patch
fix_concurrent_slot_xmin_update that you have submitted before [1]/messages/by-id/CAD21AoDi1fGGpie3vpxaHNiRdbsac2pJBbZAiLBay+Q=WArbRg@mail.gmail.com on
REL_18_STABLE, Here are the below steps to reproduce this issue manually

Steps :
1) <postgres_bin_directory>/initdb -D primary
2) echo "wal_level=logical" > primary/postgresql.conf (Edit the
postgresql.conf to add wal_level = logical)
3) echo "hot_standby_feeback = on" > primary/postgresql.conf (Edit the
postgresql.conf to hot_standby_feedback = on)
4) Start the postgres server (Primary)
5) Connect Primary => <postgres_bin_directory>/psql postgres
6) Execute UDF "SELECT pg_create_physical_replication_slot('standby_slot');
" on primary
7) Execute UDF "SELECT
pg_create_logical_replication_slot('test_logical_slot', 'pgoutput', false,
false, true); " on primary
8) <postgres_bin_directory>/pg_basebackup -h localhost -p 5432 -D 'standby'
-R (get a basebackup of primary to attach as replica to the primary)
9) Edit standby/postgresql.conf as "wal_level=logical" ,
"primary_slot_name=standby_slot"
10) Edit standby/postgresql.auto.conf as "dbname=postgres" in
primary_conninfo GUC
11) Start the standby server
12) execute UDF "SELECT pg_sync_replication_slots();" => this will leads to
assert failure

Here I attached the updated patch to solve this issue.

Thanks and Regards
Pradeep

[1]: /messages/by-id/CAD21AoDi1fGGpie3vpxaHNiRdbsac2pJBbZAiLBay+Q=WArbRg@mail.gmail.com
/messages/by-id/CAD21AoDi1fGGpie3vpxaHNiRdbsac2pJBbZAiLBay+Q=WArbRg@mail.gmail.com

On Thu, Oct 30, 2025 at 4:31 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Show quoted text

On Mon, Oct 27, 2025 at 5:22 AM Pradeep Kumar <spradeepkumar29@gmail.com>
wrote:

Hi All,
In this thread they proposed fix_concurrent_slot_xmin_update.patch will

solve this assert failure. After applying this patch I execute
pg_sync_replication_slots() (which calls SyncReplicationSlots →
synchronize_slots() → synchronize_one_slot() →
ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in
ReplicationSlotsComputeRequiredXmin() because the
ReplicationSlotControlLock is not held in that code path. By default
sync_replication_slots is off, so the background slot-sync worker is not
spawned; invoking the UDF directly exercises the path without the lock. I
have a small patch that acquires ReplicationSlotControlLock in the manual
sync path; that stops the assert.

Call Stack :
TRAP: failed Assert("!already_locked ||

(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot.

c", Line: 1061, PID: 67056
0 postgres 0x000000010104aad4

ExceptionalCondition + 216

1 postgres 0x0000000100d8718c

ReplicationSlotsComputeRequiredXmin + 180

2 postgres 0x0000000100d6fba8

synchronize_one_slot + 1488

3 postgres 0x0000000100d6e8cc

synchronize_slots + 1480

4 postgres 0x0000000100d6efe4

SyncReplicationSlots + 164

5 postgres 0x0000000100d8da84

pg_sync_replication_slots + 476

6 postgres 0x0000000100b34c58

ExecInterpExpr + 2388

7 postgres 0x0000000100b33ee8

ExecInterpExprStillValid + 76

8 postgres 0x00000001008acd5c

ExecEvalExprSwitchContext + 64

9 postgres 0x0000000100b54d48 ExecProject +

76

10 postgres 0x0000000100b925d4 ExecResult +

312

11 postgres 0x0000000100b5083c

ExecProcNodeFirst + 92

12 postgres 0x0000000100b48b88 ExecProcNode

+ 60

13 postgres 0x0000000100b44410 ExecutePlan +

184

14 postgres 0x0000000100b442dc

standard_ExecutorRun + 644

15 postgres 0x0000000100b44048 ExecutorRun +

104

16 postgres 0x0000000100e3053c

PortalRunSelect + 308

17 postgres 0x0000000100e2ff40 PortalRun +

736

18 postgres 0x0000000100e2b21c

exec_simple_query + 1368

19 postgres 0x0000000100e2a42c PostgresMain

+ 2508

20 postgres 0x0000000100e22ce4

BackendInitialize + 0

21 postgres 0x0000000100d1fd4c

postmaster_child_launch + 304

22 postgres 0x0000000100d26d9c

BackendStartup + 448

23 postgres 0x0000000100d23f18 ServerLoop +

372

24 postgres 0x0000000100d22f18

PostmasterMain + 6396

25 postgres 0x0000000100bcffd4 init_locale +

0

26 dyld 0x0000000186d82b98 start + 6076

The assert is raised inside ReplicationSlotsComputeRequiredXmin()

because that function expects either that already_locked is false (and it
will acquire what it needs), or that callers already hold both
ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In
the manual-sync path called by the UDF, neither lock is held, so the
assertion trips.

Why this happens:
The background slot sync worker (spawned when sync_replication_slots =

on) acquires the necessary locks before calling the routines that
update/compute slot xmins, so the worker path is safe.The manual path
through the SQL-callable UDF does not take the same locks before calling
synchronize_slots()/synchronize_one_slot(). As a result the invariant
assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading
to the assert.

Proposed fix:
In synchronize_slots() (the code path used by

SyncReplicationSlots()/pg_sync_replication_slots()), acquire
ReplicationSlotControlLock before any call that can end up calling
ReplicationSlotsComputeRequiredXmin(true).

It would be great if we have a test case for this issue possibly using
injection points.

Also, I think it's worth considering the idea Robert shared before[1]:

---
But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does
nothing?
---

We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582,
and it sounds reasonable to me that ProcArraySetReplicationSlotXmin()
refuses to retreat the values.

Regards,

[1]
/messages/by-id/CA+TgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_concurrent_slot_xmin_update_version2.patchapplication/octet-stream; name=fix_concurrent_slot_xmin_update_version2.patchDownload

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1eb798f3e9..e3a2a1ffdd4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -405,11 +405,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -422,6 +422,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -436,6 +437,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 61b2e9396aa..fee5aad7574 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -775,7 +775,8 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		SpinLockRelease(&slot->mutex);
 
 		reserve_wal_for_local_slot(remote_slot->restart_lsn);
-
+		
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		xmin_horizon = GetOldestSafeDecodingTransactionId(true);
 		SpinLockAcquire(&slot->mutex);
@@ -784,6 +785,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		SpinLockRelease(&slot->mutex);
 		ReplicationSlotsComputeRequiredXmin(true);
 		LWLockRelease(ProcArrayLock);
+		LWLockRelease(ReplicationSlotControlLock);
 
 		update_and_persist_local_synced_slot(remote_slot, remote_dbid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 101157ed8c9..303a40e61ea 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1129,8 +1129,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and
+ * the ProcArrayLock have already been acquired exclusively.
+ *
+ * Note that the ReplicationSlotControlLock must be locked first to avoid
+ * deadlocks.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -1140,8 +1143,26 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock exclusive until after updating
+	 * the slot xmin values, so no backend can compute and update the new
+	 * value concurrently.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively,
+	 * compute the xmin values while holding the ReplicationSlotControlLock
+	 * in shared mode, and update the slot xmin values, but it could increase
+	 * lock contention on the ProcArrayLock, which is not great since this
+	 * function can be called at non-negligible frequency.
+	 *
+	 * We instead increase lock contention on the ReplicationSlotControlLock
+	 * but it would be less harmful.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -1176,9 +1197,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*

#48

houzj.fnst@fujitsu.com

2 months ago

In reply to: Masahiko Sawada (#46)

RE: Assertion failure in SnapBuildInitialSnapshot()

On Thursday, October 30, 2025 7:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 27, 2025 at 5:22 AM Pradeep Kumar
<spradeepkumar29@gmail.com> wrote:

Hi All,
In this thread they proposed fix_concurrent_slot_xmin_update.patch will

solve this assert failure. After applying this patch I execute
pg_sync_replication_slots() (which calls SyncReplicationSlots →
synchronize_slots() → synchronize_one_slot() →
ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in
ReplicationSlotsComputeRequiredXmin() because the
ReplicationSlotControlLock is not held in that code path. By default
sync_replication_slots is off, so the background slot-sync worker is not
spawned; invoking the UDF directly exercises the path without the lock. I have
a small patch that acquires ReplicationSlotControlLock in the manual sync
path; that stops the assert.

The assert is raised inside ReplicationSlotsComputeRequiredXmin()

because that function expects either that already_locked is false (and it will
acquire what it needs), or that callers already hold both
ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In the
manual-sync path called by the UDF, neither lock is held, so the assertion trips.

Why this happens:
The background slot sync worker (spawned when sync_replication_slots =

on) acquires the necessary locks before calling the routines that
update/compute slot xmins, so the worker path is safe.The manual path
through the SQL-callable UDF does not take the same locks before calling
synchronize_slots()/synchronize_one_slot(). As a result the invariant
assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading
to the assert.

Proposed fix:
In synchronize_slots() (the code path used by

SyncReplicationSlots()/pg_sync_replication_slots()), acquire
ReplicationSlotControlLock before any call that can end up calling
ReplicationSlotsComputeRequiredXmin(true).

It would be great if we have a test case for this issue possibly using injection
points.

Also, I think it's worth considering the idea Robert shared before[1]:

---
But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does nothing?
---

We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582, and it
sounds reasonable to me that ProcArraySetReplicationSlotXmin() refuses to
retreat the values.

I reviewed the thread and think that we could not straightforwardly apply a
similar strategy to prevent the retreat of xmin/catalog_xmin here. This is
because we maintain a central value
(replication_slot_xmin/replication_slot_catalog_xmin) in
ProcArraySetReplicationSlotXmin, where the value is expected to decrease when
certain slots are dropped or invalidated. Therefore, I think we might need to
continue with the original proposal to invert the lock and also address the code
path for slotsync.

[1]
/messages/by-id/CA+TgmoYLzJxCEa0aCan3KR7o
_25G52cbqw-90Q0VGRmV3a8XGQ%40mail.gmail.com

Best Regards,
Hou zj

#49

amit.kapila16@gmail.com

2 months ago

In reply to: Zhijie Hou (Fujitsu) (#48)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, Nov 6, 2025 at 12:03 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, October 30, 2025 7:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Also, I think it's worth considering the idea Robert shared before[1]:

---
But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does nothing?
---

We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582, and it
sounds reasonable to me that ProcArraySetReplicationSlotXmin() refuses to
retreat the values.

I reviewed the thread and think that we could not straightforwardly apply a
similar strategy to prevent the retreat of xmin/catalog_xmin here. This is
because we maintain a central value
(replication_slot_xmin/replication_slot_catalog_xmin) in
ProcArraySetReplicationSlotXmin, where the value is expected to decrease when
certain slots are dropped or invalidated.

Good point. This can happen when the last slot is invalidated or dropped.

Therefore, I think we might need to

continue with the original proposal to invert the lock and also address the code
path for slotsync.

+1.

--
With Regards,
Amit Kapila.

#50

sawada.mshk@gmail.com

2 months ago

In reply to: Amit Kapila (#49)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, Nov 6, 2025 at 2:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 6, 2025 at 12:03 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, October 30, 2025 7:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Also, I think it's worth considering the idea Robert shared before[1]:

---
But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does nothing?
---

We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582, and it
sounds reasonable to me that ProcArraySetReplicationSlotXmin() refuses to
retreat the values.

I reviewed the thread and think that we could not straightforwardly apply a
similar strategy to prevent the retreat of xmin/catalog_xmin here. This is
because we maintain a central value
(replication_slot_xmin/replication_slot_catalog_xmin) in
ProcArraySetReplicationSlotXmin, where the value is expected to decrease when
certain slots are dropped or invalidated.

Good point. This can happen when the last slot is invalidated or dropped.

After the last slot is invalidated or dropped, both slot_xmin and
slot_catalog_xmin values are set InvalidTransactionId. Then in this
case, these values are ignored when computing the oldest safe decoding
XID in GetOldestSafeDecodingTransactionId(), no? Or do you mean that
there is a case where slot_xmin and slot_catalog_xmin retreat to a
valid XID?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#51

[1]: /messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com
[2]: /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com

houzj.fnst@fujitsu.com

2 months ago

In reply to: Masahiko Sawada (#50)

RE: Assertion failure in SnapBuildInitialSnapshot()

On Friday, November 7, 2025 2:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Nov 6, 2025 at 2:36 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Thu, Nov 6, 2025 at 12:03 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, October 30, 2025 7:01 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

Also, I think it's worth considering the idea Robert shared before[1]:

---
But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does

nothing?

---

We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582,

and it

sounds reasonable to me that ProcArraySetReplicationSlotXmin()

refuses to

retreat the values.

I reviewed the thread and think that we could not straightforwardly apply a
similar strategy to prevent the retreat of xmin/catalog_xmin here. This is
because we maintain a central value
(replication_slot_xmin/replication_slot_catalog_xmin) in
ProcArraySetReplicationSlotXmin, where the value is expected to decrease

when

certain slots are dropped or invalidated.

Good point. This can happen when the last slot is invalidated or dropped.

After the last slot is invalidated or dropped, both slot_xmin and
slot_catalog_xmin values are set InvalidTransactionId. Then in this
case, these values are ignored when computing the oldest safe decoding
XID in GetOldestSafeDecodingTransactionId(), no? Or do you mean that
there is a case where slot_xmin and slot_catalog_xmin retreat to a
valid XID?

I think when replication_slot_xmin is invalid,
GetOldestSafeDecodingTransactionId would return nextXid, which can be greater
than the original snap.xmin if some transaction IDs have been assigned. After
reviewing the report [1]/messages/by-id/CAD21AoDKJBB6p4X-+057Vz44Xyc-zDFbWJ+g9FL6qAF5PC2iFg@mail.gmail.com, the bug appears reproducible when
replication_slot_xmin is set to InvalidTransactionId (specific reproduction
steps are detailed at [2]/messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com) as well. Therefore, if we adopt the approach to
prevent retreating these values, we need to somehow avoid resetting
replication_slot_xmin, but that seems conflict with the behavior of resetting
replication_slot_xmin when dropping the last slot.

Best Regards,
Hou zj

#52

amit.kapila16@gmail.com

2 months ago

In reply to: Zhijie Hou (Fujitsu) (#51)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Fri, Nov 7, 2025 at 8:30 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Friday, November 7, 2025 2:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Nov 6, 2025 at 2:36 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

Good point. This can happen when the last slot is invalidated or dropped.

After the last slot is invalidated or dropped, both slot_xmin and
slot_catalog_xmin values are set InvalidTransactionId. Then in this
case, these values are ignored when computing the oldest safe decoding
XID in GetOldestSafeDecodingTransactionId(), no? Or do you mean that
there is a case where slot_xmin and slot_catalog_xmin retreat to a
valid XID?

I think when replication_slot_xmin is invalid,
GetOldestSafeDecodingTransactionId would return nextXid, which can be greater
than the original snap.xmin if some transaction IDs have been assigned.

Won't we have a problem that values of
procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin won't be set to
InvalidTransactionId after last slot removal due to a new check unless
we do special treatment for drop/invalidation of a slot? And that
would lead to accumulating dead rows even when not required.

--
With Regards,
Amit Kapila.

#53

[1]: /messages/by-id/20230207194903.ws4acm7ake6ikacn@awork3.anarazel.de

sawada.mshk@gmail.com

2 months ago

In reply to: Amit Kapila (#52)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, Nov 6, 2025 at 8:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 7, 2025 at 8:30 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Friday, November 7, 2025 2:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Nov 6, 2025 at 2:36 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

Good point. This can happen when the last slot is invalidated or dropped.

After the last slot is invalidated or dropped, both slot_xmin and
slot_catalog_xmin values are set InvalidTransactionId. Then in this
case, these values are ignored when computing the oldest safe decoding
XID in GetOldestSafeDecodingTransactionId(), no? Or do you mean that
there is a case where slot_xmin and slot_catalog_xmin retreat to a
valid XID?

I think when replication_slot_xmin is invalid,
GetOldestSafeDecodingTransactionId would return nextXid, which can be greater
than the original snap.xmin if some transaction IDs have been assigned.

Won't we have a problem that values of
procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin won't be set to
InvalidTransactionId after last slot removal due to a new check unless
we do special treatment for drop/invalidation of a slot? And that
would lead to accumulating dead rows even when not required.

I understand Hou-san's point. Agreed. procArray->replication_slot_xmin
and replication_slot_catalog_xmin should not retreat to a valid XID
but could become 0 (invalid). Let's consider the idea of inverting the
locks as Andres proposed[1]/messages/by-id/20230207194903.ws4acm7ake6ikacn@awork3.anarazel.de.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#54

Pradeep Kumar

spradeepkumar29@gmail.com

2 months ago

In reply to: Masahiko Sawada (#53)

Re: Assertion failure in SnapBuildInitialSnapshot()

I've been investigating the assert failure in
ProcArraySetReplicationSlotXmin() and would like to share my approach and
get feedback. Instead of inverting the locks and what robert shared before
[1]: /messages/by-id/CA+TgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ@mail.gmail.com
Instead of unconditionally updating procArray->replication_slot_xmin in
ProcArraySetReplicationSlotXmin() in procarray.c, I made the updates
conditional:
1) Only update if the incoming xmin is valid
2) Only update if it's older than the currently stored xmin
3) Do the same for procArray->replication_slot_catalog_xmin

void ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId
catalog_xmin, bool already_locked)
{
Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));

if (!already_locked)
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

if (TransactionIdIsValid(xmin))
{
if (!TransactionIdIsValid(procArray->replication_slot_xmin) ||
TransactionIdPrecedes(xmin, procArray->replication_slot_xmin))
procArray->replication_slot_xmin = xmin;
}
if (TransactionIdIsValid(catalog_xmin))
{
if
(!TransactionIdIsValid(procArray->replication_slot_catalog_xmin) ||
TransactionIdPrecedes(catalog_xmin,
procArray->replication_slot_catalog_xmin))
procArray->replication_slot_catalog_xmin = catalog_xmin;
}
if (!already_locked)
LWLockRelease(ProcArrayLock);

elog(DEBUG1, "xmin required by slots: data %u, catalog %u", xmin,
catalog_xmin);
}

In above block of code ensures we always track the minimum xmin across all
active replication slots without losing data. And also no need to worry
about locks. And also while reproducing this issue [2]/messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
In SnapBuildInitialSnapshot() while we computing safexid by
calling GetOldestSafeDecodingTransactionId(false) will enters into first
case and update the oldestSafeXid = procArray->replication_slot_xmin. So it
won't return nextXid. And also it solves this issue [2]/messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com.

[1]: /messages/by-id/CA+TgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ@mail.gmail.com
/messages/by-id/CA+TgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ@mail.gmail.com
[2]: /messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com
/messages/by-id/CAA4eK1KDFeh=ZbvSWPx=ir2QOXBxJbH0K8YqifDtG3xJENLR+w@mail.gmail.com

On Fri, Nov 7, 2025 at 11:05 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Show quoted text

On Thu, Nov 6, 2025 at 8:05 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Nov 7, 2025 at 8:30 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Friday, November 7, 2025 2:36 AM Masahiko Sawada <

sawada.mshk@gmail.com> wrote:

On Thu, Nov 6, 2025 at 2:36 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

Good point. This can happen when the last slot is invalidated or

dropped.

After the last slot is invalidated or dropped, both slot_xmin and
slot_catalog_xmin values are set InvalidTransactionId. Then in this
case, these values are ignored when computing the oldest safe

decoding

XID in GetOldestSafeDecodingTransactionId(), no? Or do you mean that
there is a case where slot_xmin and slot_catalog_xmin retreat to a
valid XID?

I think when replication_slot_xmin is invalid,
GetOldestSafeDecodingTransactionId would return nextXid, which can be

greater

than the original snap.xmin if some transaction IDs have been assigned.

Won't we have a problem that values of
procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin won't be set to
InvalidTransactionId after last slot removal due to a new check unless
we do special treatment for drop/invalidation of a slot? And that
would lead to accumulating dead rows even when not required.

I understand Hou-san's point. Agreed. procArray->replication_slot_xmin
and replication_slot_catalog_xmin should not retreat to a valid XID
but could become 0 (invalid). Let's consider the idea of inverting the
locks as Andres proposed[1].

Regards,

[1]
/messages/by-id/20230207194903.ws4acm7ake6ikacn@awork3.anarazel.de

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#55

[1]: /messages/by-id/TY4PR01MB169070EE618FA2908B3D2F2AE94C3A@TY4PR01MB16907.jpnprd01.prod.outlook.com

houzj.fnst@fujitsu.com

2 months ago

In reply to: Pradeep Kumar (#54)

3 attachment(s)

RE: Assertion failure in SnapBuildInitialSnapshot()

On Wednesday, November 12, 2025 7:27 PM Pradeep Kumar <spradeepkumar29@gmail.com> wote:

I've been investigating the assert failure in
ProcArraySetReplicationSlotXmin() and would like to share my approach and get
feedback. Instead of inverting the locks and what robert shared before [1].
Instead of unconditionally updating procArray->replication_slot_xmin in
ProcArraySetReplicationSlotXmin() in procarray.c, I made the updates
conditional:
1) Only update if the incoming xmin is valid
2) Only update if it's older than the currently stored xmin
3) Do the same for procArray->replication_slot_catalog_xmin

...

In above block of code ensures we always track the minimum xmin across all
active replication slots without losing data. And also no need to worry about
locks. And also while reproducing this issue [2] In SnapBuildInitialSnapshot()
while we computing safexid by calling GetOldestSafeDecodingTransactionId(false)
will enters into first case and update the oldestSafeXid =
procArray->replication_slot_xmin. So it won't return nextXid. And also it solves
this issue [2].

Thanks for evaluating new approach, but I think this approach could not work
because we expect replication_slot_xmin to be set to an invalid number when the
last slot is dropped, while this approach would disallow that, causing WALs to
be retained. For a detailed explanation, please refer to [1]/messages/by-id/TY4PR01MB169070EE618FA2908B3D2F2AE94C3A@TY4PR01MB16907.jpnprd01.prod.outlook.com.

While testing the patches across all branches, I noticed that an additional lock
needs to be added in the launcher.c where
ReplicationSlotsComputeRequiredXmin(true) was recently added for conflict
detection slot. I have modified the original patch accordingly.

BTW, I am not adding a test using an injection point because it does not seem
practical to insert an injection point inner
ReplicationSlotsComputeRequiredXmin. The reason is that the injection point
function internally calls CHECK_FOR_INTERRUPTS(), but the key functions in the
patch holds the lwlock, holding holds interrupts.

I am sharing the patches for all branches for reference.

Best Regards,
Hou zj

Attachments:

v3HEAD-0001-Fix-a-race-condition-of-updating-procArray-replic.patchapplication/octet-stream; name=v3HEAD-0001-Fix-a-race-condition-of-updating-procArray-replic.patchDownload

From ccc2e4c76779d765a104b0de69f0154ddb71e36a Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Tue, 11 Nov 2025 18:12:53 +0800
Subject: [PATCH v3] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots while not holding ProcArrayLock if
already_locked is false, and acquires the ProcArrayLock just before
updating the replication slot xmin. Therefore, if a process calls
ReplicationSlotsComputeRequiredXmin() with already_locked being false
and another process updates the replication slot xmin before the
process acquiring the lock, the slot xmin was overwritten with an old
value.

In the reported failure, a walsender for an apply worker computes
InvalidTransaction as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker with this value. Then the walsender for a tablesync worker
ended up computing the transaction id by
GetOldestSafeDecodingTransactionId() without considering replication
slot xmin. That led to an error ""cannot build an initial slot
snapshot as oldest safe xid %u follows snapshot's xmin %u", which was
an assertion failure prior to 240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.
---
 src/backend/replication/logical/launcher.c |  2 ++
 src/backend/replication/logical/logical.c  | 12 ++++----
 src/backend/replication/logical/slotsync.c |  2 ++
 src/backend/replication/slot.c             | 32 ++++++++++++++++++----
 4 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 6214028eda9..86aced9bdf5 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -1540,6 +1540,7 @@ init_conflict_slot_xmin(void)
 	Assert(MyReplicationSlot &&
 		   !TransactionIdIsValid(MyReplicationSlot->data.xmin));
 
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(false);
@@ -1552,6 +1553,7 @@ init_conflict_slot_xmin(void)
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	/* Write this slot to disk */
 	ReplicationSlotMarkDirty();
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 866f92cf799..e9a07e67a73 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -405,11 +405,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -422,6 +422,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -436,6 +437,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 8b4afd87dc9..84f1d62b572 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -775,6 +775,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 
 		reserve_wal_for_local_slot(remote_slot->restart_lsn);
 
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		xmin_horizon = GetOldestSafeDecodingTransactionId(true);
 		SpinLockAcquire(&slot->mutex);
@@ -783,6 +784,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		SpinLockRelease(&slot->mutex);
 		ReplicationSlotsComputeRequiredXmin(true);
 		LWLockRelease(ProcArrayLock);
+		LWLockRelease(ReplicationSlotControlLock);
 
 		update_and_persist_local_synced_slot(remote_slot, remote_dbid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1ec1e997b27..6712c6ee7c7 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1170,8 +1170,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and
+ * the ProcArrayLock have already been acquired exclusively.
+ *
+ * Note that the ReplicationSlotControlLock must be locked first to avoid
+ * deadlocks.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -1181,8 +1184,26 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock exclusive until after updating the
+	 * slot xmin values, so no backend can compute and update the new value
+	 * concurrently.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively, compute
+	 * the xmin values while holding the ReplicationSlotControlLock in shared
+	 * mode, and update the slot xmin values, but it could increase lock
+	 * contention on the ProcArrayLock, which is not great since this function
+	 * can be called at non-negligible frequency.
+	 *
+	 * We instead increase lock contention on the ReplicationSlotControlLock
+	 * but it would be less harmful.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -1217,9 +1238,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.31.1

v3PG18-17-0001-Fix-a-race-condition-of-updating-procArray-re.patchapplication/octet-stream; name=v3PG18-17-0001-Fix-a-race-condition-of-updating-procArray-re.patchDownload

From f454252fd238bdd60e9291b55f85992a43d99470 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Thu, 13 Nov 2025 11:20:28 +0800
Subject: [PATCH v3PG18] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots while not holding ProcArrayLock if
already_locked is false, and acquires the ProcArrayLock just before
updating the replication slot xmin. Therefore, if a process calls
ReplicationSlotsComputeRequiredXmin() with already_locked being false
and another process updates the replication slot xmin before the
process acquiring the lock, the slot xmin was overwritten with an old
value.

In the reported failure, a walsender for an apply worker computes
InvalidTransaction as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker with this value. Then the walsender for a tablesync worker
ended up computing the transaction id by
GetOldestSafeDecodingTransactionId() without considering replication
slot xmin. That led to an error ""cannot build an initial slot
snapshot as oldest safe xid %u follows snapshot's xmin %u", which was
an assertion failure prior to 240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.
---
 src/backend/replication/logical/logical.c  | 12 ++++----
 src/backend/replication/logical/slotsync.c |  2 ++
 src/backend/replication/slot.c             | 32 ++++++++++++++++++----
 3 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1eb798f3e9..e3a2a1ffdd4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -405,11 +405,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -422,6 +422,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -436,6 +437,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 61b2e9396aa..fb9f98cdd99 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -776,6 +776,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 
 		reserve_wal_for_local_slot(remote_slot->restart_lsn);
 
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		xmin_horizon = GetOldestSafeDecodingTransactionId(true);
 		SpinLockAcquire(&slot->mutex);
@@ -784,6 +785,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		SpinLockRelease(&slot->mutex);
 		ReplicationSlotsComputeRequiredXmin(true);
 		LWLockRelease(ProcArrayLock);
+		LWLockRelease(ReplicationSlotControlLock);
 
 		update_and_persist_local_synced_slot(remote_slot, remote_dbid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 90f9e8068a6..a9f650d96f3 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1129,8 +1129,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and
+ * the ProcArrayLock have already been acquired exclusively.
+ *
+ * Note that the ReplicationSlotControlLock must be locked first to avoid
+ * deadlocks.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -1140,8 +1143,26 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock exclusive until after updating the
+	 * slot xmin values, so no backend can compute and update the new value
+	 * concurrently.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively, compute
+	 * the xmin values while holding the ReplicationSlotControlLock in shared
+	 * mode, and update the slot xmin values, but it could increase lock
+	 * contention on the ProcArrayLock, which is not great since this function
+	 * can be called at non-negligible frequency.
+	 *
+	 * We instead increase lock contention on the ReplicationSlotControlLock
+	 * but it would be less harmful.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -1176,9 +1197,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.51.1.windows.1

v3PG16-13-0001-Fix-a-race-condition-of-updating-procArray-re.patchapplication/octet-stream; name=v3PG16-13-0001-Fix-a-race-condition-of-updating-procArray-re.patchDownload

From e1049a0fc58ebd837ebdd7745a52d5f47acd6bb3 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Thu, 13 Nov 2025 11:23:36 +0800
Subject: [PATCH v3PG16] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots while not holding ProcArrayLock if
already_locked is false, and acquires the ProcArrayLock just before
updating the replication slot xmin. Therefore, if a process calls
ReplicationSlotsComputeRequiredXmin() with already_locked being false
and another process updates the replication slot xmin before the
process acquiring the lock, the slot xmin was overwritten with an old
value.

In the reported failure, a walsender for an apply worker computes
InvalidTransaction as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker with this value. Then the walsender for a tablesync worker
ended up computing the transaction id by
GetOldestSafeDecodingTransactionId() without considering replication
slot xmin. That led to an error ""cannot build an initial slot
snapshot as oldest safe xid %u follows snapshot's xmin %u", which was
an assertion failure prior to 240e0dbacd3.

This commit changes ReplicationSlotsComputeRequiredXmin() so that it
computes the oldest xmin while holding ProcArrayLock in exclusive
mode. We keep already_locked parameter in
ProcArraySetReplicationSlotXmin() on backbranches to not break ABI
compatibility.
---
 src/backend/replication/logical/logical.c | 12 +++++----
 src/backend/replication/slot.c            | 32 +++++++++++++++++++----
 2 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6e1879d8149..057f62a8d8f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -404,11 +404,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -421,6 +421,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -435,6 +436,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d03a0556b5a..637b49ee2c4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -858,8 +858,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and
+ * the ProcArrayLock have already been acquired exclusively.
+ *
+ * Note that the ReplicationSlotControlLock must be locked first to avoid
+ * deadlocks.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -869,8 +872,26 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock exclusive until after updating the
+	 * slot xmin values, so no backend can compute and update the new value
+	 * concurrently.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively, compute
+	 * the xmin values while holding the ReplicationSlotControlLock in shared
+	 * mode, and update the slot xmin values, but it could increase lock
+	 * contention on the ProcArrayLock, which is not great since this function
+	 * can be called at non-negligible frequency.
+	 *
+	 * We instead increase lock contention on the ReplicationSlotControlLock
+	 * but it would be less harmful.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -905,9 +926,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.51.1.windows.1

#56

houzj.fnst@fujitsu.com

about 2 months ago

In reply to: Zhijie Hou (Fujitsu) (#55)

1 attachment(s)

RE: Assertion failure in SnapBuildInitialSnapshot()

On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

While testing the patches across all branches, I noticed that an additional lock
needs to be added in the launcher.c where
ReplicationSlotsComputeRequiredXmin(true) was recently added for conflict
detection slot. I have modified the original patch accordingly.

BTW, I am not adding a test using an injection point because it does not seem
practical to insert an injection point inner
ReplicationSlotsComputeRequiredXmin. The reason is that the injection point
function internally calls CHECK_FOR_INTERRUPTS(), but the key functions in
the patch holds the lwlock, holding holds interrupts.

I am sharing the patches for all branches for reference.

I have been thinking if there a way to avoid holding ReplicationSlotControlLock
exclusively in ReplicationSlotsComputeRequiredXmin() because that could cause
lock contention when many slots exist and advancements occur frequently.

Given that the bug arises from a race condition between slot creation and
concurrent slot xmin computation, I think another way is that, we acquire the
ReplicationSlotControlLock exclusively only during slot creation to do the
initial update of the slot xmin. In ReplicationSlotsComputeRequiredXmin(), we
still hold the ReplicationSlotControlLock in shared mode until the global slot
xmin is updated in ProcArraySetReplicationSlotXmin(). This approach prevents
concurrent computations and updates of new xmin horizons by other backends
during the initial slot xmin update process, while it still permits concurrent
calls to ReplicationSlotsComputeRequiredXmin().

Here is an update patch for this approach on HEAD.

Best Regards,
Hou zj

Attachments:

v4HEAD-0001-Fix-a-race-condition-of-updating-procArray-re.patchapplication/octet-stream; name=v4HEAD-0001-Fix-a-race-condition-of-updating-procArray-re.patchDownload

From ad9e4b1484ed93afd86ef3017528d3bcb1b826c3 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Tue, 11 Nov 2025 18:12:53 +0800
Subject: [PATCH v4HEAD] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots while not holding ProcArrayLock if
already_locked is false, and acquires the ProcArrayLock just before
updating the replication slot xmin.

This could lead to a race condition: if a backend creates a new slot and
attempts to initialize the slot.xmin, but meanwhile, another backend invokes
ReplicationSlotsComputeRequiredXmin() with already_locked set to false, the
global slot xmin may be initially updated by the newly created slot, only to be
subsequently overwritten by the backend running
ReplicationSlotsComputeRequiredXmin() with an invalid or new xid value.

In the reported failure, a walsender for an apply worker computes
InvalidTransaction as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker with this value. Then the walsender for a tablesync worker
ended up computing the transaction id by
GetOldestSafeDecodingTransactionId() without considering replication
slot xmin. That led to an error ""cannot build an initial slot
snapshot as oldest safe xid %u follows snapshot's xmin %u", which was
an assertion failure prior to 240e0dbacd3.

To address the bug, we acquire the ReplicationSlotControlLock exclusively during
slot creation to do the initial update of the slot xmin. In
ReplicationSlotsComputeRequiredXmin(), we hold the ReplicationSlotControlLock in
shared mode until the global slot xmin is updated in
ProcArraySetReplicationSlotXmin(). This approach prevents concurrent
computations and updates of new xmin horizons by other backends during the
initial slot xmin update process, while it still permits concurrent calls to
ReplicationSlotsComputeRequiredXmin().
---
 src/backend/replication/logical/launcher.c |  2 ++
 src/backend/replication/logical/logical.c  | 12 +++++----
 src/backend/replication/logical/slotsync.c |  2 ++
 src/backend/replication/slot.c             | 31 ++++++++++++++++++----
 4 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index fdf1ccad462..5592223a52b 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -1540,6 +1540,7 @@ init_conflict_slot_xmin(void)
 	Assert(MyReplicationSlot &&
 		   !TransactionIdIsValid(MyReplicationSlot->data.xmin));
 
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(false);
@@ -1552,6 +1553,7 @@ init_conflict_slot_xmin(void)
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	/* Write this slot to disk */
 	ReplicationSlotMarkDirty();
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 866f92cf799..e9a07e67a73 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -405,11 +405,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -422,6 +422,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -436,6 +437,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 8b4afd87dc9..84f1d62b572 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -775,6 +775,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 
 		reserve_wal_for_local_slot(remote_slot->restart_lsn);
 
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		xmin_horizon = GetOldestSafeDecodingTransactionId(true);
 		SpinLockAcquire(&slot->mutex);
@@ -783,6 +784,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		SpinLockRelease(&slot->mutex);
 		ReplicationSlotsComputeRequiredXmin(true);
 		LWLockRelease(ProcArrayLock);
+		LWLockRelease(ReplicationSlotControlLock);
 
 		update_and_persist_local_synced_slot(remote_slot, remote_dbid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1ec1e997b27..61c575fdd4b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1170,8 +1170,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and
+ * the ProcArrayLock have already been acquired exclusively.
+ *
+ * Note that the ReplicationSlotControlLock must be locked first to avoid
+ * deadlocks.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -1181,8 +1184,25 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock until after updating the slot xmin
+	 * values, so no backend update the initial xmin for newly created slot
+	 * concurrently. A shared lock is used here to minimize lock contention,
+	 * especially when many slots exist and advancements occur frequently. This
+	 * is safe since an exclusive lock is taken during initial slot xmin update
+	 * in slot creation.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively and update
+	 * the slot xmin values, but it could increase lock contention on the
+	 * ProcArrayLock, which is not great since this function can be called at
+	 * non-negligible frequency.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -1217,9 +1237,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.51.1.windows.1

#57

amit.kapila16@gmail.com

about 2 months ago

In reply to: Zhijie Hou (Fujitsu) (#56)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Fri, Nov 21, 2025 at 9:17 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

I have been thinking if there a way to avoid holding ReplicationSlotControlLock
exclusively in ReplicationSlotsComputeRequiredXmin() because that could cause
lock contention when many slots exist and advancements occur frequently.

Given that the bug arises from a race condition between slot creation and
concurrent slot xmin computation, I think another way is that, we acquire the
ReplicationSlotControlLock exclusively only during slot creation to do the
initial update of the slot xmin. In ReplicationSlotsComputeRequiredXmin(), we
still hold the ReplicationSlotControlLock in shared mode until the global slot
xmin is updated in ProcArraySetReplicationSlotXmin(). This approach prevents
concurrent computations and updates of new xmin horizons by other backends
during the initial slot xmin update process, while it still permits concurrent
calls to ReplicationSlotsComputeRequiredXmin().

Yeah, this seems to work.

Here is an update patch for this approach on HEAD.

Thanks for the patch.

Sawada-San, are you planning to look into this? Otherwise, I can take
care of it.

--
With Regards,
Amit Kapila.

#58

sawada.mshk@gmail.com

about 2 months ago

In reply to: Amit Kapila (#57)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Mon, Nov 24, 2025 at 1:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 21, 2025 at 9:17 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

I have been thinking if there a way to avoid holding ReplicationSlotControlLock
exclusively in ReplicationSlotsComputeRequiredXmin() because that could cause
lock contention when many slots exist and advancements occur frequently.

Given that the bug arises from a race condition between slot creation and
concurrent slot xmin computation, I think another way is that, we acquire the
ReplicationSlotControlLock exclusively only during slot creation to do the
initial update of the slot xmin. In ReplicationSlotsComputeRequiredXmin(), we
still hold the ReplicationSlotControlLock in shared mode until the global slot
xmin is updated in ProcArraySetReplicationSlotXmin(). This approach prevents
concurrent computations and updates of new xmin horizons by other backends
during the initial slot xmin update process, while it still permits concurrent
calls to ReplicationSlotsComputeRequiredXmin().

Yeah, this seems to work.

Here is an update patch for this approach on HEAD.

Thanks for the patch.

Sawada-San, are you planning to look into this? Otherwise, I can take
care of it.

Yes, I'll review the patch and share some comments soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#59

sawada.mshk@gmail.com

about 2 months ago

In reply to: Masahiko Sawada (#58)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Mon, Nov 24, 2025 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 24, 2025 at 1:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 21, 2025 at 9:17 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

I have been thinking if there a way to avoid holding ReplicationSlotControlLock
exclusively in ReplicationSlotsComputeRequiredXmin() because that could cause
lock contention when many slots exist and advancements occur frequently.

Given that the bug arises from a race condition between slot creation and
concurrent slot xmin computation, I think another way is that, we acquire the
ReplicationSlotControlLock exclusively only during slot creation to do the
initial update of the slot xmin. In ReplicationSlotsComputeRequiredXmin(), we
still hold the ReplicationSlotControlLock in shared mode until the global slot
xmin is updated in ProcArraySetReplicationSlotXmin(). This approach prevents
concurrent computations and updates of new xmin horizons by other backends
during the initial slot xmin update process, while it still permits concurrent
calls to ReplicationSlotsComputeRequiredXmin().

Yeah, this seems to work.

+1

Given that the computation of xmin and catalog_xmin among all slots
could be executed concurrently, could the following scenario happen
where procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin are retreat to a non-invalid
XID?

1. Suppose the initial value procArray->replication_slot_catalog_xmin is 50.
2. Process-A updates its owned slot's catalog_xmin to 100, and
computes the new catalog_xmin as 100 while holding
ReplicationSlotControlLock in a shared mode in
ReplicationSlotsComputeRequiredLSN(). But it doesn't update the
procArray's catalog_xmin value yet.
3. Process-B updates its owned slot's catalog_xmin to 150, and
computes the new catalog_xmin as 150.
4. Process-B updates the procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates the procArray->repilcation_slot_catalog_xmin to
100, which was 150.

It might be worth adding an assertion to
ProcArraySetReplicationSlotXmin(), checking if the new xmin and
catalog_xmin values are either >= the current values or an
InvalidTransactionId.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#60

houzj.fnst@fujitsu.com

about 2 months ago

In reply to: Masahiko Sawada (#59)

RE: Assertion failure in SnapBuildInitialSnapshot()

On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 24, 2025 at 10:48 AM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Mon, Nov 24, 2025 at 1:46 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Nov 21, 2025 at 9:17 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu)

<houzj.fnst@fujitsu.com> wrote:

I have been thinking if there a way to avoid holding
ReplicationSlotControlLock exclusively in
ReplicationSlotsComputeRequiredXmin() because that could cause lock

contention when many slots exist and advancements occur frequently.

Given that the bug arises from a race condition between slot
creation and concurrent slot xmin computation, I think another way
is that, we acquire the ReplicationSlotControlLock exclusively
only during slot creation to do the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we still hold the
ReplicationSlotControlLock in shared mode until the global slot
xmin is updated in ProcArraySetReplicationSlotXmin(). This
approach prevents concurrent computations and updates of new xmin
horizons by other backends during the initial slot xmin update process,

while it still permits concurrent calls to
ReplicationSlotsComputeRequiredXmin().

Yeah, this seems to work.

+1

Given that the computation of xmin and catalog_xmin among all slots could
be executed concurrently, could the following scenario happen where
procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin are retreat to a non-invalid
XID?

1. Suppose the initial value procArray->replication_slot_catalog_xmin is 50.
2. Process-A updates its owned slot's catalog_xmin to 100, and computes the
new catalog_xmin as 100 while holding ReplicationSlotControlLock in a shared
mode in ReplicationSlotsComputeRequiredLSN(). But it doesn't update the
procArray's catalog_xmin value yet.
3. Process-B updates its owned slot's catalog_xmin to 150, and computes the
new catalog_xmin as 150.
4. Process-B updates the procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates the procArray->repilcation_slot_catalog_xmin to 100,
which was 150.

After further investigation, I think that steps 3 and 4 cannot occur because
Process-B must have already encountered the catalog_xmin maintained by
Process-A, either 50 or 100. Consequently, Process-B will refrain from updating
the catalog_xmin to a more recent value, such as 150.

It might be worth adding an assertion to ProcArraySetReplicationSlotXmin(),
checking if the new xmin and catalog_xmin values are either >= the current
values or an InvalidTransactionId.

I considered this scenario and identified a potential exception in the
copy_replication_slot(). This function uses a two-phase copy process, the
original restart_lsn is directly copied to the new slot during the first phase.
However, the original slot.restart_lsn might advance between phases.
Consequently, the newly created slot initially uses the outdated restart_lsn,
which could cause the procArray->replication_slot_catalog_xmin to retreat. I
think this behavior isn't harmful, as explained in the comments, because the new
restart_lsn will be updated in the created slot during the second phase.

Best Regards,
Hou zj

#61

sawada.mshk@gmail.com

about 2 months ago

In reply to: Zhijie Hou (Fujitsu) (#60)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Nov 25, 2025 at 4:02 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 24, 2025 at 10:48 AM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

On Mon, Nov 24, 2025 at 1:46 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Fri, Nov 21, 2025 at 9:17 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Thursday, November 13, 2025 12:56 PM Zhijie Hou (Fujitsu)

<houzj.fnst@fujitsu.com> wrote:

I have been thinking if there a way to avoid holding
ReplicationSlotControlLock exclusively in
ReplicationSlotsComputeRequiredXmin() because that could cause lock

contention when many slots exist and advancements occur frequently.

Given that the bug arises from a race condition between slot
creation and concurrent slot xmin computation, I think another way
is that, we acquire the ReplicationSlotControlLock exclusively
only during slot creation to do the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we still hold the
ReplicationSlotControlLock in shared mode until the global slot
xmin is updated in ProcArraySetReplicationSlotXmin(). This
approach prevents concurrent computations and updates of new xmin
horizons by other backends during the initial slot xmin update process,

while it still permits concurrent calls to
ReplicationSlotsComputeRequiredXmin().

Yeah, this seems to work.

+1

Given that the computation of xmin and catalog_xmin among all slots could
be executed concurrently, could the following scenario happen where
procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin are retreat to a non-invalid
XID?

1. Suppose the initial value procArray->replication_slot_catalog_xmin is 50.
2. Process-A updates its owned slot's catalog_xmin to 100, and computes the
new catalog_xmin as 100 while holding ReplicationSlotControlLock in a shared
mode in ReplicationSlotsComputeRequiredLSN(). But it doesn't update the
procArray's catalog_xmin value yet.
3. Process-B updates its owned slot's catalog_xmin to 150, and computes the
new catalog_xmin as 150.
4. Process-B updates the procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates the procArray->repilcation_slot_catalog_xmin to 100,
which was 150.

After further investigation, I think that steps 3 and 4 cannot occur because
Process-B must have already encountered the catalog_xmin maintained by
Process-A, either 50 or 100. Consequently, Process-B will refrain from updating
the catalog_xmin to a more recent value, such as 150.

Right. But the following scenario seems to happen:

1. Both processes have a slot with effective_catalog_xmin = 100.
2. Process-A updates effective_catalog_xmin to 150, and computes the
new catalog_xmin as 100 because process-B slot still has
effective_catalog_xmin = 100.
3. Process-B updates effective_catalog_xmin to 150, and computes the
new catalog_xmin as 150.
4. Process-B updates procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates procArray->replication_slot_catalog_xmin to 100.

It might be worth adding an assertion to ProcArraySetReplicationSlotXmin(),
checking if the new xmin and catalog_xmin values are either >= the current
values or an InvalidTransactionId.

I considered this scenario and identified a potential exception in the
copy_replication_slot(). This function uses a two-phase copy process, the
original restart_lsn is directly copied to the new slot during the first phase.
However, the original slot.restart_lsn might advance between phases.
Consequently, the newly created slot initially uses the outdated restart_lsn,
which could cause the procArray->replication_slot_catalog_xmin to retreat. I
think this behavior isn't harmful, as explained in the comments, because the new
restart_lsn will be updated in the created slot during the second phase.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#62

houzj.fnst@fujitsu.com

about 2 months ago

In reply to: Masahiko Sawada (#61)

RE: Assertion failure in SnapBuildInitialSnapshot()

On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Nov 25, 2025 at 4:02 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

Given that the computation of xmin and catalog_xmin among all slots
could be executed concurrently, could the following scenario happen
where
procArray->replication_slot_xmin and replication_slot_catalog_xmin
procArray->are retreat to a non-invalid
XID?

1. Suppose the initial value procArray->replication_slot_catalog_xmin is

50.

2. Process-A updates its owned slot's catalog_xmin to 100, and
computes the new catalog_xmin as 100 while holding
ReplicationSlotControlLock in a shared mode in
ReplicationSlotsComputeRequiredLSN(). But it doesn't update the

procArray's catalog_xmin value yet.

3. Process-B updates its owned slot's catalog_xmin to 150, and
computes the new catalog_xmin as 150.
4. Process-B updates the procArray->replication_slot_catalog_xmin to

150.

5. Process-A updates the procArray->repilcation_slot_catalog_xmin to
100, which was 150.

After further investigation, I think that steps 3 and 4 cannot occur
because Process-B must have already encountered the catalog_xmin
maintained by Process-A, either 50 or 100. Consequently, Process-B
will refrain from updating the catalog_xmin to a more recent value, such as

150.

Right. But the following scenario seems to happen:

1. Both processes have a slot with effective_catalog_xmin = 100.
2. Process-A updates effective_catalog_xmin to 150, and computes the new
catalog_xmin as 100 because process-B slot still has effective_catalog_xmin =
100.
3. Process-B updates effective_catalog_xmin to 150, and computes the new
catalog_xmin as 150.
4. Process-B updates procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates procArray->replication_slot_catalog_xmin to 100.

I think this scenario can occur, but is not harmful. Because the catalog rows
removed prior to xid:150 would no longer be used, as both slots have advanced
their catalog_xmin and flushed the value to disk. Therefore, even if
replication_slot_catalog_xmin regresses, it should be OK.

Considering all above, I think allowing concurrent xmin computation, as the
patch does, is acceptable. What do you think ?

Best Regards,
Hou zj

#63

sawada.mshk@gmail.com

about 1 month ago

In reply to: Zhijie Hou (Fujitsu) (#62)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Nov 25, 2025 at 4:02 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

Given that the computation of xmin and catalog_xmin among all slots
could be executed concurrently, could the following scenario happen
where
procArray->replication_slot_xmin and replication_slot_catalog_xmin
procArray->are retreat to a non-invalid
XID?

1. Suppose the initial value procArray->replication_slot_catalog_xmin is

50.

2. Process-A updates its owned slot's catalog_xmin to 100, and
computes the new catalog_xmin as 100 while holding
ReplicationSlotControlLock in a shared mode in
ReplicationSlotsComputeRequiredLSN(). But it doesn't update the

procArray's catalog_xmin value yet.

3. Process-B updates its owned slot's catalog_xmin to 150, and
computes the new catalog_xmin as 150.
4. Process-B updates the procArray->replication_slot_catalog_xmin to

150.

5. Process-A updates the procArray->repilcation_slot_catalog_xmin to
100, which was 150.

After further investigation, I think that steps 3 and 4 cannot occur
because Process-B must have already encountered the catalog_xmin
maintained by Process-A, either 50 or 100. Consequently, Process-B
will refrain from updating the catalog_xmin to a more recent value, such as

150.

Right. But the following scenario seems to happen:

1. Both processes have a slot with effective_catalog_xmin = 100.
2. Process-A updates effective_catalog_xmin to 150, and computes the new
catalog_xmin as 100 because process-B slot still has effective_catalog_xmin =
100.
3. Process-B updates effective_catalog_xmin to 150, and computes the new
catalog_xmin as 150.
4. Process-B updates procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates procArray->replication_slot_catalog_xmin to 100.

I think this scenario can occur, but is not harmful. Because the catalog rows
removed prior to xid:150 would no longer be used, as both slots have advanced
their catalog_xmin and flushed the value to disk. Therefore, even if
replication_slot_catalog_xmin regresses, it should be OK.

Considering all above, I think allowing concurrent xmin computation, as the
patch does, is acceptable. What do you think ?

I agree with your analysis. Another thing I'd like to confirm is that
in an extreme case, if the server crashes suddenly after removing
catalog tuples older than XID 100 and logical decoding restarts, it
ends up missing necessary catalog tuples? I think it's not a problem
as long as the subscriber knows the next commit LSN they want but
could it be problematic if the user switches to use the logical
decoding SQL API? I might be worrying too much, though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#64

houzj.fnst@fujitsu.com

about 1 month ago

In reply to: Masahiko Sawada (#63)

RE: Assertion failure in SnapBuildInitialSnapshot()

On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

On Tue, Nov 25, 2025 at 4:02 AM Zhijie Hou (Fujitsu)

<houzj.fnst@fujitsu.com>

wrote:

On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

Given that the computation of xmin and catalog_xmin among all slots
could be executed concurrently, could the following scenario happen
where
procArray->replication_slot_xmin and replication_slot_catalog_xmin
procArray->are retreat to a non-invalid
XID?

1. Suppose the initial value procArray->replication_slot_catalog_xmin

is

50.

2. Process-A updates its owned slot's catalog_xmin to 100, and
computes the new catalog_xmin as 100 while holding
ReplicationSlotControlLock in a shared mode in
ReplicationSlotsComputeRequiredLSN(). But it doesn't update the

procArray's catalog_xmin value yet.

3. Process-B updates its owned slot's catalog_xmin to 150, and
computes the new catalog_xmin as 150.
4. Process-B updates the procArray->replication_slot_catalog_xmin to

150.

5. Process-A updates the procArray->repilcation_slot_catalog_xmin to
100, which was 150.

After further investigation, I think that steps 3 and 4 cannot occur
because Process-B must have already encountered the catalog_xmin
maintained by Process-A, either 50 or 100. Consequently, Process-B
will refrain from updating the catalog_xmin to a more recent value, such

as

150.

Right. But the following scenario seems to happen:

1. Both processes have a slot with effective_catalog_xmin = 100.
2. Process-A updates effective_catalog_xmin to 150, and computes the

new

catalog_xmin as 100 because process-B slot still has

effective_catalog_xmin =

100.
3. Process-B updates effective_catalog_xmin to 150, and computes the

new

catalog_xmin as 150.
4. Process-B updates procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates procArray->replication_slot_catalog_xmin to 100.

I think this scenario can occur, but is not harmful. Because the catalog rows
removed prior to xid:150 would no longer be used, as both slots have

advanced

their catalog_xmin and flushed the value to disk. Therefore, even if
replication_slot_catalog_xmin regresses, it should be OK.

Considering all above, I think allowing concurrent xmin computation, as the
patch does, is acceptable. What do you think ?

I agree with your analysis. Another thing I'd like to confirm is that
in an extreme case, if the server crashes suddenly after removing
catalog tuples older than XID 100 and logical decoding restarts, it
ends up missing necessary catalog tuples? I think it's not a problem
as long as the subscriber knows the next commit LSN they want but
could it be problematic if the user switches to use the logical
decoding SQL API? I might be worrying too much, though.

I think this case is not a problem because:

In LogicalConfirmReceivedLocation, the updated restart_lsn and catalog_xmin are
flushed to disk before the effective_catalog_xmin is updated. Thus, once
replication_slot_catalog_xmin advances to a certain value, even in the event of
a crash, users won't encounter any removed tuples when consuming from slots
after a restart. This is because all slots have their updated restart_lsn
flushed to disk, ensuring that upon restarting, changes are decoded from the
updated position where older catalog tuples are no longer needed.

BTW, I assume you meant catalog tuples older than XID 150 are removed, since in
the previous example, tuples older than XID 100 are already not useful.

Best Regards,
Hou zj

#65

sawada.mshk@gmail.com

24 days ago

In reply to: Zhijie Hou (Fujitsu) (#64)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

On Tue, Nov 25, 2025 at 4:02 AM Zhijie Hou (Fujitsu)

<houzj.fnst@fujitsu.com>

wrote:

On Tuesday, November 25, 2025 3:30 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

Given that the computation of xmin and catalog_xmin among all slots
could be executed concurrently, could the following scenario happen
where
procArray->replication_slot_xmin and replication_slot_catalog_xmin
procArray->are retreat to a non-invalid
XID?

1. Suppose the initial value procArray->replication_slot_catalog_xmin

is

50.

2. Process-A updates its owned slot's catalog_xmin to 100, and
computes the new catalog_xmin as 100 while holding
ReplicationSlotControlLock in a shared mode in
ReplicationSlotsComputeRequiredLSN(). But it doesn't update the

procArray's catalog_xmin value yet.

3. Process-B updates its owned slot's catalog_xmin to 150, and
computes the new catalog_xmin as 150.
4. Process-B updates the procArray->replication_slot_catalog_xmin to

150.

5. Process-A updates the procArray->repilcation_slot_catalog_xmin to
100, which was 150.

After further investigation, I think that steps 3 and 4 cannot occur
because Process-B must have already encountered the catalog_xmin
maintained by Process-A, either 50 or 100. Consequently, Process-B
will refrain from updating the catalog_xmin to a more recent value, such

as

150.

Right. But the following scenario seems to happen:

1. Both processes have a slot with effective_catalog_xmin = 100.
2. Process-A updates effective_catalog_xmin to 150, and computes the

new

catalog_xmin as 100 because process-B slot still has

effective_catalog_xmin =

100.
3. Process-B updates effective_catalog_xmin to 150, and computes the

new

catalog_xmin as 150.
4. Process-B updates procArray->replication_slot_catalog_xmin to 150.
5. Process-A updates procArray->replication_slot_catalog_xmin to 100.

I think this scenario can occur, but is not harmful. Because the catalog rows
removed prior to xid:150 would no longer be used, as both slots have

advanced

their catalog_xmin and flushed the value to disk. Therefore, even if
replication_slot_catalog_xmin regresses, it should be OK.

Considering all above, I think allowing concurrent xmin computation, as the
patch does, is acceptable. What do you think ?

I agree with your analysis. Another thing I'd like to confirm is that
in an extreme case, if the server crashes suddenly after removing
catalog tuples older than XID 100 and logical decoding restarts, it
ends up missing necessary catalog tuples? I think it's not a problem
as long as the subscriber knows the next commit LSN they want but
could it be problematic if the user switches to use the logical
decoding SQL API? I might be worrying too much, though.

I think this case is not a problem because:

In LogicalConfirmReceivedLocation, the updated restart_lsn and catalog_xmin are
flushed to disk before the effective_catalog_xmin is updated. Thus, once
replication_slot_catalog_xmin advances to a certain value, even in the event of
a crash, users won't encounter any removed tuples when consuming from slots
after a restart. This is because all slots have their updated restart_lsn
flushed to disk, ensuring that upon restarting, changes are decoded from the
updated position where older catalog tuples are no longer needed.

Agreed.

BTW, I assume you meant catalog tuples older than XID 150 are removed, since in
the previous example, tuples older than XID 100 are already not useful.

Right. Thank you for pointing this out.

I think we can proceed with the idea proposed by Hou-san. Regarding
the patch, while it mostly looks good, it might be worth considering
adding more comments:

- If the caller passes already_locked=true to
ReplicationSlotsComputeRequiredXmin(), the lock order should also be
considered (must acquire RepliationSlotControlLock and then
ProcArrayLock).
- ReplicationSlotsComputeRequiredXmin() can concurrently run by
multiple process, resulting in temporarily moving
procArray->replication_slot_catalog_xmin backward, but it's harmless
because a smaller catalog_xmin is conservative: it merely prevents
VACUUM from removing catalog tuples that could otherwise be pruned. It
does not lead to premature deletion of required data.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#66

houzj.fnst@fujitsu.com

24 days ago

In reply to: Masahiko Sawada (#65)

1 attachment(s)

RE: Assertion failure in SnapBuildInitialSnapshot()

On Friday, December 19, 2025 3:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

Right. But the following scenario seems to happen:

1. Both processes have a slot with effective_catalog_xmin = 100.
2. Process-A updates effective_catalog_xmin to 150, and computes
the

new

catalog_xmin as 100 because process-B slot still has

effective_catalog_xmin =

100.
3. Process-B updates effective_catalog_xmin to 150, and computes
the

new

catalog_xmin as 150.
4. Process-B updates procArray->replication_slot_catalog_xmin to

150.

5. Process-A updates procArray->replication_slot_catalog_xmin to

100.

I think this scenario can occur, but is not harmful. Because the
catalog rows removed prior to xid:150 would no longer be used, as
both slots have

advanced

their catalog_xmin and flushed the value to disk. Therefore, even
if replication_slot_catalog_xmin regresses, it should be OK.

Considering all above, I think allowing concurrent xmin
computation, as the patch does, is acceptable. What do you think ?

I agree with your analysis. Another thing I'd like to confirm is
that in an extreme case, if the server crashes suddenly after
removing catalog tuples older than XID 100 and logical decoding
restarts, it ends up missing necessary catalog tuples? I think it's
not a problem as long as the subscriber knows the next commit LSN
they want but could it be problematic if the user switches to use
the logical decoding SQL API? I might be worrying too much, though.

I think this case is not a problem because:

In LogicalConfirmReceivedLocation, the updated restart_lsn and
catalog_xmin are flushed to disk before the effective_catalog_xmin is
updated. Thus, once replication_slot_catalog_xmin advances to a
certain value, even in the event of a crash, users won't encounter any
removed tuples when consuming from slots after a restart. This is
because all slots have their updated restart_lsn flushed to disk,
ensuring that upon restarting, changes are decoded from the updated

position where older catalog tuples are no longer needed.

Agreed.

BTW, I assume you meant catalog tuples older than XID 150 are removed,
since in the previous example, tuples older than XID 100 are already not

useful.

Right. Thank you for pointing this out.

I think we can proceed with the idea proposed by Hou-san. Regarding the
patch, while it mostly looks good, it might be worth considering adding more
comments:

- If the caller passes already_locked=true to
ReplicationSlotsComputeRequiredXmin(), the lock order should also be
considered (must acquire RepliationSlotControlLock and then ProcArrayLock).
- ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple
process, resulting in temporarily moving
procArray->replication_slot_catalog_xmin backward, but it's harmless
because a smaller catalog_xmin is conservative: it merely prevents VACUUM
from removing catalog tuples that could otherwise be pruned. It does not lead
to premature deletion of required data.

Thanks for the comments. I added some more comments as suggested.

Here is the updated patch.

Best Regards,
Hou zj

Attachments:

v5HEAD-0001-Fix-a-race-condition-of-updating-procArray-re.patchapplication/octet-stream; name=v5HEAD-0001-Fix-a-race-condition-of-updating-procArray-re.patchDownload

From 8cc50c4464f24da10c49e5218ddb6b0f38a58c02 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Fri, 19 Dec 2025 10:51:20 +0800
Subject: [PATCH v5HEAD] Fix a race condition of updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots while not holding ProcArrayLock if
already_locked is false, and acquires the ProcArrayLock just before
updating the replication slot xmin.

This could lead to a race condition: if a backend creates a new slot and
attempts to initialize the slot.xmin, but meanwhile, another backend invokes
ReplicationSlotsComputeRequiredXmin() with already_locked set to false, the
global slot xmin may be initially updated by the newly created slot, only to be
subsequently overwritten by the backend running
ReplicationSlotsComputeRequiredXmin() with an invalid or new xid value.

In the reported failure, a walsender for an apply worker computes
InvalidTransaction as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker with this value. Then the walsender for a tablesync worker
ended up computing the transaction id by
GetOldestSafeDecodingTransactionId() without considering replication
slot xmin. That led to an error ""cannot build an initial slot
snapshot as oldest safe xid %u follows snapshot's xmin %u", which was
an assertion failure prior to 240e0dbacd3.

To address the bug, we acquire the ReplicationSlotControlLock exclusively during
slot creation to do the initial update of the slot xmin. In
ReplicationSlotsComputeRequiredXmin(), we hold the ReplicationSlotControlLock in
shared mode until the global slot xmin is updated in
ProcArraySetReplicationSlotXmin(). This approach prevents concurrent
computations and updates of new xmin horizons by other backends during the
initial slot xmin update process, while it still permits concurrent calls to
ReplicationSlotsComputeRequiredXmin().
---
 src/backend/replication/logical/launcher.c |  2 ++
 src/backend/replication/logical/logical.c  | 12 ++++---
 src/backend/replication/logical/slotsync.c |  2 ++
 src/backend/replication/slot.c             | 42 +++++++++++++++++++---
 4 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 3991e1495d4..b9780c7bc99 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -1540,6 +1540,7 @@ init_conflict_slot_xmin(void)
 	Assert(MyReplicationSlot &&
 		   !TransactionIdIsValid(MyReplicationSlot->data.xmin));
 
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(false);
@@ -1552,6 +1553,7 @@ init_conflict_slot_xmin(void)
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	/* Write this slot to disk */
 	ReplicationSlotMarkDirty();
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 1b11ed63dc6..eef95a5b4c4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -405,11 +405,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -422,6 +422,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -436,6 +437,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index bf50317b443..afcb1f3487d 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -857,6 +857,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid,
 
 		reserve_wal_for_local_slot(remote_slot->restart_lsn);
 
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		xmin_horizon = GetOldestSafeDecodingTransactionId(true);
 		SpinLockAcquire(&slot->mutex);
@@ -865,6 +866,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid,
 		SpinLockRelease(&slot->mutex);
 		ReplicationSlotsComputeRequiredXmin(true);
 		LWLockRelease(ProcArrayLock);
+		LWLockRelease(ReplicationSlotControlLock);
 
 		/*
 		 * Make sure that concerned WAL is received and flushed before syncing
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 682eccd116c..d69be3fe701 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1171,8 +1171,14 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and the
+ * ProcArrayLock have already been acquired exclusively. It is crucial that the
+ * caller first acquires the ReplicationSlotControlLock, followed by the
+ * ProcArrayLock, to prevent any undetectable deadlocks due to incorrect lock
+ * ordering.
+ *
+ * Note that the ReplicationSlotControlLock must be locked first to avoid
+ * deadlocks.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -1182,8 +1188,33 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock until after updating the slot xmin
+	 * values, so no backend update the initial xmin for newly created slot
+	 * concurrently. A shared lock is used here to minimize lock contention,
+	 * especially when many slots exist and advancements occur frequently.
+	 * This is safe since an exclusive lock is taken during initial slot xmin
+	 * update in slot creation.
+	 *
+	 * Concurrent invocation of this function may cause the computed slot xmin
+	 * to regress. However, this is harmless because tuples prior to the most
+	 * recent xmin are no longer useful once advancement occurs (see
+	 * LogicalConfirmReceivedLocation where the slot's xmin value is flushed
+	 * before updating the effective_xmin). Thus, such regression merely
+	 * prevents VACUUM from prematurely removing tuples without causing the
+	 * early deletion of required data.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively and
+	 * update the slot xmin values, but it could increase lock contention on
+	 * the ProcArrayLock, which is not great since this function can be called
+	 * at non-negligible frequency.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -1218,9 +1249,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.51.1.windows.1

#67

sawada.mshk@gmail.com

13 days ago

In reply to: Zhijie Hou (Fujitsu) (#66)

7 attachment(s)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Thu, Dec 18, 2025 at 7:19 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Friday, December 19, 2025 3:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

Right. But the following scenario seems to happen:

1. Both processes have a slot with effective_catalog_xmin = 100.
2. Process-A updates effective_catalog_xmin to 150, and computes
the

new

catalog_xmin as 100 because process-B slot still has

effective_catalog_xmin =

100.
3. Process-B updates effective_catalog_xmin to 150, and computes
the

new

catalog_xmin as 150.
4. Process-B updates procArray->replication_slot_catalog_xmin to

150.

5. Process-A updates procArray->replication_slot_catalog_xmin to

100.

I think this scenario can occur, but is not harmful. Because the
catalog rows removed prior to xid:150 would no longer be used, as
both slots have

advanced

their catalog_xmin and flushed the value to disk. Therefore, even
if replication_slot_catalog_xmin regresses, it should be OK.

Considering all above, I think allowing concurrent xmin
computation, as the patch does, is acceptable. What do you think ?

I agree with your analysis. Another thing I'd like to confirm is
that in an extreme case, if the server crashes suddenly after
removing catalog tuples older than XID 100 and logical decoding
restarts, it ends up missing necessary catalog tuples? I think it's
not a problem as long as the subscriber knows the next commit LSN
they want but could it be problematic if the user switches to use
the logical decoding SQL API? I might be worrying too much, though.

I think this case is not a problem because:

In LogicalConfirmReceivedLocation, the updated restart_lsn and
catalog_xmin are flushed to disk before the effective_catalog_xmin is
updated. Thus, once replication_slot_catalog_xmin advances to a
certain value, even in the event of a crash, users won't encounter any
removed tuples when consuming from slots after a restart. This is
because all slots have their updated restart_lsn flushed to disk,
ensuring that upon restarting, changes are decoded from the updated

position where older catalog tuples are no longer needed.

Agreed.

BTW, I assume you meant catalog tuples older than XID 150 are removed,
since in the previous example, tuples older than XID 100 are already not

useful.

Right. Thank you for pointing this out.

I think we can proceed with the idea proposed by Hou-san. Regarding the
patch, while it mostly looks good, it might be worth considering adding more
comments:

- If the caller passes already_locked=true to
ReplicationSlotsComputeRequiredXmin(), the lock order should also be
considered (must acquire RepliationSlotControlLock and then ProcArrayLock).
- ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple
process, resulting in temporarily moving
procArray->replication_slot_catalog_xmin backward, but it's harmless
because a smaller catalog_xmin is conservative: it merely prevents VACUUM
from removing catalog tuples that could otherwise be pruned. It does not lead
to premature deletion of required data.

Thanks for the comments. I added some more comments as suggested.

Here is the updated patch.

Thank you for updating the patch! The patch looks good to me.

I've made minor changes to the comment and commit message and created
patches for backbranches. I'm going to push them, barring any
objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

master_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchapplication/octet-stream; name=master_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchDownload

From b10a655ed8000457da9160c13cc7832c46154347 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Fri, 19 Dec 2025 10:51:20 +0800
Subject: [PATCH] Fix a race condition in updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots without holding ProcArrayLock (when
already_locked is false), acquiring the lock just before updating the
replication slot xmin.

This could lead to a race condition: if a backend created a new slot
and updates the global replication slot xmin, another backend
concurrently running ReplicationSlotsComputeRequiredXmin() could
overwrite that update with an invalid or stale value. This happens
because the concurrent backend might have computed the aggregate xmin
before the new slot was accounted for, but applied the update after
the new slot had already updated the global value.

In the reported failure, a walsender for an apply worker computed
InvalidTransactionId as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker. Consequently, the tablesync worker computed a transaction ID
via GetOldestSafeDecodingTransactionId() effectively without
considering the replication slot xmin. This led to the error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to commit
240e0dbacd3.

To fix this, we acquire ReplicationSlotControlLock in exclusive mode
during slot creation to perform the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we hold
ReplicationSlotControlLock in shared mode until the global slot xmin
is updated in ProcArraySetReplicationSlotXmin(). This prevents
concurrent computations and updates of the global xmin by other
backends during the initial slot xmin update process, while still
permitting concurrent calls to ReplicationSlotsComputeRequiredXmin().

Backpatch to all supported versions.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Pradeep Kumar <spradeepkumar29@gmail.com>
Reviewed-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/logical/launcher.c |  2 ++
 src/backend/replication/logical/logical.c  | 12 ++++---
 src/backend/replication/logical/slotsync.c |  2 ++
 src/backend/replication/slot.c             | 39 +++++++++++++++++++---
 4 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 3991e1495d4..b9780c7bc99 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -1540,6 +1540,7 @@ init_conflict_slot_xmin(void)
 	Assert(MyReplicationSlot &&
 		   !TransactionIdIsValid(MyReplicationSlot->data.xmin));
 
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(false);
@@ -1552,6 +1553,7 @@ init_conflict_slot_xmin(void)
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	/* Write this slot to disk */
 	ReplicationSlotMarkDirty();
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index c8858e06616..ed117536c7c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -394,11 +394,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -411,6 +411,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -425,6 +426,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 2aea776352d..0feebffd431 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -857,6 +857,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid,
 
 		reserve_wal_for_local_slot(remote_slot->restart_lsn);
 
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		xmin_horizon = GetOldestSafeDecodingTransactionId(true);
 		SpinLockAcquire(&slot->mutex);
@@ -865,6 +866,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid,
 		SpinLockRelease(&slot->mutex);
 		ReplicationSlotsComputeRequiredXmin(true);
 		LWLockRelease(ProcArrayLock);
+		LWLockRelease(ReplicationSlotControlLock);
 
 		/*
 		 * Make sure that concerned WAL is received and flushed before syncing
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 58c41d45516..75967580550 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1201,8 +1201,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and the
+ * ProcArrayLock have already been acquired exclusively. It is crucial that the
+ * caller first acquires the ReplicationSlotControlLock, followed by the
+ * ProcArrayLock, to prevent any undetectable deadlocks since this function
+ * acquire them in that order.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -1212,8 +1215,33 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock until after updating the slot xmin
+	 * values, so no backend update the initial xmin for newly created slot
+	 * concurrently. A shared lock is used here to minimize lock contention,
+	 * especially when many slots exist and advancements occur frequently.
+	 * This is safe since an exclusive lock is taken during initial slot xmin
+	 * update in slot creation.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively and
+	 * update the slot xmin values, but it could increase lock contention on
+	 * the ProcArrayLock, which is not great since this function can be called
+	 * at non-negligible frequency.
+	 *
+	 * Concurrent invocation of this function may cause the computed slot xmin
+	 * to regress. However, this is harmless because tuples prior to the most
+	 * recent xmin are no longer useful once advancement occurs (see
+	 * LogicalConfirmReceivedLocation where the slot's xmin value is flushed
+	 * before updating the effective_xmin). Thus, such regression merely
+	 * prevents VACUUM from prematurely removing tuples without causing the
+	 * early deletion of required data.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -1248,9 +1276,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.47.3

REL_18_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchapplication/octet-stream; name=REL_18_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchDownload

From dd6dfdcbaafb4ac41428d0840df3828fc72ae77a Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Fri, 19 Dec 2025 10:51:20 +0800
Subject: [PATCH] Fix a race condition in updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots without holding ProcArrayLock (when
already_locked is false), acquiring the lock just before updating the
replication slot xmin.

This could lead to a race condition: if a backend created a new slot
and updates the global replication slot xmin, another backend
concurrently running ReplicationSlotsComputeRequiredXmin() could
overwrite that update with an invalid or stale value. This happens
because the concurrent backend might have computed the aggregate xmin
before the new slot was accounted for, but applied the update after
the new slot had already updated the global value.

In the reported failure, a walsender for an apply worker computed
InvalidTransactionId as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker. Consequently, the tablesync worker computed a transaction ID
via GetOldestSafeDecodingTransactionId() effectively without
considering the replication slot xmin. This led to the error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to commit
240e0dbacd3.

To fix this, we acquire ReplicationSlotControlLock in exclusive mode
during slot creation to perform the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we hold
ReplicationSlotControlLock in shared mode until the global slot xmin
is updated in ProcArraySetReplicationSlotXmin(). This prevents
concurrent computations and updates of the global xmin by other
backends during the initial slot xmin update process, while still
permitting concurrent calls to ReplicationSlotsComputeRequiredXmin().

Backpatch to all supported versions.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Pradeep Kumar <spradeepkumar29@gmail.com>
Reviewed-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/logical/logical.c  | 12 ++++---
 src/backend/replication/logical/slotsync.c |  2 ++
 src/backend/replication/slot.c             | 39 +++++++++++++++++++---
 3 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f1eb798f3e9..e3a2a1ffdd4 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -405,11 +405,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -422,6 +422,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -436,6 +437,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 35be111aa73..64db4fedbe8 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -776,6 +776,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 
 		reserve_wal_for_local_slot(remote_slot->restart_lsn);
 
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		xmin_horizon = GetOldestSafeDecodingTransactionId(true);
 		SpinLockAcquire(&slot->mutex);
@@ -784,6 +785,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		SpinLockRelease(&slot->mutex);
 		ReplicationSlotsComputeRequiredXmin(true);
 		LWLockRelease(ProcArrayLock);
+		LWLockRelease(ReplicationSlotControlLock);
 
 		update_and_persist_local_synced_slot(remote_slot, remote_dbid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index da8e813f81b..61f02088e4c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1129,8 +1129,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and the
+ * ProcArrayLock have already been acquired exclusively. It is crucial that the
+ * caller first acquires the ReplicationSlotControlLock, followed by the
+ * ProcArrayLock, to prevent any undetectable deadlocks since this function
+ * acquire them in that order.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -1140,8 +1143,33 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock until after updating the slot xmin
+	 * values, so no backend update the initial xmin for newly created slot
+	 * concurrently. A shared lock is used here to minimize lock contention,
+	 * especially when many slots exist and advancements occur frequently.
+	 * This is safe since an exclusive lock is taken during initial slot xmin
+	 * update in slot creation.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively and
+	 * update the slot xmin values, but it could increase lock contention on
+	 * the ProcArrayLock, which is not great since this function can be called
+	 * at non-negligible frequency.
+	 *
+	 * Concurrent invocation of this function may cause the computed slot xmin
+	 * to regress. However, this is harmless because tuples prior to the most
+	 * recent xmin are no longer useful once advancement occurs (see
+	 * LogicalConfirmReceivedLocation where the slot's xmin value is flushed
+	 * before updating the effective_xmin). Thus, such regression merely
+	 * prevents VACUUM from prematurely removing tuples without causing the
+	 * early deletion of required data.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -1176,9 +1204,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.47.3

REL_17_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchapplication/octet-stream; name=REL_17_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchDownload

From 26b9f3bdfdd04c7e815a041d08f7dd82512c496c Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Fri, 19 Dec 2025 10:51:20 +0800
Subject: [PATCH] Fix a race condition in updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots without holding ProcArrayLock (when
already_locked is false), acquiring the lock just before updating the
replication slot xmin.

This could lead to a race condition: if a backend created a new slot
and updates the global replication slot xmin, another backend
concurrently running ReplicationSlotsComputeRequiredXmin() could
overwrite that update with an invalid or stale value. This happens
because the concurrent backend might have computed the aggregate xmin
before the new slot was accounted for, but applied the update after
the new slot had already updated the global value.

In the reported failure, a walsender for an apply worker computed
InvalidTransactionId as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker. Consequently, the tablesync worker computed a transaction ID
via GetOldestSafeDecodingTransactionId() effectively without
considering the replication slot xmin. This led to the error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to commit
240e0dbacd3.

To fix this, we acquire ReplicationSlotControlLock in exclusive mode
during slot creation to perform the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we hold
ReplicationSlotControlLock in shared mode until the global slot xmin
is updated in ProcArraySetReplicationSlotXmin(). This prevents
concurrent computations and updates of the global xmin by other
backends during the initial slot xmin update process, while still
permitting concurrent calls to ReplicationSlotsComputeRequiredXmin().

Backpatch to all supported versions.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Pradeep Kumar <spradeepkumar29@gmail.com>
Reviewed-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/logical/logical.c  | 12 ++++---
 src/backend/replication/logical/slotsync.c |  2 ++
 src/backend/replication/slot.c             | 39 +++++++++++++++++++---
 3 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 206fb932484..7f23009da08 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -405,11 +405,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -422,6 +422,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -436,6 +437,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 27e262ecbf2..35874e6f1bf 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -760,6 +760,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 
 		reserve_wal_for_local_slot(remote_slot->restart_lsn);
 
+		LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		xmin_horizon = GetOldestSafeDecodingTransactionId(true);
 		SpinLockAcquire(&slot->mutex);
@@ -768,6 +769,7 @@ synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 		SpinLockRelease(&slot->mutex);
 		ReplicationSlotsComputeRequiredXmin(true);
 		LWLockRelease(ProcArrayLock);
+		LWLockRelease(ReplicationSlotControlLock);
 
 		update_and_persist_local_synced_slot(remote_slot, remote_dbid);
 
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b7da21a94..ff06a5e0eb9 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1071,8 +1071,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and the
+ * ProcArrayLock have already been acquired exclusively. It is crucial that the
+ * caller first acquires the ReplicationSlotControlLock, followed by the
+ * ProcArrayLock, to prevent any undetectable deadlocks since this function
+ * acquire them in that order.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -1082,8 +1085,33 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock until after updating the slot xmin
+	 * values, so no backend update the initial xmin for newly created slot
+	 * concurrently. A shared lock is used here to minimize lock contention,
+	 * especially when many slots exist and advancements occur frequently.
+	 * This is safe since an exclusive lock is taken during initial slot xmin
+	 * update in slot creation.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively and
+	 * update the slot xmin values, but it could increase lock contention on
+	 * the ProcArrayLock, which is not great since this function can be called
+	 * at non-negligible frequency.
+	 *
+	 * Concurrent invocation of this function may cause the computed slot xmin
+	 * to regress. However, this is harmless because tuples prior to the most
+	 * recent xmin are no longer useful once advancement occurs (see
+	 * LogicalConfirmReceivedLocation where the slot's xmin value is flushed
+	 * before updating the effective_xmin). Thus, such regression merely
+	 * prevents VACUUM from prematurely removing tuples without causing the
+	 * early deletion of required data.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -1118,9 +1146,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.47.3

REL_16_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchapplication/octet-stream; name=REL_16_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchDownload

From 9ae9a5edac584d6be07cd6d39c5cb245db229756 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Mon, 29 Dec 2025 14:06:32 -0800
Subject: [PATCH] Fix a race condition in updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots without holding ProcArrayLock (when
already_locked is false), acquiring the lock just before updating the
replication slot xmin.

This could lead to a race condition: if a backend created a new slot
and updates the global replication slot xmin, another backend
concurrently running ReplicationSlotsComputeRequiredXmin() could
overwrite that update with an invalid or stale value. This happens
because the concurrent backend might have computed the aggregate xmin
before the new slot was accounted for, but applied the update after
the new slot had already updated the global value.

In the reported failure, a walsender for an apply worker computed
InvalidTransactionId as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker. Consequently, the tablesync worker computed a transaction ID
via GetOldestSafeDecodingTransactionId() effectively without
considering the replication slot xmin. This led to the error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to commit
240e0dbacd3.

To fix this, we acquire ReplicationSlotControlLock in exclusive mode
during slot creation to perform the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we hold
ReplicationSlotControlLock in shared mode until the global slot xmin
is updated in ProcArraySetReplicationSlotXmin(). This prevents
concurrent computations and updates of the global xmin by other
backends during the initial slot xmin update process, while still
permitting concurrent calls to ReplicationSlotsComputeRequiredXmin().

Backpatch to all supported versions.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Pradeep Kumar <spradeepkumar29@gmail.com>
Reviewed-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/logical/logical.c | 12 ++++---
 src/backend/replication/slot.c            | 39 ++++++++++++++++++++---
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 6e1879d8149..057f62a8d8f 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -404,11 +404,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -421,6 +421,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -435,6 +436,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d03a0556b5a..5f6298e9f1c 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -858,8 +858,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and the
+ * ProcArrayLock have already been acquired exclusively. It is crucial that the
+ * caller first acquires the ReplicationSlotControlLock, followed by the
+ * ProcArrayLock, to prevent any undetectable deadlocks since this function
+ * acquire them in that order.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -869,8 +872,33 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock until after updating the slot xmin
+	 * values, so no backend update the initial xmin for newly created slot
+	 * concurrently. A shared lock is used here to minimize lock contention,
+	 * especially when many slots exist and advancements occur frequently.
+	 * This is safe since an exclusive lock is taken during initial slot xmin
+	 * update in slot creation.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively and
+	 * update the slot xmin values, but it could increase lock contention on
+	 * the ProcArrayLock, which is not great since this function can be called
+	 * at non-negligible frequency.
+	 *
+	 * Concurrent invocation of this function may cause the computed slot xmin
+	 * to regress. However, this is harmless because tuples prior to the most
+	 * recent xmin are no longer useful once advancement occurs (see
+	 * LogicalConfirmReceivedLocation where the slot's xmin value is flushed
+	 * before updating the effective_xmin). Thus, such regression merely
+	 * prevents VACUUM from prematurely removing tuples without causing the
+	 * early deletion of required data.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -905,9 +933,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.47.3

REL_14_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchapplication/octet-stream; name=REL_14_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchDownload

From 82f409d6711ccdbf13caf31b7f644cdd6ab3c3fc Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Mon, 29 Dec 2025 14:06:32 -0800
Subject: [PATCH] Fix a race condition in updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots without holding ProcArrayLock (when
already_locked is false), acquiring the lock just before updating the
replication slot xmin.

This could lead to a race condition: if a backend created a new slot
and updates the global replication slot xmin, another backend
concurrently running ReplicationSlotsComputeRequiredXmin() could
overwrite that update with an invalid or stale value. This happens
because the concurrent backend might have computed the aggregate xmin
before the new slot was accounted for, but applied the update after
the new slot had already updated the global value.

In the reported failure, a walsender for an apply worker computed
InvalidTransactionId as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker. Consequently, the tablesync worker computed a transaction ID
via GetOldestSafeDecodingTransactionId() effectively without
considering the replication slot xmin. This led to the error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to commit
240e0dbacd3.

To fix this, we acquire ReplicationSlotControlLock in exclusive mode
during slot creation to perform the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we hold
ReplicationSlotControlLock in shared mode until the global slot xmin
is updated in ProcArraySetReplicationSlotXmin(). This prevents
concurrent computations and updates of the global xmin by other
backends during the initial slot xmin update process, while still
permitting concurrent calls to ReplicationSlotsComputeRequiredXmin().

Backpatch to all supported versions.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Pradeep Kumar <spradeepkumar29@gmail.com>
Reviewed-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/logical/logical.c | 12 ++++---
 src/backend/replication/slot.c            | 39 ++++++++++++++++++++---
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index f40b84592a6..f670cb169a2 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -388,11 +388,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -405,6 +405,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -419,6 +420,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c9eb1836a64..d0e85b776ab 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -788,8 +788,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and the
+ * ProcArrayLock have already been acquired exclusively. It is crucial that the
+ * caller first acquires the ReplicationSlotControlLock, followed by the
+ * ProcArrayLock, to prevent any undetectable deadlocks since this function
+ * acquire them in that order.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -799,8 +802,33 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock until after updating the slot xmin
+	 * values, so no backend update the initial xmin for newly created slot
+	 * concurrently. A shared lock is used here to minimize lock contention,
+	 * especially when many slots exist and advancements occur frequently.
+	 * This is safe since an exclusive lock is taken during initial slot xmin
+	 * update in slot creation.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively and
+	 * update the slot xmin values, but it could increase lock contention on
+	 * the ProcArrayLock, which is not great since this function can be called
+	 * at non-negligible frequency.
+	 *
+	 * Concurrent invocation of this function may cause the computed slot xmin
+	 * to regress. However, this is harmless because tuples prior to the most
+	 * recent xmin are no longer useful once advancement occurs (see
+	 * LogicalConfirmReceivedLocation where the slot's xmin value is flushed
+	 * before updating the effective_xmin). Thus, such regression merely
+	 * prevents VACUUM from prematurely removing tuples without causing the
+	 * early deletion of required data.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -836,9 +864,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.47.3

REL_15_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchapplication/octet-stream; name=REL_15_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchDownload

From b773a93e1195fa37414b2d831427ed0afe8c1508 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Mon, 29 Dec 2025 14:06:32 -0800
Subject: [PATCH] Fix a race condition in updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots without holding ProcArrayLock (when
already_locked is false), acquiring the lock just before updating the
replication slot xmin.

This could lead to a race condition: if a backend created a new slot
and updates the global replication slot xmin, another backend
concurrently running ReplicationSlotsComputeRequiredXmin() could
overwrite that update with an invalid or stale value. This happens
because the concurrent backend might have computed the aggregate xmin
before the new slot was accounted for, but applied the update after
the new slot had already updated the global value.

In the reported failure, a walsender for an apply worker computed
InvalidTransactionId as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker. Consequently, the tablesync worker computed a transaction ID
via GetOldestSafeDecodingTransactionId() effectively without
considering the replication slot xmin. This led to the error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to commit
240e0dbacd3.

To fix this, we acquire ReplicationSlotControlLock in exclusive mode
during slot creation to perform the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we hold
ReplicationSlotControlLock in shared mode until the global slot xmin
is updated in ProcArraySetReplicationSlotXmin(). This prevents
concurrent computations and updates of the global xmin by other
backends during the initial slot xmin update process, while still
permitting concurrent calls to ReplicationSlotsComputeRequiredXmin().

Backpatch to all supported versions.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Pradeep Kumar <spradeepkumar29@gmail.com>
Reviewed-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/logical/logical.c | 12 ++++---
 src/backend/replication/slot.c            | 39 ++++++++++++++++++++---
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index e9105edaef5..b64904d9dd7 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -389,11 +389,11 @@ CreateInitDecodingContext(const char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -406,6 +406,7 @@ CreateInitDecodingContext(const char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -420,6 +421,7 @@ CreateInitDecodingContext(const char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 7aa1dcca6d7..b167f34cad2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -856,8 +856,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and the
+ * ProcArrayLock have already been acquired exclusively. It is crucial that the
+ * caller first acquires the ReplicationSlotControlLock, followed by the
+ * ProcArrayLock, to prevent any undetectable deadlocks since this function
+ * acquire them in that order.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -867,8 +870,33 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock until after updating the slot xmin
+	 * values, so no backend update the initial xmin for newly created slot
+	 * concurrently. A shared lock is used here to minimize lock contention,
+	 * especially when many slots exist and advancements occur frequently.
+	 * This is safe since an exclusive lock is taken during initial slot xmin
+	 * update in slot creation.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively and
+	 * update the slot xmin values, but it could increase lock contention on
+	 * the ProcArrayLock, which is not great since this function can be called
+	 * at non-negligible frequency.
+	 *
+	 * Concurrent invocation of this function may cause the computed slot xmin
+	 * to regress. However, this is harmless because tuples prior to the most
+	 * recent xmin are no longer useful once advancement occurs (see
+	 * LogicalConfirmReceivedLocation where the slot's xmin value is flushed
+	 * before updating the effective_xmin). Thus, such regression merely
+	 * prevents VACUUM from prematurely removing tuples without causing the
+	 * early deletion of required data.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -904,9 +932,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.47.3

REL_13_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchapplication/octet-stream; name=REL_13_0001-Fix-a-race-condition-in-updating-procArray-replicati.patchDownload

From d0c848c40bb721e61579fa4b232865eb3232f3ea Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <msawada@postgresql.orig>
Date: Mon, 29 Dec 2025 14:06:32 -0800
Subject: [PATCH] Fix a race condition in updating
 procArray->replication_slot_xmin.

Previously, ReplicationSlotsComputeRequiredXmin() computed the oldest
xmin across all slots without holding ProcArrayLock (when
already_locked is false), acquiring the lock just before updating the
replication slot xmin.

This could lead to a race condition: if a backend created a new slot
and updates the global replication slot xmin, another backend
concurrently running ReplicationSlotsComputeRequiredXmin() could
overwrite that update with an invalid or stale value. This happens
because the concurrent backend might have computed the aggregate xmin
before the new slot was accounted for, but applied the update after
the new slot had already updated the global value.

In the reported failure, a walsender for an apply worker computed
InvalidTransactionId as the oldest xmin and overwrote a valid
replication slot xmin value computed by a walsender for a tablesync
worker. Consequently, the tablesync worker computed a transaction ID
via GetOldestSafeDecodingTransactionId() effectively without
considering the replication slot xmin. This led to the error "cannot
build an initial slot snapshot as oldest safe xid %u follows
snapshot's xmin %u", which was an assertion failure prior to commit
240e0dbacd3.

To fix this, we acquire ReplicationSlotControlLock in exclusive mode
during slot creation to perform the initial update of the slot
xmin. In ReplicationSlotsComputeRequiredXmin(), we hold
ReplicationSlotControlLock in shared mode until the global slot xmin
is updated in ProcArraySetReplicationSlotXmin(). This prevents
concurrent computations and updates of the global xmin by other
backends during the initial slot xmin update process, while still
permitting concurrent calls to ReplicationSlotsComputeRequiredXmin().

Backpatch to all supported versions.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Pradeep Kumar <spradeepkumar29@gmail.com>
Reviewed-by: Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Backpatch-through: 13
---
 src/backend/replication/logical/logical.c | 12 ++++---
 src/backend/replication/slot.c            | 39 ++++++++++++++++++++---
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 25695809037..bec6a9df521 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -294,11 +294,11 @@ CreateInitDecodingContext(char *plugin,
 	 * without further interlock its return value might immediately be out of
 	 * date.
 	 *
-	 * So we have to acquire the ProcArrayLock to prevent computation of new
-	 * xmin horizons by other backends, get the safe decoding xid, and inform
-	 * the slot machinery about the new limit. Once that's done the
-	 * ProcArrayLock can be released as the slot machinery now is
-	 * protecting against vacuum.
+	 * So we have to acquire both the ReplicationSlotControlLock and the
+	 * ProcArrayLock to prevent concurrent computation and update of new xmin
+	 * horizons by other backends, get the safe decoding xid, and inform the
+	 * slot machinery about the new limit. Once that's done the both locks
+	 * can be released as the slot machinery now is protecting against vacuum.
 	 *
 	 * Note that, temporarily, the data, not just the catalog, xmin has to be
 	 * reserved if a data snapshot is to be exported.  Otherwise the initial
@@ -311,6 +311,7 @@ CreateInitDecodingContext(char *plugin,
 	 *
 	 * ----
 	 */
+	LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
 	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 
 	xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot);
@@ -325,6 +326,7 @@ CreateInitDecodingContext(char *plugin,
 	ReplicationSlotsComputeRequiredXmin(true);
 
 	LWLockRelease(ProcArrayLock);
+	LWLockRelease(ReplicationSlotControlLock);
 
 	ReplicationSlotMarkDirty();
 	ReplicationSlotSave();
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index bc3f99cacef..0d67371c514 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -776,8 +776,11 @@ ReplicationSlotPersist(void)
 /*
  * Compute the oldest xmin across all slots and store it in the ProcArray.
  *
- * If already_locked is true, ProcArrayLock has already been acquired
- * exclusively.
+ * If already_locked is true, both the ReplicationSlotControlLock and the
+ * ProcArrayLock have already been acquired exclusively. It is crucial that the
+ * caller first acquires the ReplicationSlotControlLock, followed by the
+ * ProcArrayLock, to prevent any undetectable deadlocks since this function
+ * acquire them in that order.
  */
 void
 ReplicationSlotsComputeRequiredXmin(bool already_locked)
@@ -787,8 +790,33 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 	TransactionId agg_catalog_xmin = InvalidTransactionId;
 
 	Assert(ReplicationSlotCtl != NULL);
+	Assert(!already_locked ||
+		   (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
+			LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE)));
 
-	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+	/*
+	 * Hold the ReplicationSlotControlLock until after updating the slot xmin
+	 * values, so no backend update the initial xmin for newly created slot
+	 * concurrently. A shared lock is used here to minimize lock contention,
+	 * especially when many slots exist and advancements occur frequently.
+	 * This is safe since an exclusive lock is taken during initial slot xmin
+	 * update in slot creation.
+	 *
+	 * One might think that we can hold the ProcArrayLock exclusively and
+	 * update the slot xmin values, but it could increase lock contention on
+	 * the ProcArrayLock, which is not great since this function can be called
+	 * at non-negligible frequency.
+	 *
+	 * Concurrent invocation of this function may cause the computed slot xmin
+	 * to regress. However, this is harmless because tuples prior to the most
+	 * recent xmin are no longer useful once advancement occurs (see
+	 * LogicalConfirmReceivedLocation where the slot's xmin value is flushed
+	 * before updating the effective_xmin). Thus, such regression merely
+	 * prevents VACUUM from prematurely removing tuples without causing the
+	 * early deletion of required data.
+	 */
+	if (!already_locked)
+		LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 
 	for (i = 0; i < max_replication_slots; i++)
 	{
@@ -824,9 +852,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
 			agg_catalog_xmin = effective_catalog_xmin;
 	}
 
-	LWLockRelease(ReplicationSlotControlLock);
-
 	ProcArraySetReplicationSlotXmin(agg_xmin, agg_catalog_xmin, already_locked);
+
+	if (!already_locked)
+		LWLockRelease(ReplicationSlotControlLock);
 }
 
 /*
-- 
2.47.3

#68

Chao Li

li.evan.chao@gmail.com

13 days ago

In reply to: Masahiko Sawada (#67)

Re: Assertion failure in SnapBuildInitialSnapshot()

On Dec 30, 2025, at 06:14, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Dec 18, 2025 at 7:19 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Friday, December 19, 2025 3:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 9, 2025 at 7:32 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
wrote:

On Wednesday, December 10, 2025 7:25 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

On Tue, Nov 25, 2025 at 10:25 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wednesday, November 26, 2025 2:57 AM Masahiko Sawada

<sawada.mshk@gmail.com> wrote:

Right. But the following scenario seems to happen:

1. Both processes have a slot with effective_catalog_xmin = 100.
2. Process-A updates effective_catalog_xmin to 150, and computes
the

new

catalog_xmin as 100 because process-B slot still has

effective_catalog_xmin =

100.
3. Process-B updates effective_catalog_xmin to 150, and computes
the

new

catalog_xmin as 150.
4. Process-B updates procArray->replication_slot_catalog_xmin to

150.

5. Process-A updates procArray->replication_slot_catalog_xmin to

100.

I think this scenario can occur, but is not harmful. Because the
catalog rows removed prior to xid:150 would no longer be used, as
both slots have

advanced

their catalog_xmin and flushed the value to disk. Therefore, even
if replication_slot_catalog_xmin regresses, it should be OK.

Considering all above, I think allowing concurrent xmin
computation, as the patch does, is acceptable. What do you think ?

I agree with your analysis. Another thing I'd like to confirm is
that in an extreme case, if the server crashes suddenly after
removing catalog tuples older than XID 100 and logical decoding
restarts, it ends up missing necessary catalog tuples? I think it's
not a problem as long as the subscriber knows the next commit LSN
they want but could it be problematic if the user switches to use
the logical decoding SQL API? I might be worrying too much, though.

I think this case is not a problem because:

In LogicalConfirmReceivedLocation, the updated restart_lsn and
catalog_xmin are flushed to disk before the effective_catalog_xmin is
updated. Thus, once replication_slot_catalog_xmin advances to a
certain value, even in the event of a crash, users won't encounter any
removed tuples when consuming from slots after a restart. This is
because all slots have their updated restart_lsn flushed to disk,
ensuring that upon restarting, changes are decoded from the updated

position where older catalog tuples are no longer needed.

Agreed.

BTW, I assume you meant catalog tuples older than XID 150 are removed,
since in the previous example, tuples older than XID 100 are already not

useful.

Right. Thank you for pointing this out.

I think we can proceed with the idea proposed by Hou-san. Regarding the
patch, while it mostly looks good, it might be worth considering adding more
comments:

- If the caller passes already_locked=true to
ReplicationSlotsComputeRequiredXmin(), the lock order should also be
considered (must acquire RepliationSlotControlLock and then ProcArrayLock).
- ReplicationSlotsComputeRequiredXmin() can concurrently run by multiple
process, resulting in temporarily moving
procArray->replication_slot_catalog_xmin backward, but it's harmless
because a smaller catalog_xmin is conservative: it merely prevents VACUUM
from removing catalog tuples that could otherwise be pruned. It does not lead
to premature deletion of required data.

Thanks for the comments. I added some more comments as suggested.

Here is the updated patch.

Thank you for updating the patch! The patch looks good to me.

I've made minor changes to the comment and commit message and created
patches for backbranches. I'm going to push them, barring any
objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
<master_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_18_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_17_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_16_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_14_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_15_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch><REL_13_0001-Fix-a-race-condition-in-updating-procArray-replicati.patch>

I’ve just looked through the patch for master. The fix itself looks solid to me. I only noticed a few minor comment nits:

1
```
+ * ProcArrayLock, to prevent any undetectable deadlocks since this function
+ * acquire them in that order.
```

acquire -> acquires

2
```
+ * values, so no backend update the initial xmin for newly created slot
```

Update -> updates

3
```
+ * slot machinery about the new limit. Once that's done the both locks
```

“The both locks”, feels like “the” is not needed.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#69