024_add_drop_pub.pl might fail due to deadlock
Hello hackers,
The recent buildfarm failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58 on REL_15_STABLE with the following
diagnostics:
# Looks like your test exited with 29 just after 1.
t/024_add_drop_pub.pl ..............
Dubious, test returned 29 (wstat 7424, 0x1d00)
pgsql.build/src/test/subscription/tmp_check/log/regress_log_024_add_drop_pub
[21:01:34.406](16.501s) ok 1 - check initial data is copied to subscriber
error running SQL: 'psql:<stdin>:1: ERROR: deadlock detected
DETAIL: Process 219632 waits for ExclusiveLock on relation 6000 of database 0; blocked by process 218369.
Process 218369 waits for AccessShareLock on object 16387 of class 6100 of database 0; blocked by process 219632.
HINT: See server log for query details.'
while running 'psql -XAtq -d port=14957 host=/home/bf/bf-build/petalura/tmp/bGI6HuRtfa dbname='postgres' -f - -v
ON_ERROR_STOP=1' with sql 'ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1' at
/home/bf/bf-build/petalura/REL_15_STABLE/pgsql.build/../pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 1951.
pgsql.build/src/test/subscription/tmp_check/log/024_add_drop_pub_subscriber.log
2025-07-01 21:01:32.682 CEST [218369][logical replication worker][3/6:0] LOG: logical replication apply worker for
subscription "tap_sub" has started
...
2025-07-01 21:01:34.771 CEST [219632][client backend][4/14:0] LOG: statement: ALTER SUBSCRIPTION tap_sub DROP
PUBLICATION tap_pub_1
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] ERROR: deadlock detected
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] DETAIL: Process 219632 waits for ExclusiveLock on
relation 6000 of database 0; blocked by process 218369.
Process 218369 waits for AccessShareLock on object 16387 of class 6100 of database 0; blocked by process 219632.
Process 219632: ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1
Process 218369: <command string not enabled>
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] HINT: See server log for query details.
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] STATEMENT: ALTER SUBSCRIPTION tap_sub DROP PUBLICATION
tap_pub_1
shows that the test can fail due to deadlock on accessing
pg_replication_origin (relation 6000).
This failure can be easily reproduced with:
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
* the specific origin and then re-check if the origin still exists.
*/
rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
+pg_usleep(300000);
Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
024_add_drop_pub.pl was added in v15).
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58
Best regards,
Alexander
On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
--- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) * the specific origin and then re-check if the origin still exists. */ rel = table_open(ReplicationOriginRelationId, ExclusiveLock); +pg_usleep(300000);Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
024_add_drop_pub.pl was added in v15).[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58
Best regards,
Alexander
Hi Alexander,
Yes, the problem can be reproduced by the changes you suggested. I
will look into what is happening and how we can fix this.
regards,
Ajin Cherian
Fujitsu Australia
On Mon, Jul 7, 2025 at 8:15 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
--- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) * the specific origin and then re-check if the origin still exists. */ rel = table_open(ReplicationOriginRelationId, ExclusiveLock); +pg_usleep(300000);Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
024_add_drop_pub.pl was added in v15).[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58
Best regards,
AlexanderHi Alexander,
Yes, the problem can be reproduced by the changes you suggested. I
will look into what is happening and how we can fix this.
The issue appears to be a deadlock caused by inconsistent lock
acquisition order between two processes:
Process A (executing ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1):
In AlterSubscription_refresh(), it first acquires an
AccessExclusiveLock on SubscriptionRelRelationId (resource 1), then
later tries to acquire an ExclusiveLock on ReplicationOriginRelationId
(resource 2).
Process B (apply worker):
In process_syncing_tables_for_apply(), it first acquires an
ExclusiveLock on ReplicationOriginRelationId (resource 2), then calls
UpdateSubscriptionRelState(), which tries to acquire a AccessShareLock
on SubscriptionRelRelationId (resource 1).
This leads to a deadlock:
Process A holds a lock on resource 1 and waits for resource 2, while
process B holds a lock on resource 2 and waits for resource 1.
Proposed fix:
In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock
on SubscriptionRelRelationId before acquiring the lock on
ReplicationOriginRelationId.
Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.
regards,
Ajin Cherian
Fujitsu Australia.
Attachments:
0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patchapplication/octet-stream; name=0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patchDownload
From a1c3d9a2706c03790debbf7acd1ba70dcaa254ae Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Tue, 8 Jul 2025 06:18:23 -0400
Subject: [PATCH] Fix a deadlock during ALTER SUBSCRIPTION ... DROP PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelRelationId and
ReplicationOriginRelationId are inverted between the
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get an AccessExclusiveLock on SubscriptionRelRelationId in
process_syncing_tables_for_apply() in advance.
---
src/backend/replication/logical/tablesync.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e6159ac..b2f9bad 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -379,6 +379,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
static HTAB *last_start_times = NULL;
ListCell *lc;
bool started_tx = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -470,7 +471,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* refresh for the subscription where we remove the table
* state and its origin and by this time the origin might be
* already removed. So passing missing_ok = true.
+ *
+ * Also Lock pg_subscription_rel with AccessExclusiveLock to
+ * prevent any deadlocks with user concurrently performing
+ * refresh on the subscription.
*/
+
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -483,6 +491,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
rstate->lsn);
+
+ /* close and unlock table */
+ table_close(rel, AccessExclusiveLock);
}
}
else
--
1.8.3.1
On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:
Proposed fix:
In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock
on SubscriptionRelRelationId before acquiring the lock on
ReplicationOriginRelationId.Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.
While debugging this further, I found that there is another lock taken
prior to this in AlterSubscription(),
/* Lock the subscription so nobody else can do anything with it. */
LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);
Since this is the first lock taken while altering subscription, that
should also be taken first by the tablesync worker to avoid deadlock.
So, attaching a modified patch.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
0002-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patchapplication/octet-stream; name=0002-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patchDownload
From 5bee6b435a89a3d313137472c6d7d30ff818925b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Tue, 8 Jul 2025 06:18:23 -0400
Subject: [PATCH] Fix a deadlock during ALTER SUBSCRIPTION ... DROP PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get an AccessShareLock on SubscriptionRelationId in
process_syncing_tables_for_apply() in advance.
---
src/backend/replication/logical/tablesync.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e6159ac..4fd0bb3 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -470,7 +470,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* refresh for the subscription where we remove the table
* state and its origin and by this time the origin might be
* already removed. So passing missing_ok = true.
+ *
+ * Also Lock SubscriptionRelationId with AccessShareLock to
+ * prevent any deadlocks with user concurrently performing
+ * refresh on the subscription.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+
ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
rstate->relid,
originname,
--
1.8.3.1
On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:
Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.
While debugging if this problem can occur on HEAD, I found out that on
head, it is mostly the tablesync worker that drops the origin on HEAD
and since the tablesysnc worker does not attempt to update the
SubscriptionRel state in that process, there doesn't seem to be the
possibility of a deadlock. But there is a rare situation where the
tablesync worker could crash or get an error just prior to dropping
the origin, then the origin is dropped in the apply worker (this is
explained in the comments in process_syncing_tables_for_sync()). If
the origin has to be dropped in the apply worker, then the same
deadlock can happen in HEAD code as well. I was able to simulate this
by using an injection point to create an error on the tablesync worker
and then the similar deadlock happens on HEAD as well. Attaching a
patch for fixing this on HEAD as well.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
HEAD-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION-.-.patchapplication/octet-stream; name=HEAD-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION-.-.patchDownload
From c1a32b49fcfa01d3c2fc4fc904388815b9cee418 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Mon, 14 Jul 2025 06:04:38 -0400
Subject: [PATCH] Fix a possible deadlock during ALTER SUBSCRIPTION ... DROP
PUBLICATION
In most situations the tablesync worker will drop the corresponding origin before it
finishes executing, but if an error causes the tablesync worker to fail just prior to
dropping the origin, the apply worker will later find the origin and drop it. During this
time if the user simultaneously does ALTER SUBSCRIPTION ... DROP PUBLICATION, there is a
possibility of a deadlock between the apply worker and the user process, because of the
order in which the locks are taken. The ALTER SUBSCRIPTION code takes a AccessExclusiveLock
on SubscriptionRelationId initially and now the fix is for the apply worker to also
take an AccessShareLock on SubscriptionRelationId prior to dropping any origins.
---
src/backend/replication/logical/tablesync.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e4fd634..6a98c7e 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -122,6 +122,7 @@
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/usercontext.h"
+#include "utils/injection_point.h"
typedef enum
{
@@ -492,7 +493,13 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Also lock SubscriptionRelationId with AccessShareLock to
+ * prevent any deadlocks with the user concurrently performing
+ * refresh on the subscription.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
--
1.8.3.1
On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:
Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.While debugging if this problem can occur on HEAD, I found out that on
head, it is mostly the tablesync worker that drops the origin on HEAD
and since the tablesysnc worker does not attempt to update the
SubscriptionRel state in that process, there doesn't seem to be the
possibility of a deadlock. But there is a rare situation where the
tablesync worker could crash or get an error just prior to dropping
the origin, then the origin is dropped in the apply worker (this is
explained in the comments in process_syncing_tables_for_sync()). If
the origin has to be dropped in the apply worker, then the same
deadlock can happen in HEAD code as well. I was able to simulate this
by using an injection point to create an error on the tablesync worker
and then the similar deadlock happens on HEAD as well. Attaching a
patch for fixing this on HEAD as well.
I was able to reproduce the deadlock on HEAD as well using the
attached patch, which introduces a delay in the tablesync worker
before dropping the replication origin by adding a sleep of a few
seconds. During this delay, the apply worker also attempts to drop the
replication origin. If an ALTER SUBSCRIPTION command is executed
concurrently, a deadlock frequently occurs:
2025-07-14 15:59:53.572 IST [141100] DETAIL: Process 141100 waits for
AccessExclusiveLock on object 2 of class 6000 of database 0; blocked
by process 140974.
Process 140974 waits for AccessShareLock on object 16396 of class 6100
of database 0; blocked by process 141100.
Process 141100: alter subscription sub1 drop publication pub1
Process 140974: <command string not enabled>
After apply the attached patch, create the logical replication setup
for a publication pub1 having table t1 and then run the following
commands in a loop:
alter subscription sub1 drop publication pub1;
alter subscription sub1 add publication pub1;
sleep 4
Regards,
Vignesh
Attachments:
deadlock_simulate_add_drop_pub.patchapplication/octet-stream; name=deadlock_simulate_add_drop_pub.patchDownload
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index e23b0de7242..22d0a2626eb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -990,6 +990,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data,
*/
ReplicationOriginNameForLogicalRep(sub->oid, relid, originname,
sizeof(originname));
+ sleep(3);
replorigin_drop_by_name(originname, true, false);
}
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 87f10e50dcc..e1a58c7970d 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -445,6 +445,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
Assert(IsTransactionState());
rel = table_open(ReplicationOriginRelationId, RowExclusiveLock);
+ sleep(2);
roident = replorigin_by_name(name, missing_ok);
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e4fd6347fd1..4f501bf1b55 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -376,6 +376,7 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
replorigin_session_origin_lsn = InvalidXLogRecPtr;
replorigin_session_origin_timestamp = 0;
+ sleep(6);
/*
* Drop the tablesync's origin tracking if exists.
*
On Mon, 14 Jul 2025 at 16:15, vignesh C <vignesh21@gmail.com> wrote:
On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:
Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.While debugging if this problem can occur on HEAD, I found out that on
head, it is mostly the tablesync worker that drops the origin on HEAD
and since the tablesysnc worker does not attempt to update the
SubscriptionRel state in that process, there doesn't seem to be the
possibility of a deadlock. But there is a rare situation where the
tablesync worker could crash or get an error just prior to dropping
the origin, then the origin is dropped in the apply worker (this is
explained in the comments in process_syncing_tables_for_sync()). If
the origin has to be dropped in the apply worker, then the same
deadlock can happen in HEAD code as well. I was able to simulate this
by using an injection point to create an error on the tablesync worker
and then the similar deadlock happens on HEAD as well. Attaching a
patch for fixing this on HEAD as well.I was able to reproduce the deadlock on HEAD as well using the
attached patch, which introduces a delay in the tablesync worker
before dropping the replication origin by adding a sleep of a few
seconds. During this delay, the apply worker also attempts to drop the
replication origin. If an ALTER SUBSCRIPTION command is executed
concurrently, a deadlock frequently occurs:
2025-07-14 15:59:53.572 IST [141100] DETAIL: Process 141100 waits for
AccessExclusiveLock on object 2 of class 6000 of database 0; blocked
by process 140974.
Process 140974 waits for AccessShareLock on object 16396 of class 6100
of database 0; blocked by process 141100.
Process 141100: alter subscription sub1 drop publication pub1
Process 140974: <command string not enabled>After apply the attached patch, create the logical replication setup
for a publication pub1 having table t1 and then run the following
commands in a loop:
alter subscription sub1 drop publication pub1;
alter subscription sub1 add publication pub1;
sleep 4
Attached is the script used to reproduce the issue and the deadlock
logs for the same. Your patch fixes the issue.
Couple of comments:
1) This change is not required:
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/usercontext.h"
+#include "utils/injection_point.h"
2) This can not only happen in error case but also in normal cases
where the tablesync worker is slower as shown in the script to
reproduce, we can update the commit message accordingly:
In most situations the tablesync worker will drop the corresponding
origin before it
finishes executing, but if an error causes the tablesync worker to
fail just prior to
dropping the origin, the apply worker will later find the origin and drop it.
Regards,
Vignesh
Attachments:
On Tue, Jul 15, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote:
Couple of comments:
1) This change is not required:
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/usercontext.h"
+#include "utils/injection_point.h"2) This can not only happen in error case but also in normal cases
where the tablesync worker is slower as shown in the script to
reproduce, we can update the commit message accordingly:
In most situations the tablesync worker will drop the corresponding
origin before it
finishes executing, but if an error causes the tablesync worker to
fail just prior to
dropping the origin, the apply worker will later find the origin and drop it.
Thanks for the test and confirming the fix. Fixed the comments.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
HEAD-v2-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/octet-stream; name=HEAD-v2-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload
From cec0a51f91175343525900b6b8c0985f0b3e3385 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Mon, 14 Jul 2025 06:04:38 -0400
Subject: [PATCH v2] Fix a possible deadlock during ALTER SUBSCRIPTION ... DROP
PUBLICATION
In most situations the tablesync worker will drop the corresponding origin before it
finishes executing, but if an error causes the tablesync worker to fail just prior to
dropping the origin, or if it is delayed or it didn't get cpu time, the apply worker
could find the origin and attempt to drop the origin. During this time if the
user simultaneously issues an ALTER SUBSCRIPTION ... DROP PUBLICATION, there is a
possibility of a deadlock between the apply worker and the user process, because of the
order in which the locks are taken. The ALTER SUBSCRIPTION code takes a AccessExclusiveLock
on SubscriptionRelationId initially and now the fix is for the apply worker to also
take an AccessShareLock on SubscriptionRelationId prior to dropping any origins.
---
src/backend/replication/logical/tablesync.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e4fd634..c410bde 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -492,7 +492,13 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Also lock SubscriptionRelationId with AccessShareLock to
+ * prevent any deadlocks with the user concurrently performing
+ * refresh on the subscription.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
--
1.8.3.1
On Wed, Jul 16, 2025 at 8:38 AM Ajin Cherian <itsajin@gmail.com> wrote:
Thanks for the test and confirming the fix. Fixed the comments.
* origin. So passing missing_ok = true.
+ *
+ * Also lock SubscriptionRelationId with AccessShareLock to
+ * prevent any deadlocks with the user concurrently performing
+ * refresh on the subscription.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
It seems the patch assumes that the above lock is sufficient because
AlterSubscription will take an AcessExclusiveLock on the same
subscription. So, with this proposal, if both of those become
serialized then the other locking order may not matter. Am I correct
or is there any other theory you have in mind?
If that is the theory then I think we are missing cases where the
apply worker and Alter subscription operates on different
subscriptions.
Consider AlterSubscription_refresh() takes first AccessExclusiveLock
on SubscriptionRelRelationId and then ExclusiveLock on
ReplicationOriginRelationId via replorigin_drop_by_name() . The apply
worker first takes ExclusiveLock on ReplicationOriginRelationId via
replorigin_drop_by_name() and then RowExclusiveLock on
SubscriptionRelRelationId via UpdateSubscriptionRelState(). Won't such
a scenario taking conflicting locks in reverse order can lead to
deadlock at least in PG15?
--
With Regards,
Amit Kapila.
On Thu, Jul 17, 2025 at 4:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
It seems the patch assumes that the above lock is sufficient because
AlterSubscription will take an AcessExclusiveLock on the same
subscription. So, with this proposal, if both of those become
serialized then the other locking order may not matter. Am I correct
or is there any other theory you have in mind?If that is the theory then I think we are missing cases where the
apply worker and Alter subscription operates on different
subscriptions.Consider AlterSubscription_refresh() takes first AccessExclusiveLock
on SubscriptionRelRelationId and then ExclusiveLock on
ReplicationOriginRelationId via replorigin_drop_by_name() . The apply
worker first takes ExclusiveLock on ReplicationOriginRelationId via
replorigin_drop_by_name() and then RowExclusiveLock on
SubscriptionRelRelationId via UpdateSubscriptionRelState(). Won't such
a scenario taking conflicting locks in reverse order can lead to
deadlock at least in PG15?
Yes, this is correct. I have also verified this in my testing that
when there is a second subscription, a deadlock can still happen with
my previous patch. I have now modified the patch in tablesync worker
to take locks on both SubscriptionRelationId and
SubscriptionRelRelationId prior to taking lock on
ReplicationOriginRelationId. I have also found that there is a similar
problem in DropSubscription() where lock on SubscriptionRelRelationId
is not taken before dropping the tracking origin. I've also modified
the signature of UpdateSubscriptionRelState to take a bool
"lock_needed" which if false, the SubscriptionRelationId is not
locked. I've made a new version of the patch on PG_15.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
v3-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=v3-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From 615f51df20cb5c56ad7b2894d3e4e8b0d899abe8 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Mon, 21 Jul 2025 09:16:59 -0400
Subject: [PATCH v3] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 5 ++--
src/backend/commands/subscriptioncmds.c | 11 ++++++++-
src/backend/replication/logical/tablesync.c | 37 +++++++++++++++++++++++++----
src/include/catalog/pg_subscription_rel.h | 2 +-
4 files changed, 46 insertions(+), 9 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index add51ca..518a25e 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -324,7 +324,7 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool lock_needed)
{
Relation rel;
HeapTuple tup;
@@ -332,7 +332,8 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ if (lock_needed)
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 334717c..c0df1e5 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -1323,7 +1323,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
void
DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
{
- Relation rel;
+ Relation rel, sub_rel;
ObjectAddress myself;
HeapTuple tup;
Oid subid;
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
* Note that the state can't change because we have already stopped both
* the apply and tablesync workers and they can't restart because of
* exclusive lock on the subscription.
+ *
+ * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+ * deadlock with apply workers of other subscriptions trying
+ * to drop tracking origin.
*/
+ sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
rstates = GetSubscriptionNotReadyRelations(subid);
foreach(lc, rstates)
{
@@ -1493,6 +1499,9 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
snprintf(originname, sizeof(originname), "pg_%u", subid);
replorigin_drop_by_name(originname, true, false);
+ /* Once the origin tracking has been dropped, we can release lock */
+ table_close(sub_rel, AccessExclusiveLock);
+
/*
* Tell the cumulative stats system that the subscription is getting
* dropped.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e6159ac..9aee210 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -314,7 +314,8 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ true);
/*
* End streaming so that LogRepWorkerWalRcvConn can be used to drop
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
* is dropped. So passing missing_ok = false.
*/
ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
finish_sync_worker();
}
else
@@ -379,6 +379,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
static HTAB *last_start_times = NULL;
ListCell *lc;
bool started_tx = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -470,7 +471,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* refresh for the subscription where we remove the table
* state and its origin and by this time the origin might be
* already removed. So passing missing_ok = true.
+ *
+ * Lock SubscriptionRelationId with AccessShareLock and
+ * take AccessExclusiveLock on SubscriptionRelRelationId to
+ * prevent any deadlocks with user concurrently performing
+ * refresh on the subscription.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -482,7 +493,8 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, false);
+
}
}
else
@@ -533,7 +545,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, AccessExclusiveLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -593,6 +612,12 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* close and unlock table if opened*/
+ if (rel)
+ {
+ table_close(rel, AccessExclusiveLock);
+ }
+
if (started_tx)
{
CommitTransactionCommand();
@@ -1310,7 +1335,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ true);
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1431,7 +1457,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
SUBREL_STATE_FINISHEDCOPY,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ true);
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 9df99c3..a1c5bb1 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -83,7 +83,7 @@ typedef struct SubscriptionRelState
extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool lock_needed);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
locked. I've made a new version of the patch on PG_15.
I've made a similar fix on HEAD just so that the code is now
consistent. I don't think the similar problem (deadlock between two
different subscriptions trying to drop tracking origin) occurs on
HEAD with my previous patch, as the way origins are dropped are
different on HEAD. On HEAD, while dropping origin, a RowExclusiveLock
lock is taken on ReplicationOriginRelationId and then an
AccessExclusiveLock is taken on the particular origin. Since the
particular origin will be different on different subscriptions, the
similar deadlock will not happen. But just to keep code consistent, I
have made a similar fix.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
HEAD-v3-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/octet-stream; name=HEAD-v3-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload
From 79176f05def67a3cd2236c1ad9465959860d26cc Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Tue, 22 Jul 2025 07:45:26 -0400
Subject: [PATCH v3] Fix a possible deadlock during ALTER SUBSCRIPTION ... DROP
PUBLICATION
In most situations the tablesync worker will drop the corresponding origin before it
finishes executing, but if an error causes the tablesync worker to fail just prior to
dropping the origin, or if it is delayed or it didn't get cpu time, the apply worker
could find the origin and attempt to drop the origin. During this time if the
user simultaneously issues an ALTER SUBSCRIPTION ... DROP PUBLICATION, there is a
possibility of a deadlock between the apply worker and the user process, because of the
order in which the locks are taken.
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 5 +++--
src/backend/commands/subscriptioncmds.c | 10 ++++++++-
src/backend/replication/logical/tablesync.c | 35 +++++++++++++++++++++++++----
src/include/catalog/pg_subscription_rel.h | 2 +-
4 files changed, 44 insertions(+), 8 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 1395032..7894314 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -319,7 +319,7 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool lock_needed)
{
Relation rel;
HeapTuple tup;
@@ -327,7 +327,8 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ if (lock_needed)
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index e23b0de..cbcede1 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -1625,7 +1625,7 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt,
void
DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
{
- Relation rel;
+ Relation rel, sub_rel;
ObjectAddress myself;
HeapTuple tup;
Oid subid;
@@ -1772,7 +1772,12 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
* Note that the state can't change because we have already stopped both
* the apply and tablesync workers and they can't restart because of
* exclusive lock on the subscription.
+ *
+ * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+ * deadlock with apply workers of other subscriptions trying to drop
+ * tracking origin.
*/
+ sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
rstates = GetSubscriptionRelations(subid, true);
foreach(lc, rstates)
{
@@ -1795,6 +1800,9 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
replorigin_drop_by_name(originname, true, false);
}
+ /* Once the origin tracking has been dropped, we can release lock */
+ table_close(sub_rel, AccessExclusiveLock);
+
/* Clean up dependencies */
deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e4fd634..10299d6 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -316,7 +316,8 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ true);
/*
* End streaming so that LogRepWorkerWalRcvConn can be used to drop
@@ -425,6 +426,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -492,7 +494,18 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Also lock SubscriptionRelationId with AccessShareLock and
+ * take AccessExclusiveLock on SubscriptionRelRelationId to
+ * prevent any deadlocks with the user concurrently performing
+ * refresh on the subscription.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -504,7 +517,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, false);
}
}
else
@@ -555,7 +568,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, AccessExclusiveLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -622,6 +642,11 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* close and unlock table if opened */
+ if (rel)
+ table_close(rel, AccessExclusiveLock);
+
+
if (started_tx)
{
/*
@@ -1413,7 +1438,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ true);
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1546,7 +1572,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
SUBREL_STATE_FINISHEDCOPY,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ true);
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index c91797c..d29a608 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -85,7 +85,7 @@ typedef struct SubscriptionRelState
extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn, bool retain_lock);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool lock_needed);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote:
Yes, this is correct. I have also verified this in my testing that
when there is a second subscription, a deadlock can still happen with
my previous patch. I have now modified the patch in tablesync worker
to take locks on both SubscriptionRelationId and
SubscriptionRelRelationId prior to taking lock on
ReplicationOriginRelationId. I have also found that there is a similar
problem in DropSubscription() where lock on SubscriptionRelRelationId
is not taken before dropping the tracking origin. I've also modified
the signature of UpdateSubscriptionRelState to take a bool
"lock_needed" which if false, the SubscriptionRelationId is not
locked. I've made a new version of the patch on PG_15.
Review comments:
================
1.
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
Why did you acquire AccessExclusiveLock here when the current code has
RowExclusiveLock? It should be RowExclusiveLock.
2.
+ *
+ * Lock SubscriptionRelationId with AccessShareLock and
+ * take AccessExclusiveLock on SubscriptionRelRelationId to
+ * prevent any deadlocks with user concurrently performing
+ * refresh on the subscription.
*/
Try to tell in the comments that we want to keep the locking order
same as DDL commands to prevent deadlocks.
3.
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, AccessExclusiveLock);
+ rel = NULL;
We don't need to explicitly release lock on table_close, it will be
done at transaction end, so use NoLock here as we do in current HEAD
code.
4.
DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
{
- Relation rel;
+ Relation rel, sub_rel;
ObjectAddress myself;
HeapTuple tup;
Oid subid;
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt,
bool isTopLevel)
* Note that the state can't change because we have already stopped both
* the apply and tablesync workers and they can't restart because of
* exclusive lock on the subscription.
+ *
+ * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+ * deadlock with apply workers of other subscriptions trying
+ * to drop tracking origin.
*/
+ sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
I don't think we need AccessExclusiveLock on SubscriptionRelRelationId
in DropSubscription. Kindly test again after fixing the first comment
above.
--
With Regards,
Amit Kapila.
Dear Ajin,
Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:
Order target level
1 pg_subscription row exclusive
2 pg_subscription, my tuple access exclusive
3 pg_subscription_rel access exclusive
4 pg_replication_origin excluive
In contrast, apply worker works like:
Order target level
1 pg_replication_origin excluive
2 pg_subscription, my tuple access share
3 pg_subscrition_rel row exclusive
Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.
Below are my comments:
```
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
* is dropped. So passing missing_ok = false.
*/
ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
```
This change is not needed.
```
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
```
I feel it is too strong, isn't it enough to use row exclusive as initially used?
```
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool lock_needed)
```
I feel the name `lock_needed` is bit misleading, because the function still opens
the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?
```
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
* Note that the state can't change because we have already stopped both
* the apply and tablesync workers and they can't restart because of
* exclusive lock on the subscription.
+ *
+ * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+ * deadlock with apply workers of other subscriptions trying
+ * to drop tracking origin.
*/
+ sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
```
Hmm. Per my understanding, DropSubscription acquires below locks till it reaches
replorigin_drop_by_name().
Order target level
1 pg_subscription access exclusive
2 pg_subscription, my tuple access exclusive
3 pg_replication_origin excluive
IIUC we must preserve the ordering, not the target of locks.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Wed, Jul 23, 2025 at 2:42 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Ajin,
Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:Order target level
1 pg_subscription row exclusive
2 pg_subscription, my tuple access exclusive
3 pg_subscription_rel access exclusive
4 pg_replication_origin excluiveIn contrast, apply worker works like:
Order target level
1 pg_replication_origin excluive
2 pg_subscription, my tuple access share
3 pg_subscrition_rel row exclusiveThus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.
Below are my comments:
```
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
* is dropped. So passing missing_ok = false.
*/
ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
```
This change is not needed.``` + if (!rel) + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); + ```I feel it is too strong, isn't it enough to use row exclusive as initially used?
``` UpdateSubscriptionRelState(Oid subid, Oid relid, char state, - XLogRecPtr sublsn) + XLogRecPtr sublsn, bool lock_needed) ```I feel the name `lock_needed` is bit misleading, because the function still opens
the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?
I think if we lock in a caller, we don't need to use any lock during
table_open. We can use the parameter name as already_locked as we do
at some other places in the code.
--
With Regards,
Amit Kapila.
Dear Ajin,
Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:
I forgot to confirm one point. For which branch should be backpatch? Initially
it was reported only on PG15 [1]/messages/by-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com, but I found 021_alter_sub_pub could fail on PG14.
Regarding the PG13, it may not be affected because the replication origin seemed
not to be used for the table sync. It was introduced for ce0fdbfe97.
[1]: /messages/by-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Wed, Jul 23, 2025 at 4:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote:
Yes, this is correct. I have also verified this in my testing that
when there is a second subscription, a deadlock can still happen with
my previous patch. I have now modified the patch in tablesync worker
to take locks on both SubscriptionRelationId and
SubscriptionRelRelationId prior to taking lock on
ReplicationOriginRelationId. I have also found that there is a similar
problem in DropSubscription() where lock on SubscriptionRelRelationId
is not taken before dropping the tracking origin. I've also modified
the signature of UpdateSubscriptionRelState to take a bool
"lock_needed" which if false, the SubscriptionRelationId is not
locked. I've made a new version of the patch on PG_15.Review comments: ================ 1. + if (!rel) + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);Why did you acquire AccessExclusiveLock here when the current code has
RowExclusiveLock? It should be RowExclusiveLock.
Yes, you are correct. I have replaced it with RowExclusiveLock.
2. + * + * Lock SubscriptionRelationId with AccessShareLock and + * take AccessExclusiveLock on SubscriptionRelRelationId to + * prevent any deadlocks with user concurrently performing + * refresh on the subscription. */Try to tell in the comments that we want to keep the locking order
same as DDL commands to prevent deadlocks.
Modified.
3. + * Also close any tables prior to the commit. */ + if (rel) + { + table_close(rel, AccessExclusiveLock); + rel = NULL;We don't need to explicitly release lock on table_close, it will be
done at transaction end, so use NoLock here as we do in current HEAD
code.
Done.
4. DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) { - Relation rel; + Relation rel, sub_rel; ObjectAddress myself; HeapTuple tup; Oid subid; @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) * Note that the state can't change because we have already stopped both * the apply and tablesync workers and they can't restart because of * exclusive lock on the subscription. + * + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any + * deadlock with apply workers of other subscriptions trying + * to drop tracking origin. */ + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);I don't think we need AccessExclusiveLock on SubscriptionRelRelationId
in DropSubscription. Kindly test again after fixing the first comment
above.
--
Yes, it was failing because I was taking AccessExclusiveLock in the
apply worker, and that was causing the deadlock in my testing. I
tested this worked.
On Wed, Jul 23, 2025 at 7:12 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Ajin,
Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:Order target level
1 pg_subscription row exclusive
2 pg_subscription, my tuple access exclusive
3 pg_subscription_rel access exclusive
4 pg_replication_origin excluiveIn contrast, apply worker works like:
Order target level
1 pg_replication_origin excluive
2 pg_subscription, my tuple access share
3 pg_subscrition_rel row exclusiveThus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.
Yes, that is correct.
Below are my comments:
```
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
* is dropped. So passing missing_ok = false.
*/
ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
```
This change is not needed.
Removed.
``` + if (!rel) + rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); + ```I feel it is too strong, isn't it enough to use row exclusive as initially used?
Yes, agree. Fixed.
``` UpdateSubscriptionRelState(Oid subid, Oid relid, char state, - XLogRecPtr sublsn) + XLogRecPtr sublsn, bool lock_needed) ```I feel the name `lock_needed` is bit misleading, because the function still opens
the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?
I have modified it to not take lock while table_open as well and
changed the name of the variable to already_locked.
``` @@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) * Note that the state can't change because we have already stopped both * the apply and tablesync workers and they can't restart because of * exclusive lock on the subscription. + * + * Lock pg_subscription_rel with AccessExclusiveLock to prevent any + * deadlock with apply workers of other subscriptions trying + * to drop tracking origin. */ + sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock); ```Hmm. Per my understanding, DropSubscription acquires below locks till it reaches
replorigin_drop_by_name().Order target level
1 pg_subscription access exclusive
2 pg_subscription, my tuple access exclusive
3 pg_replication_origin excluiveIIUC we must preserve the ordering, not the target of locks.
I have removed this change as this does not now conflict with the apply worker.
Two patches attached. One for HEAD and the other for PG_15.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
HEAD-v4-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/x-patch; name=HEAD-v4-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload
From 74d9751501b0f7de40636bfe2beeb6ba65b19df5 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Wed, 23 Jul 2025 07:06:36 -0400
Subject: [PATCH v4] Fix a possible deadlock during ALTER SUBSCRIPTION ... DROP
PUBLICATION
In most situations the tablesync worker will drop the corresponding origin before it
finishes executing, but if an error causes the tablesync worker to fail just prior to
dropping the origin, or if it is delayed or it didn't get cpu time, the apply worker
could find the origin and attempt to drop the origin. During this time if the
user simultaneously issues an ALTER SUBSCRIPTION ... DROP PUBLICATION, there is a
possibility of a deadlock between the apply worker and the user process, because of the
order in which the locks are taken.
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 12 ++++++----
src/backend/replication/logical/tablesync.c | 35 +++++++++++++++++++++++++----
src/include/catalog/pg_subscription_rel.h | 2 +-
3 files changed, 40 insertions(+), 9 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 63c2992..18ffbe7 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -320,7 +320,7 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 3fea0a0..80d29d2 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -316,7 +316,8 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
/*
* End streaming so that LogRepWorkerWalRcvConn can be used to drop
@@ -425,6 +426,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -492,7 +494,18 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock SubscriptionRelationId with AccessShareLock and
+ * take RowExclusiveLock on SubscriptionRelRelationId to
+ * keep the same locking order as refresh from an
+ * user issued DDL command to prevent any deadlocks.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -504,7 +517,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, true);
}
}
else
@@ -555,7 +568,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -623,6 +643,11 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
+
if (started_tx)
{
/*
@@ -1414,7 +1439,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1547,7 +1573,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
SUBREL_STATE_FINISHEDCOPY,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index c91797c..f458447 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -85,7 +85,7 @@ typedef struct SubscriptionRelState
extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn, bool retain_lock);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_15-v4-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/x-patch; name=PG_15-v4-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From 3b6e923614a93040e1487bd2b19f5523d1ff81ed Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Mon, 21 Jul 2025 09:16:59 -0400
Subject: [PATCH v4] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 12 ++++++----
src/backend/replication/logical/tablesync.c | 36 +++++++++++++++++++++++++----
src/include/catalog/pg_subscription_rel.h | 2 +-
3 files changed, 40 insertions(+), 10 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index add51ca..d48ed20 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -324,7 +324,7 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -332,9 +332,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e6159ac..1015b2b 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -314,7 +314,8 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
/*
* End streaming so that LogRepWorkerWalRcvConn can be used to drop
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
* is dropped. So passing missing_ok = false.
*/
ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
finish_sync_worker();
}
else
@@ -379,6 +379,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
static HTAB *last_start_times = NULL;
ListCell *lc;
bool started_tx = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -470,7 +471,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* refresh for the subscription where we remove the table
* state and its origin and by this time the origin might be
* already removed. So passing missing_ok = true.
+ *
+ * Lock SubscriptionRelationId with AccessShareLock and
+ * take RowExclusiveLock on SubscriptionRelRelationId to
+ * keep the same locking order as refresh from an
+ * user issued DDL command to prevent any deadlocks.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -482,7 +493,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, true);
}
}
else
@@ -533,7 +544,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -593,6 +611,12 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ }
+
if (started_tx)
{
CommitTransactionCommand();
@@ -1310,7 +1334,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1431,7 +1456,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
SUBREL_STATE_FINISHEDCOPY,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 9df99c3..0fcff92 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -83,7 +83,7 @@ typedef struct SubscriptionRelState
extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
On Wed, Jul 23, 2025 at 8:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Ajin,
Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:I forgot to confirm one point. For which branch should be backpatch? Initially
it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14.
Yes, here's a patch for PG14 as well, based on REL_14_STABLE.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
PG_14-v4-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_14-v4-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From 52039bba820d633bbb69615994d7bb78a512ae4e Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Thu, 24 Jul 2025 07:54:27 -0400
Subject: [PATCH v4] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 12 ++++++----
src/backend/replication/logical/tablesync.c | 36 +++++++++++++++++++++++++----
src/include/catalog/pg_subscription_rel.h | 2 +-
3 files changed, 40 insertions(+), 10 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..cd799d3 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -282,7 +282,7 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -290,9 +290,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 26b71de..2471609 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -300,7 +300,8 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
/*
* End streaming so that LogRepWorkerWalRcvConn can be used to drop
@@ -366,6 +367,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
static HTAB *last_start_times = NULL;
ListCell *lc;
bool started_tx = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -463,7 +465,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* refresh for the subscription where we remove the table
* state and its origin and by this time the origin might be
* already removed. So passing missing_ok = true.
+ *
+ * Lock SubscriptionRelationId with AccessShareLock and
+ * take RowExclusiveLock on SubscriptionRelRelationId to
+ * keep the same locking order as refresh from an
+ * user issued DDL command to prevent any deadlocks.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -475,7 +487,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, true);
}
}
else
@@ -526,7 +538,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -586,6 +605,12 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ }
+
if (started_tx)
{
CommitTransactionCommand();
@@ -1041,7 +1066,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
pgstat_report_stat(false);
@@ -1136,8 +1162,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
SUBREL_STATE_FINISHEDCOPY,
- MyLogicalRepWorker->relstate_lsn);
-
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
copy_table_done:
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..7baeda5 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -84,7 +84,7 @@ typedef struct SubscriptionRelState
extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
Dear Ajin,
Thanks for patches! While checking, I recalled that the backpatch policy [1]https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-API-ABI-STABILITY-GUIDANCE.
We must not modify definitions of opened functions but this does. Can you define
another function like UpdateSubscriptionRelStateEx or something on the back
branches?
Another comment:
```
@@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ rel = table_open(SubscriptionRelRelationId, NoLock);
```
Can we assert that RowExclusiveLock for pg_subscription_rel has already been
acquired, by using CheckRelationOidLockedByMe() family?
Also, I'm now bit confusing for which branches are really affected. Can you
create all patches for branches, attach at the same e-mail and add some summary
what you want to fix?
E.g., you provided a patch for HEAD/PG15/PG14, what about PG18, 17, 16 and 13?
If not needed, why?
[1]: https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-API-ABI-STABILITY-GUIDANCE
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Thu, 24 Jul 2025 at 17:45, Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Jul 23, 2025 at 8:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:Dear Ajin,
Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:I forgot to confirm one point. For which branch should be backpatch? Initially
it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14.Yes, here's a patch for PG14 as well, based on REL_14_STABLE.
I believe the patch is trying the address the following issues reported:
1) 024_add_drop_pub.pl test failure reported on REL_16_STABLE at [1]/messages/by-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com
2) Different variation of the above issue on head with the script
attached at [2]/messages/by-id/CALDaNm3PrTkVc2uxMyQTkqw0sg7O6i0EXe1jJo9CzOyW2gFS+Q@mail.gmail.com
3) Amit reported different variant of it for PG15 with the patch at [3]/messages/by-id/CAA4eK1KPa1dJrcd=XfOWx-r37eZudKQRqct0tY1R7vnUw0OabQ@mail.gmail.com
I felt these issues are not applicable to the PG13 branch as
Replication origin creation for table sync is not there in the PG13
branch. So the fix is required from master to PG14 branches.
The patch does not apply on the PG16 branch.
In PG15 you have the following code:
+ /* Close table if opened */
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ }
In master branch you have the following code:
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
+
We can keep the fix consistent in both cases and additional newlines
not required in the master branch.
[1]: /messages/by-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com
[2]: /messages/by-id/CALDaNm3PrTkVc2uxMyQTkqw0sg7O6i0EXe1jJo9CzOyW2gFS+Q@mail.gmail.com
[3]: /messages/by-id/CAA4eK1KPa1dJrcd=XfOWx-r37eZudKQRqct0tY1R7vnUw0OabQ@mail.gmail.com
Regards,
Vignesh
On Fri, Jul 25, 2025 at 6:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Ajin,
Thanks for patches! While checking, I recalled that the backpatch policy [1].
We must not modify definitions of opened functions but this does. Can you define
another function like UpdateSubscriptionRelStateEx or something on the back
branches?Another comment:
```
@@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock); - - rel = table_open(SubscriptionRelRelationId, RowExclusiveLock); + if (already_locked) + rel = table_open(SubscriptionRelRelationId, NoLock); ```Can we assert that RowExclusiveLock for pg_subscription_rel has already been
acquired, by using CheckRelationOidLockedByMe() family?
Thanks for your review Kuroda-san, I have changed the logic to not use
already_locked and instead check if the locks are taken inside
UpdateSubscriptionRelState itself. I've tested this and this works
fine. If this logic is acceptable to the reviewers I can update the
other patches also in a similar way.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
PG_15-v5-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_15-v5-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From bf90fe696e3384885f8a993ab03821ab62c80e72 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Mon, 28 Jul 2025 08:00:23 -0400
Subject: [PATCH v5] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 22 ++++++++++++++++++++--
src/backend/replication/logical/tablesync.c | 22 ++++++++++++++++++++++
2 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index add51ca..657d6ff 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -331,10 +331,28 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
bool nulls[Natts_pg_subscription_rel];
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
+ bool sub_rel_locked, sub_locked;
+ LOCKTAG tag;
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ /*
+ * Check if we already hold locks for SubscriptionRelRelationId
+ * and SubscriptionRelationId. If we do, we don't need to take
+ * them again.
+ */
+ sub_rel_locked = CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true);
+
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ sub_locked = LockHeldByMe(&tag, AccessShareLock);
+
+ if (sub_rel_locked && sub_locked)
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e6159ac..6d5024b 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -379,6 +379,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
static HTAB *last_start_times = NULL;
ListCell *lc;
bool started_tx = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -470,7 +471,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* refresh for the subscription where we remove the table
* state and its origin and by this time the origin might be
* already removed. So passing missing_ok = true.
+ *
+ * Lock SubscriptionRelationId with AccessShareLock and
+ * take RowExclusiveLock on SubscriptionRelRelationId to
+ * keep the same locking order as refresh from an
+ * user issued DDL command to prevent any deadlocks.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -533,7 +544,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -593,6 +611,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
CommitTransactionCommand();
--
1.8.3.1
On Fri, Jul 25, 2025 at 6:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Also, I'm now bit confusing for which branches are really affected. Can you
create all patches for branches, attach at the same e-mail and add some summary
what you want to fix?
E.g., you provided a patch for HEAD/PG15/PG14, what about PG18, 17, 16 and 13?
If not needed, why?
Yes, patches are needed for all these versions except PG13 because
replicating origin is not there in PG13. If this logic is fine, I will
create patches for all these branches as well and send them tomorrow.
regards,
Ajin Cherian
Fujitsu Australia
Dear Ajin,
Thanks for your review Kuroda-san, I have changed the logic to not use
already_locked and instead check if the locks are taken inside
UpdateSubscriptionRelState itself. I've tested this and this works
fine. If this logic is acceptable to the reviewers I can update the
other patches also in a similar way.
Thanks for updates.
However, I found that functions like LockHeldByMe(), CheckRelationOidLockedByMe()
and LWLockHeldByMe() have been used only for the debug build. Functions like
ProcArraySetReplicationSlotXmin() and MarkAsPrepared() can remove the flag from
the argument but they are retained till now.
Based on that, I suggest adding new argument (or add new Ex function for bank branches)
and do the assertion check when the assertion is enabled in this build. Thought?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Tue, Jul 29, 2025 at 7:23 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Thanks for your review Kuroda-san, I have changed the logic to not use
already_locked and instead check if the locks are taken inside
UpdateSubscriptionRelState itself. I've tested this and this works
fine. If this logic is acceptable to the reviewers I can update the
other patches also in a similar way.Thanks for updates.
However, I found that functions like LockHeldByMe(), CheckRelationOidLockedByMe()
and LWLockHeldByMe() have been used only for the debug build. Functions like
ProcArraySetReplicationSlotXmin() and MarkAsPrepared() can remove the flag from
the argument but they are retained till now.
Based on that, I suggest adding new argument (or add new Ex function for bank branches)
and do the assertion check when the assertion is enabled in this build. Thought?
Yes, that makes sense to me. For HEAD and PG18, we can still add a new
argument to the API. For other bank branches, it is better to use a
new Ex function as suggested by Kuroda-San.
--
With Regards,
Amit Kapila.
On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Yes, that makes sense to me. For HEAD and PG18, we can still add a new
argument to the API. For other bank branches, it is better to use a
new Ex function as suggested by Kuroda-San.
Here are the updated patches.
I have added "Ex functions" for back branches (PG_17 and earlier) ,
which also have Asserts for making sure locks are taken. For PG_18 and
HEAD, I've used the extra parameter already_locked.
PG_14_15-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patch
is for both PG_14 and PG_15 and
PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION
is for both PG_18 and HEAD.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
PG_14_15-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_14_15-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From 82cb39948b604474e8a5d66d11608f45cf13061b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Mon, 28 Jul 2025 08:00:23 -0400
Subject: [PATCH v6] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 32 +++++++++++++++++++++++++----
src/backend/replication/logical/tablesync.c | 27 +++++++++++++++++++++---
src/include/catalog/pg_subscription_rel.h | 2 ++
3 files changed, 54 insertions(+), 7 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index add51ca..01f7858 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -323,18 +323,23 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
* Update the state of a subscription table.
*/
void
-UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
{
Relation rel;
HeapTuple tup;
bool nulls[Natts_pg_subscription_rel];
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
+ LOCKTAG tag PG_USED_FOR_ASSERTS_ONLY;
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock));
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -369,6 +374,25 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
}
/*
+ * Update the state of a subscription table.
+ */
+void
+UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
+{
+ Relation rel;
+
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn);
+
+ /* close table */
+ table_close(rel, NoLock);
+}
+
+/*
* Get state of subscription table.
*
* Returns SUBREL_STATE_UNKNOWN when the table is not in the subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e6159ac..440a0bb 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -379,6 +379,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
static HTAB *last_start_times = NULL;
ListCell *lc;
bool started_tx = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -470,7 +471,16 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* refresh for the subscription where we remove the table
* state and its origin and by this time the origin might be
* already removed. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -480,9 +490,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
/*
* Update the state to READY only after the origin cleanup.
*/
- UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
- rstate->relid, rstate->state,
- rstate->lsn);
+ UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
+ rstate->relid, rstate->state,
+ rstate->lsn);
}
}
else
@@ -533,7 +543,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -593,6 +610,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 9df99c3..8913e17 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -84,6 +84,8 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
+extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_16-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_16-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From 6a9954df6811f8e8096410746b77091eff186c41 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Tue, 29 Jul 2025 03:25:50 -0400
Subject: [PATCH v6] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 32 +++++++++++++++++++++++++----
src/backend/replication/logical/tablesync.c | 27 +++++++++++++++++++++---
src/include/catalog/pg_subscription_rel.h | 2 ++
3 files changed, 54 insertions(+), 7 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index d07f88c..63ae485 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -273,18 +273,23 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
* Update the state of a subscription table.
*/
void
-UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
{
Relation rel;
HeapTuple tup;
bool nulls[Natts_pg_subscription_rel];
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
+ LOCKTAG tag PG_USED_FOR_ASSERTS_ONLY;
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock));
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -319,6 +324,25 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
}
/*
+ * Update the state of a subscription table.
+ */
+void
+UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
+{
+ Relation rel;
+
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn);
+
+ /* close table */
+ table_close(rel, NoLock);
+}
+
+/*
* Get state of subscription table.
*
* Returns SUBREL_STATE_UNKNOWN when the table is not in the subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ca88133..67b5ea9 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -425,6 +425,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -492,7 +493,16 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -502,9 +512,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
/*
* Update the state to READY only after the origin cleanup.
*/
- UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
- rstate->relid, rstate->state,
- rstate->lsn);
+ UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
+ rstate->relid, rstate->state,
+ rstate->lsn);
}
}
else
@@ -555,7 +565,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -621,6 +638,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
/*
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 60a2bcc..df09c75 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -84,6 +84,8 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
+extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/octet-stream; name=PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload
From f5febcd8bfbfb83ceeb19f83605c6b125852cfc6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Tue, 29 Jul 2025 04:19:15 -0400
Subject: [PATCH v6] Fix a possible deadlock during ALTER SUBSCRIPTION ... DROP
PUBLICATION
In most situations the tablesync worker will drop the corresponding origin before it
finishes executing, but if an error causes the tablesync worker to fail just prior to
dropping the origin, or if it is delayed or it didn't get cpu time, the apply worker
could find the origin and attempt to drop the origin. During this time if the
user simultaneously issues an ALTER SUBSCRIPTION ... DROP PUBLICATION, there is a
possibility of a deadlock between the apply worker and the user process, because of the
order in which the locks are taken.
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 12 ++++++----
src/backend/replication/logical/tablesync.c | 34 +++++++++++++++++++++++++----
src/include/catalog/pg_subscription_rel.h | 2 +-
3 files changed, 39 insertions(+), 9 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 1395032..a61d734 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -319,7 +319,7 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -327,9 +327,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index c90f23e..9ea7b2c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -316,7 +316,8 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
/*
* End streaming so that LogRepWorkerWalRcvConn can be used to drop
@@ -425,6 +426,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -492,7 +494,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -504,7 +516,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, true);
}
}
else
@@ -555,7 +567,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -622,6 +641,11 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
+
if (started_tx)
{
/*
@@ -1413,7 +1437,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1546,7 +1571,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
SUBREL_STATE_FINISHEDCOPY,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index c91797c..f458447 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -85,7 +85,7 @@ typedef struct SubscriptionRelState
extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn, bool retain_lock);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_17-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_17-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From ad42f956486ab72c8669614fa97e3b08147cdac4 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Tue, 29 Jul 2025 03:57:02 -0400
Subject: [PATCH v6] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 32 +++++++++++++++++++++++++----
src/backend/replication/logical/tablesync.c | 27 +++++++++++++++++++++---
src/include/catalog/pg_subscription_rel.h | 2 ++
3 files changed, 54 insertions(+), 7 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 9efc915..e23a7e2 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -287,18 +287,23 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
* Update the state of a subscription table.
*/
void
-UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
{
Relation rel;
HeapTuple tup;
bool nulls[Natts_pg_subscription_rel];
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
+ LOCKTAG tag PG_USED_FOR_ASSERTS_ONLY;
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock, true));
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -333,6 +338,25 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
}
/*
+ * Update the state of a subscription table.
+ */
+void
+UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
+{
+ Relation rel;
+
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn);
+
+ /* close table */
+ table_close(rel, NoLock);
+}
+
+/*
* Get state of subscription table.
*
* Returns SUBREL_STATE_UNKNOWN when the table is not in the subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index c8893ff..157f417 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -426,6 +426,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -493,7 +494,16 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -503,9 +513,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
/*
* Update the state to READY only after the origin cleanup.
*/
- UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
- rstate->relid, rstate->state,
- rstate->lsn);
+ UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
+ rstate->relid, rstate->state,
+ rstate->lsn);
}
}
else
@@ -556,7 +566,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -623,6 +640,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
/*
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 8244ad5..7a28742 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -86,6 +86,8 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn, bool retain_lock);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
+extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
On Tue, 29 Jul 2025 at 14:46, Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Yes, that makes sense to me. For HEAD and PG18, we can still add a new
argument to the API. For other bank branches, it is better to use a
new Ex function as suggested by Kuroda-San.Here are the updated patches.
I noticed the order of LockSharedObject and table lock is different
here compared to disable subscription:
@@ -492,7 +494,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin
tracking as if there is any
* error while dropping we won't
restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in
the same order as we
+ * are doing during DDL commands to
avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+
LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0,
AccessShareLock);
+
+ if (!rel)
+ rel =
table_open(SubscriptionRelRelationId, RowExclusiveLock);
DisableSubscription(Oid subid)
{
....
rel = table_open(SubscriptionRelationId, RowExclusiveLock);
tup = SearchSysCacheCopy1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid));
if (!HeapTupleIsValid(tup))
elog(ERROR, "cache lookup failed for subscription %u", subid);
LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
....
Do we need to enforce consistent lock ordering here, or is it safe to
ignore because we're only using AccessShareLock?
Regards,
Vignesh
On Tue, Jul 29, 2025 at 4:26 PM vignesh C <vignesh21@gmail.com> wrote:
On Tue, 29 Jul 2025 at 14:46, Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Yes, that makes sense to me. For HEAD and PG18, we can still add a new
argument to the API. For other bank branches, it is better to use a
new Ex function as suggested by Kuroda-San.Here are the updated patches.
I noticed the order of LockSharedObject and table lock is different
here compared to disable subscription:
Note that catalog tables are not the same. The DisableSubscription()
takes lock on pg_subscrition catalog and then on a particular
subscription using subscription_id. Here, we are first taking lock on
a particular subscription and then on pg_subscription_rel. So, this
seems to follow exactly the order which we should follow and I don't
see any problem here. Please let us know if you have a specific
concern.
--
With Regards,
Amit Kapila.
Dear Ajin,
I have added "Ex functions" for back branches (PG_17 and earlier) ,
which also have Asserts for making sure locks are taken. For PG_18 and
HEAD, I've used the extra parameter already_locked.
PG_14_15-v6-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.p
atch
is for both PG_14 and PG_15 and
PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION
is for both PG_18 and HEAD.
Thanks for creating the patch!
I feel the existing function here should be the wrapper of Ex function, so attached
patch did not match the expectation. table_open() and table_close() are called
both in UpdateSubscriptionRelState() and UpdateSubscriptionRelStateEx().
Another issue is that the variable "tag" is used by SET_LOCKTAG_OBJECT() even
without the debug build.
How do you feel the .diff file can be applied atop PG17 patch? It is mainly
same as v4 patch but has some assertion.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
kuroda.diffsapplication/octet-stream; name=kuroda.diffsDownload
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index e23a7e2ee97..35d1baf3725 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -288,22 +288,33 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool already_locked);
{
Relation rel;
HeapTuple tup;
bool nulls[Natts_pg_subscription_rel];
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LOCKTAG tag PG_USED_FOR_ASSERTS_ONLY;
- Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
- RowExclusiveLock, true));
-
- SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
- Assert(LockHeldByMe(&tag, AccessShareLock, true));
-
- rel = table_open(SubscriptionRelRelationId, NoLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
+#endif
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+#ifdef USE_ASSERT_CHECKING
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock, true));
+#endif
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -344,16 +355,7 @@ void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn)
{
- Relation rel;
-
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
-
- UpdateSubscriptionRelStateEx(subid, relid, state, sublsn);
-
- /* close table */
- table_close(rel, NoLock);
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn, false);
}
/*
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 7a287423d75..0afda883219 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -87,7 +87,7 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
How do you feel the .diff file can be applied atop PG17 patch? It is mainly
same as v4 patch but has some assertion.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
How do you feel the .diff file can be applied atop PG17 patch? It is mainly
same as v4 patch but has some assertion.
Sorry for my interrupted message. I noticed only I attached old version patch.
PSA the correct version.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
kuroda.diffsapplication/octet-stream; name=kuroda.diffsDownload
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index e23a7e2ee97..f1611312e13 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -288,22 +288,34 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
bool nulls[Natts_pg_subscription_rel];
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LOCKTAG tag PG_USED_FOR_ASSERTS_ONLY;
- Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
- RowExclusiveLock, true));
-
- SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
- Assert(LockHeldByMe(&tag, AccessShareLock, true));
-
- rel = table_open(SubscriptionRelRelationId, NoLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
+#endif
+
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+#ifdef USE_ASSERT_CHECKING
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock, true));
+#endif
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -344,16 +356,7 @@ void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn)
{
- Relation rel;
-
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
-
- UpdateSubscriptionRelStateEx(subid, relid, state, sublsn);
-
- /* close table */
- table_close(rel, NoLock);
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn, false);
}
/*
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 157f41720c5..1da5a7e7ef8 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -515,7 +515,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, true);
}
}
else
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 7a287423d75..0afda883219 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -87,7 +87,7 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
On Tue, 29 Jul 2025 at 14:46, Ajin Cherian <itsajin@gmail.com> wrote:
On Tue, Jul 29, 2025 at 1:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Yes, that makes sense to me. For HEAD and PG18, we can still add a new
argument to the API. For other bank branches, it is better to use a
new Ex function as suggested by Kuroda-San.Here are the updated patches.
The only case where the ordering of lock now is different is in
DropSubscription, but in this case we take an AccessExclusiveLock on
pg_subscription which should prevent this deadlock from occurring:
...
replorigin_drop_by_name(originname, true, false);
}
/* Clean up dependencies */
deleteSharedDependencyRecordsFor(SubscriptionRelationId, subid, 0);
/* Remove any associated relation synchronization states. */
RemoveSubscriptionRel(subid, InvalidOid);
...
Regards,
Vignesh
On Tue, Jul 29, 2025 at 10:45 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
How do you feel the .diff file can be applied atop PG17 patch? It is mainly
same as v4 patch but has some assertion.Sorry for my interrupted message. I noticed only I attached old version patch.
PSA the correct version.
Attaching the updated patches with the changes you requested. I've
also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
so that everything is together in one mail.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
PG_17-v7-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_17-v7-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From c7ccd6618afc6d020bc2d8b81000a435df3bfa8b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Wed, 30 Jul 2025 05:29:24 -0400
Subject: [PATCH v7] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 37 +++++++++++++++++++++++++----
src/backend/replication/logical/tablesync.c | 27 ++++++++++++++++++---
src/include/catalog/pg_subscription_rel.h | 2 ++
3 files changed, 58 insertions(+), 8 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 9efc915..f161131 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -287,8 +287,8 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
* Update the state of a subscription table.
*/
void
-UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -296,9 +296,26 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
+#endif
+
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+#ifdef USE_ASSERT_CHECKING
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock, true));
+#endif
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -333,6 +350,16 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
}
/*
+ * Update the state of a subscription table.
+ */
+void
+UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
+{
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn, false);
+}
+
+/*
* Get state of subscription table.
*
* Returns SUBREL_STATE_UNKNOWN when the table is not in the subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index c8893ff..1da5a7e 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -426,6 +426,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -493,7 +494,16 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -503,9 +513,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
/*
* Update the state to READY only after the origin cleanup.
*/
- UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
- rstate->relid, rstate->state,
- rstate->lsn);
+ UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
+ rstate->relid, rstate->state,
+ rstate->lsn, true);
}
}
else
@@ -556,7 +566,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -623,6 +640,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
/*
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 8244ad5..0afda88 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -86,6 +86,8 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn, bool retain_lock);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
+extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_14-15-v7-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_14-15-v7-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From 9c40e8474622208fec7a6ccc94da2d82fbb2999c Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Wed, 30 Jul 2025 02:17:44 -0400
Subject: [PATCH v7] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 37 +++++++++++++++++++++++++----
src/backend/replication/logical/tablesync.c | 27 ++++++++++++++++++---
src/include/catalog/pg_subscription_rel.h | 2 ++
3 files changed, 58 insertions(+), 8 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 29fc421..288f970 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -281,8 +281,8 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
* Update the state of a subscription table.
*/
void
-UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -290,9 +290,26 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
+#endif
+
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+#ifdef USE_ASSERT_CHECKING
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock));
+#endif
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -327,6 +344,16 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
}
/*
+ * Update the state of a subscription table.
+ */
+void
+UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
+{
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn, false);
+}
+
+/*
* Get state of subscription table.
*
* Returns SUBREL_STATE_UNKNOWN when the table is not in the subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 26b71de..3e9de50 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -366,6 +366,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
static HTAB *last_start_times = NULL;
ListCell *lc;
bool started_tx = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -463,7 +464,16 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* refresh for the subscription where we remove the table
* state and its origin and by this time the origin might be
* already removed. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -473,9 +483,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
/*
* Update the state to READY only after the origin cleanup.
*/
- UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
- rstate->relid, rstate->state,
- rstate->lsn);
+ UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
+ rstate->relid, rstate->state,
+ rstate->lsn, true);
}
}
else
@@ -526,7 +536,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -586,6 +603,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index ed94f57..f68fa2d 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -85,6 +85,8 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
+extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_16-v7-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_16-v7-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From 7ebce29cea14c878c46baa15029bc382634dc986 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Wed, 30 Jul 2025 05:04:44 -0400
Subject: [PATCH v7] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 37 +++++++++++++++++++++++++----
src/backend/replication/logical/tablesync.c | 27 ++++++++++++++++++---
src/include/catalog/pg_subscription_rel.h | 2 ++
3 files changed, 58 insertions(+), 8 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index d07f88c..d054327 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -273,8 +273,8 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
* Update the state of a subscription table.
*/
void
-UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -282,9 +282,26 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
+#endif
+
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+#ifdef USE_ASSERT_CHECKING
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock));
+#endif
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -319,6 +336,16 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
}
/*
+ * Update the state of a subscription table.
+ */
+void
+UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
+{
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn, false);
+}
+
+/*
* Get state of subscription table.
*
* Returns SUBREL_STATE_UNKNOWN when the table is not in the subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ca88133..3fa921a 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -425,6 +425,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -492,7 +493,16 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -502,9 +512,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
/*
* Update the state to READY only after the origin cleanup.
*/
- UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
- rstate->relid, rstate->state,
- rstate->lsn);
+ UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
+ rstate->relid, rstate->state,
+ rstate->lsn, true);
}
}
else
@@ -555,7 +565,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -621,6 +638,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
/*
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 60a2bcc..7817a30 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -84,6 +84,8 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
+extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/octet-stream; name=PG_18_HEAD-v6-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload
From f5febcd8bfbfb83ceeb19f83605c6b125852cfc6 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Tue, 29 Jul 2025 04:19:15 -0400
Subject: [PATCH v6] Fix a possible deadlock during ALTER SUBSCRIPTION ... DROP
PUBLICATION
In most situations the tablesync worker will drop the corresponding origin before it
finishes executing, but if an error causes the tablesync worker to fail just prior to
dropping the origin, or if it is delayed or it didn't get cpu time, the apply worker
could find the origin and attempt to drop the origin. During this time if the
user simultaneously issues an ALTER SUBSCRIPTION ... DROP PUBLICATION, there is a
possibility of a deadlock between the apply worker and the user process, because of the
order in which the locks are taken.
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 12 ++++++----
src/backend/replication/logical/tablesync.c | 34 +++++++++++++++++++++++++----
src/include/catalog/pg_subscription_rel.h | 2 +-
3 files changed, 39 insertions(+), 9 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 1395032..a61d734 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -319,7 +319,7 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -327,9 +327,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index c90f23e..9ea7b2c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -316,7 +316,8 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
/*
* End streaming so that LogRepWorkerWalRcvConn can be used to drop
@@ -425,6 +426,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -492,7 +494,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -504,7 +516,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, true);
}
}
else
@@ -555,7 +567,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -622,6 +641,11 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
+
if (started_tx)
{
/*
@@ -1413,7 +1437,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1546,7 +1571,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
SUBREL_STATE_FINISHEDCOPY,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index c91797c..f458447 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -85,7 +85,7 @@ typedef struct SubscriptionRelState
extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn, bool retain_lock);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
Dear Ajin,
Attaching the updated patches with the changes you requested. I've
also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
so that everything is together in one mail.
Thanks for update, but the patch for PG18/HEAD seemed not to have Assert().
Can you modify like others do?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Wed, Jul 30, 2025 at 10:33 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Ajin,
Attaching the updated patches with the changes you requested. I've
also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
so that everything is together in one mail.Thanks for update, but the patch for PG18/HEAD seemed not to have Assert().
Can you modify like others do?
Updated patch for PG18/HEAD.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
PG_18_HEAD_v7-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/octet-stream; name=PG_18_HEAD_v7-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload
From 41ebae84557e6c6863c79fb58cca372063f923c9 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Wed, 30 Jul 2025 22:32:55 -0400
Subject: [PATCH v7] Fix a possible deadlock during ALTER SUBSCRIPTION ... DROP
PUBLICATION
In most situations the tablesync worker will drop the corresponding origin before it
finishes executing, but if an error causes the tablesync worker to fail just prior to
dropping the origin, or if it is delayed or it didn't get cpu time, the apply worker
could find the origin and attempt to drop the origin. During this time if the
user simultaneously issues an ALTER SUBSCRIPTION ... DROP PUBLICATION, there is a
possibility of a deadlock between the apply worker and the user process, because of the
order in which the locks are taken.
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 25 +++++++++++++++++----
src/backend/replication/logical/tablesync.c | 34 +++++++++++++++++++++++++----
src/include/catalog/pg_subscription_rel.h | 2 +-
3 files changed, 52 insertions(+), 9 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 63c2992..24534da 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -320,7 +320,7 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -328,9 +328,26 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
+#endif
+
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+#ifdef USE_ASSERT_CHECKING
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock, true));
+#endif
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 3fea0a0..d3356bc 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -316,7 +316,8 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
/*
* End streaming so that LogRepWorkerWalRcvConn can be used to drop
@@ -425,6 +426,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -492,7 +494,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -504,7 +516,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, true);
}
}
else
@@ -555,7 +567,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -623,6 +642,11 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
+
if (started_tx)
{
/*
@@ -1414,7 +1438,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1547,7 +1572,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
SUBREL_STATE_FINISHEDCOPY,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index c91797c..f458447 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -85,7 +85,7 @@ typedef struct SubscriptionRelState
extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn, bool retain_lock);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
Dear Ajin,
Thanks for updates. I confirmed that reported issues could be fixed by your patch.
I have no comments anymore.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Thu, 31 Jul 2025 at 08:23, Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Jul 30, 2025 at 10:33 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:Dear Ajin,
Attaching the updated patches with the changes you requested. I've
also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
so that everything is together in one mail.Thanks for update, but the patch for PG18/HEAD seemed not to have Assert().
Can you modify like others do?Updated patch for PG18/HEAD.
How about we change the below:
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
+#endif
+
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+
RowExclusiveLock, true));
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+#ifdef USE_ASSERT_CHECKING
+ SET_LOCKTAG_OBJECT(tag, InvalidOid,
SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock, true));
+#endif
to:
#ifdef USE_ASSERT_CHECKING
LOCKTAG tag;
Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
RowExclusiveLock, true));
SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
Assert(LockHeldByMe(&tag, AccessShareLock, true));
#endif
rel = table_open(SubscriptionRelRelationId, NoLock);
}
The rest looks good to me.
Regards,
Vignesh
On Thu, Jul 31, 2025 at 2:37 PM vignesh C <vignesh21@gmail.com> wrote:
How about we change the below: +#ifdef USE_ASSERT_CHECKING + LOCKTAG tag; +#endif + + Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId, + RowExclusiveLock, true)); + + rel = table_open(SubscriptionRelRelationId, NoLock); +#ifdef USE_ASSERT_CHECKING + SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0); + Assert(LockHeldByMe(&tag, AccessShareLock, true)); +#endifto:
#ifdef USE_ASSERT_CHECKING
LOCKTAG tag;
Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
RowExclusiveLock, true));
SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
Assert(LockHeldByMe(&tag, AccessShareLock, true));
#endifrel = table_open(SubscriptionRelRelationId, NoLock);
}
Your suggested change looks better to me.
--
With Regards,
Amit Kapila.
On Thu, Jul 31, 2025 at 7:07 PM vignesh C <vignesh21@gmail.com> wrote:
On Thu, 31 Jul 2025 at 08:23, Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Jul 30, 2025 at 10:33 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:Dear Ajin,
Attaching the updated patches with the changes you requested. I've
also added the unchanged patches for PG_18 and HEAD (PG_18_HEAD-v6*),
so that everything is together in one mail.Thanks for update, but the patch for PG18/HEAD seemed not to have Assert().
Can you modify like others do?Updated patch for PG18/HEAD.
How about we change the below: +#ifdef USE_ASSERT_CHECKING + LOCKTAG tag; +#endif + + Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId, + RowExclusiveLock, true)); + + rel = table_open(SubscriptionRelRelationId, NoLock); +#ifdef USE_ASSERT_CHECKING + SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0); + Assert(LockHeldByMe(&tag, AccessShareLock, true)); +#endifto:
#ifdef USE_ASSERT_CHECKING
LOCKTAG tag;
Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
RowExclusiveLock, true));
SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
Assert(LockHeldByMe(&tag, AccessShareLock, true));
#endifrel = table_open(SubscriptionRelRelationId, NoLock);
}The rest looks good to me.
I've fixed the patches accordingly for all branches.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
PG_14_15-v8-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_14_15-v8-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From d2e2212b3fd0c8ebd8e6f53e052667e4486fc39b Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Thu, 31 Jul 2025 06:45:44 -0400
Subject: [PATCH v8] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 33 +++++++++++++++++++++++++----
src/backend/replication/logical/tablesync.c | 27 ++++++++++++++++++++---
src/include/catalog/pg_subscription_rel.h | 2 ++
3 files changed, 55 insertions(+), 7 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index add51ca..b9e2c39 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -323,8 +323,8 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
* Update the state of a subscription table.
*/
void
-UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -332,9 +332,24 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock));
+#endif
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -369,6 +384,16 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
}
/*
+ * Update the state of a subscription table.
+ */
+void
+UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
+{
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn, false);
+}
+
+/*
* Get state of subscription table.
*
* Returns SUBREL_STATE_UNKNOWN when the table is not in the subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index e6159ac..dc0526e 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -379,6 +379,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
static HTAB *last_start_times = NULL;
ListCell *lc;
bool started_tx = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -470,7 +471,16 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* refresh for the subscription where we remove the table
* state and its origin and by this time the origin might be
* already removed. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -480,9 +490,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
/*
* Update the state to READY only after the origin cleanup.
*/
- UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
- rstate->relid, rstate->state,
- rstate->lsn);
+ UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
+ rstate->relid, rstate->state,
+ rstate->lsn, true);
}
}
else
@@ -533,7 +543,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -593,6 +610,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 9df99c3..101a1fe 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -84,6 +84,8 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
+extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_18_HEAD-v8-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/octet-stream; name=PG_18_HEAD-v8-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload
From fcd5b0e39c11e270fd2e0f5b34dce33304908212 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Thu, 31 Jul 2025 06:05:13 -0400
Subject: [PATCH v8] Fix a possible deadlock during ALTER SUBSCRIPTION ... DROP
PUBLICATION
In most situations the tablesync worker will drop the corresponding origin before it
finishes executing, but if an error causes the tablesync worker to fail just prior to
dropping the origin, or if it is delayed or it didn't get cpu time, the apply worker
could find the origin and attempt to drop the origin. During this time if the
user simultaneously issues an ALTER SUBSCRIPTION ... DROP PUBLICATION, there is a
possibility of a deadlock between the apply worker and the user process, because of the
order in which the locks are taken.
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 21 +++++++++++++++---
src/backend/replication/logical/tablesync.c | 34 +++++++++++++++++++++++++----
src/include/catalog/pg_subscription_rel.h | 2 +-
3 files changed, 49 insertions(+), 8 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 63c2992..eb99b89 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -320,7 +320,7 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
*/
void
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -328,9 +328,24 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock, true));
+#endif
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 3fea0a0..d3356bc 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -316,7 +316,8 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
/*
* End streaming so that LogRepWorkerWalRcvConn can be used to drop
@@ -425,6 +426,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -492,7 +494,17 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -504,7 +516,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
*/
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
rstate->relid, rstate->state,
- rstate->lsn);
+ rstate->lsn, true);
}
}
else
@@ -555,7 +567,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -623,6 +642,11 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
+
if (started_tx)
{
/*
@@ -1414,7 +1438,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
MyLogicalRepWorker->relstate,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
pgstat_report_stat(true);
@@ -1547,7 +1572,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
MyLogicalRepWorker->relid,
SUBREL_STATE_FINISHEDCOPY,
- MyLogicalRepWorker->relstate_lsn);
+ MyLogicalRepWorker->relstate_lsn,
+ false);
CommitTransactionCommand();
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index c91797c..f458447 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -85,7 +85,7 @@ typedef struct SubscriptionRelState
extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn, bool retain_lock);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn);
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_16-v8-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_16-v8-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From e8aed96c7f0a1552dccd27a1fce817fb8a5ca1fd Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Thu, 31 Jul 2025 06:35:42 -0400
Subject: [PATCH v8] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 33 +++++++++++++++++++++++++----
src/backend/replication/logical/tablesync.c | 27 ++++++++++++++++++++---
src/include/catalog/pg_subscription_rel.h | 2 ++
3 files changed, 55 insertions(+), 7 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index d07f88c..49cb378 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -273,8 +273,8 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
* Update the state of a subscription table.
*/
void
-UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -282,9 +282,24 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock));
+#endif
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -319,6 +334,16 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
}
/*
+ * Update the state of a subscription table.
+ */
+void
+UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
+{
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn, false);
+}
+
+/*
* Get state of subscription table.
*
* Returns SUBREL_STATE_UNKNOWN when the table is not in the subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ca88133..3fa921a 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -425,6 +425,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -492,7 +493,16 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -502,9 +512,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
/*
* Update the state to READY only after the origin cleanup.
*/
- UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
- rstate->relid, rstate->state,
- rstate->lsn);
+ UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
+ rstate->relid, rstate->state,
+ rstate->lsn, true);
}
}
else
@@ -555,7 +565,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -621,6 +638,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
/*
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 60a2bcc..7817a30 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -84,6 +84,8 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
+extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1
PG_17-v8-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_17-v8-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload
From b1c9387871c8eb765c412124f767feecce39f547 Mon Sep 17 00:00:00 2001
From: Ajin Cherian <itsajin@gmail.com>
Date: Thu, 31 Jul 2025 06:18:08 -0400
Subject: [PATCH v8] Fix a deadlock during ALTER SUBSCRIPTION... DROP
PUBLICATION
When user drops a publication from a subscription, this will result in a
publication refresh on the subscriber which will try and drop any pending origins.
Meanwhile the apply worker could also be trying to cleanup origins. There could be
a deadlock if the order of locking of SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId are not aligned between functions
process_syncing_tables_for_apply() and AlterSubscription_refresh().
The fix is to get locks on SubscriptionRelationId, SubscriptionRelRelationId and
ReplicationOriginRelationId in that order when dropping tracking origins.
---
src/backend/catalog/pg_subscription.c | 33 +++++++++++++++++++++++++----
src/backend/replication/logical/tablesync.c | 27 ++++++++++++++++++++---
src/include/catalog/pg_subscription_rel.h | 2 ++
3 files changed, 55 insertions(+), 7 deletions(-)
diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c
index 9efc915..68e32c4 100644
--- a/src/backend/catalog/pg_subscription.c
+++ b/src/backend/catalog/pg_subscription.c
@@ -287,8 +287,8 @@ AddSubscriptionRelState(Oid subid, Oid relid, char state,
* Update the state of a subscription table.
*/
void
-UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
- XLogRecPtr sublsn)
+UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked)
{
Relation rel;
HeapTuple tup;
@@ -296,9 +296,24 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];
- LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ if (already_locked)
+ {
+#ifdef USE_ASSERT_CHECKING
+ LOCKTAG tag;
- rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ Assert(CheckRelationOidLockedByMe(SubscriptionRelRelationId,
+ RowExclusiveLock, true));
+ SET_LOCKTAG_OBJECT(tag, InvalidOid, SubscriptionRelationId, subid, 0);
+ Assert(LockHeldByMe(&tag, AccessShareLock, true));
+#endif
+
+ rel = table_open(SubscriptionRelRelationId, NoLock);
+ }
+ else
+ {
+ LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+ }
/* Try finding existing mapping. */
tup = SearchSysCacheCopy2(SUBSCRIPTIONRELMAP,
@@ -333,6 +348,16 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
}
/*
+ * Update the state of a subscription table.
+ */
+void
+UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn)
+{
+ UpdateSubscriptionRelStateEx(subid, relid, state, sublsn, false);
+}
+
+/*
* Get state of subscription table.
*
* Returns SUBREL_STATE_UNKNOWN when the table is not in the subscription.
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index c8893ff..1da5a7e 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -426,6 +426,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
ListCell *lc;
bool started_tx = false;
bool should_exit = false;
+ Relation rel = NULL;
Assert(!IsTransactionState());
@@ -493,7 +494,16 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* worker to remove the origin tracking as if there is any
* error while dropping we won't restart it to drop the
* origin. So passing missing_ok = true.
+ *
+ * Lock the subscription and origin in the same order as we
+ * are doing during DDL commands to avoid deadlocks. See
+ * AlterSubscription_refresh.
*/
+ LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+ 0, AccessShareLock);
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+
ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid,
rstate->relid,
originname,
@@ -503,9 +513,9 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
/*
* Update the state to READY only after the origin cleanup.
*/
- UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
- rstate->relid, rstate->state,
- rstate->lsn);
+ UpdateSubscriptionRelStateEx(MyLogicalRepWorker->subid,
+ rstate->relid, rstate->state,
+ rstate->lsn, true);
}
}
else
@@ -556,7 +566,14 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
* This is required to avoid any undetected deadlocks
* due to any existing lock as deadlock detector won't
* be able to detect the waits on the latch.
+ *
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, NoLock);
+ rel = NULL;
+ }
CommitTransactionCommand();
pgstat_report_stat(false);
}
@@ -623,6 +640,10 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
}
}
+ /* Close table if opened */
+ if (rel)
+ table_close(rel, NoLock);
+
if (started_tx)
{
/*
diff --git a/src/include/catalog/pg_subscription_rel.h b/src/include/catalog/pg_subscription_rel.h
index 8244ad5..0afda88 100644
--- a/src/include/catalog/pg_subscription_rel.h
+++ b/src/include/catalog/pg_subscription_rel.h
@@ -86,6 +86,8 @@ extern void AddSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn, bool retain_lock);
extern void UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
XLogRecPtr sublsn);
+extern void UpdateSubscriptionRelStateEx(Oid subid, Oid relid, char state,
+ XLogRecPtr sublsn, bool already_locked);
extern char GetSubscriptionRelState(Oid subid, Oid relid, XLogRecPtr *sublsn);
extern void RemoveSubscriptionRel(Oid subid, Oid relid);
--
1.8.3.1