024_add_drop_pub.pl might fail due to deadlock

Started by Alexander Lakhin8 months ago38 messages
Jump to latest
#1Alexander Lakhin
exclusion@gmail.com

Hello hackers,

The recent buildfarm failure [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&dt=2025-07-01%2018%3A00%3A58 on REL_15_STABLE with the following
diagnostics:
# Looks like your test exited with 29 just after 1.
t/024_add_drop_pub.pl ..............
Dubious, test returned 29 (wstat 7424, 0x1d00)

pgsql.build/src/test/subscription/tmp_check/log/regress_log_024_add_drop_pub
[21:01:34.406](16.501s) ok 1 - check initial data is copied to subscriber
error running SQL: 'psql:<stdin>:1: ERROR:  deadlock detected
DETAIL:  Process 219632 waits for ExclusiveLock on relation 6000 of database 0; blocked by process 218369.
Process 218369 waits for AccessShareLock on object 16387 of class 6100 of database 0; blocked by process 219632.
HINT:  See server log for query details.'
while running 'psql -XAtq -d port=14957 host=/home/bf/bf-build/petalura/tmp/bGI6HuRtfa dbname='postgres' -f - -v
ON_ERROR_STOP=1' with sql 'ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1' at
/home/bf/bf-build/petalura/REL_15_STABLE/pgsql.build/../pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 1951.

pgsql.build/src/test/subscription/tmp_check/log/024_add_drop_pub_subscriber.log
2025-07-01 21:01:32.682 CEST [218369][logical replication worker][3/6:0] LOG:  logical replication apply worker for
subscription "tap_sub" has started
...
2025-07-01 21:01:34.771 CEST [219632][client backend][4/14:0] LOG: statement: ALTER SUBSCRIPTION tap_sub DROP
PUBLICATION tap_pub_1
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] ERROR:  deadlock detected
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] DETAIL:  Process 219632 waits for ExclusiveLock on
relation 6000 of database 0; blocked by process 218369.
    Process 218369 waits for AccessShareLock on object 16387 of class 6100 of database 0; blocked by process 219632.
    Process 219632: ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1
    Process 218369: <command string not enabled>
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] HINT:  See server log for query details.
2025-07-01 21:01:37.355 CEST [219632][client backend][4/14:731] STATEMENT:  ALTER SUBSCRIPTION tap_sub DROP PUBLICATION
tap_pub_1

shows that the test can fail due to deadlock on accessing
pg_replication_origin (relation 6000).

This failure can be easily reproduced with:
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
          * the specific origin and then re-check if the origin still exists.
          */
         rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
+pg_usleep(300000);

Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
024_add_drop_pub.pl was added in v15).

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&amp;dt=2025-07-01%2018%3A00%3A58

Best regards,
Alexander

#2Ajin Cherian
itsajin@gmail.com
In reply to: Alexander Lakhin (#1)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
* the specific origin and then re-check if the origin still exists.
*/
rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
+pg_usleep(300000);

Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
024_add_drop_pub.pl was added in v15).

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&amp;dt=2025-07-01%2018%3A00%3A58

Best regards,
Alexander

Hi Alexander,

Yes, the problem can be reproduced by the changes you suggested. I
will look into what is happening and how we can fix this.

regards,
Ajin Cherian
Fujitsu Australia

#3Ajin Cherian
itsajin@gmail.com
In reply to: Ajin Cherian (#2)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Mon, Jul 7, 2025 at 8:15 PM Ajin Cherian <itsajin@gmail.com> wrote:

On Sun, Jul 6, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -428,6 +428,7 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait)
* the specific origin and then re-check if the origin still exists.
*/
rel = table_open(ReplicationOriginRelationId, ExclusiveLock);
+pg_usleep(300000);

Not reproduced on REL_16_STABLE (since f6c5edb8a), nor in v14- (because
024_add_drop_pub.pl was added in v15).

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=petalura&amp;dt=2025-07-01%2018%3A00%3A58

Best regards,
Alexander

Hi Alexander,

Yes, the problem can be reproduced by the changes you suggested. I
will look into what is happening and how we can fix this.

The issue appears to be a deadlock caused by inconsistent lock
acquisition order between two processes:

Process A (executing ALTER SUBSCRIPTION tap_sub DROP PUBLICATION tap_pub_1):
In AlterSubscription_refresh(), it first acquires an
AccessExclusiveLock on SubscriptionRelRelationId (resource 1), then
later tries to acquire an ExclusiveLock on ReplicationOriginRelationId
(resource 2).

Process B (apply worker):
In process_syncing_tables_for_apply(), it first acquires an
ExclusiveLock on ReplicationOriginRelationId (resource 2), then calls
UpdateSubscriptionRelState(), which tries to acquire a AccessShareLock
on SubscriptionRelRelationId (resource 1).

This leads to a deadlock:
Process A holds a lock on resource 1 and waits for resource 2, while
process B holds a lock on resource 2 and waits for resource 1.

Proposed fix:
In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock
on SubscriptionRelRelationId before acquiring the lock on
ReplicationOriginRelationId.

Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.

regards,
Ajin Cherian
Fujitsu Australia.

Attachments:

0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patchapplication/octet-stream; name=0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patchDownload+11-1
#4Ajin Cherian
itsajin@gmail.com
In reply to: Ajin Cherian (#3)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:

Proposed fix:
In process_syncing_tables_for_apply(), acquire an AccessExclusiveLock
on SubscriptionRelRelationId before acquiring the lock on
ReplicationOriginRelationId.

Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.

While debugging this further, I found that there is another lock taken
prior to this in AlterSubscription(),
/* Lock the subscription so nobody else can do anything with it. */
LockSharedObject(SubscriptionRelationId, subid, 0, AccessExclusiveLock);

Since this is the first lock taken while altering subscription, that
should also be taken first by the tablesync worker to avoid deadlock.
So, attaching a modified patch.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

0002-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patchapplication/octet-stream; name=0002-Fix-a-deadlock-during-ALTER-SUBSCRIPTION-.-DROP-PUBL.patchDownload+7-1
#5Ajin Cherian
itsajin@gmail.com
In reply to: Ajin Cherian (#3)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:

Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.

While debugging if this problem can occur on HEAD, I found out that on
head, it is mostly the tablesync worker that drops the origin on HEAD
and since the tablesysnc worker does not attempt to update the
SubscriptionRel state in that process, there doesn't seem to be the
possibility of a deadlock. But there is a rare situation where the
tablesync worker could crash or get an error just prior to dropping
the origin, then the origin is dropped in the apply worker (this is
explained in the comments in process_syncing_tables_for_sync()). If
the origin has to be dropped in the apply worker, then the same
deadlock can happen in HEAD code as well. I was able to simulate this
by using an injection point to create an error on the tablesync worker
and then the similar deadlock happens on HEAD as well. Attaching a
patch for fixing this on HEAD as well.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

HEAD-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION-.-.patchapplication/octet-stream; name=HEAD-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION-.-.patchDownload+7-1
#6vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#5)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:

Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.

While debugging if this problem can occur on HEAD, I found out that on
head, it is mostly the tablesync worker that drops the origin on HEAD
and since the tablesysnc worker does not attempt to update the
SubscriptionRel state in that process, there doesn't seem to be the
possibility of a deadlock. But there is a rare situation where the
tablesync worker could crash or get an error just prior to dropping
the origin, then the origin is dropped in the apply worker (this is
explained in the comments in process_syncing_tables_for_sync()). If
the origin has to be dropped in the apply worker, then the same
deadlock can happen in HEAD code as well. I was able to simulate this
by using an injection point to create an error on the tablesync worker
and then the similar deadlock happens on HEAD as well. Attaching a
patch for fixing this on HEAD as well.

I was able to reproduce the deadlock on HEAD as well using the
attached patch, which introduces a delay in the tablesync worker
before dropping the replication origin by adding a sleep of a few
seconds. During this delay, the apply worker also attempts to drop the
replication origin. If an ALTER SUBSCRIPTION command is executed
concurrently, a deadlock frequently occurs:
2025-07-14 15:59:53.572 IST [141100] DETAIL: Process 141100 waits for
AccessExclusiveLock on object 2 of class 6000 of database 0; blocked
by process 140974.
Process 140974 waits for AccessShareLock on object 16396 of class 6100
of database 0; blocked by process 141100.
Process 141100: alter subscription sub1 drop publication pub1
Process 140974: <command string not enabled>

After apply the attached patch, create the logical replication setup
for a publication pub1 having table t1 and then run the following
commands in a loop:
alter subscription sub1 drop publication pub1;
alter subscription sub1 add publication pub1;
sleep 4

Regards,
Vignesh

Attachments:

deadlock_simulate_add_drop_pub.patchapplication/octet-stream; name=deadlock_simulate_add_drop_pub.patchDownload+3-0
#7vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#6)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Mon, 14 Jul 2025 at 16:15, vignesh C <vignesh21@gmail.com> wrote:

On Mon, 14 Jul 2025 at 15:46, Ajin Cherian <itsajin@gmail.com> wrote:

On Tue, Jul 8, 2025 at 8:41 PM Ajin Cherian <itsajin@gmail.com> wrote:

Patch with fix attached.
I'll continue investigating whether this issue also affects HEAD.

While debugging if this problem can occur on HEAD, I found out that on
head, it is mostly the tablesync worker that drops the origin on HEAD
and since the tablesysnc worker does not attempt to update the
SubscriptionRel state in that process, there doesn't seem to be the
possibility of a deadlock. But there is a rare situation where the
tablesync worker could crash or get an error just prior to dropping
the origin, then the origin is dropped in the apply worker (this is
explained in the comments in process_syncing_tables_for_sync()). If
the origin has to be dropped in the apply worker, then the same
deadlock can happen in HEAD code as well. I was able to simulate this
by using an injection point to create an error on the tablesync worker
and then the similar deadlock happens on HEAD as well. Attaching a
patch for fixing this on HEAD as well.

I was able to reproduce the deadlock on HEAD as well using the
attached patch, which introduces a delay in the tablesync worker
before dropping the replication origin by adding a sleep of a few
seconds. During this delay, the apply worker also attempts to drop the
replication origin. If an ALTER SUBSCRIPTION command is executed
concurrently, a deadlock frequently occurs:
2025-07-14 15:59:53.572 IST [141100] DETAIL: Process 141100 waits for
AccessExclusiveLock on object 2 of class 6000 of database 0; blocked
by process 140974.
Process 140974 waits for AccessShareLock on object 16396 of class 6100
of database 0; blocked by process 141100.
Process 141100: alter subscription sub1 drop publication pub1
Process 140974: <command string not enabled>

After apply the attached patch, create the logical replication setup
for a publication pub1 having table t1 and then run the following
commands in a loop:
alter subscription sub1 drop publication pub1;
alter subscription sub1 add publication pub1;
sleep 4

Attached is the script used to reproduce the issue and the deadlock
logs for the same. Your patch fixes the issue.
Couple of comments:
1) This change is not required:
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/usercontext.h"
+#include "utils/injection_point.h"

2) This can not only happen in error case but also in normal cases
where the tablesync worker is slower as shown in the script to
reproduce, we can update the commit message accordingly:
In most situations the tablesync worker will drop the corresponding
origin before it
finishes executing, but if an error causes the tablesync worker to
fail just prior to
dropping the origin, the apply worker will later find the origin and drop it.

Regards,
Vignesh

Attachments:

deadlock_script_log.zipapplication/x-zip-compressed; name=deadlock_script_log.zipDownload
#8Ajin Cherian
itsajin@gmail.com
In reply to: vignesh C (#7)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Tue, Jul 15, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote:

Couple of comments:
1) This change is not required:
#include "utils/snapmgr.h"
#include "utils/syscache.h"
#include "utils/usercontext.h"
+#include "utils/injection_point.h"

2) This can not only happen in error case but also in normal cases
where the tablesync worker is slower as shown in the script to
reproduce, we can update the commit message accordingly:
In most situations the tablesync worker will drop the corresponding
origin before it
finishes executing, but if an error causes the tablesync worker to
fail just prior to
dropping the origin, the apply worker will later find the origin and drop it.

Thanks for the test and confirming the fix. Fixed the comments.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

HEAD-v2-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/octet-stream; name=HEAD-v2-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload+6-1
#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#8)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Wed, Jul 16, 2025 at 8:38 AM Ajin Cherian <itsajin@gmail.com> wrote:

Thanks for the test and confirming the fix. Fixed the comments.

         * origin. So passing missing_ok = true.
+         *
+         * Also lock SubscriptionRelationId with AccessShareLock to
+         * prevent any deadlocks with the user concurrently performing
+         * refresh on the subscription.
         */
+        LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid,
+                 0, AccessShareLock);

It seems the patch assumes that the above lock is sufficient because
AlterSubscription will take an AcessExclusiveLock on the same
subscription. So, with this proposal, if both of those become
serialized then the other locking order may not matter. Am I correct
or is there any other theory you have in mind?

If that is the theory then I think we are missing cases where the
apply worker and Alter subscription operates on different
subscriptions.

Consider AlterSubscription_refresh() takes first AccessExclusiveLock
on SubscriptionRelRelationId and then ExclusiveLock on
ReplicationOriginRelationId via replorigin_drop_by_name() . The apply
worker first takes ExclusiveLock on ReplicationOriginRelationId via
replorigin_drop_by_name() and then RowExclusiveLock on
SubscriptionRelRelationId via UpdateSubscriptionRelState(). Won't such
a scenario taking conflicting locks in reverse order can lead to
deadlock at least in PG15?

--
With Regards,
Amit Kapila.

#10Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#9)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Thu, Jul 17, 2025 at 4:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

It seems the patch assumes that the above lock is sufficient because
AlterSubscription will take an AcessExclusiveLock on the same
subscription. So, with this proposal, if both of those become
serialized then the other locking order may not matter. Am I correct
or is there any other theory you have in mind?

If that is the theory then I think we are missing cases where the
apply worker and Alter subscription operates on different
subscriptions.

Consider AlterSubscription_refresh() takes first AccessExclusiveLock
on SubscriptionRelRelationId and then ExclusiveLock on
ReplicationOriginRelationId via replorigin_drop_by_name() . The apply
worker first takes ExclusiveLock on ReplicationOriginRelationId via
replorigin_drop_by_name() and then RowExclusiveLock on
SubscriptionRelRelationId via UpdateSubscriptionRelState(). Won't such
a scenario taking conflicting locks in reverse order can lead to
deadlock at least in PG15?

Yes, this is correct. I have also verified this in my testing that
when there is a second subscription, a deadlock can still happen with
my previous patch. I have now modified the patch in tablesync worker
to take locks on both SubscriptionRelationId and
SubscriptionRelRelationId prior to taking lock on
ReplicationOriginRelationId. I have also found that there is a similar
problem in DropSubscription() where lock on SubscriptionRelRelationId
is not taken before dropping the tracking origin. I've also modified
the signature of UpdateSubscriptionRelState to take a bool
"lock_needed" which if false, the SubscriptionRelationId is not
locked. I've made a new version of the patch on PG_15.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

v3-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=v3-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload+46-10
#11Ajin Cherian
itsajin@gmail.com
In reply to: Ajin Cherian (#10)
Re: 024_add_drop_pub.pl might fail due to deadlock

locked. I've made a new version of the patch on PG_15.

I've made a similar fix on HEAD just so that the code is now
consistent. I don't think the similar problem (deadlock between two
different subscriptions trying to drop tracking origin) occurs on
HEAD with my previous patch, as the way origins are dropped are
different on HEAD. On HEAD, while dropping origin, a RowExclusiveLock
lock is taken on ReplicationOriginRelationId and then an
AccessExclusiveLock is taken on the particular origin. Since the
particular origin will be different on different subscriptions, the
similar deadlock will not happen. But just to keep code consistent, I
have made a similar fix.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

HEAD-v3-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/octet-stream; name=HEAD-v3-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload+44-9
#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#10)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote:

Yes, this is correct. I have also verified this in my testing that
when there is a second subscription, a deadlock can still happen with
my previous patch. I have now modified the patch in tablesync worker
to take locks on both SubscriptionRelationId and
SubscriptionRelRelationId prior to taking lock on
ReplicationOriginRelationId. I have also found that there is a similar
problem in DropSubscription() where lock on SubscriptionRelRelationId
is not taken before dropping the tracking origin. I've also modified
the signature of UpdateSubscriptionRelState to take a bool
"lock_needed" which if false, the SubscriptionRelationId is not
locked. I've made a new version of the patch on PG_15.

Review comments:
================
1.
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);

Why did you acquire AccessExclusiveLock here when the current code has
RowExclusiveLock? It should be RowExclusiveLock.

2.
+ *
+ * Lock SubscriptionRelationId with AccessShareLock and
+ * take AccessExclusiveLock on SubscriptionRelRelationId to
+ * prevent any deadlocks with user concurrently performing
+ * refresh on the subscription.
  */

Try to tell in the comments that we want to keep the locking order
same as DDL commands to prevent deadlocks.

3.
+ * Also close any tables prior to the commit.
  */
+ if (rel)
+ {
+ table_close(rel, AccessExclusiveLock);
+ rel = NULL;

We don't need to explicitly release lock on table_close, it will be
done at transaction end, so use NoLock here as we do in current HEAD
code.

4.
DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
 {
- Relation rel;
+ Relation rel, sub_rel;
  ObjectAddress myself;
  HeapTuple tup;
  Oid subid;
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt,
bool isTopLevel)
  * Note that the state can't change because we have already stopped both
  * the apply and tablesync workers and they can't restart because of
  * exclusive lock on the subscription.
+ *
+ * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+ * deadlock with apply workers of other subscriptions trying
+ * to drop tracking origin.
  */
+ sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);

I don't think we need AccessExclusiveLock on SubscriptionRelRelationId
in DropSubscription. Kindly test again after fixing the first comment
above.
--
With Regards,
Amit Kapila.

#13Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Ajin Cherian (#10)
RE: 024_add_drop_pub.pl might fail due to deadlock

Dear Ajin,

Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:

Order target level
1 pg_subscription row exclusive
2 pg_subscription, my tuple access exclusive
3 pg_subscription_rel access exclusive
4 pg_replication_origin excluive

In contrast, apply worker works like:

Order target level
1 pg_replication_origin excluive
2 pg_subscription, my tuple access share
3 pg_subscrition_rel row exclusive

Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.

Below are my comments:

```
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
* is dropped. So passing missing_ok = false.
*/
ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
```
This change is not needed.

```
+                               if (!rel)
+                                       rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
```

I feel it is too strong, isn't it enough to use row exclusive as initially used?

```
 UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
-                                                  XLogRecPtr sublsn)
+                                                  XLogRecPtr sublsn, bool lock_needed)
```

I feel the name `lock_needed` is bit misleading, because the function still opens
the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?

```
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
         * Note that the state can't change because we have already stopped both
         * the apply and tablesync workers and they can't restart because of
         * exclusive lock on the subscription.
+        *
+        * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+        * deadlock with apply workers of other subscriptions trying
+        * to drop tracking origin.
         */
+       sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
```

Hmm. Per my understanding, DropSubscription acquires below locks till it reaches
replorigin_drop_by_name().

Order target level
1 pg_subscription access exclusive
2 pg_subscription, my tuple access exclusive
3 pg_replication_origin excluive

IIUC we must preserve the ordering, not the target of locks.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#13)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Wed, Jul 23, 2025 at 2:42 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Ajin,

Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:

Order target level
1 pg_subscription row exclusive
2 pg_subscription, my tuple access exclusive
3 pg_subscription_rel access exclusive
4 pg_replication_origin excluive

In contrast, apply worker works like:

Order target level
1 pg_replication_origin excluive
2 pg_subscription, my tuple access share
3 pg_subscrition_rel row exclusive

Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.

Below are my comments:

```
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
* is dropped. So passing missing_ok = false.
*/
ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
```
This change is not needed.

```
+                               if (!rel)
+                                       rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
```

I feel it is too strong, isn't it enough to use row exclusive as initially used?

```
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
-                                                  XLogRecPtr sublsn)
+                                                  XLogRecPtr sublsn, bool lock_needed)
```

I feel the name `lock_needed` is bit misleading, because the function still opens
the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?

I think if we lock in a caller, we don't need to use any lock during
table_open. We can use the parameter name as already_locked as we do
at some other places in the code.

--
With Regards,
Amit Kapila.

#15Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#13)
RE: 024_add_drop_pub.pl might fail due to deadlock

Dear Ajin,

Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:

I forgot to confirm one point. For which branch should be backpatch? Initially
it was reported only on PG15 [1]/messages/by-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com, but I found 021_alter_sub_pub could fail on PG14.
Regarding the PG13, it may not be affected because the replication origin seemed
not to be used for the table sync. It was introduced for ce0fdbfe97.

[1]: /messages/by-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#16Ajin Cherian
itsajin@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#13)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Wed, Jul 23, 2025 at 4:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jul 21, 2025 at 6:59 PM Ajin Cherian <itsajin@gmail.com> wrote:

Yes, this is correct. I have also verified this in my testing that
when there is a second subscription, a deadlock can still happen with
my previous patch. I have now modified the patch in tablesync worker
to take locks on both SubscriptionRelationId and
SubscriptionRelRelationId prior to taking lock on
ReplicationOriginRelationId. I have also found that there is a similar
problem in DropSubscription() where lock on SubscriptionRelRelationId
is not taken before dropping the tracking origin. I've also modified
the signature of UpdateSubscriptionRelState to take a bool
"lock_needed" which if false, the SubscriptionRelationId is not
locked. I've made a new version of the patch on PG_15.

Review comments:
================
1.
+ if (!rel)
+ rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);

Why did you acquire AccessExclusiveLock here when the current code has
RowExclusiveLock? It should be RowExclusiveLock.

Yes, you are correct. I have replaced it with RowExclusiveLock.

2.
+ *
+ * Lock SubscriptionRelationId with AccessShareLock and
+ * take AccessExclusiveLock on SubscriptionRelRelationId to
+ * prevent any deadlocks with user concurrently performing
+ * refresh on the subscription.
*/

Try to tell in the comments that we want to keep the locking order
same as DDL commands to prevent deadlocks.

Modified.

3.
+ * Also close any tables prior to the commit.
*/
+ if (rel)
+ {
+ table_close(rel, AccessExclusiveLock);
+ rel = NULL;

We don't need to explicitly release lock on table_close, it will be
done at transaction end, so use NoLock here as we do in current HEAD
code.

Done.

4.
DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
{
- Relation rel;
+ Relation rel, sub_rel;
ObjectAddress myself;
HeapTuple tup;
Oid subid;
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt,
bool isTopLevel)
* Note that the state can't change because we have already stopped both
* the apply and tablesync workers and they can't restart because of
* exclusive lock on the subscription.
+ *
+ * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+ * deadlock with apply workers of other subscriptions trying
+ * to drop tracking origin.
*/
+ sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);

I don't think we need AccessExclusiveLock on SubscriptionRelRelationId
in DropSubscription. Kindly test again after fixing the first comment
above.
--

Yes, it was failing because I was taking AccessExclusiveLock in the
apply worker, and that was causing the deadlock in my testing. I
tested this worked.

On Wed, Jul 23, 2025 at 7:12 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Ajin,

Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:

Order target level
1 pg_subscription row exclusive
2 pg_subscription, my tuple access exclusive
3 pg_subscription_rel access exclusive
4 pg_replication_origin excluive

In contrast, apply worker works like:

Order target level
1 pg_replication_origin excluive
2 pg_subscription, my tuple access share
3 pg_subscrition_rel row exclusive

Thus a backend may wait at step 4, and apply worker can stuck at step 2 or 3.

Yes, that is correct.

Below are my comments:

```
@@ -340,7 +341,6 @@ process_syncing_tables_for_sync(XLogRecPtr current_lsn)
* is dropped. So passing missing_ok = false.
*/
ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, syncslotname, false);
-
```
This change is not needed.

Removed.

```
+                               if (!rel)
+                                       rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
+
```

I feel it is too strong, isn't it enough to use row exclusive as initially used?

Yes, agree. Fixed.

```
UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
-                                                  XLogRecPtr sublsn)
+                                                  XLogRecPtr sublsn, bool lock_needed)
```

I feel the name `lock_needed` is bit misleading, because the function still opens
the pg_subscription_rel catalog with row exclusive. How about "lock_shared_object"?

I have modified it to not take lock while table_open as well and
changed the name of the variable to already_locked.

```
@@ -1460,7 +1460,13 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
* Note that the state can't change because we have already stopped both
* the apply and tablesync workers and they can't restart because of
* exclusive lock on the subscription.
+        *
+        * Lock pg_subscription_rel with AccessExclusiveLock to prevent any
+        * deadlock with apply workers of other subscriptions trying
+        * to drop tracking origin.
*/
+       sub_rel = table_open(SubscriptionRelRelationId, AccessExclusiveLock);
```

Hmm. Per my understanding, DropSubscription acquires below locks till it reaches
replorigin_drop_by_name().

Order target level
1 pg_subscription access exclusive
2 pg_subscription, my tuple access exclusive
3 pg_replication_origin excluive

IIUC we must preserve the ordering, not the target of locks.

I have removed this change as this does not now conflict with the apply worker.

Two patches attached. One for HEAD and the other for PG_15.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

HEAD-v4-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchapplication/x-patch; name=HEAD-v4-0001-Fix-a-possible-deadlock-during-ALTER-SUBSCRIPTION.patchDownload+40-10
PG_15-v4-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/x-patch; name=PG_15-v4-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload+40-11
#17Ajin Cherian
itsajin@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#15)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Wed, Jul 23, 2025 at 8:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Ajin,

Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:

I forgot to confirm one point. For which branch should be backpatch? Initially
it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14.

Yes, here's a patch for PG14 as well, based on REL_14_STABLE.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

PG_14-v4-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_14-v4-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload+40-11
#18Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Ajin Cherian (#17)
RE: 024_add_drop_pub.pl might fail due to deadlock

Dear Ajin,

Thanks for patches! While checking, I recalled that the backpatch policy [1]https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-API-ABI-STABILITY-GUIDANCE.
We must not modify definitions of opened functions but this does. Can you define
another function like UpdateSubscriptionRelStateEx or something on the back
branches?

Another comment:
```
@@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];

-	LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
-	rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+	if (already_locked)
+		rel = table_open(SubscriptionRelRelationId, NoLock);
```

Can we assert that RowExclusiveLock for pg_subscription_rel has already been
acquired, by using CheckRelationOidLockedByMe() family?

Also, I'm now bit confusing for which branches are really affected. Can you
create all patches for branches, attach at the same e-mail and add some summary
what you want to fix?
E.g., you provided a patch for HEAD/PG15/PG14, what about PG18, 17, 16 and 13?
If not needed, why?

[1]: https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-API-ABI-STABILITY-GUIDANCE

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#19vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#17)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Thu, 24 Jul 2025 at 17:45, Ajin Cherian <itsajin@gmail.com> wrote:

On Wed, Jul 23, 2025 at 8:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Ajin,

Thanks for the patch. Firstly let me confirm my understanding. While altering the
subscription, locks are acquired with below ordering:

I forgot to confirm one point. For which branch should be backpatch? Initially
it was reported only on PG15 [1], but I found 021_alter_sub_pub could fail on PG14.

Yes, here's a patch for PG14 as well, based on REL_14_STABLE.

I believe the patch is trying the address the following issues reported:
1) 024_add_drop_pub.pl test failure reported on REL_16_STABLE at [1]/messages/by-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com
2) Different variation of the above issue on head with the script
attached at [2]/messages/by-id/CALDaNm3PrTkVc2uxMyQTkqw0sg7O6i0EXe1jJo9CzOyW2gFS+Q@mail.gmail.com
3) Amit reported different variant of it for PG15 with the patch at [3]/messages/by-id/CAA4eK1KPa1dJrcd=XfOWx-r37eZudKQRqct0tY1R7vnUw0OabQ@mail.gmail.com

I felt these issues are not applicable to the PG13 branch as
Replication origin creation for table sync is not there in the PG13
branch. So the fix is required from master to PG14 branches.

The patch does not apply on the PG16 branch.

In PG15 you have the following code:
+       /* Close table if opened */
+       if (rel)
+       {
+               table_close(rel, NoLock);
+       }
In master branch you have the following code:
+       /* Close table if opened */
+       if (rel)
+               table_close(rel, NoLock);
+
+

We can keep the fix consistent in both cases and additional newlines
not required in the master branch.

[1]: /messages/by-id/bab95e12-6cc5-4ebb-80a8-3e41956aa297@gmail.com
[2]: /messages/by-id/CALDaNm3PrTkVc2uxMyQTkqw0sg7O6i0EXe1jJo9CzOyW2gFS+Q@mail.gmail.com
[3]: /messages/by-id/CAA4eK1KPa1dJrcd=XfOWx-r37eZudKQRqct0tY1R7vnUw0OabQ@mail.gmail.com

Regards,
Vignesh

#20Ajin Cherian
itsajin@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#18)
Re: 024_add_drop_pub.pl might fail due to deadlock

On Fri, Jul 25, 2025 at 6:01 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Ajin,

Thanks for patches! While checking, I recalled that the backpatch policy [1].
We must not modify definitions of opened functions but this does. Can you define
another function like UpdateSubscriptionRelStateEx or something on the back
branches?

Another comment:
```
@@ -328,9 +328,13 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char state,
Datum values[Natts_pg_subscription_rel];
bool replaces[Natts_pg_subscription_rel];

-       LockSharedObject(SubscriptionRelationId, subid, 0, AccessShareLock);
-
-       rel = table_open(SubscriptionRelRelationId, RowExclusiveLock);
+       if (already_locked)
+               rel = table_open(SubscriptionRelRelationId, NoLock);
```

Can we assert that RowExclusiveLock for pg_subscription_rel has already been
acquired, by using CheckRelationOidLockedByMe() family?

Thanks for your review Kuroda-san, I have changed the logic to not use
already_locked and instead check if the locks are taken inside
UpdateSubscriptionRelState itself. I've tested this and this works
fine. If this logic is acceptable to the reviewers I can update the
other patches also in a similar way.

regards,
Ajin Cherian
Fujitsu Australia

Attachments:

PG_15-v5-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchapplication/octet-stream; name=PG_15-v5-0001-Fix-a-deadlock-during-ALTER-SUBSCRIPTION.-DROP-PU.patchDownload+42-3
#21Ajin Cherian
itsajin@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#18)
#22Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Ajin Cherian (#20)
#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#22)
#24Ajin Cherian
itsajin@gmail.com
In reply to: Amit Kapila (#23)
#25vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#24)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#25)
#27Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Ajin Cherian (#24)
#28Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#27)
#29Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#28)
#30vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#24)
#31Ajin Cherian
itsajin@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#29)
#32Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Ajin Cherian (#31)
#33Ajin Cherian
itsajin@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#32)
#34Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Ajin Cherian (#33)
#35vignesh C
vignesh21@gmail.com
In reply to: Ajin Cherian (#33)
#36Amit Kapila
amit.kapila16@gmail.com
In reply to: vignesh C (#35)
#37Ajin Cherian
itsajin@gmail.com
In reply to: vignesh C (#35)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Ajin Cherian (#37)