BUG #18988: DROP SUBSCRIPTION locks not-yet-accessed database
The following bug has been logged on the website:
Bug reference: 18988
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 18beta1
Operating system: Ubuntu 24.04
Description:
The following script:
createdb ndb
echo "
CREATE SUBSCRIPTION testsub CONNECTION 'dbname=ndb' PUBLICATION testpub WITH
(connect = false);
" | psql
echo "
DROP SUBSCRIPTION testsub
" | psql &
sleep 1
timeout 30 psql ndb -c "SELECT 1" || echo "TIMEOUT"
makes DROP SUBSCRIPTION stuck on waiting for a connection to drop a slot,
while this connection is waiting for a lock for relation 6100
(pg_subscription), locked by DROP SUBSCRIPTION:
law 1545967 1545946 0 21:10 ? 00:00:00 postgres: law regression
[local] DROP SUBSCRIPTION
law 1545968 1545946 0 21:10 ? 00:00:00 postgres: walsender law
ndb [local] startup waiting
With debug_discard_caches = 1 or under some lucky circumstances (I
encountered these), this leads to inability to connect to any database.
Reproduced on REL_13_STABLE .. master.
On Thu, 17 Jul 2025 at 00:36, PG Bug reporting form
<noreply@postgresql.org> wrote:
The following bug has been logged on the website:
Bug reference: 18988
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 18beta1
Operating system: Ubuntu 24.04
Description:The following script:
createdb ndb
echo "
CREATE SUBSCRIPTION testsub CONNECTION 'dbname=ndb' PUBLICATION testpub WITH
(connect = false);
" | psqlecho "
DROP SUBSCRIPTION testsub
" | psql &
sleep 1
timeout 30 psql ndb -c "SELECT 1" || echo "TIMEOUT"makes DROP SUBSCRIPTION stuck on waiting for a connection to drop a slot,
while this connection is waiting for a lock for relation 6100
(pg_subscription), locked by DROP SUBSCRIPTION:
law 1545967 1545946 0 21:10 ? 00:00:00 postgres: law regression
[local] DROP SUBSCRIPTION
law 1545968 1545946 0 21:10 ? 00:00:00 postgres: walsender law
ndb [local] startup waitingWith debug_discard_caches = 1 or under some lucky circumstances (I
encountered these), this leads to inability to connect to any database.Reproduced on REL_13_STABLE .. master.
Thanks, I was able to reproduce the issue using the steps provided.
The problem occurs because: When dropping a subscription, it takes an
AccessExclusiveLock on the pg_subscription system tables to prevent
the launcher from restarting the worker. During this process, it also
attempts to connect to the publisher in order to drop the replication
slot. As we are connecting to a newly created database, it may not yet
have initialized its catalog caches. As part of the backend startup,
it attempts to build the cache hierarchy via:
RelationCacheInitializePhase3 → InitCatalogCachePhase2 →
InitCatCachePhase2 This cache initialization requires acquiring a
shared lock on pg_subscription, since it is one of the syscache-backed
catalog tables. But that shared lock is blocked by the
AccessExclusiveLock already held by the dropping process. As a result,
the new backend hangs waiting for the lock, and the original DROP
SUBSCRIPTION process cannot proceed, leading to a self-blocking
scenario.
In this specific case, no replication slot was created during
subscription creation as the connect option was specified as false.
Therefore, I believe the system should skip connecting to the
publisher when dropping the subscription. I've attached a patch that
addresses this behavior. Thoughts?
Regards,
Vignesh
On Wed, Jul 30, 2025 at 4:24 PM vignesh C <vignesh21@gmail.com> wrote:
On Thu, 17 Jul 2025 at 00:36, PG Bug reporting form
<noreply@postgresql.org> wrote:The following bug has been logged on the website:
Bug reference: 18988
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 18beta1
Operating system: Ubuntu 24.04
Description:The following script:
createdb ndb
echo "
CREATE SUBSCRIPTION testsub CONNECTION 'dbname=ndb' PUBLICATION testpub WITH
(connect = false);
" | psqlecho "
DROP SUBSCRIPTION testsub
" | psql &
sleep 1
timeout 30 psql ndb -c "SELECT 1" || echo "TIMEOUT"makes DROP SUBSCRIPTION stuck on waiting for a connection to drop a slot,
while this connection is waiting for a lock for relation 6100
(pg_subscription), locked by DROP SUBSCRIPTION:
law 1545967 1545946 0 21:10 ? 00:00:00 postgres: law regression
[local] DROP SUBSCRIPTION
law 1545968 1545946 0 21:10 ? 00:00:00 postgres: walsender law
ndb [local] startup waitingWith debug_discard_caches = 1 or under some lucky circumstances (I
encountered these), this leads to inability to connect to any database.Reproduced on REL_13_STABLE .. master.
Thanks, I was able to reproduce the issue using the steps provided.
The problem occurs because: When dropping a subscription, it takes an
AccessExclusiveLock on the pg_subscription system tables to prevent
the launcher from restarting the worker. During this process, it also
attempts to connect to the publisher in order to drop the replication
slot. As we are connecting to a newly created database, it may not yet
have initialized its catalog caches. As part of the backend startup,
it attempts to build the cache hierarchy via:
RelationCacheInitializePhase3 → InitCatalogCachePhase2 →
InitCatCachePhase2 This cache initialization requires acquiring a
shared lock on pg_subscription, since it is one of the syscache-backed
catalog tables. But that shared lock is blocked by the
AccessExclusiveLock already held by the dropping process. As a result,
the new backend hangs waiting for the lock, and the original DROP
SUBSCRIPTION process cannot proceed, leading to a self-blocking
scenario.In this specific case, no replication slot was created during
subscription creation as the connect option was specified as false.
Therefore, I believe the system should skip connecting to the
publisher when dropping the subscription. I've attached a patch that
addresses this behavior. Thoughts?
I think this fix looks correct, while testing I realize that now we
need an additional step before we can enable the subscription, so I
think we should put that additional step in the error hint, as
attached, this top up patch can be applied on top of your patch.
--
Regards,
Dilip Kumar
Google
Attachments:
additinal_errhint.patchapplication/octet-stream; name=additinal_errhint.patchDownload+13-13
Dear Vignesh, Dilip,
I found another corner case:
```
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false, slot_name=sub);
WARNING: subscription was created, but is not connected
HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
CREATE SUBSCRIPTION
postgres=# DROP SUBSCRIPTION sub ;
... (won't return)
```
Because still can explicitly specify the slot_name while creating the subscription.
Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
CREATE SUBSCRIPTION WITH (connect=false);.
Should we fix the case? If so, how?
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On Thu, Jul 31, 2025 at 2:34 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Vignesh, Dilip,
I found another corner case:
```
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false, slot_name=sub);
WARNING: subscription was created, but is not connected
HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
CREATE SUBSCRIPTION
postgres=# DROP SUBSCRIPTION sub ;
... (won't return)
```Because still can explicitly specify the slot_name while creating the subscription.
Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
CREATE SUBSCRIPTION WITH (connect=false);.Should we fix the case? If so, how?
IMHO, what we should do is to set a flag in subscription that whether
the connect is true or not, and drop subscription should not try to
drop the slot if the connect is false, thoughts?
--
Regards,
Dilip Kumar
Google
On Thu, Jul 31, 2025 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Jul 31, 2025 at 2:34 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:Dear Vignesh, Dilip,
I found another corner case:
```
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false, slot_name=sub);
WARNING: subscription was created, but is not connected
HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
CREATE SUBSCRIPTION
postgres=# DROP SUBSCRIPTION sub ;
... (won't return)
```Because still can explicitly specify the slot_name while creating the subscription.
Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
CREATE SUBSCRIPTION WITH (connect=false);.Should we fix the case? If so, how?
IMHO, what we should do is to set a flag in subscription that whether
the connect is true or not, and drop subscription should not try to
drop the slot if the connect is false, thoughts?
Maybe not, since this exists in previous branches as well and we
cannot alter the system catalog in back branches. Let me put more
thoughts on this.
--
Regards,
Dilip Kumar
Google
On Thu, 31 Jul 2025 at 14:34, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Vignesh, Dilip,
I found another corner case:
```
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false, slot_name=sub);
WARNING: subscription was created, but is not connected
HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
CREATE SUBSCRIPTION
postgres=# DROP SUBSCRIPTION sub ;
... (won't return)
```Because still can explicitly specify the slot_name while creating the subscription.
Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
CREATE SUBSCRIPTION WITH (connect=false);.Should we fix the case? If so, how?
An alternative approach would be to acquire an AccessShareLock on
pg_subscription, and then explicitly lock the target subscription row
using LockSharedObject with AccessExclusiveLock while dropping the
subscription. On the launcher side, before starting a worker, it
should similarly acquire an AccessExclusiveLock on the corresponding
subscription row using LockSharedObject. Once the lock is acquired, it
must revalidate that the subscription still exists, to prevent race
conditions with concurrent drops, before proceeding to start the
worker.
Thoughts?
Regards,
Vignesh
On Mon, Aug 4, 2025 at 9:07 AM vignesh C <vignesh21@gmail.com> wrote:
On Thu, 31 Jul 2025 at 14:34, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:Dear Vignesh, Dilip,
I found another corner case:
```
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false, slot_name=sub);
WARNING: subscription was created, but is not connected
HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
CREATE SUBSCRIPTION
postgres=# DROP SUBSCRIPTION sub ;
... (won't return)
```Because still can explicitly specify the slot_name while creating the subscription.
Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
CREATE SUBSCRIPTION WITH (connect=false);.Should we fix the case? If so, how?
An alternative approach would be to acquire an AccessShareLock on
pg_subscription, and then explicitly lock the target subscription row
using LockSharedObject with AccessExclusiveLock while dropping the
subscription. On the launcher side, before starting a worker, it
should similarly acquire an AccessExclusiveLock on the corresponding
subscription row using LockSharedObject. Once the lock is acquired, it
must revalidate that the subscription still exists, to prevent race
conditions with concurrent drops, before proceeding to start the
worker.
Thanks for the idea, so currently during drop subscription we are
already doing LockSharedObject on the subscription in
AccessExclusiveLock. So now we should try to acquire
AccessExclusiveLock on the launcher side and then revalidate the
object existence, let me try this and post a patch in a while?
--
Regards,
Dilip Kumar
Google
On Mon, Aug 4, 2025 at 9:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Aug 4, 2025 at 9:07 AM vignesh C <vignesh21@gmail.com> wrote:
On Thu, 31 Jul 2025 at 14:34, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:Dear Vignesh, Dilip,
I found another corner case:
```
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=db1 user=postgres port=5431' PUBLICATION pub1 WITH (connect=false, slot_name=sub);
WARNING: subscription was created, but is not connected
HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
CREATE SUBSCRIPTION
postgres=# DROP SUBSCRIPTION sub ;
... (won't return)
```Because still can explicitly specify the slot_name while creating the subscription.
Another pattern is to run ALTER SUBSCRIPTION SET (slot_name) command after the
CREATE SUBSCRIPTION WITH (connect=false);.Should we fix the case? If so, how?
An alternative approach would be to acquire an AccessShareLock on
pg_subscription, and then explicitly lock the target subscription row
using LockSharedObject with AccessExclusiveLock while dropping the
subscription. On the launcher side, before starting a worker, it
should similarly acquire an AccessExclusiveLock on the corresponding
subscription row using LockSharedObject. Once the lock is acquired, it
must revalidate that the subscription still exists, to prevent race
conditions with concurrent drops, before proceeding to start the
worker.Thanks for the idea, so currently during drop subscription we are
already doing LockSharedObject on the subscription in
AccessExclusiveLock. So now we should try to acquire
AccessExclusiveLock on the launcher side and then revalidate the
object existence, let me try this and post a patch in a while?
I have worked on this and produced a first version of patch, let's see
what others think about this idea. It would have been better if we
could use SysCache for rechecking the subscription, but since we are
not connected to the database in the launcher we can not use the
SysCache, at least that's what I think.
--
Regards,
Dilip Kumar
Google
Attachments:
fix_drop_subscription_hang.patchapplication/octet-stream; name=fix_drop_subscription_hang.patchDownload+54-1
On 2025-Aug-04, Dilip Kumar wrote:
I have worked on this and produced a first version of patch, let's see
what others think about this idea. It would have been better if we
could use SysCache for rechecking the subscription, but since we are
not connected to the database in the launcher we can not use the
SysCache, at least that's what I think.
I think it's reasonable to recheck after locking. There's a comment in
DropSubscription that says we get AEL, which is no longer true. In
is_subscription_exists() you should use the index on OID instead of
seqscanning the catalog without a scankey; also I think the name ought
to be "does" rather than "is". I think it's really odd that that
function opens and closes a transaction; sounds to me that something
like that really belongs in the caller (frankly the same is true with
the other function that your comment references). Why isn't
systable_beginscan being used to scan the catalog?
I think with this coding, the resource owner for this new lock is NULL.
Is this really a good approach? Maybe there should be a resowner here.
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine." (Julien PUYDT)
On Mon, Aug 4, 2025 at 6:20 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Aug-04, Dilip Kumar wrote:
I have worked on this and produced a first version of patch, let's see
what others think about this idea. It would have been better if we
could use SysCache for rechecking the subscription, but since we are
not connected to the database in the launcher we can not use the
SysCache, at least that's what I think.I think it's reasonable to recheck after locking. There's a comment in
DropSubscription that says we get AEL, which is no longer true.
Right, will remove that.
In
is_subscription_exists() you should use the index on OID instead of
seqscanning the catalog without a scankey;
I thought since launcher is not connected to the database we will not
be able to open the index relation. Otherwise we may just call
SearchSysCache1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid)); Maybe this
is not connected because it was not required so far and we can just
connect it to template1 ?
also I think the name ought
to be "does" rather than "is".
Okay
I think it's really odd that that
function opens and closes a transaction; sounds to me that something
like that really belongs in the caller (frankly the same is true with
the other function that your comment references). Why isn't
systable_beginscan being used to scan the catalog?
You mean for this function or for get_subscription_list() as well,
yeah logically systable_beginscan() sounds better.
I think with this coding, the resource owner for this new lock is NULL.
Is this really a good approach? Maybe there should be a resowner here.
As you suggested we should move the transaction to the caller and
start it before LockSharedObject() so that we will acquire the lock
under the TopTransactionResourceOwner ?
--
Regards,
Dilip Kumar
Google
On Mon, Aug 4, 2025 at 7:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Aug 4, 2025 at 6:20 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Aug-04, Dilip Kumar wrote:
I have worked on this and produced a first version of patch, let's see
what others think about this idea. It would have been better if we
could use SysCache for rechecking the subscription, but since we are
not connected to the database in the launcher we can not use the
SysCache, at least that's what I think.I think it's reasonable to recheck after locking. There's a comment in
DropSubscription that says we get AEL, which is no longer true.Right, will remove that.
In
is_subscription_exists() you should use the index on OID instead of
seqscanning the catalog without a scankey;I thought since launcher is not connected to the database we will not
be able to open the index relation. Otherwise we may just call
SearchSysCache1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid)); Maybe this
is not connected because it was not required so far and we can just
connect it to template1 ?also I think the name ought
to be "does" rather than "is".
Okay
I think it's really odd that that
function opens and closes a transaction; sounds to me that something
like that really belongs in the caller (frankly the same is true with
the other function that your comment references). Why isn't
systable_beginscan being used to scan the catalog?You mean for this function or for get_subscription_list() as well,
yeah logically systable_beginscan() sounds better.I think with this coding, the resource owner for this new lock is NULL.
Is this really a good approach? Maybe there should be a resowner here.As you suggested we should move the transaction to the caller and
start it before LockSharedObject() so that we will acquire the lock
under the TopTransactionResourceOwner ?
Here is revised version based on what I proposed here
- I have removed the comment in DropSubscription where we acquire the
lock, as mentioning the ASL is not interesting anymore, instead I am
explaining in launcher why we are acquiring shared object lock.
- Connected launcher to "postgres" database so that we can do syscache lookup
- got rid of "is_subscription_exists" as we are directly validating
using syscache lookup
- Didn't do anything with existing function i.e. get_subscription_list
--
Regards,
Dilip Kumar
Google
Attachments:
v1-0001-Fix-drop-subcription-deadlock-with-create-databas.patchapplication/octet-stream; name=v1-0001-Fix-drop-subcription-deadlock-with-create-databas.patchDownload+28-11
On Tue, Aug 5, 2025 at 11:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Aug 4, 2025 at 7:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Aug 4, 2025 at 6:20 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Aug-04, Dilip Kumar wrote:
I have worked on this and produced a first version of patch, let's see
what others think about this idea. It would have been better if we
could use SysCache for rechecking the subscription, but since we are
not connected to the database in the launcher we can not use the
SysCache, at least that's what I think.I think it's reasonable to recheck after locking. There's a comment in
DropSubscription that says we get AEL, which is no longer true.Right, will remove that.
In
is_subscription_exists() you should use the index on OID instead of
seqscanning the catalog without a scankey;I thought since launcher is not connected to the database we will not
be able to open the index relation. Otherwise we may just call
SearchSysCache1(SUBSCRIPTIONOID, ObjectIdGetDatum(subid)); Maybe this
is not connected because it was not required so far and we can just
connect it to template1 ?also I think the name ought
to be "does" rather than "is".
Okay
I think it's really odd that that
function opens and closes a transaction; sounds to me that something
like that really belongs in the caller (frankly the same is true with
the other function that your comment references). Why isn't
systable_beginscan being used to scan the catalog?You mean for this function or for get_subscription_list() as well,
yeah logically systable_beginscan() sounds better.I think with this coding, the resource owner for this new lock is NULL.
Is this really a good approach? Maybe there should be a resowner here.As you suggested we should move the transaction to the caller and
start it before LockSharedObject() so that we will acquire the lock
under the TopTransactionResourceOwner ?Here is revised version based on what I proposed here
- I have removed the comment in DropSubscription where we acquire the
lock, as mentioning the ASL is not interesting anymore, instead I am
explaining in launcher why we are acquiring shared object lock.
- Connected launcher to "postgres" database so that we can do syscache lookup
Won't that add a new restriction that one can't drop postgres database
after this?
BTW, isn't it better to acquire the share_lock on subscription during
worker initialization (InitializeLogRepWorker()) where we already
checking whether the subscription is dropped by that time? I think if
we don't acquire the lock on subscription during launcher or during
InitializeLogRepWorker(), there is a risk that DropSubscription would
have reached the place where it would have tried to stop the workers
and launcher will launch the worker after that point. Then there is a
possibility of dangling worker because the GetSubscription() can still
return valid value as the transaction in which we are dropping a
subscription could still be in-progress.
--
With Regards,
Amit Kapila.
On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Won't that add a new restriction that one can't drop postgres database
after this?
That's correct, so maybe this is not acceptable IMHO.
BTW, isn't it better to acquire the share_lock on subscription during
worker initialization (InitializeLogRepWorker()) where we already
checking whether the subscription is dropped by that time? I think if
we don't acquire the lock on subscription during launcher or during
InitializeLogRepWorker(), there is a risk that DropSubscription would
have reached the place where it would have tried to stop the workers
and launcher will launch the worker after that point. Then there is a
possibility of dangling worker because the GetSubscription() can still
return valid value as the transaction in which we are dropping a
subscription could still be in-progress.
Here is my thought on these 2 approaches, so if we lock in launcher
and check there we avoid launching the worker altogether but for that
we will have to revalidate the subscription using sequence scan on
pg_subcription as we can not open additional indexes as we are not
connected to any database OTOH if we acquire lock in
InitializeLogRepWorker() we don't need to do any additional scan but
we will have to launch the worker and after that if subscription
recheck shows its dropped we will exit the worker.
I think either way it's fine because this is not a very common
performance path, I prefer what you proposed as we will have to add
less code in this case, because we are already checking the
subscription and just need to acquire the shared object lock in
InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
modify the code with this approach as well.
--
Regards,
Dilip Kumar
Google
On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Won't that add a new restriction that one can't drop postgres database
after this?That's correct, so maybe this is not acceptable IMHO.
BTW, isn't it better to acquire the share_lock on subscription during
worker initialization (InitializeLogRepWorker()) where we already
checking whether the subscription is dropped by that time? I think if
we don't acquire the lock on subscription during launcher or during
InitializeLogRepWorker(), there is a risk that DropSubscription would
have reached the place where it would have tried to stop the workers
and launcher will launch the worker after that point. Then there is a
possibility of dangling worker because the GetSubscription() can still
return valid value as the transaction in which we are dropping a
subscription could still be in-progress.Here is my thought on these 2 approaches, so if we lock in launcher
and check there we avoid launching the worker altogether but for that
we will have to revalidate the subscription using sequence scan on
pg_subcription as we can not open additional indexes as we are not
connected to any database OTOH if we acquire lock in
InitializeLogRepWorker() we don't need to do any additional scan but
we will have to launch the worker and after that if subscription
recheck shows its dropped we will exit the worker.I think either way it's fine because this is not a very common
performance path, I prefer what you proposed as we will have to add
less code in this case, because we are already checking the
subscription and just need to acquire the shared object lock in
InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
modify the code with this approach as well.
Here is a patch with a different approach. IMHO with this approach
the code change is much simpler.
--
Regards,
Dilip Kumar
Google
Attachments:
v2-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patchapplication/octet-stream; name=v2-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patchDownload+13-6
On Wed, Aug 6, 2025 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Won't that add a new restriction that one can't drop postgres database
after this?That's correct, so maybe this is not acceptable IMHO.
BTW, isn't it better to acquire the share_lock on subscription during
worker initialization (InitializeLogRepWorker()) where we already
checking whether the subscription is dropped by that time? I think if
we don't acquire the lock on subscription during launcher or during
InitializeLogRepWorker(), there is a risk that DropSubscription would
have reached the place where it would have tried to stop the workers
and launcher will launch the worker after that point. Then there is a
possibility of dangling worker because the GetSubscription() can still
return valid value as the transaction in which we are dropping a
subscription could still be in-progress.Here is my thought on these 2 approaches, so if we lock in launcher
and check there we avoid launching the worker altogether but for that
we will have to revalidate the subscription using sequence scan on
pg_subcription as we can not open additional indexes as we are not
connected to any database OTOH if we acquire lock in
InitializeLogRepWorker() we don't need to do any additional scan but
we will have to launch the worker and after that if subscription
recheck shows its dropped we will exit the worker.I think either way it's fine because this is not a very common
performance path, I prefer what you proposed as we will have to add
less code in this case, because we are already checking the
subscription and just need to acquire the shared object lock in
InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
modify the code with this approach as well.Here is a patch with a different approach. IMHO with this approach
the code change is much simpler.
I have slightly modified the patch, basically removing explicitly
unlocking the shared object as that will be taken care by transaction
commit / proc_exit()
--
Regards,
Dilip Kumar
Google
Attachments:
v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patchapplication/octet-stream; name=v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patchDownload+16-6
On Thu, 7 Aug 2025 at 19:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 6, 2025 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Won't that add a new restriction that one can't drop postgres database
after this?That's correct, so maybe this is not acceptable IMHO.
BTW, isn't it better to acquire the share_lock on subscription during
worker initialization (InitializeLogRepWorker()) where we already
checking whether the subscription is dropped by that time? I think if
we don't acquire the lock on subscription during launcher or during
InitializeLogRepWorker(), there is a risk that DropSubscription would
have reached the place where it would have tried to stop the workers
and launcher will launch the worker after that point. Then there is a
possibility of dangling worker because the GetSubscription() can still
return valid value as the transaction in which we are dropping a
subscription could still be in-progress.Here is my thought on these 2 approaches, so if we lock in launcher
and check there we avoid launching the worker altogether but for that
we will have to revalidate the subscription using sequence scan on
pg_subcription as we can not open additional indexes as we are not
connected to any database OTOH if we acquire lock in
InitializeLogRepWorker() we don't need to do any additional scan but
we will have to launch the worker and after that if subscription
recheck shows its dropped we will exit the worker.I think either way it's fine because this is not a very common
performance path, I prefer what you proposed as we will have to add
less code in this case, because we are already checking the
subscription and just need to acquire the shared object lock in
InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
modify the code with this approach as well.Here is a patch with a different approach. IMHO with this approach
the code change is much simpler.I have slightly modified the patch, basically removing explicitly
unlocking the shared object as that will be taken care by transaction
commit / proc_exit()
Thanks for the updated patch. This change is required for the back
branches too. Currently it is not applying for PG16 and below
branches.
git am v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch
Applying: Fix DROP SUBSCRIPTION deadlock with new database creation
error: patch failed: src/backend/commands/subscriptioncmds.c:1571
error: src/backend/commands/subscriptioncmds.c: patch does not apply
error: patch failed: src/backend/replication/logical/worker.c:4624
error: src/backend/replication/logical/worker.c: patch does not apply
Patch failed at 0001 Fix DROP SUBSCRIPTION deadlock with new database creation
Regards,
Vignesh
On Wed, Aug 13, 2025 at 11:16 AM vignesh C <vignesh21@gmail.com> wrote:
On Thu, 7 Aug 2025 at 19:18, Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 6, 2025 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 6, 2025 at 11:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 6, 2025 at 10:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Won't that add a new restriction that one can't drop postgres database
after this?That's correct, so maybe this is not acceptable IMHO.
BTW, isn't it better to acquire the share_lock on subscription during
worker initialization (InitializeLogRepWorker()) where we already
checking whether the subscription is dropped by that time? I think if
we don't acquire the lock on subscription during launcher or during
InitializeLogRepWorker(), there is a risk that DropSubscription would
have reached the place where it would have tried to stop the workers
and launcher will launch the worker after that point. Then there is a
possibility of dangling worker because the GetSubscription() can still
return valid value as the transaction in which we are dropping a
subscription could still be in-progress.Here is my thought on these 2 approaches, so if we lock in launcher
and check there we avoid launching the worker altogether but for that
we will have to revalidate the subscription using sequence scan on
pg_subcription as we can not open additional indexes as we are not
connected to any database OTOH if we acquire lock in
InitializeLogRepWorker() we don't need to do any additional scan but
we will have to launch the worker and after that if subscription
recheck shows its dropped we will exit the worker.I think either way it's fine because this is not a very common
performance path, I prefer what you proposed as we will have to add
less code in this case, because we are already checking the
subscription and just need to acquire the shared object lock in
InitializeLogRepWorker(), lets see what other thinks, meanwhile I will
modify the code with this approach as well.Here is a patch with a different approach. IMHO with this approach
the code change is much simpler.I have slightly modified the patch, basically removing explicitly
unlocking the shared object as that will be taken care by transaction
commit / proc_exit()Thanks for the updated patch. This change is required for the back
branches too. Currently it is not applying for PG16 and below
branches.
git am v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch
Applying: Fix DROP SUBSCRIPTION deadlock with new database creation
error: patch failed: src/backend/commands/subscriptioncmds.c:1571
error: src/backend/commands/subscriptioncmds.c: patch does not apply
error: patch failed: src/backend/replication/logical/worker.c:4624
error: src/backend/replication/logical/worker.c: patch does not apply
Patch failed at 0001 Fix DROP SUBSCRIPTION deadlock with new database creation
Right, I was waiting for consensus for the fix, but anyway I will send
the patches for back branches, thanks.
--
Regards,
Dilip Kumar
Google
On Wed, Aug 13, 2025 at 11:20 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 13, 2025 at 11:16 AM vignesh C <vignesh21@gmail.com> wrote:
Thanks for the updated patch. This change is required for the back
branches too. Currently it is not applying for PG16 and below
branches.
git am v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-.patch
Applying: Fix DROP SUBSCRIPTION deadlock with new database creation
error: patch failed: src/backend/commands/subscriptioncmds.c:1571
error: src/backend/commands/subscriptioncmds.c: patch does not apply
error: patch failed: src/backend/replication/logical/worker.c:4624
error: src/backend/replication/logical/worker.c: patch does not apply
Patch failed at 0001 Fix DROP SUBSCRIPTION deadlock with new database creationRight, I was waiting for consensus for the fix, but anyway I will send
the patches for back branches, thanks.
I have prepared back branch patches.
--
Regards,
Dilip Kumar
Google
Attachments:
v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V17-18.patchapplication/octet-stream; name=v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V17-18.patchDownload+16-6
v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V15.patchapplication/octet-stream; name=v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V15.patchDownload+16-6
v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-v14.patchapplication/octet-stream; name=v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-v14.patchDownload+16-6
v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-Master.patchapplication/octet-stream; name=v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-Master.patchDownload+16-6
v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V13.patchapplication/octet-stream; name=v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V13.patchDownload+16-6
v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V16.patchapplication/octet-stream; name=v3-0001-Fix-DROP-SUBSCRIPTION-deadlock-with-new-database-V16.patchDownload+17-7
On Wed, Aug 13, 2025 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
I have prepared back branch patches.
@@ -1802,11 +1802,7 @@ DropSubscription(DropSubscriptionStmt *stmt,
bool isTopLevel)
...
- /*
- * Lock pg_subscription with AccessExclusiveLock to ensure that the
- * launcher doesn't restart new worker during dropping the subscription
- */
- rel = table_open(SubscriptionRelationId, AccessExclusiveLock);
+ rel = table_open(SubscriptionRelationId, AccessShareLock);
We need to acquire RowExclusiveLock here as we are deleting the
subscription row during this operation. See docs [1]https://www.postgresql.org/docs/devel/explicit-locking.html -- With Regards, Amit Kapila. and
AlterSubscription() code for reference. Attached is a top-up patch to
fix this for HEAD. Additionally, I have edited a few comments. If you
agree with these changes then please prepare backbranch patches
accordingly.
[1]: https://www.postgresql.org/docs/devel/explicit-locking.html -- With Regards, Amit Kapila.
--
With Regards,
Amit Kapila.