Fix LOCK_TIMEOUT handling in slotsync worker

Started by Zhijie Hou (Fujitsu)3 days ago11 messages
#1Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
1 attachment(s)

Hi,

Previously, the slotsync worker used SIGINT to receive a graceful shutdown
signal from the startup process on promotion. However, SIGINT is also used by
the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the
slotsync worker can access and lock catalog tables while parsing libpq tuples,
this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT
signals and consequently waiting indefinitely on locks.

I can reproduce the issue by:

1) create a failover replication slot for slotsync on primary.
2) start slotsync worker on standby and uses gdb to make the slotsync
worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec ->
libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1.
3) take ACCESS EXCLUSIVE lock on pg_type on primary.
4) log standby snapshot to replicate the lock to standby.
5) release the slotsync worker, it will start waiting for the lock on pg_type to
be released. And on HEAD, it would not be canceled by the lock_timeout
setting.

Here is a patch to resolve this by replacing the current signal handler with the
appropriate StatementCancelHandler for SIGINT within the slotsync worker.
Furthermore, it updates the startup process to send a SIGUSR1 signal to notify
slotsync of the need to stop during promotion. The slotsync worker now stops
upon detecting that the shared memory flag (stopSignaled) is set to true.

I did not add a tap-test in the patch for now. Although feasible, it requires
a strong lock on a catalog and an injection point to control the
process.

Best Regards,
Hou zj

Attachments:

v1-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patchapplication/octet-stream; name=v1-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patch
#2shveta malik
shveta malik
shveta.malik@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#1)
Re: Fix LOCK_TIMEOUT handling in slotsync worker

On Mon, Dec 8, 2025 at 7:34 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

Hi,

Previously, the slotsync worker used SIGINT to receive a graceful shutdown
signal from the startup process on promotion. However, SIGINT is also used by
the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the
slotsync worker can access and lock catalog tables while parsing libpq tuples,
this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT
signals and consequently waiting indefinitely on locks.

I can reproduce the issue by:

1) create a failover replication slot for slotsync on primary.
2) start slotsync worker on standby and uses gdb to make the slotsync
worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec ->
libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1.
3) take ACCESS EXCLUSIVE lock on pg_type on primary.
4) log standby snapshot to replicate the lock to standby.
5) release the slotsync worker, it will start waiting for the lock on pg_type to
be released. And on HEAD, it would not be canceled by the lock_timeout
setting.

Here is a patch to resolve this by replacing the current signal handler with the
appropriate StatementCancelHandler for SIGINT within the slotsync worker.
Furthermore, it updates the startup process to send a SIGUSR1 signal to notify
slotsync of the need to stop during promotion. The slotsync worker now stops
upon detecting that the shared memory flag (stopSignaled) is set to true.

I did not add a tap-test in the patch for now. Although feasible, it requires
a strong lock on a catalog and an injection point to control the
process.

Thanks for the patch. I agree with the issue mentioned and can
reproduce it on HEAD; verified that the patch fixes it.
The patch looks good to me.

thanks
Shveta

#3Chao Li
Chao Li
li.evan.chao@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#1)
Re: Fix LOCK_TIMEOUT handling in slotsync worker

Hi Zhijie,

Thanks for the patch. The change looks reasonable. ShutDownSlotSync() has set SlotSyncCtx->stopSignaled, and SIGUSR1’s procsignal_sigusr1_handler will do SetLatch(MyLatch) that will in turn wake up ProcessSlotSyncInterrupts(), then hit if (SlotSyncCtx->stopSignaled) and the slotsync worker terminates.

On Dec 8, 2025, at 10:04, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:

<v1-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patch>

I have nit comment:

```
-	if (ShutdownRequestPending)
+	if (SlotSyncCtx->stopSignaled)
 	{
 		ereport(LOG,
-				errmsg("replication slot synchronization worker is shutting down on receiving SIGINT"));
+				errmsg("replication slot synchronization worker is shutting down because promotion is triggered"));
```

In the error message, “because promotion is triggered" sound a little redundant, can be just:

"replication slot synchronization worker is shutting down due to promotion"

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#4Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Chao Li (#3)
Re: Fix LOCK_TIMEOUT handling in slotsync worker

On Mon, Dec 8, 2025 at 1:06 PM Chao Li <li.evan.chao@gmail.com> wrote:

I have nit comment:

```
-       if (ShutdownRequestPending)
+       if (SlotSyncCtx->stopSignaled)
{
ereport(LOG,
-                               errmsg("replication slot synchronization worker is shutting down on receiving SIGINT"));
+                               errmsg("replication slot synchronization worker is shutting down because promotion is triggered"));
```

In the error message, “because promotion is triggered" sound a little redundant, can be just:

"replication slot synchronization worker is shutting down due to promotion"

We have a number of existing similar messages like: "logical
replication parallel apply worker for subscription \"%s\" will stop
because of a parameter change". So, how about: "replication slot
synchronization worker will stop because promotion is triggered"?

--
With Regards,
Amit Kapila.

#5Chao Li
Chao Li
li.evan.chao@gmail.com
In reply to: Amit Kapila (#4)
Re: Fix LOCK_TIMEOUT handling in slotsync worker

On Dec 9, 2025, at 13:34, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 8, 2025 at 1:06 PM Chao Li <li.evan.chao@gmail.com> wrote:

I have nit comment:

```
-       if (ShutdownRequestPending)
+       if (SlotSyncCtx->stopSignaled)
{
ereport(LOG,
-                               errmsg("replication slot synchronization worker is shutting down on receiving SIGINT"));
+                               errmsg("replication slot synchronization worker is shutting down because promotion is triggered"));
```

In the error message, “because promotion is triggered" sound a little redundant, can be just:

"replication slot synchronization worker is shutting down due to promotion"

We have a number of existing similar messages like: "logical
replication parallel apply worker for subscription \"%s\" will stop
because of a parameter change". So, how about: "replication slot
synchronization worker will stop because promotion is triggered"?

--
With Regards,
Amit Kapila.

Yeah, I just searched and see similar messages:

```
logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's superuser privileges have been revoked

logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser privileges have been revoked
```

I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#6Zhijie Hou (Fujitsu)
Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Amit Kapila (#4)
1 attachment(s)
RE: Fix LOCK_TIMEOUT handling in slotsync worker

On Tuesday, December 9, 2025 1:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Dec 8, 2025 at 1:06 PM Chao Li <li.evan.chao@gmail.com> wrote:

I have nit comment:

```
-       if (ShutdownRequestPending)
+       if (SlotSyncCtx->stopSignaled)
{
ereport(LOG,
-                               errmsg("replication slot synchronization worker is shutting

down on receiving SIGINT"));

+                               errmsg("replication slot
+ synchronization worker is shutting down because promotion is
+ triggered"));
```

In the error message, “because promotion is triggered" sound a little

redundant, can be just:

"replication slot synchronization worker is shutting down due to promotion"

We have a number of existing similar messages like: "logical replication
parallel apply worker for subscription \"%s\" will stop because of a parameter
change". So, how about: "replication slot synchronization worker will stop
because promotion is triggered"?

The suggested message looks better. Here is the updated patch which
can be applied to all branches that supports slotsync.

Best Regards,
Hou zj

Attachments:

v2-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patchapplication/octet-stream; name=v2-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patch
#7Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Chao Li (#5)
Re: Fix LOCK_TIMEOUT handling in slotsync worker

On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:

Yeah, I just searched and see similar messages:

```
logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's superuser privileges have been revoked

logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser privileges have been revoked
```

I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?

My AI tool says:

Both options are grammatically correct, but the nuance differs:
"will stop because promotion is triggered"
This uses the present tense ("is triggered"), which suggests the
promotion event is happening right now, concurrently with the stopping
action.
"will stop because promotion has been triggered"
This uses the present perfect tense ("has been triggered"), which
implies the promotion event already occurred and is the reason for the
upcoming stop.

In this case, because ShutDownSlotSync() will wait for the slotsync
worker to exit, so the first one ("will stop because promotion is
triggered") fits better.

--
With Regards,
Amit Kapila.

#8Chao Li
Chao Li
li.evan.chao@gmail.com
In reply to: Amit Kapila (#7)
Re: Fix LOCK_TIMEOUT handling in slotsync worker

On Dec 9, 2025, at 14:12, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:

Yeah, I just searched and see similar messages:

```
logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's superuser privileges have been revoked

logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser privileges have been revoked
```

I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?

My AI tool says:

Both options are grammatically correct, but the nuance differs:
"will stop because promotion is triggered"
This uses the present tense ("is triggered"), which suggests the
promotion event is happening right now, concurrently with the stopping
action.
"will stop because promotion has been triggered"
This uses the present perfect tense ("has been triggered"), which
implies the promotion event already occurred and is the reason for the
upcoming stop.

In this case, because ShutDownSlotSync() will wait for the slotsync
worker to exit, so the first one ("will stop because promotion is
triggered") fits better.

Make sense. Then Zhijie’s v2 looks good to me.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#9Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Chao Li (#8)
Re: Fix LOCK_TIMEOUT handling in slotsync worker

On Tue, Dec 9, 2025 at 11:50 AM Chao Li <li.evan.chao@gmail.com> wrote:

On Dec 9, 2025, at 14:12, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:

Yeah, I just searched and see similar messages:

```
logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's superuser privileges have been revoked

logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser privileges have been revoked
```

I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?

My AI tool says:

Both options are grammatically correct, but the nuance differs:
"will stop because promotion is triggered"
This uses the present tense ("is triggered"), which suggests the
promotion event is happening right now, concurrently with the stopping
action.
"will stop because promotion has been triggered"
This uses the present perfect tense ("has been triggered"), which
implies the promotion event already occurred and is the reason for the
upcoming stop.

In this case, because ShutDownSlotSync() will wait for the slotsync
worker to exit, so the first one ("will stop because promotion is
triggered") fits better.

Make sense. Then Zhijie’s v2 looks good to me.

Thanks for the review. Pushed.

--
With Regards,
Amit Kapila.

#10Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Chao Li (#8)
Re: Fix LOCK_TIMEOUT handling in slotsync worker

On Tue, Dec 9, 2025 at 11:50 AM Chao Li <li.evan.chao@gmail.com> wrote:

On Dec 9, 2025, at 14:12, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:

Yeah, I just searched and see similar messages:

```
logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's superuser privileges have been revoked

logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser privileges have been revoked
```

I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?

My AI tool says:

Both options are grammatically correct, but the nuance differs:
"will stop because promotion is triggered"
This uses the present tense ("is triggered"), which suggests the
promotion event is happening right now, concurrently with the stopping
action.
"will stop because promotion has been triggered"
This uses the present perfect tense ("has been triggered"), which
implies the promotion event already occurred and is the reason for the
upcoming stop.

In this case, because ShutDownSlotSync() will wait for the slotsync
worker to exit, so the first one ("will stop because promotion is
triggered") fits better.

Make sense. Then Zhijie’s v2 looks good to me.

BTW, by mistake, I ended up pushing 0001 which I think in itself is
not a bad idea. However, we can improve it at least in HEAD as part of
patch[1]/messages/by-id/CAFPTHDYHjqq53f1Cbata2MrV2nRBDe6XgxXfqv4tw4rcT2-Y8Q@mail.gmail.com where we are making changes in the same part of code. Do you
think that is okay?

[1]: /messages/by-id/CAFPTHDYHjqq53f1Cbata2MrV2nRBDe6XgxXfqv4tw4rcT2-Y8Q@mail.gmail.com

--
With Regards,
Amit Kapila.

#11Chao Li
Chao Li
li.evan.chao@gmail.com
In reply to: Amit Kapila (#10)
Re: Fix LOCK_TIMEOUT handling in slotsync worker

On Dec 9, 2025, at 19:39, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 9, 2025 at 11:50 AM Chao Li <li.evan.chao@gmail.com> wrote:

On Dec 9, 2025, at 14:12, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:

Yeah, I just searched and see similar messages:

```
logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's superuser privileges have been revoked

logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser privileges have been revoked
```

I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?

My AI tool says:

Both options are grammatically correct, but the nuance differs:
"will stop because promotion is triggered"
This uses the present tense ("is triggered"), which suggests the
promotion event is happening right now, concurrently with the stopping
action.
"will stop because promotion has been triggered"
This uses the present perfect tense ("has been triggered"), which
implies the promotion event already occurred and is the reason for the
upcoming stop.

In this case, because ShutDownSlotSync() will wait for the slotsync
worker to exit, so the first one ("will stop because promotion is
triggered") fits better.

Make sense. Then Zhijie’s v2 looks good to me.

BTW, by mistake, I ended up pushing 0001 which I think in itself is
not a bad idea. However, we can improve it at least in HEAD as part of
patch[1] where we are making changes in the same part of code. Do you
think that is okay?

[1] - /messages/by-id/CAFPTHDYHjqq53f1Cbata2MrV2nRBDe6XgxXfqv4tw4rcT2-Y8Q@mail.gmail.com

Sure, no problem. Thanks for taking care of my comment.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/