Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

Started by Fujii Masao21 days ago42 messageshackers
Jump to latest
#1Fujii Masao
masao.fujii@gmail.com

Hi,

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

If the slotsync worker is blocked waiting for input from the primary (e.g.,
due to a network outage between the primary and standby), SIGUSR1 won't
interrupt the wait. As a result, the worker can remain stuck and delay
promotion for a long time.

Would it make sense to send SIGTERM instead, so the worker can exit promptly
even while waiting? I've attached a WIP patch that does this. I haven't updated
the source comments yet, but I can do so if we agree on the approach.

SIGTERM alone is not sufficient, though. A new slotsync worker could start
immediately after the old one exits and block promotion again. To address this,
the patch makes a newly started worker exit immediately if promotion is
in progress.

Thoughts?

Regards,

--
Fujii Masao

Attachments:

v1-0001-Use-SIGTERM-to-stop-slotsync-worker-during-standb.patchapplication/octet-stream; name=v1-0001-Use-SIGTERM-to-stop-slotsync-worker-during-standb.patchDownload+7-23
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#1)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

Fujii Masao <masao.fujii@gmail.com> writes:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
Would it make sense to send SIGTERM instead, so the worker can exit promptly
even while waiting?

One consideration here is that we expect all processes to receive
SIGTERM from init at the beginning of an operating system shutdown
sequence. Background workers should exit at that point only if their
services will not be needed during database shutdown. While it
sounds plausible that a slotsync worker should exit immediately,
I'm not quite sure if that's what we want.

regards, tom lane

#3Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#1)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Thu, Mar 19, 2026 at 1:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Fujii Masao <masao.fujii@gmail.com> writes:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
Would it make sense to send SIGTERM instead, so the worker can exit promptly
even while waiting?

One consideration here is that we expect all processes to receive
SIGTERM from init at the beginning of an operating system shutdown
sequence. Background workers should exit at that point only if their
services will not be needed during database shutdown. While it
sounds plausible that a slotsync worker should exit immediately,
I'm not quite sure if that's what we want.

Currently, when the slotsync worker receives SIGUSR1 during promotion,
it exits at the next interrupt check (i.e., in ProcessSlotSyncInterrupts()).
There's no additional termination handling, so it seems the worker is expected
to exit promptly once the startup process requests it.

Given that, using SIGTERM to make the worker exit immediately seems OK to me...

With the patch, on SIGTERM, the worker basically still exits at the next
interrupt check. The difference is that if it's waiting for input from
the primary (i.e., DoingCommandRead = true), it calls ProcessInterrupts()
in the SIGTERM signal handler (die()) and exits immediately.

Regards.

--
Fujii Masao

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Fujii Masao (#1)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao.fujii@gmail.com> wrote:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

IIRC, this same signal is used for both the backend executing
pg_sync_replication_slots() and slotsync worker. We want the worker to
exit and error_out backend. Using SIGTERM for backend could result in
its exit. Also, we want the last slotsync cycle to complete before
promotion so that chances of subscribers that do failover/switchover
to new primary has better chances of finding failover slots
sync-ready.

--
With Regards,
Amit Kapila.

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Amit Kapila (#4)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao.fujii@gmail.com> wrote:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

IIRC, this same signal is used for both the backend executing
pg_sync_replication_slots() and slotsync worker. We want the worker to
exit and error_out backend. Using SIGTERM for backend could result in
its exit.

Why do we want the backend running pg_sync_replication_slots() to throw
an error here, rather than just exit? If emitting an error is really required,
another option would be to store the process type in SlotSyncCtx and send
different signals accordingly, for example, SIGTERM for the slotsync worker
and another signal for a backend. But it seems simpler and sufficient to have
the backend exit in this case as well.

Also, we want the last slotsync cycle to complete before
promotion so that chances of subscribers that do failover/switchover
to new primary has better chances of finding failover slots
sync-ready.

I'm not sure how much this behavior helps in failover/switchover scenarios.
But the main issue is that if a primary crash triggers standby promotion,
that last slotsync cycle can get stuck waiting for input from the primary,
which delays promotion. IOW, failover time can become unnecessarily long
due to the slotsync worker. I'd like to address that problem.

Regards,

--
Fujii Masao

#6Nisha Moond
nisha.moond412@gmail.com
In reply to: Fujii Masao (#5)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao.fujii@gmail.com> wrote:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

IIRC, this same signal is used for both the backend executing
pg_sync_replication_slots() and slotsync worker. We want the worker to
exit and error_out backend. Using SIGTERM for backend could result in
its exit.

Why do we want the backend running pg_sync_replication_slots() to throw
an error here, rather than just exit? If emitting an error is really required,
another option would be to store the process type in SlotSyncCtx and send
different signals accordingly, for example, SIGTERM for the slotsync worker
and another signal for a backend. But it seems simpler and sufficient to have
the backend exit in this case as well.

Also, we want the last slotsync cycle to complete before
promotion so that chances of subscribers that do failover/switchover
to new primary has better chances of finding failover slots
sync-ready.

I'm not sure how much this behavior helps in failover/switchover scenarios.
But the main issue is that if a primary crash triggers standby promotion,
that last slotsync cycle can get stuck waiting for input from the primary,
which delays promotion. IOW, failover time can become unnecessarily long
due to the slotsync worker. I'd like to address that problem.

Hi Fujii-san,

I tried reproducing the wait scenario as you mentioned, but could not
reproduce it.
Steps I followed:
1) Place a debugger in the slotsync worker and hold it at
fetch_remote_slots() ... -> libpqsrv_get_result()
2) Kill the primary.
3) Triggered promotion of the standby and release debugger from slotsync worker.

The slot sync worker stops when the promotion is triggered and then
restarts, but fails to connect to the primary. The promotion happens
immediately.
```
LOG: received promote request
LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
0.02 s, elapsed: 4574.89 s
LOG: last completed transaction was at log time 2026-03-23
17:13:15.782313+05:30
LOG: replication slot synchronization worker will stop because
promotion is triggered
LOG: slot sync worker started
ERROR: synchronization worker "slotsync worker" could not connect to
the primary server: connection to server at "127.0.0.1", port 9933
failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
```

I’ll debug this further to understand it better.
In the meantime, please let me know if I’m missing any step, or if you
followed a specific setup/script to reproduce this scenario.

--
Thanks,
Nisha

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Nisha Moond (#6)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <nisha.moond412@gmail.com> wrote:

Hi Fujii-san,

I tried reproducing the wait scenario as you mentioned, but could not
reproduce it.
Steps I followed:
1) Place a debugger in the slotsync worker and hold it at
fetch_remote_slots() ... -> libpqsrv_get_result()
2) Kill the primary.
3) Triggered promotion of the standby and release debugger from slotsync worker.

The slot sync worker stops when the promotion is triggered and then
restarts, but fails to connect to the primary. The promotion happens
immediately.
```
LOG: received promote request
LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
0.02 s, elapsed: 4574.89 s
LOG: last completed transaction was at log time 2026-03-23
17:13:15.782313+05:30
LOG: replication slot synchronization worker will stop because
promotion is triggered
LOG: slot sync worker started
ERROR: synchronization worker "slotsync worker" could not connect to
the primary server: connection to server at "127.0.0.1", port 9933
failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
```

I’ll debug this further to understand it better.
In the meantime, please let me know if I’m missing any step, or if you
followed a specific setup/script to reproduce this scenario.

Thanks for testing!

If you killed the primary with a signal like SIGTERM, an RST packet might have
been sent to the slotsync worker at that moment. That allowed the worker to
detect the connection loss and exited the wait state, so promotion could
complete as expected.

To reproduce the issue, you'll need a scenario where the worker cannot detect
the connection loss. For example, you could block network traffic (e.g., with
iptables) between the primary and the slotsync worker. The key is to create
a situation where the worker remains stuck waiting for input for a long time.

Regards,

--
Fujii Masao

#8Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#7)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Tue, Mar 24, 2026 at 3:00 PM Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <nisha.moond412@gmail.com> wrote:

Hi Fujii-san,

I tried reproducing the wait scenario as you mentioned, but could not
reproduce it.
Steps I followed:
1) Place a debugger in the slotsync worker and hold it at
fetch_remote_slots() ... -> libpqsrv_get_result()
2) Kill the primary.
3) Triggered promotion of the standby and release debugger from slotsync worker.

The slot sync worker stops when the promotion is triggered and then
restarts, but fails to connect to the primary. The promotion happens
immediately.
```
LOG: received promote request
LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
0.02 s, elapsed: 4574.89 s
LOG: last completed transaction was at log time 2026-03-23
17:13:15.782313+05:30
LOG: replication slot synchronization worker will stop because
promotion is triggered
LOG: slot sync worker started
ERROR: synchronization worker "slotsync worker" could not connect to
the primary server: connection to server at "127.0.0.1", port 9933
failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
```

I’ll debug this further to understand it better.
In the meantime, please let me know if I’m missing any step, or if you
followed a specific setup/script to reproduce this scenario.

Thanks for testing!

If you killed the primary with a signal like SIGTERM, an RST packet might have
been sent to the slotsync worker at that moment. That allowed the worker to
detect the connection loss and exited the wait state, so promotion could
complete as expected.

To reproduce the issue, you'll need a scenario where the worker cannot detect
the connection loss. For example, you could block network traffic (e.g., with
iptables) between the primary and the slotsync worker. The key is to create
a situation where the worker remains stuck waiting for input for a long time.

Here's one way to reproduce the issue using iptables:

----------------------------------------------------
[Set up slot synchronization environment]

initdb -D data --encoding=UTF8 --locale=C
cat <<EOF >> data/postgresql.conf
wal_level = logical
synchronized_standby_slots = 'physical_slot'
EOF
pg_ctl -D data start
pg_receivewal --create-slot -S physical_slot
pg_recvlogical --create-slot -S logical_slot -P pgoutput
--enable-failover -d postgres
psql -c "CREATE PUBLICATION mypub"

pg_basebackup -D sby1 -c fast -R -S physical_slot -d "dbname=postgres"
-h 127.0.0.1
cat <<EOF >> sby1/postgresql.conf
port = 5433
sync_replication_slots = on
hot_standby_feedback = on
EOF
pg_ctl -D sby1 start

psql -c "SELECT pg_logical_emit_message(true, 'abc', 'xyz')"

[Block network traffic used by slot synchronization]
su -
iptables -A INPUT -p tcp --sport 5432 -j DROP
iptables -A OUTPUT -p tcp --dport 5432 -j DROP

[Promote the standby]
# wait a few seconds
pg_ctl -D sby1 promote
----------------------------------------------------

In my tests on master, promotion got stuck in this scenario.
With the patch, promotion completed promptly.

After testing, you can remove the network block with:

iptables -D INPUT -p tcp --sport 5432 -j DROP
iptables -D OUTPUT -p tcp --dport 5432 -j DROP

Regards,

--
Fujii Masao

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Fujii Masao (#5)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao.fujii@gmail.com> wrote:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

IIRC, this same signal is used for both the backend executing
pg_sync_replication_slots() and slotsync worker. We want the worker to
exit and error_out backend. Using SIGTERM for backend could result in
its exit.

Why do we want the backend running pg_sync_replication_slots() to throw
an error here, rather than just exit?

I think it was because the backends remain connected after promotion
and if we make them exit that will change the existing behavior.

--
With Regards,
Amit Kapila.

#10Nisha Moond
nisha.moond412@gmail.com
In reply to: Fujii Masao (#8)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Tue, Mar 24, 2026 at 2:45 PM Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Mar 24, 2026 at 3:00 PM Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <nisha.moond412@gmail.com> wrote:

Hi Fujii-san,

I tried reproducing the wait scenario as you mentioned, but could not
reproduce it.
Steps I followed:
1) Place a debugger in the slotsync worker and hold it at
fetch_remote_slots() ... -> libpqsrv_get_result()
2) Kill the primary.
3) Triggered promotion of the standby and release debugger from slotsync worker.

The slot sync worker stops when the promotion is triggered and then
restarts, but fails to connect to the primary. The promotion happens
immediately.
```
LOG: received promote request
LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
0.02 s, elapsed: 4574.89 s
LOG: last completed transaction was at log time 2026-03-23
17:13:15.782313+05:30
LOG: replication slot synchronization worker will stop because
promotion is triggered
LOG: slot sync worker started
ERROR: synchronization worker "slotsync worker" could not connect to
the primary server: connection to server at "127.0.0.1", port 9933
failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
```

I’ll debug this further to understand it better.
In the meantime, please let me know if I’m missing any step, or if you
followed a specific setup/script to reproduce this scenario.

Thanks for testing!

If you killed the primary with a signal like SIGTERM, an RST packet might have
been sent to the slotsync worker at that moment. That allowed the worker to
detect the connection loss and exited the wait state, so promotion could
complete as expected.

To reproduce the issue, you'll need a scenario where the worker cannot detect
the connection loss. For example, you could block network traffic (e.g., with
iptables) between the primary and the slotsync worker. The key is to create
a situation where the worker remains stuck waiting for input for a long time.

Here's one way to reproduce the issue using iptables:

Thank you, Fujii-san, for sharing the steps. I am now able to
reproduce the behavior where promotion gets stuck because the slot
sync worker remains in a wait loop.

As an experiment, I tried setting tcp_user_timeout to 7000 / 15000
(using slightly higher values for debugging). With this setting, the
TCP stack terminates the connection if data sent to the primary
remains unacknowledged beyond the configured timeout (e.g., due to a
network drop). In such cases the slot sync worker exits instead of
waiting indefinitely. With an appropriately tuned timeout, this could
help avoid the promotion issue by ensuring the worker does not remain
stuck when the connection to the primary is lost.

Thanks,
Nisha

#11Fujii Masao
masao.fujii@gmail.com
In reply to: Nisha Moond (#10)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Wed, Mar 25, 2026 at 1:51 AM Nisha Moond <nisha.moond412@gmail.com> wrote:

Thank you, Fujii-san, for sharing the steps. I am now able to
reproduce the behavior where promotion gets stuck because the slot
sync worker remains in a wait loop.

Thanks for the test!

As an experiment, I tried setting tcp_user_timeout to 7000 / 15000
(using slightly higher values for debugging). With this setting, the
TCP stack terminates the connection if data sent to the primary
remains unacknowledged beyond the configured timeout (e.g., due to a
network drop). In such cases the slot sync worker exits instead of
waiting indefinitely. With an appropriately tuned timeout, this could
help avoid the promotion issue by ensuring the worker does not remain
stuck when the connection to the primary is lost.

Yes, TCP timeout settings like tcp_user_timeout, keepalives,
and net.ipv4.tcp_retries2 can help in this situation. However,
they involve a trade-off: using very small timeouts can reduce
failover time but increases the risk of false network failure detection,
while larger timeouts (e.g., 10s) avoid false positives but can
delay failover by that amount.

Because of this, I think it's better to address the issue without
relying on such TCP timeout parameters.

Also, tcp_user_timeout is not available on platforms that don't
support TCP_USER_TIMEOUT (e.g., Windows).

Regards,

--
Fujii Masao

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Fujii Masao (#5)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao.fujii@gmail.com> wrote:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

IIRC, this same signal is used for both the backend executing
pg_sync_replication_slots() and slotsync worker. We want the worker to
exit and error_out backend. Using SIGTERM for backend could result in
its exit.

Why do we want the backend running pg_sync_replication_slots() to throw
an error here, rather than just exit? If emitting an error is really required,
another option would be to store the process type in SlotSyncCtx and send
different signals accordingly, for example, SIGTERM for the slotsync worker
and another signal for a backend. But it seems simpler and sufficient to have
the backend exit in this case as well.

As we want to retain the existing behavior for API, so instead of
using two signals, we can achieve what you intend to achieve by one
signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
slotsync worker/backend via SendProcSignal. Then in
procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
HandleSlotSyncInterrupt() will set the InterruptPending and
SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
specific function based on the flag and do what we currently do in
ProcessSlotSyncInterrupts. I think this should address the issue you
are worried about.

--
With Regards,
Amit Kapila.

#13Nisha Moond
nisha.moond412@gmail.com
In reply to: Amit Kapila (#12)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao.fujii@gmail.com> wrote:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

IIRC, this same signal is used for both the backend executing
pg_sync_replication_slots() and slotsync worker. We want the worker to
exit and error_out backend. Using SIGTERM for backend could result in
its exit.

Why do we want the backend running pg_sync_replication_slots() to throw
an error here, rather than just exit? If emitting an error is really required,
another option would be to store the process type in SlotSyncCtx and send
different signals accordingly, for example, SIGTERM for the slotsync worker
and another signal for a backend. But it seems simpler and sufficient to have
the backend exit in this case as well.

As we want to retain the existing behavior for API, so instead of
using two signals, we can achieve what you intend to achieve by one
signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
slotsync worker/backend via SendProcSignal. Then in
procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
HandleSlotSyncInterrupt() will set the InterruptPending and
SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
specific function based on the flag and do what we currently do in
ProcessSlotSyncInterrupts. I think this should address the issue you
are worried about.

+1
Retaining the current behavior for the API backend keeps it consistent
with other backends that continue after promotion.

In the reproduced case, the worker (or API backend) is waiting in:
libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
When SIGUSR1 is received, it only sets the latch but does not mark any
interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
effectively a no-op, and the process goes back to waiting. So, control
never returns to the slotsync code path, and we cannot rely on
stopSignaled to handle exit/error separately.
Only SIGTERM works here because its handler sets
INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
break the loop. The other signals like SIGUSR1 or SIGINT do not do
this, so simply using another signal might not solve the API error
handling case.

I’ve implemented the above approach suggested by Amit in the attached
patch and verified it for both worker and API scenarios. With this,
the API can now error-out without exiting the backend.

--
Thanks,
Nisha

Attachments:

v2-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patchapplication/octet-stream; name=v2-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patchDownload+81-4
#14shveta malik
shveta.malik@gmail.com
In reply to: Nisha Moond (#13)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Thu, Mar 26, 2026 at 4:08 PM Nisha Moond <nisha.moond412@gmail.com> wrote:

On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao.fujii@gmail.com> wrote:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

IIRC, this same signal is used for both the backend executing
pg_sync_replication_slots() and slotsync worker. We want the worker to
exit and error_out backend. Using SIGTERM for backend could result in
its exit.

Why do we want the backend running pg_sync_replication_slots() to throw
an error here, rather than just exit? If emitting an error is really required,
another option would be to store the process type in SlotSyncCtx and send
different signals accordingly, for example, SIGTERM for the slotsync worker
and another signal for a backend. But it seems simpler and sufficient to have
the backend exit in this case as well.

As we want to retain the existing behavior for API, so instead of
using two signals, we can achieve what you intend to achieve by one
signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
slotsync worker/backend via SendProcSignal. Then in
procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
HandleSlotSyncInterrupt() will set the InterruptPending and
SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
specific function based on the flag and do what we currently do in
ProcessSlotSyncInterrupts. I think this should address the issue you
are worried about.

+1
Retaining the current behavior for the API backend keeps it consistent
with other backends that continue after promotion.

In the reproduced case, the worker (or API backend) is waiting in:
libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
When SIGUSR1 is received, it only sets the latch but does not mark any
interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
effectively a no-op, and the process goes back to waiting. So, control
never returns to the slotsync code path, and we cannot rely on
stopSignaled to handle exit/error separately.
Only SIGTERM works here because its handler sets
INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
break the loop. The other signals like SIGUSR1 or SIGINT do not do
this, so simply using another signal might not solve the API error
handling case.

I’ve implemented the above approach suggested by Amit in the attached
patch and verified it for both worker and API scenarios. With this,
the API can now error-out without exiting the backend.

+1 on the idea. Few comments:

1)
It was not clear initially as to why SetLatch is not done in
HandleSlotSyncShutdownInterrupt(), digging it further revealed that
procsignal_sigusr1_handler() will do SetLatch outside. Perhaps you can
add below comment at the end of HandleSlotSyncShutdownInterrupt()
similar to how other functions (HandleProcSignalBarrierInterrupt,
HandleRecoveryConflictInterrupt etc) do.

/* latch will be set by procsignal_sigusr1_handler */

2)
In ProcessSlotSyncInterrupts(), now we don't need the below logic right?

if (SlotSyncCtx->stopSignaled)
{
if (AmLogicalSlotSyncWorkerProcess())
{
...
proc_exit(0);
}
else
{
/*
* For the backend executing SQL function
* pg_sync_replication_slots().
*/
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("replication slot synchronization will stop
because promotion is triggered"));
}
}

thanks
Shveta

#15Nisha Moond
nisha.moond412@gmail.com
In reply to: shveta malik (#14)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Fri, Mar 27, 2026 at 9:28 AM shveta malik <shveta.malik@gmail.com> wrote:

On Thu, Mar 26, 2026 at 4:08 PM Nisha Moond <nisha.moond412@gmail.com> wrote:

On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <masao.fujii@gmail.com> wrote:

On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <masao.fujii@gmail.com> wrote:

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

IIRC, this same signal is used for both the backend executing
pg_sync_replication_slots() and slotsync worker. We want the worker to
exit and error_out backend. Using SIGTERM for backend could result in
its exit.

Why do we want the backend running pg_sync_replication_slots() to throw
an error here, rather than just exit? If emitting an error is really required,
another option would be to store the process type in SlotSyncCtx and send
different signals accordingly, for example, SIGTERM for the slotsync worker
and another signal for a backend. But it seems simpler and sufficient to have
the backend exit in this case as well.

As we want to retain the existing behavior for API, so instead of
using two signals, we can achieve what you intend to achieve by one
signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
slotsync worker/backend via SendProcSignal. Then in
procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
HandleSlotSyncInterrupt() will set the InterruptPending and
SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
specific function based on the flag and do what we currently do in
ProcessSlotSyncInterrupts. I think this should address the issue you
are worried about.

+1
Retaining the current behavior for the API backend keeps it consistent
with other backends that continue after promotion.

In the reproduced case, the worker (or API backend) is waiting in:
libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
When SIGUSR1 is received, it only sets the latch but does not mark any
interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
effectively a no-op, and the process goes back to waiting. So, control
never returns to the slotsync code path, and we cannot rely on
stopSignaled to handle exit/error separately.
Only SIGTERM works here because its handler sets
INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
break the loop. The other signals like SIGUSR1 or SIGINT do not do
this, so simply using another signal might not solve the API error
handling case.

I’ve implemented the above approach suggested by Amit in the attached
patch and verified it for both worker and API scenarios. With this,
the API can now error-out without exiting the backend.

+1 on the idea. Few comments:

Thanks for the review.

1)
It was not clear initially as to why SetLatch is not done in
HandleSlotSyncShutdownInterrupt(), digging it further revealed that
procsignal_sigusr1_handler() will do SetLatch outside. Perhaps you can
add below comment at the end of HandleSlotSyncShutdownInterrupt()
similar to how other functions (HandleProcSignalBarrierInterrupt,
HandleRecoveryConflictInterrupt etc) do.

/* latch will be set by procsignal_sigusr1_handler */

Fixed.

2)
In ProcessSlotSyncInterrupts(), now we don't need the below logic right?

if (SlotSyncCtx->stopSignaled)
{
if (AmLogicalSlotSyncWorkerProcess())
{
...
proc_exit(0);
}
else
{
/*
* For the backend executing SQL function
* pg_sync_replication_slots().
*/
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("replication slot synchronization will stop
because promotion is triggered"));
}
}

Right. Attached patch with the suggested changes.

--
Thanks,
Nisha

Attachments:

v3-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patchapplication/octet-stream; name=v3-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patchDownload+93-25
#16Fujii Masao
masao.fujii@gmail.com
In reply to: Nisha Moond (#15)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Fri, Mar 27, 2026 at 1:57 PM Nisha Moond <nisha.moond412@gmail.com> wrote:

Right. Attached patch with the suggested changes.

Thanks for making the patch!

From a quick look, this approach seems fine to me. I'll review it in
more detail later.

Thanks for working on this issue!

Regards,

--
Fujii Masao

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Nisha Moond (#15)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Fri, Mar 27, 2026 at 10:27 AM Nisha Moond <nisha.moond412@gmail.com> wrote:

On Fri, Mar 27, 2026 at 9:28 AM shveta malik <shveta.malik@gmail.com> wrote:

In ProcessSlotSyncInterrupts(), now we don't need the below logic right?

if (SlotSyncCtx->stopSignaled)
{
if (AmLogicalSlotSyncWorkerProcess())
{
...
proc_exit(0);
}
else
{
/*
* For the backend executing SQL function
* pg_sync_replication_slots().
*/
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("replication slot synchronization will stop
because promotion is triggered"));
}
}

Right. Attached patch with the suggested changes.

After this change, why do we need to invoke
ProcessSlotSyncInterrupts() twice in SyncReplicationSlots?

--
With Regards,
Amit Kapila.

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#17)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Fri, Mar 27, 2026 at 1:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 27, 2026 at 10:27 AM Nisha Moond <nisha.moond412@gmail.com> wrote:

On Fri, Mar 27, 2026 at 9:28 AM shveta malik <shveta.malik@gmail.com> wrote:

In ProcessSlotSyncInterrupts(), now we don't need the below logic right?

if (SlotSyncCtx->stopSignaled)
{
if (AmLogicalSlotSyncWorkerProcess())
{
...
proc_exit(0);
}
else
{
/*
* For the backend executing SQL function
* pg_sync_replication_slots().
*/
ereport(ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("replication slot synchronization will stop
because promotion is triggered"));
}
}

Right. Attached patch with the suggested changes.

After this change, why do we need to invoke
ProcessSlotSyncInterrupts() twice in SyncReplicationSlots?

Also, not sure if it is a good idea to name current function as
ProcessSlotSyncInterrupts() because we remove most of its interrupt
handling. Shall we copy paste its code at two places as we do similar
handling at other places as well.

Another comment:
*
+
+ if (SlotSyncShutdown)
+ HandleSlotSyncShutdown();
...
...
+ if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+ HandleSlotSyncShutdownInterrupt();

Would it better if we name these functions as HandleSlotSyncMessage()
and HandleSlotSyncMessageInterrupt() because for API, these simply
lead to an ERROR and that would match with the ProcSignalReason name
PROCSIG_SLOTSYNC_MESSAGE?

--
With Regards,
Amit Kapila.

#19Nisha Moond
nisha.moond412@gmail.com
In reply to: Amit Kapila (#18)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Fri, Mar 27, 2026 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 27, 2026 at 1:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

After this change, why do we need to invoke
ProcessSlotSyncInterrupts() twice in SyncReplicationSlots?

Fixed.

Also, not sure if it is a good idea to name current function as
ProcessSlotSyncInterrupts() because we remove most of its interrupt
handling. Shall we copy paste its code at two places as we do similar
handling at other places as well.

Done.

Another comment:
*
+
+ if (SlotSyncShutdown)
+ HandleSlotSyncShutdown();
...
...
+ if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+ HandleSlotSyncShutdownInterrupt();

Would it better if we name these functions as HandleSlotSyncMessage()
and HandleSlotSyncMessageInterrupt() because for API, these simply
lead to an ERROR and that would match with the ProcSignalReason name
PROCSIG_SLOTSYNC_MESSAGE?

Done.

Attached the updated patch.

--
Thanks,
Nisha

Attachments:

v4-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patchapplication/octet-stream; name=v4-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patchDownload+93-34
#20Fujii Masao
masao.fujii@gmail.com
In reply to: Nisha Moond (#19)
Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?

On Fri, Mar 27, 2026 at 9:38 PM Nisha Moond <nisha.moond412@gmail.com> wrote:

Attached the updated patch.

Thanks for updating the patch! It looks good overall.

Regarding the comments in SlotSyncCtxStruct, since the role of
stopSignaled field has changed, those comments should be updated
accordingly? For example,

-------------------------
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
+ * the SQL function pg_sync_replication_slots(). On promotion,
+ * the startup process sets 'stopSignaled' and uses this 'pid' to wake up
+ * the currently synchronizing process so that the process can
+ * immediately stop its synchronizing work.
+ * Setting 'stopSignaled' is used to handle the race condition when the
-------------------------
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdown = false;

For the interrupt flag set in procsignal_sigusr1_handler(), other flags
use a *Pending suffix (e.g., ProcSignalBarrierPending,
ParallelApplyMessagePending), so SlotSyncShutdownPending would
be more consistent.

+void
+HandleSlotSyncMessage(void)

Functions called from ProcessInterrupts() typically use the Process* prefix
(e.g., ProcessProcSignalBarrier(), ProcessParallelApplyMessages()),
so ProcessSlotSyncMessage would be more consistent than HandleSlotSyncMessage.

+ ereport(LOG,
+ errmsg("replication slot synchronization worker will stop because
promotion is triggered"));
+
+ proc_exit(0);
+ }
+ else
+ {
+ /*
+ * For the backend executing SQL function
+ * pg_sync_replication_slots().
+ */
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("replication slot synchronization will stop because promotion
is triggered"));

The log messages say "will stop", but since sync hasn't started yet,
"will not start" seems clearer here. For example, "replication slot
synchronization worker will not start because promotion was triggered"
and "replication slot synchronization will not start because promotion was
triggered". Thought?

Regards,

--
Fujii Masao

#21Nisha Moond
nisha.moond412@gmail.com
In reply to: Fujii Masao (#20)
#22shveta malik
shveta.malik@gmail.com
In reply to: Nisha Moond (#21)
#23Fujii Masao
masao.fujii@gmail.com
In reply to: Nisha Moond (#21)
#24Nisha Moond
nisha.moond412@gmail.com
In reply to: shveta malik (#22)
#25Nisha Moond
nisha.moond412@gmail.com
In reply to: Fujii Masao (#23)
#26Zhijie Hou (Fujitsu)
houzj.fnst@fujitsu.com
In reply to: Nisha Moond (#24)
#27shveta malik
shveta.malik@gmail.com
In reply to: Nisha Moond (#25)
#28Fujii Masao
masao.fujii@gmail.com
In reply to: shveta malik (#27)
#29Nisha Moond
nisha.moond412@gmail.com
In reply to: Fujii Masao (#28)
#30Nisha Moond
nisha.moond412@gmail.com
In reply to: shveta malik (#27)
#31Nisha Moond
nisha.moond412@gmail.com
In reply to: Zhijie Hou (Fujitsu) (#26)
#32Fujii Masao
masao.fujii@gmail.com
In reply to: Nisha Moond (#31)
#33Nisha Moond
nisha.moond412@gmail.com
In reply to: Fujii Masao (#32)
#34Fujii Masao
masao.fujii@gmail.com
In reply to: Nisha Moond (#33)
#35Amit Kapila
amit.kapila16@gmail.com
In reply to: Fujii Masao (#34)
#36Fujii Masao
masao.fujii@gmail.com
In reply to: Amit Kapila (#35)
#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Fujii Masao (#36)
#38Fujii Masao
masao.fujii@gmail.com
In reply to: Amit Kapila (#37)
#39Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#38)
#40Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#39)
#41Nisha Moond
nisha.moond412@gmail.com
In reply to: Fujii Masao (#40)
#42Fujii Masao
masao.fujii@gmail.com
In reply to: Nisha Moond (#41)