Startup process deadlock: WaitForProcSignalBarriers vs aux process
Hi,
Over in the Hackers Discord, Melany pointed out [0]https://discord.com/channels/1258108670710124574/1346208113132568646/1496179622591598592 a random failure
of tests on the master branch, which seemed to have nothing to do with
the commit they failed on.
The logs [1]https://api.cirrus-ci.com/v1/artifact/task/6239099197063168/log/contrib/auto_explain/log/postmaster.log indicate that the startup process was waiting for another
process to process a signal barrier. While there isn't enough
information available to conclusively point the blame on any specific
component, I think I have a good understanding of what happened:
2026-04-21 15:10:50.065 UTC startup[19246] LOG: still waiting for backend with PID 19244 to accept ProcSignalBarrier
Here, the startup process is waiting for process with PID 19244 to
handle a signal barrier. It is not entirely clear which process it's
waiting on, but we can deduce this:
In the startup sequence, the postmaster creates these child processes,
in short order:
1. checkpointer
2. bgwriter
3. startup
It is therefore likely that the startup process' PID is just two
larger than that of the checkpointer; and therefore, it's likely the
startup process is waiting for the checkpointer process.
# Which code in the Startup process is waiting?
I think it's this: The startup process logged that it started with a
clean shutdown, so no recovery code should be executed. This excludes
most possible call sites of WaitForProcSignalBarriers, except this
one: The startup process calls StartupXLOG ->
UpdateLogicalDecodingStatusEndOfRecovery(), which then calls
if (IsUnderPostmaster)
WaitForProcSignalBarrier(
EmitProcSignalBarrier(
PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO
));
# Why doesn't the Checkpointer process acknowledge the ProcSignalBarrier?
If the PSB is emitted (and signaled to checkpointer) before the
checkpointer has registered its SIGUSR1 handler, then the checkpointer
won't receive the notice to check its procsignal slots, it won't
notice the updated procsignal flags, and it won't process the PSB; not
until it receives a new SIGUSR1.
Signals are sent to all processes that have their procsignal pss_pid
set, which is true for every process which has called ProcSignalInit,
which for the checkpointer (like other aux processes) happens in
AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
processes) calls AuxiliaryProcessMainCommon before registering its
signal handlers, creating a small window in time where signals are
sent, but not handled.
# Is this new?
The issue of registering signal handlers only after opening the
process up to receiving signals has existed for a long time (unchanged
since at least 2022), only the ProcSignalBarrier in the startup
process is new: UpdateLogicalDecodingStatusEndOfRecovery was added
with Sawada-san's 67c20979.
# A solution?
I don't have one right now.
I was thinking in the direction of having a compile-time aux process
signal handlers array per process type, which is read by
AuxiliaryProcessMainCommon() to register the signal handlers ahead of
ProcSignalInit(), but I've not yet looked at the exact implications,
nor analyzed whether that's actually safe. It would move some
duplicative code patterns into compile-time structs, but that's not
necessarily a universal good.
Kind regards,
Matthias van de Meent
[0]: https://discord.com/channels/1258108670710124574/1346208113132568646/1496179622591598592
[1]: https://api.cirrus-ci.com/v1/artifact/task/6239099197063168/log/contrib/auto_explain/log/postmaster.log
Hi,
On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
If the PSB is emitted (and signaled to checkpointer) before the
checkpointer has registered its SIGUSR1 handler, then the checkpointer
won't receive the notice to check its procsignal slots, it won't
notice the updated procsignal flags, and it won't process the PSB; not
until it receives a new SIGUSR1.Signals are sent to all processes that have their procsignal pss_pid
set, which is true for every process which has called ProcSignalInit,
which for the checkpointer (like other aux processes) happens in
AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
processes) calls AuxiliaryProcessMainCommon before registering its
signal handlers, creating a small window in time where signals are
sent, but not handled.
Hm. Have we confirmed this happens?
CheckpointerMain() is called with all signals masked, so it should be ok for
the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
long as it happens before
/*
* Unblock signals (they were blocked when the postmaster forked us)
*/
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
as the signal delivery should be held until after unblocking signals.
# A solution?
I don't have one right now.
I was thinking in the direction of having a compile-time aux process
signal handlers array per process type, which is read by
AuxiliaryProcessMainCommon() to register the signal handlers ahead of
ProcSignalInit(), but I've not yet looked at the exact implications,
nor analyzed whether that's actually safe. It would move some
duplicative code patterns into compile-time structs, but that's not
necessarily a universal good.
We really should move setup of most signal handlers into
AuxiliaryProcessMainCommon(). While there are some special cases (like
checkpointer not wanting to handle SIGTERM), that can be configured after
AuxiliaryProcessMainCommon(), as signals will still be blocked.
Greetings,
Andres Freund
On Wed, Apr 22, 2026 at 12:05 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
If the PSB is emitted (and signaled to checkpointer) before the
checkpointer has registered its SIGUSR1 handler, then the checkpointer
won't receive the notice to check its procsignal slots, it won't
notice the updated procsignal flags, and it won't process the PSB; not
until it receives a new SIGUSR1.Signals are sent to all processes that have their procsignal pss_pid
set, which is true for every process which has called ProcSignalInit,
which for the checkpointer (like other aux processes) happens in
AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
processes) calls AuxiliaryProcessMainCommon before registering its
signal handlers, creating a small window in time where signals are
sent, but not handled.Hm. Have we confirmed this happens?
CheckpointerMain() is called with all signals masked, so it should be ok for
the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
long as it happens before/*
* Unblock signals (they were blocked when the postmaster forked us)
*/
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);as the signal delivery should be held until after unblocking signals.
Right. The postmaster blocks all signals before starting child process
as the following comment explains:
/*
* We start postmaster children with signals blocked. This allows them to
* install their own handlers before unblocking, to avoid races where they
* might run the postmaster's handler and miss an important control
* signal. With more analysis this could potentially be relaxed.
*/
sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
Investigating the issue, I found there is a race condition between the
procsignal initialization and emitting signal barrier that could be
the cause of this issue. Imagine the following scenario:
1. In ProcSignalInit(), the checkpointer initializes its
slot->pss_barrierGeneration with the global generation.
2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
procsignal slot but it skips emitting the signal as slot->pss_pid is
still 0. It can happen even though the checkpointer holds a spinlock
on its slot during the initialization because the first pid check is
done without a spinlock acquisition.
3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
4. In WaitForProcSignalBarrier(), the startup checks the
checkpointer's procsignal slot that has already initialized the
pss_barrierGeneration, and waits for it to be updated. However, the
checkpointer never updates its barrier generation as it doesn't get
the signal.
Another similar issue I found would be that child processes could miss
the PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO signal during the
initialization and end up in an inconsistent state because
InitializeProcessXLogLogicalInfo() is called (in BaseInit()) before
ProcSignalInit(). If the startup emits the signal to a process who is
between two steps, the process would not reflect the latest
XLogLogicalInfo state. I think we should move
InitializeProcessXLogLogicalInfo() after ProcSignalInit() like we do
so for InitLocalDataChecksumState().
I've attached the patch for fixing the latter problem as the fix is
straightforward.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
0001-Fix-race-condition-in-XLogLogicalInfo-and-ProcSignal.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-race-condition-in-XLogLogicalInfo-and-ProcSignal.patchDownload+24-14
Hello Sawada-san,
24.04.2026 20:52, Masahiko Sawada wrote:
Right. The postmaster blocks all signals before starting child process
as the following comment explains:/*
* We start postmaster children with signals blocked. This allows them to
* install their own handlers before unblocking, to avoid races where they
* might run the postmaster's handler and miss an important control
* signal. With more analysis this could potentially be relaxed.
*/
sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);Investigating the issue, I found there is a race condition between the
procsignal initialization and emitting signal barrier that could be
the cause of this issue. Imagine the following scenario:1. In ProcSignalInit(), the checkpointer initializes its
slot->pss_barrierGeneration with the global generation.
2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
procsignal slot but it skips emitting the signal as slot->pss_pid is
still 0. It can happen even though the checkpointer holds a spinlock
on its slot during the initialization because the first pid check is
done without a spinlock acquisition.
3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
4. In WaitForProcSignalBarrier(), the startup checks the
checkpointer's procsignal slot that has already initialized the
pss_barrierGeneration, and waits for it to be updated. However, the
checkpointer never updates its barrier generation as it doesn't get
the signal.
Thank you for the investigation and explanation of the issue!
I've been puzzled by a buildfarm failure [1] with such symptoms for a while
and even reproduced it locally once, but couldn't gather more information
that time. But now that you have described the scenario, I can easily
reproduce the same test failure with:
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
if (cancel_key_len > 0)
memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
slot->pss_cancel_key_len = cancel_key_len;
+pg_usleep(10000);
pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
just running `meson test test_oat_hooks_*/regress` with the test multiplied x30:
26/30 test_oat_hooks_28 - postgresql:test_oat_hooks_28/regress OK 1.28s 2 subtests passed
27/30 test_oat_hooks_30 - postgresql:test_oat_hooks_30/regress OK 1.25s 2 subtests passed
28/30 test_oat_hooks_2 - postgresql:test_oat_hooks_2/regress ERROR 62.49s exit status 2
2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG: starting PostgreSQL 19devel on x86_64-linux, compiled by
gcc-16.0.1, 64-bit
2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG: listening on Unix socket "/tmp/pg_regress-QdhMPt/.s.PGSQL.40086"
2026-04-27 17:34:44.302 UTC startup[1578114] LOG: database system was shut down at 2026-04-27 17:34:44 UTC
2026-04-27 17:34:44.325 UTC dead-end client backend[1578133] [unknown] FATAL: the database system is starting up
...
2026-04-27 17:34:49.274 UTC dead-end client backend[1578643] [unknown] FATAL: the database system is starting up
2026-04-27 17:34:49.308 UTC startup[1578114] LOG: still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:34:49.325 UTC dead-end client backend[1578645] [unknown] FATAL: the database system is starting up
...
2026-04-27 17:35:44.332 UTC dead-end client backend[1582376] [unknown] FATAL: the database system is starting up
2026-04-27 17:35:44.351 UTC startup[1578114] LOG: still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:35:44.383 UTC dead-end client backend[1582379] [unknown] FATAL: the database system is starting up
Best regards,
Alexander
On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
Hello Sawada-san,
24.04.2026 20:52, Masahiko Sawada wrote:
Right. The postmaster blocks all signals before starting child process
as the following comment explains:/*
* We start postmaster children with signals blocked. This allows them to
* install their own handlers before unblocking, to avoid races where they
* might run the postmaster's handler and miss an important control
* signal. With more analysis this could potentially be relaxed.
*/
sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);Investigating the issue, I found there is a race condition between the
procsignal initialization and emitting signal barrier that could be
the cause of this issue. Imagine the following scenario:1. In ProcSignalInit(), the checkpointer initializes its
slot->pss_barrierGeneration with the global generation.
2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
procsignal slot but it skips emitting the signal as slot->pss_pid is
still 0. It can happen even though the checkpointer holds a spinlock
on its slot during the initialization because the first pid check is
done without a spinlock acquisition.
3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
4. In WaitForProcSignalBarrier(), the startup checks the
checkpointer's procsignal slot that has already initialized the
pss_barrierGeneration, and waits for it to be updated. However, the
checkpointer never updates its barrier generation as it doesn't get
the signal.Thank you for the investigation and explanation of the issue!
I've been puzzled by a buildfarm failure [1] with such symptoms for a while and even reproduced it locally once, but couldn't gather more information that time. But now that you have described the scenario, I can easily reproduce the same test failure with: --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len) if (cancel_key_len > 0) memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len); slot->pss_cancel_key_len = cancel_key_len; +pg_usleep(10000); pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
Thank you for testing this.
I've attached a patch to address the issue. I haven't verified it
across all versions yet, but I suspect it exists in the stable
branches as well. Previously, the issue rarely occurred because
EmitProcSignalBarrier() was only used for smgr invalidation. However,
now that we use signal barriers for online wal_level changes and
checksum status updates, this race condition is likely to be
encountered more frequently.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patchDownload+10-2
On Wed, 22 Apr 2026 at 21:05, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
If the PSB is emitted (and signaled to checkpointer) before the
checkpointer has registered its SIGUSR1 handler, then the checkpointer
won't receive the notice to check its procsignal slots, it won't
notice the updated procsignal flags, and it won't process the PSB; not
until it receives a new SIGUSR1.Signals are sent to all processes that have their procsignal pss_pid
set, which is true for every process which has called ProcSignalInit,
which for the checkpointer (like other aux processes) happens in
AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
processes) calls AuxiliaryProcessMainCommon before registering its
signal handlers, creating a small window in time where signals are
sent, but not handled.Hm. Have we confirmed this happens?
CheckpointerMain() is called with all signals masked, so it should be ok for
the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
long as it happens before [...]
Yeah, that was a misidentification of the exact race that caused the issue.
On Tue, 28 Apr 2026 at 21:28, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
Hello Sawada-san,
24.04.2026 20:52, Masahiko Sawada wrote:
Right. The postmaster blocks all signals before starting child process
as the following comment explains:/*
* We start postmaster children with signals blocked. This allows them to
* install their own handlers before unblocking, to avoid races where they
* might run the postmaster's handler and miss an important control
* signal. With more analysis this could potentially be relaxed.
*/
sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);Investigating the issue, I found there is a race condition between the
procsignal initialization and emitting signal barrier that could be
the cause of this issue. Imagine the following scenario:
Ah, that'd be it indeed. Thanks!
I've attached a patch to address the issue. I haven't verified it
across all versions yet, but I suspect it exists in the stable
branches as well. Previously, the issue rarely occurred because
EmitProcSignalBarrier() was only used for smgr invalidation. However,
now that we use signal barriers for online wal_level changes and
checksum status updates, this race condition is likely to be
encountered more frequently.
Yes, I think the boot process with the xlog_logical_info barrier is
more likely to hit this issue; as indicated by two known detected
cases in various CI jobs; though it could also be that the lockup of
the new barrier is just exceptionally bad for system stability.
As for the patches:
v1-0001 -- LGTM.
0001 (upthread): LGTM, but I'd also suggest to add some code to make
sure that we're actually receiving procsignals by the time we
initialize the Logical/Checksum subsystems that need to process shared
state changes by responding to procsignals; as attached. smgr's
procsignal doesn't really depend on shared memory state, so I've kept
that out of my patch.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
Attachments:
v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patchapplication/octet-stream; name=v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patchDownload+15-1
Dear Sawada-san,
28.04.2026 22:27, Masahiko Sawada wrote:
On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I've been puzzled by a buildfarm failure [1] with such symptoms for a while and even reproduced it locally once, but couldn't gather more information that time. But now that you have described the scenario, I can easily reproduce the same test failure with: --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len) if (cancel_key_len > 0) memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len); slot->pss_cancel_key_len = cancel_key_len; +pg_usleep(10000); pg_atomic_write_u32(&slot->pss_pid, MyProcPid);Thank you for testing this.
I've attached a patch to address the issue. I haven't verified it
across all versions yet, but I suspect it exists in the stable
branches as well...
Thank you for the fix! It works for me too.
I was wondering why is that failure the only one of this kind on buildfarm
(in last two years, at least), so I've tried to reproduce it on
REL_18_STABLE... and failed.
Then I've bisected it on the master branch and found (your) commit that
introduced this behavior: 67c20979c from 2025-12-23.
Best regards,
Alexander
On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
Dear Sawada-san,
28.04.2026 22:27, Masahiko Sawada wrote:
On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I've been puzzled by a buildfarm failure [1] with such symptoms for a while and even reproduced it locally once, but couldn't gather more information that time. But now that you have described the scenario, I can easily reproduce the same test failure with: --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len) if (cancel_key_len > 0) memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len); slot->pss_cancel_key_len = cancel_key_len; +pg_usleep(10000); pg_atomic_write_u32(&slot->pss_pid, MyProcPid);Thank you for testing this.
I've attached a patch to address the issue. I haven't verified it
across all versions yet, but I suspect it exists in the stable
branches as well...Thank you for the fix! It works for me too.
I was wondering why is that failure the only one of this kind on buildfarm
(in last two years, at least), so I've tried to reproduce it on
REL_18_STABLE... and failed.Then I've bisected it on the master branch and found (your) commit that
introduced this behavior: 67c20979c from 2025-12-23.
I've confirmed that this race condition issue is present from v15 to
the master. In v14, we have the procsignal barrier code but don't use
it anywhere. In v18 or older, it could happen when executing DROP
DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
in more cases as we're using procsignal barrier more places. In any
case, if a process emits a signal barrier when another process is
between the initialization of slot->pss_barrierGeneration and
slot->pss_pid initialization, the subsequent
WaitForProcSignalBarrier() ends up waiting for that process forever.
So I think the patch should be backpatched to v15. Please review these
patches.
FYI I found that we had a similar report[1]/messages/by-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com last year, I'm not sure
it hit the exact same issue, though.
Regards,
[1]: /messages/by-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
v2_15-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patchtext/x-patch; charset=US-ASCII; name=v2_15-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patchDownload+10-4
v2_17-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patchtext/x-patch; charset=US-ASCII; name=v2_17-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patchDownload+10-4
v2_18-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patchtext/x-patch; charset=US-ASCII; name=v2_18-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patchDownload+10-2
v2_16-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patchtext/x-patch; charset=US-ASCII; name=v2_16-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patchDownload+10-4
v2_master-0001-Fix-race-between-ProcSignalInit-and-EmitPr.patchtext/x-patch; charset=US-ASCII; name=v2_master-0001-Fix-race-between-ProcSignalInit-and-EmitPr.patchDownload+10-2
Dear Sawada-san,
01.05.2026 01:08, Masahiko Sawada wrote:
On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin<exclusion@gmail.com> wrote:
I was wondering why is that failure the only one of this kind on buildfarm
(in last two years, at least), so I've tried to reproduce it on
REL_18_STABLE... and failed.Then I've bisected it on the master branch and found (your) commit that
introduced this behavior: 67c20979c from 2025-12-23.I've confirmed that this race condition issue is present from v15 to
the master. In v14, we have the procsignal barrier code but don't use
it anywhere. In v18 or older, it could happen when executing DROP
DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
in more cases as we're using procsignal barrier more places. In any
case, if a process emits a signal barrier when another process is
between the initialization of slot->pss_barrierGeneration and
slot->pss_pid initialization, the subsequent
WaitForProcSignalBarrier() ends up waiting for that process forever.
So I think the patch should be backpatched to v15. Please review these
patches.
Yes, you're right -- it's not reproduced on REL_18_STABLE with
test_oat_hooks, which simply starts postgres node (as many other tests),
but when I tried the full test suite with the sleep inserted before
setting pss_pid, I discovered the following vulnerable tests:
030_stats_cleanup_replica_standby.log
2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with PID 2086578 to accept ProcSignalBarrier
2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393
033_replay_tsp_drops_standby2_FILE_COPY.log
2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with PID 2030901 to accept ProcSignalBarrier
2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to
16384/16389
040_standby_failover_slots_sync_publisher.log
2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG: still waiting for backend with PID
1538477 to accept ProcSignalBarrier
2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT: DROP DATABASE slotsync_test_db;
002_compare_backups_pitr1.log
2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with PID 1829396 to accept ProcSignalBarrier
2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414
I've tried my repro with 033_replay_tsp_drops and it really fails on
REL_15_STABLE..master and doesn't fail on REL_14_STABLE.
FYI I found that we had a similar report[1] last year, I'm not sure
it hit the exact same issue, though.Regards,
[1]/messages/by-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com
Yeah, and probably this one:
/messages/by-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB@yandex-team.ru
By the way, mamba produced the same failure just yesterday:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39
# Running: pg_ctl --wait --pgdata
/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log
/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options
--cluster-name=primary start
waiting for server to
start...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
stopped waiting
pg_ctl: server did not start in time
004_restart_primary.log
2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
...
2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
The proposed patches make the test pass reliably for me in all affected
branches. Thank you for working on this!
Best regards,
Alexander
On Fri, May 1, 2026 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
Dear Sawada-san,
01.05.2026 01:08, Masahiko Sawada wrote:
On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I was wondering why is that failure the only one of this kind on buildfarm
(in last two years, at least), so I've tried to reproduce it on
REL_18_STABLE... and failed.Then I've bisected it on the master branch and found (your) commit that
introduced this behavior: 67c20979c from 2025-12-23.I've confirmed that this race condition issue is present from v15 to
the master. In v14, we have the procsignal barrier code but don't use
it anywhere. In v18 or older, it could happen when executing DROP
DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
in more cases as we're using procsignal barrier more places. In any
case, if a process emits a signal barrier when another process is
between the initialization of slot->pss_barrierGeneration and
slot->pss_pid initialization, the subsequent
WaitForProcSignalBarrier() ends up waiting for that process forever.
So I think the patch should be backpatched to v15. Please review these
patches.Yes, you're right -- it's not reproduced on REL_18_STABLE with
test_oat_hooks, which simply starts postgres node (as many other tests),
but when I tried the full test suite with the sleep inserted before
setting pss_pid, I discovered the following vulnerable tests:030_stats_cleanup_replica_standby.log
2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with PID 2086578 to accept ProcSignalBarrier
2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393033_replay_tsp_drops_standby2_FILE_COPY.log
2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with PID 2030901 to accept ProcSignalBarrier
2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389040_standby_failover_slots_sync_publisher.log
2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG: still waiting for backend with PID 1538477 to accept ProcSignalBarrier
2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT: DROP DATABASE slotsync_test_db;002_compare_backups_pitr1.log
2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with PID 1829396 to accept ProcSignalBarrier
2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414I've tried my repro with 033_replay_tsp_drops and it really fails on
REL_15_STABLE..master and doesn't fail on REL_14_STABLE.FYI I found that we had a similar report[1] last year, I'm not sure
it hit the exact same issue, though.Regards,
[1] /messages/by-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com
Yeah, and probably this one:
/messages/by-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB@yandex-team.ruBy the way, mamba produced the same failure just yesterday:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39# Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options --cluster-name=primary start
waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
pg_ctl: server did not start in time
004_restart_primary.log
2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
...
2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrierThe proposed patches make the test pass reliably for me in all affected
branches. Thank you for working on this!
Thank you for checking this issue on stable branches too!
Considering that this issue is not very visible in practice and we're
going to release new minor versions next week, I'm planning to push
these fixes to master and backbranches after the minor releases. That
way, we can fix the issue on the master relatively soon and have
enough time to verify that fix works well on backbranches.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com