Disable startup progress timeout during standby WAL replay
Hi,
While working on a patch discussed in [1]/messages/by-id/44c24dcf-5710-410f-b1b6-d10b315f3d51@postgrespro.ru, I looked into how
log_startup_progress_interval behaves during recovery. During that
investigation, I noticed the following comment in
EnableStandbyMode():
/*
* To avoid server log bloat, we don't report recovery progress in a
* standby as it will always be in recovery unless promoted. We disable
* startup progress timeout in standby mode to avoid calling
* startup_progress_timeout_handler() unnecessarily.
*/
So in standby mode, we intentionally suppress recovery progress
logging during WAL replay, because otherwise a standby could emit
progress messages indefinitely until promotion.
However, some startup operations executed afterward, such as
ResetUnloggedRelations(), can re-enable the timeout. As a result, the
startup progress timeout can remain active during standby WAL replay,
which contradicts the intent described in the comment above.
This does not seem to cause any user-visible issue, because standby
WAL replay does not emit recovery progress messages anyway. However,
the timeout still causes unnecessary periodic wakeups during
standby WAL replay.
The attached patch disables the startup progress timeout again just
before entering WAL replay in standby mode. This preserves progress
reporting for earlier startup operations while avoiding unnecessary
wakeups during standby replay. The patch also slightly clarifies the
documentation for log_startup_progress_interval.
I see this as a small cleanup/improvement for master rather than a bug
fix requiring backpatching, since there is no visible behavioral issue
for users.
Thought?
Regards,
[1]: /messages/by-id/44c24dcf-5710-410f-b1b6-d10b315f3d51@postgrespro.ru
--
Fujii Masao
Attachments:
v1-0001-Disable-startup-progress-timeout-during-standby-W.patchapplication/octet-stream; name=v1-0001-Disable-startup-progress-timeout-during-standby-W.patchDownload+15-2
Hello.
At Thu, 11 Jun 2026 09:13:45 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
However, some startup operations executed afterward, such as
ResetUnloggedRelations(), can re-enable the timeout. As a result, the
startup progress timeout can remain active during standby WAL replay,
which contradicts the intent described in the comment above.
Good catch.
The attached patch disables the startup progress timeout again just
before entering WAL replay in standby mode. This preserves progress
reporting for earlier startup operations while avoiding unnecessary
wakeups during standby replay. The patch also slightly clarifies the
documentation for log_startup_progress_interval.I see this as a small cleanup/improvement for master rather than a bug
fix requiring backpatching, since there is no visible behavioral issue
for users.
I agree with that assessment.
I wonder whether we need to disable the startup progress timeout in
EnableStandbyMode() in the first place, if the intention is only to
suppress progress reporting during WAL replay on a standby.
It seems to me that the decision is more closely related to entering
the WAL replay phase than to enabling standby mode itself. Also, the
patch adds another StandbyMode check shortly before an existing one.
Wouldn't it be simpler to handle the standby case at the existing
check, like this?
if (!StandbyMode)
begin_startup_progress_phase();
+ else
+ disable_startup_progress_timeout();
If the timeout is no longer disabled in EnableStandbyMode(), this
would keep progress reporting for earlier startup phases, while making
it clearer in the code that the timeout is disabled when entering WAL
replay in standby mode.
Regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Jun 11, 2026 at 11:58 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
I wonder whether we need to disable the startup progress timeout in
EnableStandbyMode() in the first place, if the intention is only to
suppress progress reporting during WAL replay on a standby.
There can be cases where WAL replay starts with StandbyMode == false
and EnableStandbyMode() is called later during WAL replay. For example,
*as far as I remember correctly*, that might happen when starting with
standby.signal but without backup_label. In such cases, we would still want
EnableStandbyMode() to disable the timeout.
Wouldn't it be simpler to handle the standby case at the existing
check, like this?if (!StandbyMode) begin_startup_progress_phase(); + else + disable_startup_progress_timeout();
I thought the same at first, too. But I just thought the timeout should not
be enabled even while reading the first WAL record, so I placed
the check before reading the first record.
Regards,
--
Fujii Masao
Hi,
On Wed, Jun 10, 2026 at 5:14 PM Fujii Masao <masao.fujii@gmail.com> wrote:
Hi,
While working on a patch discussed in [1], I looked into how
log_startup_progress_interval behaves during recovery. During that
investigation, I noticed the following comment in
EnableStandbyMode():/*
* To avoid server log bloat, we don't report recovery progress in a
* standby as it will always be in recovery unless promoted. We disable
* startup progress timeout in standby mode to avoid calling
* startup_progress_timeout_handler() unnecessarily.
*/So in standby mode, we intentionally suppress recovery progress
logging during WAL replay, because otherwise a standby could emit
progress messages indefinitely until promotion.However, some startup operations executed afterward, such as
ResetUnloggedRelations(), can re-enable the timeout. As a result, the
startup progress timeout can remain active during standby WAL replay,
which contradicts the intent described in the comment above.
Nice catch!
Discussion for another thread:
I wish we had these progress reports emitted even for the startup
process - to help with post-hoc analysis of the customer issues about
failover, slow replay, WAL growth on the primary, etc. However, I
agree the volume of log records could grow unboundedly on the replica
over its lifetime, because the same timeout parameter is being used at
different scales. On the primary, assuming startup times are slower,
one wants to know the recovery rate. On replicas, we surely want to
have this info, say, every 5 min (288 logs per day), 10 min (144 logs
per day), or 20 min (72 logs per day). I prefer to NOT add a new GUC
for this, but perhaps the same GUC log_startup_progress_interval could
log at different scales on the primary versus the replica.
--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com
Hello.
At Thu, 11 Jun 2026 13:39:22 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
On Thu, Jun 11, 2026 at 11:58 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:I wonder whether we need to disable the startup progress timeout in
EnableStandbyMode() in the first place, if the intention is only to
suppress progress reporting during WAL replay on a standby.There can be cases where WAL replay starts with StandbyMode == false
and EnableStandbyMode() is called later during WAL replay. For example,
*as far as I remember correctly*, that might happen when starting with
standby.signal but without backup_label. In such cases, we would still want
EnableStandbyMode() to disable the timeout.
I see the point about the standby.signal case, where WAL replay may
already be in progress when EnableStandbyMode() is reached. In fact,
that example makes me think the opposite: it seems that
EnableStandbyMode() only needs to disable the timeout in this
particular case because replay has already started before it is
called.
The other call sites do not seem to share that property, and it looks
to me as though the timeout could still be legitimately needed there
until standby replay actually begins. That is the reason behind my
earlier comment questioning whether EnableStandbyMode() is the right
place to disable it.
Wouldn't it be simpler to handle the standby case at the existing
check, like this?if (!StandbyMode) begin_startup_progress_phase(); + else + disable_startup_progress_timeout();I thought the same at first, too. But I just thought the timeout should not
be enabled even while reading the first WAL record, so I placed
the check before reading the first record.
Similarly, if the timeout should be disabled before reading the first
WAL record, I wonder whether begin_startup_progress_phase() should
also be called at the same point.
That said, I agree that the proposed change should work as it is, and
this is ultimately a trade-off against the size of the fix. So I do
not object to this direction if others think it is the more practical
approach.
Regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Jun 16, 2026 at 5:28 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
I see the point about the standby.signal case, where WAL replay may
already be in progress when EnableStandbyMode() is reached. In fact,
that example makes me think the opposite: it seems that
EnableStandbyMode() only needs to disable the timeout in this
particular case because replay has already started before it is
called.The other call sites do not seem to share that property, and it looks
to me as though the timeout could still be legitimately needed there
until standby replay actually begins. That is the reason behind my
earlier comment questioning whether EnableStandbyMode() is the right
place to disable it.
Maybe I'm missing your point... Could you clarify what change you are
suggesting for EnableStandbyMode() itself?
Similarly, if the timeout should be disabled before reading the first
WAL record, I wonder whether begin_startup_progress_phase() should
also be called at the same point.
begin_startup_progress_phase() is currently called at the same point
where we emit the "redo starts at ..." log message, which seems like a
natural and consistent boundary for WAL replay progress reporting. So
it seems better not to change where begin_startup_progress_phase() is
called.
On second thought, for disable_startup_progress_timeout(), calling it
at the same place as begin_startup_progress_phase(), or before reading
the first WAL record, seems effectively equivalent from the
perspective of user-visible WAL replay progress logging in standby
mode.
The practical difference, as far as I can see, is only whether the
startup progress timeout can still fire while reading the first WAL
record. Reading that first record can potentially take some time, for
example while waiting for WAL to arrive or restoring it from archive.
If we want to suppress timeout activity during that period as well, it
seems better to call disable_startup_progress_timeout() before reading
the first WAL record.
Regards,
--
Fujii Masao