recovery.signal not cleaned up when both signal files are present

Started by Nikolay Samokhvalovabout 1 month ago13 messages
Jump to latest
#1Nikolay Samokhvalov
samokhvalov@gmail.com

Hi hackers,

I observed a case when users who used "pgbackrest restore", not using
"--type=standby", which means that pgBackRest placed recover.signal,
and since they wanted this node to be a standby, then manually placed
standby.signal too, and configured primary_conninfo.

Postgres allows both recovery.signal and standby.signal to coexist –
no complaints, it starts, and gives standby.signal a precedence.

However, this might lead to a latent problem: imagine a standby gots
promoted and then goes through a subsequent failover cycle. In this
case, the orphaned recovery.signal causes the node to perform an
unexpected PITR recovery and self-promote to a new timeline instead of
remaining a standby. Which surprised the user a lot.

Exact sequence that leads to trouble (Reproduced on PostgreSQL 17.7
with pgBackRest 2.58.0):

1. Restore a backup (pgBackRest default creates `recovery.signal`)
2. Add `standby.signal` and `primary_conninfo` for streaming replication
3. Start as standby — works fine (`standby.signal` takes precedence)
4. Promote this standby to primary (e.g., switchover) —
`standby.signal` is removed, `recovery.signal` is NOT
5. Node runs as primary with `recovery.signal` still on disk
6. Node crashes or is stopped
7. pg_rewind + add `standby.signal` to rejoin as standby
8. Start — works as standby again, `recovery.signal` still present
9. Promote again (e.g., failback) — `standby.signal` removed,
`recovery.signal` still NOT removed
10. If the node later needs to rejoin as standby via pg_rewind
(without `standby.signal` yet), it finds `recovery.signal`,
performs PITR recovery, and self-promotes to a new timeline

I spent some time to understand this, and found in xlogrecovery.c:

if (stat(STANDBY_SIGNAL_FILE, &stat_buf) == 0)
{
/* ... */
standby_signal_file_found = true;
}
else if (stat(RECOVERY_SIGNAL_FILE, &stat_buf) == 0)
{
/* ... */
recovery_signal_file_found = true;
}

-- so the recovery.signal is not registered, Postgres doesn't know it exists.

Cleanup logic for both files in xlog.c looks independent:

if (endOfRecoveryInfo->standby_signal_file_found)
durable_unlink(STANDBY_SIGNAL_FILE, FATAL);

if (endOfRecoveryInfo->recovery_signal_file_found)
durable_unlink(RECOVERY_SIGNAL_FILE, FATAL);

-- but it cleans up only what it knows. So, recovery.signal is not cleaned.

Concerns/questions:

1. I don't like the fact that recovery_signal_file_found is set to
false although the file is present -- this is hard to read and
troubleshoot...
2. The comment in xlog.c says "The comment there even says "Remove the
signal files out of the way, so that we don't accidentally re-enter
archive recovery mode in a subsequent crash" -- but `recovery.signal`
escapes this cleanup. Looks like what's happening was not expected by
design, is it correct conclusion?
3. It seems to me that having both files coexist is always a
misconfiguration -- there
is no use case where a node should be in both PITR and standby mode.
If it is so, maybe we should:
- at minimum, remove the orphaned recovery.signal when
standby.signal takes precedence (or at end of recovery)
- do not start if both files are present: consider it abnormal and
ask for explicit cleanup, so user (or tooling) could decide which file
needs to stay

thoughts?

Nik

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Nikolay Samokhvalov (#1)
Re: recovery.signal not cleaned up when both signal files are present

On Sat, Feb 7, 2026 at 5:41 AM Nikolay Samokhvalov <nik@postgres.ai> wrote:

Hi hackers,

I observed a case when users who used "pgbackrest restore", not using
"--type=standby", which means that pgBackRest placed recover.signal,
and since they wanted this node to be a standby, then manually placed
standby.signal too, and configured primary_conninfo.

Postgres allows both recovery.signal and standby.signal to coexist –
no complaints, it starts, and gives standby.signal a precedence.

However, this might lead to a latent problem: imagine a standby gots
promoted and then goes through a subsequent failover cycle. In this
case, the orphaned recovery.signal causes the node to perform an
unexpected PITR recovery and self-promote to a new timeline instead of
remaining a standby. Which surprised the user a lot.

Exact sequence that leads to trouble (Reproduced on PostgreSQL 17.7
with pgBackRest 2.58.0):

1. Restore a backup (pgBackRest default creates `recovery.signal`)
2. Add `standby.signal` and `primary_conninfo` for streaming replication
3. Start as standby — works fine (`standby.signal` takes precedence)
4. Promote this standby to primary (e.g., switchover) —
`standby.signal` is removed, `recovery.signal` is NOT
5. Node runs as primary with `recovery.signal` still on disk
6. Node crashes or is stopped
7. pg_rewind + add `standby.signal` to rejoin as standby
8. Start — works as standby again, `recovery.signal` still present
9. Promote again (e.g., failback) — `standby.signal` removed,
`recovery.signal` still NOT removed
10. If the node later needs to rejoin as standby via pg_rewind
(without `standby.signal` yet), it finds `recovery.signal`,
performs PITR recovery, and self-promotes to a new timeline

I spent some time to understand this, and found in xlogrecovery.c:

if (stat(STANDBY_SIGNAL_FILE, &stat_buf) == 0)
{
/* ... */
standby_signal_file_found = true;
}
else if (stat(RECOVERY_SIGNAL_FILE, &stat_buf) == 0)
{
/* ... */
recovery_signal_file_found = true;
}

-- so the recovery.signal is not registered, Postgres doesn't know it exists.

Cleanup logic for both files in xlog.c looks independent:

if (endOfRecoveryInfo->standby_signal_file_found)
durable_unlink(STANDBY_SIGNAL_FILE, FATAL);

if (endOfRecoveryInfo->recovery_signal_file_found)
durable_unlink(RECOVERY_SIGNAL_FILE, FATAL);

-- but it cleans up only what it knows. So, recovery.signal is not cleaned.

Concerns/questions:

1. I don't like the fact that recovery_signal_file_found is set to
false although the file is present -- this is hard to read and
troubleshoot...
2. The comment in xlog.c says "The comment there even says "Remove the
signal files out of the way, so that we don't accidentally re-enter
archive recovery mode in a subsequent crash" -- but `recovery.signal`
escapes this cleanup. Looks like what's happening was not expected by
design, is it correct conclusion?
3. It seems to me that having both files coexist is always a
misconfiguration -- there
is no use case where a node should be in both PITR and standby mode.
If it is so, maybe we should:
- at minimum, remove the orphaned recovery.signal when
standby.signal takes precedence (or at end of recovery)
- do not start if both files are present: consider it abnormal and
ask for explicit cleanup, so user (or tooling) could decide which file
needs to stay

thoughts?

+1 on also cleaning up recovery.signal when both signal files are present.

The documentation states that standby.signal takes precedence if both
files exist,
and this configuration is not described as unacceptable. So, it doesn't seem ok
to prevent the server from starting in this case.

Regards,

--
Fujii Masao

#3Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#2)
Re: recovery.signal not cleaned up when both signal files are present

On Tue, Feb 10, 2026 at 12:41:48AM +0900, Fujii Masao wrote:

+1 on also cleaning up recovery.signal when both signal files are present.

The documentation states that standby.signal takes precedence if both
files exist,
and this configuration is not described as unacceptable. So, it doesn't seem ok
to prevent the server from starting in this case.

If both are present, startup should be OK and we should be in standby
mode. Like reported, it really sounds like a problem to me to enforce
unnecessary TLI jumps because a recovery.signal is still around after
a standby promotion. So, yes, removing it would be a good thing.
However I would argue against a backpatch as there is a risk of
slightly breaking existing recovery flows as well. Doing such a
change like that on HEAD is OK. This area of the code has always been
really sensitive to deal with in stable branches, particularly slight
changes in recovery behavior that could damage deployments (aka
monitoring) after a minor version upgrade.
--
Michael

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#3)
Re: recovery.signal not cleaned up when both signal files are present

On Tue, Feb 10, 2026 at 8:37 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Feb 10, 2026 at 12:41:48AM +0900, Fujii Masao wrote:

+1 on also cleaning up recovery.signal when both signal files are present.

The documentation states that standby.signal takes precedence if both
files exist,
and this configuration is not described as unacceptable. So, it doesn't seem ok
to prevent the server from starting in this case.

If both are present, startup should be OK and we should be in standby
mode. Like reported, it really sounds like a problem to me to enforce
unnecessary TLI jumps because a recovery.signal is still around after
a standby promotion. So, yes, removing it would be a good thing.
However I would argue against a backpatch as there is a risk of
slightly breaking existing recovery flows as well. Doing such a
change like that on HEAD is OK. This area of the code has always been
really sensitive to deal with in stable branches, particularly slight
changes in recovery behavior that could damage deployments (aka
monitoring) after a minor version upgrade.

+1 to apply this change only to the master branch. Patch attached.

Regards,

--
Fujii Masao

Attachments:

v1-0001-Remove-recovery.signal-at-recovery-end-when-both-.patchapplication/octet-stream; name=v1-0001-Remove-recovery.signal-at-recovery-end-when-both-.patchDownload+7-5
#5Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#4)
Re: recovery.signal not cleaned up when both signal files are present

On Tue, Feb 10, 2026 at 12:26:36PM +0900, Fujii Masao wrote:

+1 to apply this change only to the master branch. Patch attached.

It looks like something we should have a test for, at least..

+    /*
+     *
+     * If both signal files are present, standby signal file takes precedence.
+     * If neither is present then we won't enter archive recovery.
+     */

This comment's format is incorrect.
--
Michael

#6David Steele
david@pgmasters.net
In reply to: Michael Paquier (#5)
Re: recovery.signal not cleaned up when both signal files are present

On 2/10/26 11:52, Michael Paquier wrote:

On Tue, Feb 10, 2026 at 12:26:36PM +0900, Fujii Masao wrote:

+1 to apply this change only to the master branch. Patch attached.

It looks like something we should have a test for, at least..

+1 for a test.

+    /*
+     *
+     * If both signal files are present, standby signal file takes precedence.
+     * If neither is present then we won't enter archive recovery.
+     */

This comment's format is incorrect.

Other than the comment the patch looks good overall.

Reluctantly I have to agree to not back patch this. I'm not sure how
this change would break existing recovery processes but experience tells
me that it probably would.

Instead -- I wonder if we could add a warning in the back branches?

Regards,
-David

#7Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#6)
Re: recovery.signal not cleaned up when both signal files are present

On Fri, Feb 13, 2026 at 12:55:15AM +0000, David Steele wrote:

Reluctantly I have to agree to not back patch this. I'm not sure how this
change would break existing recovery processes but experience tells me that
it probably would.

Instead -- I wonder if we could add a warning in the back branches?

I am not convinced that going down to that is really necessary based
on the lack of complaints, and it could even be qualified as
disturbing for existing cases as well? Let's leave that as a
HEAD-only change.
--
Michael

#8Fujii Masao
masao.fujii@gmail.com
In reply to: David Steele (#6)
Re: recovery.signal not cleaned up when both signal files are present

On Fri, Feb 13, 2026 at 9:55 AM David Steele <david@pgbackrest.org> wrote:

On 2/10/26 11:52, Michael Paquier wrote:

On Tue, Feb 10, 2026 at 12:26:36PM +0900, Fujii Masao wrote:

+1 to apply this change only to the master branch. Patch attached.

It looks like something we should have a test for, at least..

+1 for a test.

Yeah, so I've added the test as suggested. The updated patch is attached.

+    /*
+     *
+     * If both signal files are present, standby signal file takes precedence.
+     * If neither is present then we won't enter archive recovery.
+     */

This comment's format is incorrect.

Other than the comment the patch looks good overall.

Fixed. Thanks all for the review!

Regards,

--
Fujii Masao

Attachments:

v2-0001-Remove-recovery.signal-at-recovery-end-when-both-.patchapplication/octet-stream; name=v2-0001-Remove-recovery.signal-at-recovery-end-when-both-.patchDownload+29-6
#9Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#8)
Re: recovery.signal not cleaned up when both signal files are present

On Fri, Feb 13, 2026 at 03:05:45PM +0900, Fujii Masao wrote:

Yeah, so I've added the test as suggested. The updated patch is attached.

What's the point in having the check for the files in data_dir? The
second one for standby2 should be enough as this is to test only
readRecoverySignalFile().

Except for this nit, that's OK by me.
--
Michael

#10Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#9)
Re: recovery.signal not cleaned up when both signal files are present

On Fri, Feb 13, 2026 at 3:18 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Feb 13, 2026 at 03:05:45PM +0900, Fujii Masao wrote:

Yeah, so I've added the test as suggested. The updated patch is attached.

What's the point in having the check for the files in data_dir? The
second one for standby2 should be enough as this is to test only
readRecoverySignalFile().

I added that test to verify that both files are removed even in the normal
standby case (i.e., when only standby.signal is present). However, if testing
only the case where both signal files are present is sufficient, I'm fine with
removing the data_dir check. Attached is an updated patch that checks only
the latter case for standby2.

I will commit this patch.

Regards,

--
Fujii Masao

Attachments:

v3-0001-Remove-recovery.signal-at-recovery-end-when-both-.patchapplication/octet-stream; name=v3-0001-Remove-recovery.signal-at-recovery-end-when-both-.patchDownload+23-6
#11David Steele
david@pgmasters.net
In reply to: Fujii Masao (#10)
Re: recovery.signal not cleaned up when both signal files are present

On 2/13/26 20:27, Fujii Masao wrote:

On Fri, Feb 13, 2026 at 3:18 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Feb 13, 2026 at 03:05:45PM +0900, Fujii Masao wrote:

Yeah, so I've added the test as suggested. The updated patch is attached.

What's the point in having the check for the files in data_dir? The
second one for standby2 should be enough as this is to test only
readRecoverySignalFile().

I added that test to verify that both files are removed even in the normal
standby case (i.e., when only standby.signal is present). However, if testing
only the case where both signal files are present is sufficient, I'm fine with
removing the data_dir check. Attached is an updated patch that checks only
the latter case for standby2.

I will commit this patch.

I'm fine with the additional checks in v2. They are inexpensive and show
that the changes (probably) don't have side effects.

But I don't feel strongly about it so either v2 or v3 is OK with me.

Regards,
-David

#12Fujii Masao
masao.fujii@gmail.com
In reply to: David Steele (#11)
Re: recovery.signal not cleaned up when both signal files are present

On Sat, Feb 14, 2026 at 12:45 AM David Steele <david@pgbackrest.org> wrote:

On 2/13/26 20:27, Fujii Masao wrote:

On Fri, Feb 13, 2026 at 3:18 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Feb 13, 2026 at 03:05:45PM +0900, Fujii Masao wrote:

Yeah, so I've added the test as suggested. The updated patch is attached.

What's the point in having the check for the files in data_dir? The
second one for standby2 should be enough as this is to test only
readRecoverySignalFile().

I added that test to verify that both files are removed even in the normal
standby case (i.e., when only standby.signal is present). However, if testing
only the case where both signal files are present is sufficient, I'm fine with
removing the data_dir check. Attached is an updated patch that checks only
the latter case for standby2.

I will commit this patch.

I'm fine with the additional checks in v2. They are inexpensive and show
that the changes (probably) don't have side effects.

But I don't feel strongly about it so either v2 or v3 is OK with me.

I've pushed the v3 patch. Thanks!

Regards,

--
Fujii Masao

#13Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#12)
Re: recovery.signal not cleaned up when both signal files are present

On Mon, Feb 16, 2026 at 02:11:47PM +0900, Fujii Masao wrote:

I've pushed the v3 patch. Thanks!

Thanks.
--
Michael