Requiring recovery.signal or standby.signal when recovering with a backup_label

Started by Michael Paquierabout 3 years ago37 messageshackers
Jump to latest
#1Michael Paquier
michael@paquier.xyz

Hi all,

This is a follow-up of the point I have made a few weeks ago on this
thread of pgsql-bugs about $subject:
/messages/by-id/Y/Q/17rpYS7YGbIt@paquier.xyz
/messages/by-id/Y/v0c+3W89NBT/if@paquier.xyz

Here is a short summary of what I think is incorrect, and what I'd
like to do to improve things moving forward, this pointing to a
simple solution..

While looking at the so-said thread, I have dug into the recovery code
to see what looks like an incorrect assumption behind the two boolean
flags named ArchiveRecoveryRequested and InArchiveRecovery that we
have in xlogrecovery.c to control the behavior of archive recovery in
the startup process. For information, as of HEAD, these two are
described as follows:
/*
* When ArchiveRecoveryRequested is set, archive recovery was requested,
* ie. signal files were present. When InArchiveRecovery is set, we are
* currently recovering using offline XLOG archives. These variables are only
* valid in the startup process.
*
* When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're
* currently performing crash recovery using only XLOG files in pg_wal, but
* will switch to using offline XLOG archives as soon as we reach the end of
* WAL in pg_wal.
*/
bool ArchiveRecoveryRequested = false;
bool InArchiveRecovery = false;

When you read this text alone, its assumptions are simple. When the
startup process finds a recovery.signal or a standby.signal, we switch
ArchiveRecoveryRequested to true. If there is a standby.signal,
InArchiveRecovery would be immediately set to true before beginning
the redo loop. If we begin redo with a recovery.signal,
ArchiveRecoveryRequested = true and InArchiveRecovery = false, crash
recovery happens first, consuming all the WAL in pg_wal/, then we'd
move on with archive recovery.

Now comes the problem of the other thread, which is what happens when
you use a backup_label *without* a recovery.signal or a
standby.signal. In this case, as currently coded, it is possible to
enforce ArchiveRecoveryRequested = false and later InArchiveRecovery =
true. Not setting ArchiveRecoveryRequested has a couple of undesired
effect. First, this skips some initialization steps that may be
needed at a later point in recovery. The thread quoted above has
reported one aspect of that: we miss some hot-standby related
intialization that can reflect if replaying enough WAL that a restart
point could happen. Depending on the amount of data copied into
pg_wal/ before starting a node with only a backup_label it may also be
possible that a consistent point has been reached, where restart
points would be legit. A second Kiss Cool effect (understands who
can), is that we miss the initialization of the recoveryWakeupLatch.
A third effect is that some code paths can use GUC values related to
recovery without ArchiveRecoveryRequested being around, one example
seems to be hot_standby whose default is true.

It is worth noting the end of FinishWalRecovery(), that includes this
part:
if (ArchiveRecoveryRequested)
{
/*
* We are no longer in archive recovery state.
*
* We are now done reading the old WAL. Turn off archive fetching if
* it was active.
*/
Assert(InArchiveRecovery);
InArchiveRecovery = false;

I have been pondering for a few weeks now about what kind of
definition would suit to a cluster having a backup_label file without
a signal file added, which is documented as required by the docs in
the HA section as well as pg_rewind. It is true that there could be a
point to allow such a configuration so as a node recovers without a
TLI jump, but I cannot find appealing this case, as well, as a node
could just happily overwrite WAL segments in the archives on an
existing timeline, potentially corruption other nodes writing on the
same TLI. There are a few other recovery scenarios where one copies
directly WAL segments into pg_wal/ that can lead to a lot of weird
inconsistencies as well, one being the report of the thread of
pgsql-hackers.

At the end, I'd like to think that we should just require
a recovery.signal or a standby.signal if we have a backup_label file,
and even enforce this rule at the end of recovery for some sanity
checks. I don't think that we can just enforce
ArchiveRecoveryRequested in this path, either, as a backup_label would
be renamed to .old once the control file knows up to which LSN it
needs to replay to reach consistency and if an end-of-backup record is
required. That's not something that can be reasonably backpatched, as
it could disturb some recovery workflows, even if these are kind of in
a dangerous spot, IMO, so I would like to propose that only on HEAD
for 16~ because the recovery code has never really considered this
combination of ArchiveRecoveryRequested and InArchiveRecovery.

While digging into that, I have found one TAP test of pg_basebackup
that was doing recovery with just a backup_label file, with a
restore_command already set. A second case was in pg_rewind, were we
have a node without standby.signal, still it uses a primary_conninfo.

Attached is a patch on the lines of what I am thinking about. This
reworks a bit some of the actions at the beginning of the startup
process:
- Report the set of LOGs showing the state of the node after reading
the backup_label.
- Enforce a rule in ShutdownWalRecovery() and document the
restriction.
- Add a check with the signal files after finding a backup_label
file.
- The validation checks on the recovery parameters are applied (aka
restore_command required with recovery.signal, or a primary_conninfo
required on standby for streaming, etc.).

My apologies for the long message, but this deserves some attention,
IMHO.

So, any thoughts?
--
Michael

Attachments:

v1-0001-Strengthen-use-of-ArchiveRecoveryRequested-and-In.patchtext/x-diff; charset=us-asciiDownload+75-42
#2Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#1)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Fri, Mar 10, 2023 at 03:59:04PM +0900, Michael Paquier wrote:

My apologies for the long message, but this deserves some attention,
IMHO.

Note: A CF entry has been added as of [1]https://commitfest.postgresql.org/43/4244/, and I have added an item in
the list of live issues on the open item page for 16.

[1]: https://commitfest.postgresql.org/43/4244/
[2]: https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items#Live_issues -- Michael
--
Michael

#3David Zhang
david.zhang@highgo.ca
In reply to: Michael Paquier (#2)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

I believe before users can make a backup using pg_basebackup and then
start the backup as an independent Primary server for whatever reasons.
Now, if this is still allowed, then users need to be aware that the
backup_label must be manually deleted, otherwise, the backup won't be
able to start as a Primary.

The current message below doesn't provide such a hint.

+		if (!ArchiveRecoveryRequested)
+			ereport(FATAL,
+					(errmsg("could not find recovery.signal or standby.signal when recovering with backup_label"),
+					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.",
+							 DataDir, DataDir)));

On 2023-03-12 6:06 p.m., Michael Paquier wrote:

On Fri, Mar 10, 2023 at 03:59:04PM +0900, Michael Paquier wrote:

My apologies for the long message, but this deserves some attention,
IMHO.

Note: A CF entry has been added as of [1], and I have added an item in
the list of live issues on the open item page for 16.

[1]:https://commitfest.postgresql.org/43/4244/
[2]:https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items#Live_issues
--
Michael

Best regards,

David

#4Michael Paquier
michael@paquier.xyz
In reply to: David Zhang (#3)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Fri, Jul 14, 2023 at 01:32:49PM -0700, David Zhang wrote:

I believe before users can make a backup using pg_basebackup and then start
the backup as an independent Primary server for whatever reasons. Now, if
this is still allowed, then users need to be aware that the backup_label
must be manually deleted, otherwise, the backup won't be able to start as a
Primary.

Delete a backup_label from a fresh base backup can easily lead to data
corruption, as the startup process would pick up as LSN to start
recovery from the control file rather than the backup_label file.
This would happen if a checkpoint updates the redo LSN in the control
file while a backup happens and the control file is copied after the
checkpoint, for instance. If one wishes to deploy a new primary from
a base backup, recovery.signal is the way to go, making sure that the
new primary is bumped into a new timeline once recovery finishes, on
top of making sure that the startup process starts recovery from a
position where the cluster would be able to achieve a consistent
state.

The current message below doesn't provide such a hint.

+		if (!ArchiveRecoveryRequested)
+			ereport(FATAL,
+					(errmsg("could not find
recovery.signal or standby.signal when recovering with
backup_label"), 
+					 errhint("If you are restoring
from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\"
and add required recovery options.",
+							 DataDir,
DataDir)));

How would you rewrite that? I am not sure how many details we want to
put here in terms of differences between recovery.signal and
standby.signal, still we surely should mention these are the two
possible choices.
--
Michael

#5David Zhang
david.zhang@highgo.ca
In reply to: Michael Paquier (#4)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On 2023-07-16 6:27 p.m., Michael Paquier wrote:

Delete a backup_label from a fresh base backup can easily lead to data
corruption, as the startup process would pick up as LSN to start
recovery from the control file rather than the backup_label file.
This would happen if a checkpoint updates the redo LSN in the control
file while a backup happens and the control file is copied after the
checkpoint, for instance. If one wishes to deploy a new primary from
a base backup, recovery.signal is the way to go, making sure that the
new primary is bumped into a new timeline once recovery finishes, on
top of making sure that the startup process starts recovery from a
position where the cluster would be able to achieve a consistent
state.

Thanks a lot for sharing this information.

How would you rewrite that? I am not sure how many details we want to
put here in terms of differences between recovery.signal and
standby.signal, still we surely should mention these are the two
possible choices.

Honestly, I can't convince myself to mention the backup_label here too.
But, I can share some information regarding my testing of the patch and
the corresponding results.

To assess the impact of the patch, I executed the following commands for
before and after,

pg_basebackup -h localhost -p 5432 -U david -D pg_backup1

pg_ctl -D pg_backup1 -l /tmp/logfile start

Before the patch, there were no issues encountered when starting an
independent Primary server.

However, after applying the patch, I observed the following behavior
when starting from the base backup:

1) simply start server from a base backup

FATAL:  could not find recovery.signal or standby.signal when recovering
with backup_label

HINT:  If you are restoring from a backup, touch
"/media/david/disk1/pg_backup1/recovery.signal" or
"/media/david/disk1/pg_backup1/standby.signal" and add required recovery
options.

2) touch a recovery.signal file and then try to start the server, the
following error was encountered:

FATAL:  must specify restore_command when standby mode is not enabled

3) touch a standby.signal file, then the server successfully started,
however, it operates in standby mode, whereas the intended behavior was
for it to function as a primary server.

Best regards,

David

#6Michael Paquier
michael@paquier.xyz
In reply to: David Zhang (#5)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Wed, Jul 19, 2023 at 11:21:17AM -0700, David Zhang wrote:

1) simply start server from a base backup

FATAL:  could not find recovery.signal or standby.signal when recovering
with backup_label

HINT:  If you are restoring from a backup, touch
"/media/david/disk1/pg_backup1/recovery.signal" or
"/media/david/disk1/pg_backup1/standby.signal" and add required recovery
options.

Note the difference when --write-recovery-conf is specified, where a
standby.conf is created with a primary_conninfo in
postgresql.auto.conf. So, yes, that's expected by default with the
patch.

2) touch a recovery.signal file and then try to start the server, the
following error was encountered:

FATAL:  must specify restore_command when standby mode is not enabled

Yes, that's also something expected in the scope of the v1 posted.
The idea behind this restriction is that specifying recovery.signal is
equivalent to asking for archive recovery, but not specifying
restore_command is equivalent to not provide any options to be able to
recover. See validateRecoveryParameters() and note that this
restriction exists since the beginning of times, introduced in commit
66ec2db. I tend to agree that there is something to be said about
self-contained backups taken from pg_basebackup, though, as these
would fail if no restore_command is specified, and this restriction is
in place before Postgres has introduced replication and easier ways to
have base backups. As a whole, I think that there is a good argument
in favor of removing this restriction for the case where archive
recovery is requested if users have all their WAL in pg_wal/ to be
able to recover up to a consistent point, keeping these GUC
restrictions if requesting a standby (not recovery.signal, only
standby.signal).

3) touch a standby.signal file, then the server successfully started,
however, it operates in standby mode, whereas the intended behavior was for
it to function as a primary server.

standby.signal implies that the server will start in standby mode. If
one wants to deploy a new primary, that would imply a timeline jump at
the end of recovery, you would need to specify recovery.signal
instead.

We need more discussions and more opinions, but the discussion has
stalled for a few months now. In case, I am adding Thomas Munro in CC
who has mentioned to me at PGcon that he was interested in this patch
(this thread's problem is not directly related to the fact that the
checkpointer now runs in crash recovery, though).

For now, I am attaching a rebased v2.
--
Michael

Attachments:

v2-0001-Strengthen-use-of-ArchiveRecoveryRequested-and-In.patchtext/x-diff; charset=us-asciiDownload+75-42
#7Bowen Shi
zxwsbg12138@gmail.com
In reply to: Michael Paquier (#6)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

Thanks for the patch.

I rerun the test in
/messages/by-id/ZQtzcH2lvo8leXEr@paquier.xyz
. We can discuss all the problems in this thread.

First I encountered the problem " FATAL: could not find
recovery.signal or standby.signal when recovering with backup_label ",
then I deleted the backup_label file and started the instance
successfully.

Delete a backup_label from a fresh base backup can easily lead to data
corruption, as the startup process would pick up as LSN to start
recovery from the control file rather than the backup_label file.
This would happen if a checkpoint updates the redo LSN in the control
file while a backup happens and the control file is copied after the
checkpoint, for instance. If one wishes to deploy a new primary from
a base backup, recovery.signal is the way to go, making sure that the
new primary is bumped into a new timeline once recovery finishes, on
top of making sure that the startup process starts recovery from a
position where the cluster would be able to achieve a consistent
state.

ereport(FATAL,
(errmsg("could not find redo location referenced by checkpoint record"),
errhint("If you are restoring from a backup, touch
\"%s/recovery.signal\" and add required recovery options.\n"
"If you are not restoring from a backup, try removing the file
\"%s/backup_label\".\n"
"Be careful: removing \"%s/backup_label\" will result in a corrupt
cluster if restoring from a backup.",
DataDir, DataDir, DataDir)));

There are two similar error messages in xlogrecovery.c. Maybe we can
modify the error messages to be similar.

--
Bowen Shi

Show quoted text

On Thu, 21 Sept 2023 at 11:01, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jul 19, 2023 at 11:21:17AM -0700, David Zhang wrote:

1) simply start server from a base backup

FATAL: could not find recovery.signal or standby.signal when recovering
with backup_label

HINT: If you are restoring from a backup, touch
"/media/david/disk1/pg_backup1/recovery.signal" or
"/media/david/disk1/pg_backup1/standby.signal" and add required recovery
options.

Note the difference when --write-recovery-conf is specified, where a
standby.conf is created with a primary_conninfo in
postgresql.auto.conf. So, yes, that's expected by default with the
patch.

2) touch a recovery.signal file and then try to start the server, the
following error was encountered:

FATAL: must specify restore_command when standby mode is not enabled

Yes, that's also something expected in the scope of the v1 posted.
The idea behind this restriction is that specifying recovery.signal is
equivalent to asking for archive recovery, but not specifying
restore_command is equivalent to not provide any options to be able to
recover. See validateRecoveryParameters() and note that this
restriction exists since the beginning of times, introduced in commit
66ec2db. I tend to agree that there is something to be said about
self-contained backups taken from pg_basebackup, though, as these
would fail if no restore_command is specified, and this restriction is
in place before Postgres has introduced replication and easier ways to
have base backups. As a whole, I think that there is a good argument
in favor of removing this restriction for the case where archive
recovery is requested if users have all their WAL in pg_wal/ to be
able to recover up to a consistent point, keeping these GUC
restrictions if requesting a standby (not recovery.signal, only
standby.signal).

3) touch a standby.signal file, then the server successfully started,
however, it operates in standby mode, whereas the intended behavior was for
it to function as a primary server.

standby.signal implies that the server will start in standby mode. If
one wants to deploy a new primary, that would imply a timeline jump at
the end of recovery, you would need to specify recovery.signal
instead.

We need more discussions and more opinions, but the discussion has
stalled for a few months now. In case, I am adding Thomas Munro in CC
who has mentioned to me at PGcon that he was interested in this patch
(this thread's problem is not directly related to the fact that the
checkpointer now runs in crash recovery, though).

For now, I am attaching a rebased v2.
--
Michael

#8Michael Paquier
michael@paquier.xyz
In reply to: Bowen Shi (#7)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Thu, Sep 21, 2023 at 11:45:06AM +0800, Bowen Shi wrote:

First I encountered the problem " FATAL: could not find
recovery.signal or standby.signal when recovering with backup_label ",
then I deleted the backup_label file and started the instance
successfully.

Doing that is equal to corrupting your instance as recovery would
begin from the latest redo LSN stored in the control file, but the
physical relation files included in the backup may include blocks that
require records that are needed before this redo LSN and the LSN
stored in the backup_label.

Delete a backup_label from a fresh base backup can easily lead to data
corruption, as the startup process would pick up as LSN to start
recovery from the control file rather than the backup_label file.
This would happen if a checkpoint updates the redo LSN in the control
file while a backup happens and the control file is copied after the
checkpoint, for instance. If one wishes to deploy a new primary from
a base backup, recovery.signal is the way to go, making sure that the
new primary is bumped into a new timeline once recovery finishes, on
top of making sure that the startup process starts recovery from a
position where the cluster would be able to achieve a consistent
state.

And that's what I mean here. In more details. So, really, don't do
that.

ereport(FATAL,
(errmsg("could not find redo location referenced by checkpoint record"),
errhint("If you are restoring from a backup, touch
\"%s/recovery.signal\" and add required recovery options.\n"
"If you are not restoring from a backup, try removing the file
\"%s/backup_label\".\n"
"Be careful: removing \"%s/backup_label\" will result in a corrupt
cluster if restoring from a backup.",
DataDir, DataDir, DataDir)));

There are two similar error messages in xlogrecovery.c. Maybe we can
modify the error messages to be similar.

The patch adds the following message, which is written this way to be
consistent with the two others, already:

+    ereport(FATAL,
+            (errmsg("could not find recovery.signal or standby.signal when recovering with backup_label"),
+             errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.",
+                     DataDir, DataDir)));

But you have an interesting point here, why isn't standby.signal also
mentioned in the two existing comments? Depending on what's wanted by
the user this can be equally useful to report back.

Attached is a slightly updated patch, where I have also removed the
check on ArchiveRecoveryRequested because the FATAL generated for
!ArchiveRecoveryRequested makes sure that it is useless after reading
the backup_label file.

This patch has been around for a few months now. Do others have
opinions about the direction taken here to make the presence of
recovery.signal or standby.signal a hard requirement when a
backup_label file is found (HEAD only)?
--
Michael

Attachments:

v3-0001-Strengthen-use-of-ArchiveRecoveryRequested-and-In.patchtext/x-diff; charset=us-asciiDownload+78-46
#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#1)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

At Fri, 10 Mar 2023 15:59:04 +0900, Michael Paquier <michael@paquier.xyz> wrote in

My apologies for the long message, but this deserves some attention,
IMHO.

So, any thoughts?

Sorry for being late. However, I agree with David's concern regarding
the unnecessary inconvenience it introduces. I'd like to maintain the
functionality.

While I agree that InArchiveRecovery should be activated only if
ArchiveReArchiveRecoveryRequested is true, I oppose to the notion that
the mere presence of backup_label should be interpreted as a request
for archive recovery (even if it is implied in a comment in
InitWalRecovery()). Instead, I propose that we separate backup_label
and archive recovery, in other words, we should not turn on
InArchiveRecovery if !ArchiveRecoveryRequested, regardless of the
presence of backup_label. We can know the minimum required recovery
LSN by the XLOG_BACKUP_END record.

The attached is a quick mock-up, but providing an approximation of my
thoughts. (For example, end_of_backup_reached could potentially be
extended to the ArchiveRecoveryRequested case and we could simplify
the condition..)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

xlogrecov_not_enter_arch_if_not_requested.txttext/plain; charset=us-asciiDownload+37-24
#10Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#6)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

Sorry, it seems that I posted at the wrong position..

At Thu, 28 Sep 2023 12:58:51 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 10 Mar 2023 15:59:04 +0900, Michael Paquier <michael@paquier.xyz> wrote in

My apologies for the long message, but this deserves some attention,
IMHO.

So, any thoughts?

Sorry for being late. However, I agree with David's concern regarding
the unnecessary inconvenience it introduces. I'd like to maintain the
functionality.

While I agree that InArchiveRecovery should be activated only if
ArchiveReArchiveRecoveryRequested is true, I oppose to the notion that
the mere presence of backup_label should be interpreted as a request
for archive recovery (even if it is implied in a comment in
InitWalRecovery()). Instead, I propose that we separate backup_label
and archive recovery, in other words, we should not turn on
InArchiveRecovery if !ArchiveRecoveryRequested, regardless of the
presence of backup_label. We can know the minimum required recovery
LSN by the XLOG_BACKUP_END record.

The attached is a quick mock-up, but providing an approximation of my
thoughts. (For example, end_of_backup_reached could potentially be
extended to the ArchiveRecoveryRequested case and we could simplify
the condition..)

regards

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

xlogrecov_not_enter_arch_if_not_requested.txttext/plain; charset=us-asciiDownload+37-24
#11Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#9)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Thu, Sep 28, 2023 at 12:58:51PM +0900, Kyotaro Horiguchi wrote:

The attached is a quick mock-up, but providing an approximation of my
thoughts. (For example, end_of_backup_reached could potentially be
extended to the ArchiveRecoveryRequested case and we could simplify
the condition..)

I am not sure why this is related to this thread..

static XLogRecPtr backupStartPoint;
static XLogRecPtr backupEndPoint;
static bool backupEndRequired = false;
+static bool backupEndReached = false;

Anyway, sneaking at your suggestion, this is actually outlining the
main issue I have with this code currently. We have so many static
booleans to control one behavior over the other that we always try to
make this code more complicated, while we should try to make it
simpler instead.
--
Michael

#12David Steele
david@pgmasters.net
In reply to: Kyotaro Horiguchi (#9)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On 9/27/23 23:58, Kyotaro Horiguchi wrote:

At Fri, 10 Mar 2023 15:59:04 +0900, Michael Paquier <michael@paquier.xyz> wrote in

My apologies for the long message, but this deserves some attention,
IMHO.

So, any thoughts?

Sorry for being late. However, I agree with David's concern regarding
the unnecessary inconvenience it introduces. I'd like to maintain the
functionality.

After some playing around, I find I agree with Michael on this, i.e.
require at least standby.signal when a backup_label is present.

According to my testing, you can preserve the "independent server"
functionality by setting archive_command = /bin/false. In this case the
timeline is not advanced and recovery proceeds from whatever is
available in pg_wal.

I think this type of recovery from a backup label without a timeline
change should absolutely be the exception, not the default as it seems
to be now. If the server is truly independent, then the timeline change
is not important. If the server is not independent, then the timeline
change is critical.

So overall, +1 for Michael's patch, though I have only read through it
and not tested it yet.

One comment, though, if we are going to require recovery.signal when
backup_label is present, should it just be implied? Why error and force
the user to create it?

Regards,
-David

#13Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#12)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Thu, Sep 28, 2023 at 04:23:42PM -0400, David Steele wrote:

After some playing around, I find I agree with Michael on this, i.e. require
at least standby.signal when a backup_label is present.

According to my testing, you can preserve the "independent server"
functionality by setting archive_command = /bin/false. In this case the
timeline is not advanced and recovery proceeds from whatever is available in
pg_wal.

I've seen folks depend on such setups in the past, actually, letting a
process outside Postgres just "push" WAL segments to pg_wal instead of
Postgres pulling it with a restore_command or a primary_conninfo for a
standby.

I think this type of recovery from a backup label without a timeline change
should absolutely be the exception, not the default as it seems to be now.

This can mess up archives pretty easily, additionally, so it's not
something to encourage..

If the server is truly independent, then the timeline change is not
important. If the server is not independent, then the timeline change is
critical.

So overall, +1 for Michael's patch, though I have only read through it and
not tested it yet.

Reviews, thoughts and opinions are welcome.

One comment, though, if we are going to require recovery.signal when
backup_label is present, should it just be implied? Why error and force the
user to create it?

That's one thing I was considering, but I also cannot convince myself
that this is the best option because the presence of recovery.signal
or standby.standby (if both, standby.signal takes priority) makes it
clear what type of recovery is wanted at disk level. I'd be OK if
folks think that this is a sensible consensus, as well, even if I
don't really agree with it.

Another idea I had was to force the creation of recovery.signal by
pg_basebackup even if -R is not used. All the reports we've seen with
people getting confused came from pg_basebackup that enforces no
configuration.

A last thing, that had better be covered in a separate thread and
patch, is about validateRecoveryParameters(). These days, I'd like to
think that it may be OK to lift at least the restriction on
restore_command being required if we are doing recovery to ease the
case of self-contained backups (aka the case where all the WAL needed
to reach a consistent point is in pg_wal/ or its tarball)
--
Michael

#14David Steele
david@pgmasters.net
In reply to: Michael Paquier (#13)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On 9/28/23 19:59, Michael Paquier wrote:

On Thu, Sep 28, 2023 at 04:23:42PM -0400, David Steele wrote:

So overall, +1 for Michael's patch, though I have only read through it and
not tested it yet.

Reviews, thoughts and opinions are welcome.

OK, I have now reviewed and tested the patch and it looks good to me. I
stopped short of marking this RfC since there are other reviewers in the
mix.

I dislike that we need to repeat:

OwnLatch(&XLogRecoveryCtl->recoveryWakeupLatch);

But I see the logic behind why you did it and there's no better way to
do it as far as I can see.

One comment, though, if we are going to require recovery.signal when
backup_label is present, should it just be implied? Why error and force the
user to create it?

That's one thing I was considering, but I also cannot convince myself
that this is the best option because the presence of recovery.signal
or standby.standby (if both, standby.signal takes priority) makes it
clear what type of recovery is wanted at disk level. I'd be OK if
folks think that this is a sensible consensus, as well, even if I
don't really agree with it.

I'm OK with keeping it as required for now.

Another idea I had was to force the creation of recovery.signal by
pg_basebackup even if -R is not used. All the reports we've seen with
people getting confused came from pg_basebackup that enforces no
configuration.

This change makes it more obvious if configuration is missing (since
you'll get an error), however +1 for adding this to pg_basebackup.

A last thing, that had better be covered in a separate thread and
patch, is about validateRecoveryParameters(). These days, I'd like to
think that it may be OK to lift at least the restriction on
restore_command being required if we are doing recovery to ease the
case of self-contained backups (aka the case where all the WAL needed
to reach a consistent point is in pg_wal/ or its tarball)

Hmmm, I'm not sure about this. I'd prefer users set
restore_command=/bin/false explicitly to fetch WAL from pg_wal by
default if that's what they really intend.

Regards,
-David

#15Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#14)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Sat, Oct 14, 2023 at 03:45:33PM -0400, David Steele wrote:

On 9/28/23 19:59, Michael Paquier wrote:
OK, I have now reviewed and tested the patch and it looks good to me. I
stopped short of marking this RfC since there are other reviewers in the
mix.

Thanks for the review. Yes, I am wondering if other people would
chime in here. It doesn't feel like this has gathered enough
opinions. Now this thread has been around for many months, and we've
done quite a few changes in the backup APIs in the last few years with
few users complaining back about them..

I dislike that we need to repeat:

OwnLatch(&XLogRecoveryCtl->recoveryWakeupLatch);

But I see the logic behind why you did it and there's no better way to do it
as far as I can see.

The main point is that there is no meaning in setting the latch until
the backup_label file is read because if ArchiveRecoveryRequested is
*not* set the startup process would outright fail as of the lack of
[recovery|standby].signal.

Another idea I had was to force the creation of recovery.signal by
pg_basebackup even if -R is not used. All the reports we've seen with
people getting confused came from pg_basebackup that enforces no
configuration.

This change makes it more obvious if configuration is missing (since you'll
get an error), however +1 for adding this to pg_basebackup.

Looking at the streaming APIs of pg_basebackup, it looks like this
would be a matter of using bbstreamer_inject_file() to inject an empty
file into the stream. Still something seems to be off once
compression methods are involved.. Hmm. I am not sure. Well, this
could always be done as a patch independant of this one, under a
separate discussion. There are extra arguments about whether it would
be a good idea to add a recovery.signal even when taking a backup from
a standby, and do that only in 17~.

A last thing, that had better be covered in a separate thread and
patch, is about validateRecoveryParameters(). These days, I'd like to
think that it may be OK to lift at least the restriction on
restore_command being required if we are doing recovery to ease the
case of self-contained backups (aka the case where all the WAL needed
to reach a consistent point is in pg_wal/ or its tarball)

Hmmm, I'm not sure about this. I'd prefer users set
restore_command=/bin/false explicitly to fetch WAL from pg_wal by default if
that's what they really intend.

It wouldn't be the first time we break compatibility in this area, so
perhaps you are right and keeping this requirement is fine, even if it
requires one extra step when recovering a self-contained backup
generated by pg_basebackup. At least this forces users to look at
their setup and check if something is wrong. We'd likely finish with
a few "bug" reports, as well :D
--
Michael

#16Bowen Shi
zxwsbg12138@gmail.com
In reply to: Michael Paquier (#15)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

It looks good to me.

I have reviewed the code and tested the patch with basic check-world test an pgbench test (metioned in /messages/by-id/ZQtzcH2lvo8leXEr@paquier.xyz

Another reviewer has also approved it, so I change the status to RFC.

The new status of this patch is: Ready for Committer

#17Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Michael Paquier (#15)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Mon, 2023-10-16 at 14:54 +0900, Michael Paquier wrote:

Thanks for the review.  Yes, I am wondering if other people would
chime in here.  It doesn't feel like this has gathered enough
opinions.

I don't have strong feelings either way. If you have backup_label
but no signal file, starting PostgreSQL may succeed (if the WAL
with the checkpoint happens to be in pg_wal) or it may fail with
an error message. There is no danger of causing damage unless you
remove backup_label, right?

I cannot think of a use case where you use such a configuration on
purpose, and the current error message is more crypric than a plain
"you must have a signal file to start from a backup", so perhaps
your patch is a good idea.

Yours,
Laurenz Albe

#18Michael Paquier
michael@paquier.xyz
In reply to: Laurenz Albe (#17)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Mon, Oct 16, 2023 at 05:48:43PM +0200, Laurenz Albe wrote:

I don't have strong feelings either way. If you have backup_label
but no signal file, starting PostgreSQL may succeed (if the WAL
with the checkpoint happens to be in pg_wal) or it may fail with
an error message. There is no danger of causing damage unless you
remove backup_label, right?

A bit more happens currently if you have a backup_label with no signal
files, unfortunately, because this causes some startup states to not
be initialized. See around here:
/messages/by-id/Y/Q/17rpYS7YGbIt@paquier.xyz
/messages/by-id/Y/v0c+3W89NBT/if@paquier.xyz

I cannot think of a use case where you use such a configuration on
purpose, and the current error message is more crypric than a plain
"you must have a signal file to start from a backup", so perhaps
your patch is a good idea.

I hope so.
--
Michael

#19Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#15)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On Mon, Oct 16, 2023 at 02:54:35PM +0900, Michael Paquier wrote:

On Sat, Oct 14, 2023 at 03:45:33PM -0400, David Steele wrote:

On 9/28/23 19:59, Michael Paquier wrote:

Another idea I had was to force the creation of recovery.signal by
pg_basebackup even if -R is not used. All the reports we've seen with
people getting confused came from pg_basebackup that enforces no
configuration.

This change makes it more obvious if configuration is missing (since you'll
get an error), however +1 for adding this to pg_basebackup.

Looking at the streaming APIs of pg_basebackup, it looks like this
would be a matter of using bbstreamer_inject_file() to inject an empty
file into the stream. Still something seems to be off once
compression methods are involved.. Hmm. I am not sure. Well, this
could always be done as a patch independant of this one, under a
separate discussion. There are extra arguments about whether it would
be a good idea to add a recovery.signal even when taking a backup from
a standby, and do that only in 17~.

Hmm. On this specific point, it would actually be much simpler to
force recovery.signal to be in the contents streamed to a BASE_BACKUP.
This does not step on your proposal at [1]/messages/by-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b@pgmasters.net, though, because you'd
still require a .signal file for recovery as far as I understand :/

[1]: /messages/by-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b@pgmasters.net

Would folks be OK to move on with the patch of this thread at the end?
I am attempting a last-call kind of thing.
--
Michael

#20David Steele
david@pgmasters.net
In reply to: Michael Paquier (#19)
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

On 10/27/23 03:22, Michael Paquier wrote:

On Mon, Oct 16, 2023 at 02:54:35PM +0900, Michael Paquier wrote:

On Sat, Oct 14, 2023 at 03:45:33PM -0400, David Steele wrote:

On 9/28/23 19:59, Michael Paquier wrote:

Another idea I had was to force the creation of recovery.signal by
pg_basebackup even if -R is not used. All the reports we've seen with
people getting confused came from pg_basebackup that enforces no
configuration.

This change makes it more obvious if configuration is missing (since you'll
get an error), however +1 for adding this to pg_basebackup.

Looking at the streaming APIs of pg_basebackup, it looks like this
would be a matter of using bbstreamer_inject_file() to inject an empty
file into the stream. Still something seems to be off once
compression methods are involved.. Hmm. I am not sure. Well, this
could always be done as a patch independant of this one, under a
separate discussion. There are extra arguments about whether it would
be a good idea to add a recovery.signal even when taking a backup from
a standby, and do that only in 17~.

Hmm. On this specific point, it would actually be much simpler to
force recovery.signal to be in the contents streamed to a BASE_BACKUP.

That sounds like the right plan to me. Nice and simple.

This does not step on your proposal at [1], though, because you'd
still require a .signal file for recovery as far as I understand :/

[1]: /messages/by-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b@pgmasters.net

Yes.

Would folks be OK to move on with the patch of this thread at the end?
I am attempting a last-call kind of thing.

I'm still +1 for the patch as it stands.

Regards,
-David

#21Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#20)
#22Roberto Mello
rmello@cc.usu.edu
In reply to: Michael Paquier (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#21)
#24Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#21)
#25Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#23)
#26Michael Paquier
michael@paquier.xyz
In reply to: Roberto Mello (#22)
#27Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#24)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#25)
#29Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#28)
#30Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#29)
#33Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#32)
#34Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#33)
#35Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#34)
#36Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#36)