Return pg_control from pg_backup_stop().
Hackers,
This is greatly simplified implementation of the patch proposed in [1]/messages/by-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b@pgmasters.net
and hopefully it addresses the concerns expressed there. Since the
implementation is quite different it seemed like a new thread was
appropriate, especially since the old thread title would be very
misleading regarding the new functionality.
The basic idea is to harden recovery by returning a copy of pg_control
from pg_backup_stop() that has a flag set to prevent recovery if the
backup_label file is missing. Instead of backup software copying
pg_control from PGDATA, it stores an updated version that is returned
from pg_backup_stop(). This is better for the following reasons:
* The user can no longer remove backup_label and get what looks like a
successful recovery (while almost certainly causing corruption). If
backup_label is removed the cluster will not start. The user may try
pg_resetwal, but that tool makes it pretty clear that corruption will
result from its use.
* We don't need to worry about backup software seeing a torn copy of
pg_control, since Postgres can safely read it out of memory and provide
a valid copy via pg_backup_stop(). This solves torn reads without
needing to write pg_control via a temp file, which may affect
performance on a standby.
* For backup from standby, we no longer need to instruct the backup
software to copy pg_control last. In fact the backup software should not
copy pg_control from PGDATA at all.
These changes have no impact on current backup software and they are
free to use the pg_control available from pg_stop_backup() or continue
to use pg_control from PGDATA. Of course they will miss the benefits of
getting a consistent copy of pg_control and the backup_label checking,
but will be no worse off than before.
I'll register this in the July CF.
Regards,
-David
[1]: /messages/by-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b@pgmasters.net
/messages/by-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b@pgmasters.net
Attachments:
pgcontrol-from-backupstop-v1.patchtext/plain; charset=UTF-8; name=pgcontrol-from-backupstop-v1.patchDownload+182-34
On Fri, May 17, 2024 at 12:46:49PM +1000, David Steele wrote:
This is greatly simplified implementation of the patch proposed in [1] and
hopefully it addresses the concerns expressed there. Since the
implementation is quite different it seemed like a new thread was
appropriate, especially since the old thread title would be very misleading
regarding the new functionality.
- /* No backup_label file has been found if we are here. */
+ /*
+ * No backup_label file has been found if we are here. Error if the
+ * control file requires backup_label.
+ */
+ if (ControlFile->backupLabelRequired)
+ ereport(FATAL,
+ (errmsg("could not find backup_label required for recovery"),
+ errhint("backup_label must be present for recovery to succeed")));
I thought that this had some similarities with my last fight in this
area, where xlogrecovery.c would fail hard if there was a backup_label
file but no signal files, but nope that's not the case:
/messages/by-id/ZArVOMifjzE7f8W7@paquier.xyz
The basic idea is to harden recovery by returning a copy of pg_control from
pg_backup_stop() that has a flag set to prevent recovery if the backup_label
file is missing. Instead of backup software copying pg_control from PGDATA,
it stores an updated version that is returned from pg_backup_stop().
Hmm, okay. There is also a slight impact for BASE_BACKUP, requiring
basebackup_progress_wait_wal_archive() and do_pg_backup_stop() to be
called earlier when sending the main data directory which is the last
one in the list of tablespaces. As far as I can see, this does not
change the logic because do_pg_backup_stop() does not touch the
control file, but it seems to me that we'd rather keep these two
calls as they are now, and send the control file once we are out of
the for loop that processes all the tablespaces. That seems slightly
cleaner to me, still I agree that both are the same things.
Anyway, finishing do_pg_backup_stop() and then sending the control
file is just a consequence of the implementation choice driven by the
output required for the SQL function so as this is stored in the
backup state to get it down to pg_backup_stop() in xlogfuncs.c, so we
could also take one step back and forget about the SQL function,
setting only the new flag when sending a BASE_BACKUP, or just not use
the backupState to store this data as the exercise involves forcing
one boolean and recalculate a CRC32.
* We don't need to worry about backup software seeing a torn copy of
pg_control, since Postgres can safely read it out of memory and provide a
valid copy via pg_backup_stop(). This solves torn reads without needing to
write pg_control via a temp file, which may affect performance on a standby.
We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues. So I'm not sure to
see your point here?
* For backup from standby, we no longer need to instruct the backup software
to copy pg_control last. In fact the backup software should not copy
pg_control from PGDATA at all.
Yep. Not from PGDATA, but from the function.
These changes have no impact on current backup software and they are free to
use the pg_control available from pg_stop_backup() or continue to use
pg_control from PGDATA. Of course they will miss the benefits of getting a
consistent copy of pg_control and the backup_label checking, but will be no
worse off than before.
There is a large comment block in do_pg_backup_stop() around
backup_stopped_in_recovery. Perhaps it should be improved based on
this patch.
The main concern that I have over this patch is: who is actually going
to use this extension of the SQL stop function? Perhaps existing
backup solutions are good enough risk vs reward is not worth it? The
label_file and the tablespace map are text, this is a series of bytes
that has no visibility for the end-user unless checked on the
client-side. This adds a new step where backups would need to copy
the control file to the data folder.
--
Michael
On 10/2/24 10:11, Michael Paquier wrote:
On Fri, May 17, 2024 at 12:46:49PM +1000, David Steele wrote:
The basic idea is to harden recovery by returning a copy of pg_control from
pg_backup_stop() that has a flag set to prevent recovery if the backup_label
file is missing. Instead of backup software copying pg_control from PGDATA,
it stores an updated version that is returned from pg_backup_stop().Hmm, okay. There is also a slight impact for BASE_BACKUP, requiring
basebackup_progress_wait_wal_archive() and do_pg_backup_stop() to be
called earlier when sending the main data directory which is the last
one in the list of tablespaces. As far as I can see, this does not
change the logic because do_pg_backup_stop() does not touch the
control file, but it seems to me that we'd rather keep these two
calls as they are now, and send the control file once we are out of
the for loop that processes all the tablespaces. That seems slightly
cleaner to me, still I agree that both are the same things.
Sending pg_control later results in even more code churn because of how
tar files are terminated. I've updated it that way in v2 so you can see
what I mean. I don't have a strong preference, though, so if you prefer
the implementation in v2 then that's fine with me.
Anyway, finishing do_pg_backup_stop() and then sending the control
file is just a consequence of the implementation choice driven by the
output required for the SQL function so as this is stored in the
backup state to get it down to pg_backup_stop() in xlogfuncs.c, so we
could also take one step back and forget about the SQL function,
setting only the new flag when sending a BASE_BACKUP, or just not use
the backupState to store this data as the exercise involves forcing
one boolean and recalculate a CRC32.
I can definitely see us making other updates to pg_control so I would
rather keep this logic centralized, even though it is not too
complicated at this point. Still, even 8 lines of code (as it is now)
seems better not to duplicate.
* We don't need to worry about backup software seeing a torn copy of
pg_control, since Postgres can safely read it out of memory and provide a
valid copy via pg_backup_stop(). This solves torn reads without needing to
write pg_control via a temp file, which may affect performance on a standby.We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues. So I'm not sure to
see your point here?
Even at 512B it is possible to see tears in pg_control and they happen
in the build farm right now. In fact, this thread [1]/messages/by-id/CA+hUKG+jig+QdBETj_ab++VWSoJjbwT3sLNCnk0wFsY_6tRqoA@mail.gmail.com trying to fix the
problem was what got me thinking about alternate solutions to preventing
tears in pg_control. Thomas' proposed fixes have not been committed to
my knowledge so the problem remains, but would be fixed by this commit.
There is a large comment block in do_pg_backup_stop() around
backup_stopped_in_recovery. Perhaps it should be improved based on
this patch.
I added a sentence to this comment block in v2.
The main concern that I have over this patch is: who is actually going
to use this extension of the SQL stop function?
Primarily existing backup software, I would imagine. The idea is that it
would give them feature parity with pg_basebackup, rather than the new
protections being exclusive to pg_basebackup.
Perhaps existing
backup solutions are good enough risk vs reward is not worth it?
I'm not sure I see the risk here. Saving out pg_control is optional so
no changes to current software is required. Of course they miss the
benefit of the protection against tears and missing backup_label, but
that is a choice.
Also, no matter what current backup solutions do, they cannot prevent a
user from removing backup_label after restore. This patch prevents
invalid recovery when that happens.
The
label_file and the tablespace map are text, this is a series of bytes
that has no visibility for the end-user unless checked on the
client-side. This adds a new step where backups would need to copy
the control file to the data folder.
Again, optional, but if I was able to manage these saves using the psql
interface in the TAP tests then I'd say it would be pretty easy for
anyone with a normal connection to Postgres. Also, we require users to
treat tabelspace_map and backup_label as binary so not too big a change
here.
[1]: /messages/by-id/CA+hUKG+jig+QdBETj_ab++VWSoJjbwT3sLNCnk0wFsY_6tRqoA@mail.gmail.com
/messages/by-id/CA+hUKG+jig+QdBETj_ab++VWSoJjbwT3sLNCnk0wFsY_6tRqoA@mail.gmail.com
Attachments:
pgcontrol-from-backupstop-v2.patchtext/plain; charset=UTF-8; name=pgcontrol-from-backupstop-v2.patchDownload+192-43
Import Notes
Reply to msg id not found: ZvzyL0xPUa06_O9B@paquier.xyz
On Wed, Oct 02, 2024 at 09:03:27AM +0000, David Steele wrote:
On 10/2/24 10:11, Michael Paquier wrote:
Hmm, okay. There is also a slight impact for BASE_BACKUP, requiring
basebackup_progress_wait_wal_archive() and do_pg_backup_stop() to be
called earlier when sending the main data directory which is the last
one in the list of tablespaces. As far as I can see, this does not
change the logic because do_pg_backup_stop() does not touch the
control file, but it seems to me that we'd rather keep these two
calls as they are now, and send the control file once we are out of
the for loop that processes all the tablespaces. That seems slightly
cleaner to me, still I agree that both are the same things.Sending pg_control later results in even more code churn because of how tar
files are terminated. I've updated it that way in v2 so you can see what I
mean. I don't have a strong preference, though, so if you prefer the
implementation in v2 then that's fine with me.
It does not make much of a difference, indeed.
Anyway, finishing do_pg_backup_stop() and then sending the control
file is just a consequence of the implementation choice driven by the
output required for the SQL function so as this is stored in the
backup state to get it down to pg_backup_stop() in xlogfuncs.c, so we
could also take one step back and forget about the SQL function,
setting only the new flag when sending a BASE_BACKUP, or just not use
the backupState to store this data as the exercise involves forcing
one boolean and recalculate a CRC32.I can definitely see us making other updates to pg_control so I would rather
keep this logic centralized, even though it is not too complicated at this
point. Still, even 8 lines of code (as it is now) seems better not to
duplicate.
I was wondering if the field update should be hidden behind a macro
that uses an offsetof() on ControlFileData, with the name of the field
and a pointer to the value to update to. If you include the CRC32
calculation in that, that makes for less chunks of code when updating
one field of the control file.
The full CRC calculation could also be hidden inside a macro, as there
are a couple of places where we do the same things, like pg_rewind.c,
etc.
We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues. So I'm not sure to
see your point here?Even at 512B it is possible to see tears in pg_control and they happen in
the build farm right now. In fact, this thread [1] trying to fix the problem
was what got me thinking about alternate solutions to preventing tears in
pg_control. Thomas' proposed fixes have not been committed to my knowledge
so the problem remains, but would be fixed by this commit.
Ah, right. That rings a bell. Thomas has done some work with
c558e6fd92ff and 63a582222c6b. And we're still not taking the
ControlFileLock while copying it over.. It looks like we should do it
separately, and backpatch. That's not something for this thread to
worry about.
Perhaps existing
backup solutions are good enough risk vs reward is not worth it?I'm not sure I see the risk here. Saving out pg_control is optional so no
changes to current software is required. Of course they miss the benefit of
the protection against tears and missing backup_label, but that is a choice.Again, optional, but if I was able to manage these saves using the psql
interface in the TAP tests then I'd say it would be pretty easy for anyone
with a normal connection to Postgres. Also, we require users to treat
tabelspace_map and backup_label as binary so not too big a change here.
Maintenance cost for a limited user impact overall. With incremental
backups being a thing in v18 only available through the replication
protocol, the SQL functions have less advantages these days. My point
would be to see this thread as a two-step process:
1) Update the flag in the control file when sending it across in
replication stream.
2) Do the SQL function thing with the bytea for the control file, if
necessary.
1) is something that has more value than 2), IMO, because there is no
need for a manual step when a backup is taken by the replication
protocol. Well, custom backup solutions that rely on the replication
protocol to copy the data would need to make sure that they have a
backup_label, but that's something they should do anyway and what this
patch wants to protect users from. The SQL part is optional IMO. It
can be done, but it has less impact overall and makes backups more
complicated by requiring the manual copy of the control file.
--
Michael
On 10/3/24 07:45, Michael Paquier wrote:
On Wed, Oct 02, 2024 at 09:03:27AM +0000, David Steele wrote:
On 10/2/24 10:11, Michael Paquier wrote:
I can definitely see us making other updates to pg_control so I would rather
keep this logic centralized, even though it is not too complicated at this
point. Still, even 8 lines of code (as it is now) seems better not to
duplicate.I was wondering if the field update should be hidden behind a macro
that uses an offsetof() on ControlFileData, with the name of the field
and a pointer to the value to update to. If you include the CRC32
calculation in that, that makes for less chunks of code when updating
one field of the control file.The full CRC calculation could also be hidden inside a macro, as there
are a couple of places where we do the same things, like pg_rewind.c,
etc.
This seems to be a different case than pg_rewind, especially since we
need a ControlFileLock. I think that is a bit much to do in a macro, so
I split the functionality out into a function instead. This simplifies
the logic in basebackup.c but has little impact elsewhere.
We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues. So I'm not sure to
see your point here?Even at 512B it is possible to see tears in pg_control and they happen in
the build farm right now. In fact, this thread [1] trying to fix the problem
was what got me thinking about alternate solutions to preventing tears in
pg_control. Thomas' proposed fixes have not been committed to my knowledge
so the problem remains, but would be fixed by this commit.Ah, right. That rings a bell. Thomas has done some work with
c558e6fd92ff and 63a582222c6b. And we're still not taking the
ControlFileLock while copying it over.. It looks like we should do it
separately, and backpatch. That's not something for this thread to
worry about.
I'd be happy to adapt patch 01 to be back-patched (without the new flag)
if we decide it is a good idea. Just locking and making a copy of
pg_control is easy enough, but if we accept the backup_control_file()
function for new versions then we could keep that for the back patch to
reduce churn between versions.
Perhaps existing
backup solutions are good enough risk vs reward is not worth it?I'm not sure I see the risk here. Saving out pg_control is optional so no
changes to current software is required. Of course they miss the benefit of
the protection against tears and missing backup_label, but that is a choice.Again, optional, but if I was able to manage these saves using the psql
interface in the TAP tests then I'd say it would be pretty easy for anyone
with a normal connection to Postgres. Also, we require users to treat
tabelspace_map and backup_label as binary so not too big a change here.Maintenance cost for a limited user impact overall. With incremental
backups being a thing in v18 only available through the replication
protocol, the SQL functions have less advantages these days. My point
would be to see this thread as a two-step process:
1) Update the flag in the control file when sending it across in
replication stream.
2) Do the SQL function thing with the bytea for the control file, if
necessary.
OK, I have split the patch into two parts along these lines.
1) is something that has more value than 2), IMO, because there is no
need for a manual step when a backup is taken by the replication
protocol. Well, custom backup solutions that rely on the replication
protocol to copy the data would need to make sure that they have a
backup_label, but that's something they should do anyway and what this
patch wants to protect users from. The SQL part is optional IMO. It
can be done, but it has less impact overall and makes backups more
complicated by requiring the manual copy of the control file.
I don't think having incremental backup in pg_basebackup means alternate
backup solutions are going away or that we should deprecate the SQL
interface. If nothing else, third-party solutions need a way to get an
untorn copy of pg_control and in general I think the new flag will be
universally useful.
Regards,
-David
On 10/3/24 05:11, David Steele wrote:
On 10/3/24 07:45, Michael Paquier wrote:
1) is something that has more value than 2), IMO, because there is no
need for a manual step when a backup is taken by the replication
protocol. Well, custom backup solutions that rely on the replication
protocol to copy the data would need to make sure that they have a
backup_label, but that's something they should do anyway and what this
patch wants to protect users from. The SQL part is optional IMO. It
can be done, but it has less impact overall and makes backups more
complicated by requiring the manual copy of the control file.I don't think having incremental backup in pg_basebackup means alternate
backup solutions are going away or that we should deprecate the SQL
interface. If nothing else, third-party solutions need a way to get an
untorn copy of pg_control and in general I think the new flag will be
universally useful.
I updated this patch to fix an issue with -fsanitize=alignment. I'm not
entirely happy with copying twice but not sure of another way to do it.
As far as I can see VARDATA() will not return aligned data on 64-bit
architectures.
Regards,
-David
On 11/20/24 17:44, David Steele wrote:
On 10/3/24 05:11, David Steele wrote:
On 10/3/24 07:45, Michael Paquier wrote:
1) is something that has more value than 2), IMO, because there is no
need for a manual step when a backup is taken by the replication
protocol. Well, custom backup solutions that rely on the replication
protocol to copy the data would need to make sure that they have a
backup_label, but that's something they should do anyway and what this
patch wants to protect users from. The SQL part is optional IMO. It
can be done, but it has less impact overall and makes backups more
complicated by requiring the manual copy of the control file.I don't think having incremental backup in pg_basebackup means
alternate backup solutions are going away or that we should deprecate
the SQL interface. If nothing else, third-party solutions need a way
to get an untorn copy of pg_control and in general I think the new
flag will be universally useful.I updated this patch to fix an issue with -fsanitize=alignment. I'm not
entirely happy with copying twice but not sure of another way to do it.
As far as I can see VARDATA() will not return aligned data on 64-bit
architectures.
Rebased and improved a comment and an error.
Regards,
-David
On 1/24/25 13:43, David Steele wrote:
Rebased and improved a comment and an error.
Rebased to fix breakage caused by the split of func.sgml in 4e23c9e.
Regards,
-David
On 8/7/25 05:30, David Steele wrote:
On 1/24/25 13:43, David Steele wrote:
Rebased and improved a comment and an error.
Rebased to fix breakage caused by the split of func.sgml in 4e23c9e.
Rebased to implement simplification added by "Simplify creation of
built-in functions with default arguments" (759b03b2).
Regards,
-David
On 2/20/26 10:10, David Steele wrote:
On 8/7/25 05:30, David Steele wrote:
On 1/24/25 13:43, David Steele wrote:
Rebased and improved a comment and an error.
Rebased to fix breakage caused by the split of func.sgml in 4e23c9e.
Rebased to implement simplification added by "Simplify creation of
built-in functions with default arguments" (759b03b2).
With the patches this time!
Regards,
-David
On 2/20/26 12:47, David Steele wrote:
On 2/20/26 10:10, David Steele wrote:
On 8/7/25 05:30, David Steele wrote:
On 1/24/25 13:43, David Steele wrote:
Rebased and improved a comment and an error.
Rebased to fix breakage caused by the split of func.sgml in 4e23c9e.
Rebased to implement simplification added by "Simplify creation of
built-in functions with default arguments" (759b03b2).
Rebased on "Simplify creation of built-in functions with non-default
ACLs." (f95d73ed).
Regards,
-David
Hi David
I have not read the code yet, so this may already be answered there, but I had a question about the proposal itself. This patch protects against a missing backup_label, but what about a wrong one? If a user restores a backup_label file from a different backup, the existence check alone would not detect that. Do we need some consistency check between the returned pg_control copy and the backup_label contents, or is the intended scope here limited to the “missing file” case only?
Regards
Haibo
Show quoted text
On Mar 5, 2026, at 5:27 PM, David Steele <david@pgbackrest.org> wrote:
On 2/20/26 12:47, David Steele wrote:
On 2/20/26 10:10, David Steele wrote:
On 8/7/25 05:30, David Steele wrote:
On 1/24/25 13:43, David Steele wrote:
Rebased and improved a comment and an error.
Rebased to fix breakage caused by the split of func.sgml in 4e23c9e.
Rebased to implement simplification added by "Simplify creation of built-in functions with default arguments" (759b03b2).
Rebased on "Simplify creation of built-in functions with non-default ACLs." (f95d73ed).
Regards,
-David<pgcontrol-flag-v8-01-basebackup.patch><pgcontrol-flag-v8-02-sql.patch>
On Mon, Mar 16, 2026 at 10:16:58PM -0700, Haibo Yan wrote:
I have not read the code yet, so this may already be answered there,
but I had a question about the proposal itself. This patch protects
against a missing backup_label, but what about a wrong one? If a
user restores a backup_label file from a different backup, the
existence check alone would not detect that. Do we need some
consistency check between the returned pg_control copy and the
backup_label contents, or is the intended scope here limited to the
“missing file” case only?
Please note that we use bottom-posting on the lists, as of:
https://en.wikipedia.org/wiki/Posting_style#Bottom-posting
--
Michael
On 3/17/26 12:16, Haibo Yan wrote:
I have not read the code yet, so this may already be answered there, but
I had a question about the proposal itself. This patch protects against
a missing backup_label, but what about a wrong one? If a user restores a
backup_label file from a different backup, the existence check alone
would not detect that. Do we need some consistency check between the
returned pg_control copy and the backup_label contents, or is the
intended scope here limited to the “missing file” case only?
Thank you for having a look!
The goal here is only to check for a missing backup_label. The general
problem is that PostgreSQL suggests that removing backup_label might be
a good idea so the user does it:
If you are not restoring from a backup, try removing the file
\"%s/backup_label\"
The user *could* copy a backup_label from another backup and there are
ways we could detect that but I feel that should be material for a
separate patch.
Regards,
-David
Hi David,
Thank you for the clarification. I have now read the code, and overall it looks good to me. I only had one very small comment.
You currently have:
```
memset(controlFile + sizeof(ControlFileData), 0,
PG_CONTROL_FILE_SIZE - sizeof(ControlFileData));
memcpy(controlFile, ControlFile, sizeof(ControlFileData));
```
This is correct, since only the trailing bytes need to be zeroed before the copy.
I was just wondering whether the following might be slightly clearer:
```
memset(controlFile, 0, PG_CONTROL_FILE_SIZE);
memcpy(controlFile, ControlFile, sizeof(ControlFileData));
```
I do not think this is a real issue, though.
Thanks
Haibo
Show quoted text
On Mar 17, 2026, at 12:05 AM, David Steele <david@pgbackrest.org> wrote:
On 3/17/26 12:16, Haibo Yan wrote:
I have not read the code yet, so this may already be answered there, but I had a question about the proposal itself. This patch protects against a missing backup_label, but what about a wrong one? If a user restores a backup_label file from a different backup, the existence check alone would not detect that. Do we need some consistency check between the returned pg_control copy and the backup_label contents, or is the intended scope here limited to the “missing file” case only?
Thank you for having a look!
The goal here is only to check for a missing backup_label. The general problem is that PostgreSQL suggests that removing backup_label might be a good idea so the user does it:
If you are not restoring from a backup, try removing the file \"%s/backup_label\"
The user *could* copy a backup_label from another backup and there are ways we could detect that but I feel that should be material for a separate patch.
Regards,
-David
On Tue, Mar 17, 2026 at 11:50:29AM -0700, Haibo Yan wrote:
Thank you for the clarification. I have now read the code, and
overall it looks good to me. I only had one very small comment.
(Bottom-posting note from above, please be careful.)
I was just wondering whether the following might be slightly clearer:
```
memset(controlFile, 0, PG_CONTROL_FILE_SIZE);
memcpy(controlFile, ControlFile, sizeof(ControlFileData));
```I do not think this is a real issue, though.
{
ControlFile->backupStartPoint = checkPoint.redo;
ControlFile->backupEndRequired = backupEndRequired;
+ ControlFile->backupLabelRequired = false;
It sounds like it is going to be important to document the reason why
the flag is reset here (aka we don't need the backup_label file
anymore).
+backup_control_file(uint8_t *controlFile)
+{
+ ControlFileData *controlData = ((ControlFileData *)controlFile);
+
+ memset(controlFile + sizeof(ControlFileData), 0,
+ PG_CONTROL_FILE_SIZE - sizeof(ControlFileData));
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ memcpy(controlFile, ControlFile, sizeof(ControlFileData));
+ LWLockRelease(ControlFileLock);
+
+ controlData->backupLabelRequired = true;
+
+ INIT_CRC32C(controlData->crc);
+ COMP_CRC32C(controlData->crc, controlFile, offsetof(ControlFileData, crc));
+ FIN_CRC32C(controlData->crc);
I was wondering if we should have an assertion at least to cross-check
that the contents we store in shared memory never go out-of-sync with
the on-disk contents, in the shape of a USE_ASSERT_CHECKING block that
calls get_controlfile() and memcmp()'s the contents between shmem and
the on-disk file, while the LWLock is taken. We ship the control file
last on purpose, one reason being backups taken from standbys, so that
may be sensible to do.
Another property of the new control file flag that is implied in the
implementation but not documented is that we should never check for
backupLabelRequired when a backup_label is gone. Actually, the flag
is reset in InitWalRecovery() in the initial steps of recovery, and
the backup_label file is removed much later in StartupXLOG() just
*after* UpdateControlFile() to minimize the window where the contents
of the control file and the backup_label file is removed are
out-of-sync. This window means that if we crash between the
completion of UpdateControlFile() and the durable_rename() we could
have a flag reset with a backup_label still around. On restart,
recovery would fail, requiring a manual modification of the control
file, at least. It sounds to me that this implementation detail
should be documented clearly.
Finally, here is a general opinion. I like this patch, and it is
basically risk-free for base backups taken with the replication
protocol as we update the control file with the new flag set
on-the-fly.
Now, I am worried about backups that use pg_stop_backup(). Changing
backup APIs has always been a very sensitive area, and this is going
to require operators to update backup tools so as the control file
received as a result of pg_stop_backup() is copied, at the cost of
getting a failure if they don't do so. I will *not* proceed with this
change without a clear approval from some more committers or senior
hackers that they like this change (approach previously suggested by
Andres, actually, for what I can see). I am adding in CC a few
committers who have commented on this set of proposals and who have
touched the recovery code in the last few years, for awareness.
The timing is what it is, and we are at the end of a release cycle.
Let's see if we can reach a consensus of some kind.
--
Michael
On 3/18/26 08:43, Michael Paquier wrote:
On Tue, Mar 17, 2026 at 11:50:29AM -0700, Haibo Yan wrote:
Thank you for the clarification. I have now read the code, and
overall it looks good to me. I only had one very small comment.(Bottom-posting note from above, please be careful.)
I was just wondering whether the following might be slightly clearer:
```
memset(controlFile, 0, PG_CONTROL_FILE_SIZE);
memcpy(controlFile, ControlFile, sizeof(ControlFileData));
```
Yeah, perhaps I am being too clever here. The reason why I like this
pattern:
memset(controlFile + sizeof(ControlFileData), 0,
PG_CONTROL_FILE_SIZE - sizeof(ControlFileData));
...is that valgrind will complain if the ControlFileData part is not
completely initialized later on.
But your version is what is generally used in the code so I'm fine with
doing it that way.
{
ControlFile->backupStartPoint = checkPoint.redo;
ControlFile->backupEndRequired = backupEndRequired;
+ ControlFile->backupLabelRequired = false;It sounds like it is going to be important to document the reason why
the flag is reset here (aka we don't need the backup_label file
anymore).
Good point -- I'll add that in the next revision.
+backup_control_file(uint8_t *controlFile) +{ + ControlFileData *controlData = ((ControlFileData *)controlFile); + + memset(controlFile + sizeof(ControlFileData), 0, + PG_CONTROL_FILE_SIZE - sizeof(ControlFileData)); + + LWLockAcquire(ControlFileLock, LW_SHARED); + memcpy(controlFile, ControlFile, sizeof(ControlFileData)); + LWLockRelease(ControlFileLock); + + controlData->backupLabelRequired = true; + + INIT_CRC32C(controlData->crc); + COMP_CRC32C(controlData->crc, controlFile, offsetof(ControlFileData, crc)); + FIN_CRC32C(controlData->crc);I was wondering if we should have an assertion at least to cross-check
that the contents we store in shared memory never go out-of-sync with
the on-disk contents, in the shape of a USE_ASSERT_CHECKING block that
calls get_controlfile() and memcmp()'s the contents between shmem and
the on-disk file, while the LWLock is taken. We ship the control file
last on purpose, one reason being backups taken from standbys, so that
may be sensible to do.
As far as I can see this should always be true -- I audited all the
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE)
sections and the file is always saved once if is updated. Let me see if
I can add this check without too much pain, e.g. an additional parameter.
Another property of the new control file flag that is implied in the
implementation but not documented is that we should never check for
backupLabelRequired when a backup_label is gone.
I'm not sure what you mean here? That's exactly when we do want to check
as below:
/*
* No backup_label file has been found if we are here. Error if the
* control file requires backup_label.
*/
if (ControlFile->backupLabelRequired)
ereport(FATAL,
(errmsg("could not find backup_label required for recovery"),
errhint("backup_label must be present for recovery to
proceed")));
Actually, the flag> is reset in InitWalRecovery() in the initial
steps of recovery, and
the backup_label file is removed much later in StartupXLOG() just
*after* UpdateControlFile() to minimize the window where the contents
of the control file and the backup_label file is removed are
out-of-sync. This window means that if we crash between the
completion of UpdateControlFile() and the durable_rename() we could
have a flag reset with a backup_label still around. On restart,
recovery would fail, requiring a manual modification of the control
file, at least. It sounds to me that this implementation detail
should be documented clearly.
I'll test it but I don't think this is the case. If backup_label is
present then we'll just update pg_control again as we do now.
When backup_label is present the value of backupLabelRequired does not
matter so it just follows the prior logic.
Finally, here is a general opinion. I like this patch, and it is
basically risk-free for base backups taken with the replication
protocol as we update the control file with the new flag set
on-the-fly.
Glad to hear it!
Now, I am worried about backups that use pg_stop_backup(). Changing
backup APIs has always been a very sensitive area, and this is going
to require operators to update backup tools so as the control file
received as a result of pg_stop_backup() is copied, at the cost of
getting a failure if they don't do so.
Using the pg_control copy from pg_backup_stop() is entirely optional and
nothing is broken if vendors decide not to use it. This can be
demonstrated by applying the 01 patch without 02. In that case the tests
in 042_low_level_backup continue to run. You can also apply 01 and 02
and revert the test changes in 042_low_level_backup and that works, too.
Vendors can use the new feature if they want the protection of
backupLabelRequired and a guaranteed non-torn copy of pg_control but if
they do nothing then nothing breaks.
I will *not* proceed with this
change without a clear approval from some more committers or senior
hackers that they like this change (approach previously suggested by
Andres, actually, for what I can see).
You are correct and this was an omission on my part. If this gets to
commit we'll definitely want to mention that this flag was Andres'
suggestion.
I am adding in CC a few
committers who have commented on this set of proposals and who have
touched the recovery code in the last few years, for awareness.
The timing is what it is, and we are at the end of a release cycle.
Let's see if we can reach a consensus of some kind.
Perfect. I'm looking forward to their input.
I'll hold off on a new patch version until we get some feedback since I
don't think any of the requested changes are critical to the functionality.
Regards,
-David
On Wed, Mar 18, 2026 at 04:05:24AM +0000, David Steele wrote:
On 3/18/26 08:43, Michael Paquier wrote:
I was wondering if we should have an assertion at least to cross-check
that the contents we store in shared memory never go out-of-sync with
the on-disk contents, in the shape of a USE_ASSERT_CHECKING block that
calls get_controlfile() and memcmp()'s the contents between shmem and
the on-disk file, while the LWLock is taken. We ship the control file
last on purpose, one reason being backups taken from standbys, so that
may be sensible to do.As far as I can see this should always be true -- I audited all the
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE)
sections and the file is always saved once if is updated. Let me see if I
can add this check without too much pain, e.g. an additional parameter.
This matches with my reads of the code. The attached check, that can
be applied on top of your patches, passes under check-world.
Another property of the new control file flag that is implied in the
implementation but not documented is that we should never check for
backupLabelRequired when a backup_label is gone.I'm not sure what you mean here? That's exactly when we do want to check as
below:
Sorry for the confusion, I meant that "we should never check for
backupLabelRequired when we have a backup_label".
Using the pg_control copy from pg_backup_stop() is entirely optional and
nothing is broken if vendors decide not to use it. This can be demonstrated
by applying the 01 patch without 02. In that case the tests in
042_low_level_backup continue to run. You can also apply 01 and 02 and
revert the test changes in 042_low_level_backup and that works, too.
FWIW, after a second look I am actually wondering if 0002 is safe at
all. The contents of the control file are fetched after we are done
with do_pg_backup_stop(), and there could be a bunch of activity that
happens between the end of do_pg_backup_stop() and the
backup_control_file() call, where the control file could be updated
more and interfere with the recovery startup for some of its fields?
GUC parameter updates that may touch the control file are one thing
popping into mind.
--
Michael
Attachments:
0001-Add-consistency-check-for-shmem-and-on-disk-control-.patchtext/plain; charset=us-asciiDownload+11-1
On 3/18/26 11:53, Michael Paquier wrote:
On Wed, Mar 18, 2026 at 04:05:24AM +0000, David Steele wrote:
On 3/18/26 08:43, Michael Paquier wrote:
I was wondering if we should have an assertion at least to cross-check
that the contents we store in shared memory never go out-of-sync with
the on-disk contents, in the shape of a USE_ASSERT_CHECKING block that
calls get_controlfile() and memcmp()'s the contents between shmem and
the on-disk file, while the LWLock is taken. We ship the control file
last on purpose, one reason being backups taken from standbys, so that
may be sensible to do.As far as I can see this should always be true -- I audited all the
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE)
sections and the file is always saved once if is updated. Let me see if I
can add this check without too much pain, e.g. an additional parameter.This matches with my reads of the code. The attached check, that can
be applied on top of your patches, passes under check-world.
Looks good to me. I have integrated it into the attached patches. I just
changed it to look at our copy instead of the global ControlData since
that is the one we are going to save.
Using the pg_control copy from pg_backup_stop() is entirely optional and
nothing is broken if vendors decide not to use it. This can be demonstrated
by applying the 01 patch without 02. In that case the tests in
042_low_level_backup continue to run. You can also apply 01 and 02 and
revert the test changes in 042_low_level_backup and that works, too.FWIW, after a second look I am actually wondering if 0002 is safe at
all. The contents of the control file are fetched after we are done
with do_pg_backup_stop(), and there could be a bunch of activity that
happens between the end of do_pg_backup_stop() and the
backup_control_file() call, where the control file could be updated
more and interfere with the recovery startup for some of its fields?
GUC parameter updates that may touch the control file are one thing
popping into mind.
You are correct -- the copy of pg_control needs to happen before
do_pg_backup_stop(). An older version of this patch saved pg_control in
backup_state which made the prior location safe. However, I missed
moving this code when I moved pg_control out of backup_state. Code
review to the rescue.
I believe I have addressed all current review comments in the attached
patches. I tested Postgres crashing right after pg_control is updated
but before backup_label is renamed. It worked as expected on restart. I
also manually removed backup_label before restarting and that worked as
well. I wrote a comment documenting all that.
One thing I could do is note in the documentation that it is not
strictly necessary to get pg_control from pg_backup_stop(). Right now it
sounds like copying from disk is no longer an option -- but it is if you
are willing to accept the possibility of pg_control being torn. But I'd
hope most well-maintained backup software is taking care of that by now.
The problem with caveats in the docs is it can lead to confusion and
getting pg_control from pg_backup_stop() is just a better idea in general.
Thoughts?
Regards,
-David
On Wed, Mar 18, 2026 at 07:35:47AM +0000, David Steele wrote:
You are correct -- the copy of pg_control needs to happen before
do_pg_backup_stop(). An older version of this patch saved pg_control in
backup_state which made the prior location safe. However, I missed moving
this code when I moved pg_control out of backup_state. Code review to the
rescue.
Right. I am wondering also if the final result would not be better
without 0002, actually, focusing only on the "simpler" base backup
case through the replication protocol, and you are making a good case
in mentioning it as not absolutely mandatory for base backups that are
taken through the SQL functions. One could always tweak the flag
manually in the control file based on the contents taken from the data
folder. That's more hairy than writing the entire file, for sure,
still possible.
--
Michael