Add system identifier to backup manifest
Hi All,
With the attached patch, the backup manifest will have a new key item as
"System-Identifier" 64-bit integer whose value is derived from pg_control
while
generating it, and the manifest version bumps to 2.
This helps to identify the correct database server and/or backup for the
subsequent backup operations. pg_verifybackup validates the manifest system
identifier against the backup control file and fails if they don’t match.
Similarly, pg_basebackup increment backup will fail if the manifest system
identifier does not match with the server system identifier. The
pg_combinebackup is already a bit smarter -- checks the system identifier
from
the pg_control of all the backups, with this patch the manifest system
identifier also validated.
For backward compatibility, the manifest system identifier validation will
be
skipped for version 1.
--
Regards,
Amul Sul
EDB: http://www.enterprisedb.com
Attachments:
v1-0001-Add-system-identifier-to-backup-manifest.patchapplication/x-patch; name=v1-0001-Add-system-identifier-to-backup-manifest.patchDownload+286-39
On 2024-Jan-17, Amul Sul wrote:
This helps to identify the correct database server and/or backup for the
subsequent backup operations. pg_verifybackup validates the manifest system
identifier against the backup control file and fails if they don’t match.
Similarly, pg_basebackup increment backup will fail if the manifest system
identifier does not match with the server system identifier. The
pg_combinebackup is already a bit smarter -- checks the system identifier
from
the pg_control of all the backups, with this patch the manifest system
identifier also validated.
Hmm, okay, but what if I take a full backup from a primary server and
later I want an incremental from a standby, or the other way around?
Will this prevent me from using such a combination?
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick." (Andrew Sullivan)
/messages/by-id/20050809113420.GD2768@phlogiston.dyndns.org
On Wed, Jan 17, 2024 at 5:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:
On 2024-Jan-17, Amul Sul wrote:
This helps to identify the correct database server and/or backup for the
subsequent backup operations. pg_verifybackup validates the manifestsystem
identifier against the backup control file and fails if they don’t match.
Similarly, pg_basebackup increment backup will fail if the manifestsystem
identifier does not match with the server system identifier. The
pg_combinebackup is already a bit smarter -- checks the system identifier
from
the pg_control of all the backups, with this patch the manifest system
identifier also validated.Hmm, okay, but what if I take a full backup from a primary server and
later I want an incremental from a standby, or the other way around?
Will this prevent me from using such a combination?
Yes, that worked for me where the system identifier was the same on
master as well standby.
Regards,
Amul
On Wed, Jan 17, 2024 at 6:45 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Hmm, okay, but what if I take a full backup from a primary server and
later I want an incremental from a standby, or the other way around?
Will this prevent me from using such a combination?
The system identifier had BETTER match in such cases. If it doesn't,
somebody's run pg_resetwal on your standby since it was created... and
in that case, no incremental backup for you!
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Jan 17, 2024 at 6:31 AM Amul Sul <sulamul@gmail.com> wrote:
With the attached patch, the backup manifest will have a new key item as
"System-Identifier" 64-bit integer whose value is derived from pg_control while
generating it, and the manifest version bumps to 2.This helps to identify the correct database server and/or backup for the
subsequent backup operations. pg_verifybackup validates the manifest system
identifier against the backup control file and fails if they don’t match.
Similarly, pg_basebackup increment backup will fail if the manifest system
identifier does not match with the server system identifier. The
pg_combinebackup is already a bit smarter -- checks the system identifier from
the pg_control of all the backups, with this patch the manifest system
identifier also validated.
Thanks for working on this. Without this, I think what happens is that
you can potentially take an incremental backup from the "wrong"
server, if the states of the systems are such that all of the other
sanity checks pass. When you run pg_combinebackup, it'll discover the
problem and tell you, but you ideally want to discover such errors at
backup time rather than at restore time. This addresses that. And,
overall, I think it's a pretty good patch. But I nonetheless have a
bunch of comments.
- The associated value is always the integer 1.
+ The associated value is the integer, either 1 or 2.
is an integer. Beginning in <productname>PostgreSQL</productname> 17,
it is 2; in older versions, it is 1.
+ context.identity_cb = manifest_process_identity;
I'm not really on board with calling the system identifier "identity"
throughout the patch. I think it should just say system_identifier. If
we were going to abbreviate, I'd prefer something like "sysident" that
looks like it's derived from "system identifier" rather than
"identity" which is a different word altogether. But I don't think we
should abbreviate unless doing so creates *ridiculously* long
identifier names.
+static void
+manifest_process_identity(JsonManifestParseContext *context,
+ int manifest_version,
+ uint64 manifest_system_identifier)
+{
+ uint64 system_identifier;
+
+ /* Manifest system identifier available in version 2 or later */
+ if (manifest_version == 1)
+ return;
I think you've got the wrong idea here. I think this function would
only get called if System-Identifier is present in the manifest, so if
it's a v1 manifest, this would never get called, so this if-statement
would not ever do anything useful. I think what you should do is (1)
if the client supplies a v1 manifest, reject it, because surely that's
from an older server version that doesn't support incremental backup;
but do that when the version is parsed rather than here; and (2) also
detect and reject the case when it's supposedly a v2 manifest but this
is absent.
(1) should really be done when the version number is parsed, so I
suspect you may need to add manifest_version_cb.
+static void
+combinebackup_identity_cb(JsonManifestParseContext *context,
+ int manifest_version,
+ uint64 manifest_system_identifier)
+{
+ parser_context *private_context = context->private_data;
+ uint64 system_identifier = private_context->system_identifier;
+
+ /* Manifest system identifier available in version 2 or later */
+ if (manifest_version == 1)
+ return;
Very similar to the above case. Just reject a version 1 manifest as
soon as we see the version number. In this function, set a flag
indicating we saw the system identifier; if at the end of parsing that
flag is not set, kaboom.
- parse_manifest_file(manifest_path, &context.ht, &first_wal_range);
+ parse_manifest_file(manifest_path, &context.ht, &first_wal_range,
+ context.backup_directory);
Don't do this! parse_manifest_file() should just record everything
found in the manifest in the context object. Syntax validation should
happen while parsing the manifest (e.g. "CAT/DOG" is not a valid LSN
and we should reject that at this stage) but semantic validation
should happen later (e.g. "0/0" can't be a the correct backup end LSN
but we don't figure that out while parsing, but rather later). I think
you should actually move validation of the system identifier to the
point where the directory walk encounters the control file (and update
the docs and tests to match that decision). Imagine if you wanted to
validate a tar-format backup; then you wouldn't have random access to
the directory. You'd see the manifest file first, and then all the
files in a random order, with one chance to look at each one.
(This is, in fact, a feature I think we should implement.)
- if (strcmp(token, "1") != 0)
+ parse->manifest_version = atoi(token);
+ if (parse->manifest_version != 1 && parse->manifest_version != 2)
json_manifest_parse_failure(parse->context,
"unexpected manifest version");
Please either (a) don't do a string-to-integer conversion and just
strcmp() twice or (b) use strtol so that you can check that it
succeeded. I don't want to accept manifest version 1a as 1.
+/*
+ * Validate manifest system identifier against the database server system
+ * identifier.
+ */
This comment assumes you know what the callback is going to do, but
you don't. This should be more like the comment for
json_manifest_finalize_file or json_manifest_finalize_wal_range.
--
Robert Haas
EDB: http://www.enterprisedb.com
I have also done a review of the patch and some testing. The patch looks
good, and I agree with Robert's comments.
On Wed, Jan 17, 2024 at 8:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 17, 2024 at 6:31 AM Amul Sul <sulamul@gmail.com> wrote:
With the attached patch, the backup manifest will have a new key item as
"System-Identifier" 64-bit integer whose value is derived from
pg_control while
generating it, and the manifest version bumps to 2.
This helps to identify the correct database server and/or backup for the
subsequent backup operations. pg_verifybackup validates the manifest
system
identifier against the backup control file and fails if they don’t
match.
Similarly, pg_basebackup increment backup will fail if the manifest
system
identifier does not match with the server system identifier. The
pg_combinebackup is already a bit smarter -- checks the system
identifier from
the pg_control of all the backups, with this patch the manifest system
identifier also validated.Thanks for working on this. Without this, I think what happens is that
you can potentially take an incremental backup from the "wrong"
server, if the states of the systems are such that all of the other
sanity checks pass. When you run pg_combinebackup, it'll discover the
problem and tell you, but you ideally want to discover such errors at
backup time rather than at restore time. This addresses that. And,
overall, I think it's a pretty good patch. But I nonetheless have a
bunch of comments.- The associated value is always the integer 1. + The associated value is the integer, either 1 or 2.is an integer. Beginning in <productname>PostgreSQL</productname> 17,
it is 2; in older versions, it is 1.+ context.identity_cb = manifest_process_identity;
I'm not really on board with calling the system identifier "identity"
throughout the patch. I think it should just say system_identifier. If
we were going to abbreviate, I'd prefer something like "sysident" that
looks like it's derived from "system identifier" rather than
"identity" which is a different word altogether. But I don't think we
should abbreviate unless doing so creates *ridiculously* long
identifier names.+static void +manifest_process_identity(JsonManifestParseContext *context, + int manifest_version, + uint64 manifest_system_identifier) +{ + uint64 system_identifier; + + /* Manifest system identifier available in version 2 or later */ + if (manifest_version == 1) + return;I think you've got the wrong idea here. I think this function would
only get called if System-Identifier is present in the manifest, so if
it's a v1 manifest, this would never get called, so this if-statement
would not ever do anything useful. I think what you should do is (1)
if the client supplies a v1 manifest, reject it, because surely that's
from an older server version that doesn't support incremental backup;
but do that when the version is parsed rather than here; and (2) also
detect and reject the case when it's supposedly a v2 manifest but this
is absent.(1) should really be done when the version number is parsed, so I
suspect you may need to add manifest_version_cb.+static void +combinebackup_identity_cb(JsonManifestParseContext *context, + int manifest_version, + uint64 manifest_system_identifier) +{ + parser_context *private_context = context->private_data; + uint64 system_identifier = private_context->system_identifier; + + /* Manifest system identifier available in version 2 or later */ + if (manifest_version == 1) + return;Very similar to the above case. Just reject a version 1 manifest as
soon as we see the version number. In this function, set a flag
indicating we saw the system identifier; if at the end of parsing that
flag is not set, kaboom.- parse_manifest_file(manifest_path, &context.ht, &first_wal_range); + parse_manifest_file(manifest_path, &context.ht, &first_wal_range, + context.backup_directory);Don't do this! parse_manifest_file() should just record everything
found in the manifest in the context object. Syntax validation should
happen while parsing the manifest (e.g. "CAT/DOG" is not a valid LSN
and we should reject that at this stage) but semantic validation
should happen later (e.g. "0/0" can't be a the correct backup end LSN
but we don't figure that out while parsing, but rather later). I think
you should actually move validation of the system identifier to the
point where the directory walk encounters the control file (and update
the docs and tests to match that decision). Imagine if you wanted to
validate a tar-format backup; then you wouldn't have random access to
the directory. You'd see the manifest file first, and then all the
files in a random order, with one chance to look at each one.(This is, in fact, a feature I think we should implement.)
- if (strcmp(token, "1") != 0) + parse->manifest_version = atoi(token); + if (parse->manifest_version != 1 && parse->manifest_version != 2) json_manifest_parse_failure(parse->context, "unexpected manifest version");Please either (a) don't do a string-to-integer conversion and just
strcmp() twice or (b) use strtol so that you can check that it
succeeded. I don't want to accept manifest version 1a as 1.
+/* + * Validate manifest system identifier against the database serversystem
+ * identifier.
+ */This comment assumes you know what the callback is going to do, but
you don't. This should be more like the comment for
json_manifest_finalize_file or json_manifest_finalize_wal_range.
This comment caught me off-guard too. After some testing and detailed
review I found that this is
called by pg_verifybackup and pg_combinebackup both of which do not
validate against any
running database system.
--
Robert Haas
EDB: http://www.enterprisedb.com
--
Thanks & Regards,
Sravan Velagandula
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Jan 17, 2024 at 08:46:09AM -0500, Robert Haas wrote:
On Wed, Jan 17, 2024 at 6:45 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Hmm, okay, but what if I take a full backup from a primary server and
later I want an incremental from a standby, or the other way around?
Will this prevent me from using such a combination?The system identifier had BETTER match in such cases. If it doesn't,
somebody's run pg_resetwal on your standby since it was created... and
in that case, no incremental backup for you!
There is an even stronger check than that at replay as we also store
the system identifier in XLogLongPageHeaderData and cross-check it
with the contents of the control file. Having a field in the backup
manifest makes for a much faster detection, even if that's not the
same as replaying things, it can avoid a lot of problems when
combining backup pieces. I'm +1 for Amul's patch concept.
--
Michael
On Wed, Jan 17, 2024 at 8:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 17, 2024 at 6:31 AM Amul Sul <sulamul@gmail.com> wrote:
With the attached patch, the backup manifest will have a new key item as
"System-Identifier" 64-bit integer whose value is derived frompg_control while
generating it, and the manifest version bumps to 2.
This helps to identify the correct database server and/or backup for the
subsequent backup operations. pg_verifybackup validates the manifestsystem
identifier against the backup control file and fails if they don’t match.
Similarly, pg_basebackup increment backup will fail if the manifestsystem
identifier does not match with the server system identifier. The
pg_combinebackup is already a bit smarter -- checks the systemidentifier from
the pg_control of all the backups, with this patch the manifest system
identifier also validated.Thanks for working on this. Without this, I think what happens is that
you can potentially take an incremental backup from the "wrong"
server, if the states of the systems are such that all of the other
sanity checks pass. When you run pg_combinebackup, it'll discover the
problem and tell you, but you ideally want to discover such errors at
backup time rather than at restore time. This addresses that. And,
overall, I think it's a pretty good patch. But I nonetheless have a
bunch of comments.
Thank you for the review.
- The associated value is always the integer 1. + The associated value is the integer, either 1 or 2.is an integer. Beginning in <productname>PostgreSQL</productname> 17,
it is 2; in older versions, it is 1.
Ok,
+ context.identity_cb = manifest_process_identity;
I'm not really on board with calling the system identifier "identity"
throughout the patch. I think it should just say system_identifier. If
we were going to abbreviate, I'd prefer something like "sysident" that
looks like it's derived from "system identifier" rather than
"identity" which is a different word altogether. But I don't think we
should abbreviate unless doing so creates *ridiculously* long
identifier names.
Ok, used "system identifier" at all the places.
+static void +manifest_process_identity(JsonManifestParseContext *context, + int manifest_version, + uint64 manifest_system_identifier) +{ + uint64 system_identifier; + + /* Manifest system identifier available in version 2 or later */ + if (manifest_version == 1) + return;I think you've got the wrong idea here. I think this function would
only get called if System-Identifier is present in the manifest, so if
it's a v1 manifest, this would never get called, so this if-statement
would not ever do anything useful. I think what you should do is (1)
if the client supplies a v1 manifest, reject it, because surely that's
from an older server version that doesn't support incremental backup;
but do that when the version is parsed rather than here; and (2) also
detect and reject the case when it's supposedly a v2 manifest but this
is absent.(1) should really be done when the version number is parsed, so I
suspect you may need to add manifest_version_cb.+static void +combinebackup_identity_cb(JsonManifestParseContext *context, + int manifest_version, + uint64 manifest_system_identifier) +{ + parser_context *private_context = context->private_data; + uint64 system_identifier = private_context->system_identifier; + + /* Manifest system identifier available in version 2 or later */ + if (manifest_version == 1) + return;Very similar to the above case. Just reject a version 1 manifest as
soon as we see the version number. In this function, set a flag
indicating we saw the system identifier; if at the end of parsing that
flag is not set, kaboom.
Ok, I added a version_cb callback. Using this pg_combinebackup &
pg_basebackup
will report an error for manifest version 1, whereas pg_verifybackup
doesn't (not needed IIUC).
- parse_manifest_file(manifest_path, &context.ht, &first_wal_range); + parse_manifest_file(manifest_path, &context.ht, &first_wal_range, + context.backup_directory);Don't do this! parse_manifest_file() should just record everything
found in the manifest in the context object. Syntax validation should
happen while parsing the manifest (e.g. "CAT/DOG" is not a valid LSN
and we should reject that at this stage) but semantic validation
should happen later (e.g. "0/0" can't be a the correct backup end LSN
but we don't figure that out while parsing, but rather later). I think
you should actually move validation of the system identifier to the
point where the directory walk encounters the control file (and update
the docs and tests to match that decision). Imagine if you wanted to
validate a tar-format backup; then you wouldn't have random access to
the directory. You'd see the manifest file first, and then all the
files in a random order, with one chance to look at each one.
Agree. I have moved the system identifier validation after manifest
parsing.
But, not in the directory walkthrough since in pg_combinebackup, we don't
really needed to open the pg_control file to get the system identifier,
which we
have from the check_control_files().
(This is, in fact, a feature I think we should implement.)
- if (strcmp(token, "1") != 0) + parse->manifest_version = atoi(token); + if (parse->manifest_version != 1 && parse->manifest_version != 2) json_manifest_parse_failure(parse->context, "unexpected manifest version");Please either (a) don't do a string-to-integer conversion and just
strcmp() twice or (b) use strtol so that you can check that it
succeeded. I don't want to accept manifest version 1a as 1.
Understood, corrected in the attached version.
+/* + * Validate manifest system identifier against the database server system + * identifier. + */This comment assumes you know what the callback is going to do, but
you don't. This should be more like the comment for
json_manifest_finalize_file or json_manifest_finalize_wal_range.
Ok.
Updated version is attached.
Regards,
Amul
Attachments:
v2-0001-Add-system-identifier-to-backup-manifest.patchapplication/x-patch; name=v2-0001-Add-system-identifier-to-backup-manifest.patchDownload+404-33
On Thu, Jan 18, 2024 at 6:39 AM Sravan Kumar <sravanvcybage@gmail.com>
wrote:
I have also done a review of the patch and some testing. The patch looks
good, and I agree with Robert's comments.
Thank you for your review, testing and the offline discussion.
Regards,
Amul
On Fri, Jan 19, 2024 at 10:36 PM Amul Sul <sulamul@gmail.com> wrote:
On Wed, Jan 17, 2024 at 8:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
Updated version is attached.
Another updated version attached -- fix missing manifest version check in
pg_verifybackup before system identifier validation.
Regards,
Amul
Attachments:
v3-0001-Add-system-identifier-to-backup-manifest.patchapplication/x-patch; name=v3-0001-Add-system-identifier-to-backup-manifest.patchDownload+413-33
On Mon, Jan 22, 2024 at 10:08 AM Amul Sul <sulamul@gmail.com> wrote:
On Fri, Jan 19, 2024 at 10:36 PM Amul Sul <sulamul@gmail.com> wrote:
On Wed, Jan 17, 2024 at 8:40 PM Robert Haas <robertmhaas@gmail.com>
wrote:Updated version is attached.
Another updated version attached -- fix missing manifest version check in
pg_verifybackup before system identifier validation.
Thinking a bit more on this, I realized parse_manifest_file() has many out
parameters. Instead parse_manifest_file() should simply return manifest data
like load_backup_manifest(). Attached 0001 patch doing the same, and
removed
parser_context structure, and added manifest_data, and did the required
adjustments to pg_verifybackup code.
Regards,
Amul
On Mon, Jan 22, 2024 at 2:22 AM Amul Sul <sulamul@gmail.com> wrote:
Thinking a bit more on this, I realized parse_manifest_file() has many out
parameters. Instead parse_manifest_file() should simply return manifest data
like load_backup_manifest(). Attached 0001 patch doing the same, and removed
parser_context structure, and added manifest_data, and did the required
adjustments to pg_verifybackup code.
InitializeBackupManifest(&manifest, opt->manifest,
-
opt->manifest_checksum_type);
+
opt->manifest_checksum_type,
+ GetSystemIdentifier());
InitializeBackupManifest() can just call GetSystemIdentifier() itself,
instead of passing another parameter, I think.
+ if (manifest_version == 1)
+ context->error_cb(context,
+ "%s: backup manifest
doesn't support incremental changes",
+
private_context->backup_directory);
I think this is weird. First, there doesn't seem to be any reason to
bounce through error_cb() here. You could just call pg_fatal(), as we
do elsewhere in this file. Second, there doesn't seem to be any need
to include the backup directory in this error message. We include the
file name (not the directory name) in errors that pertain to the file
itself, like if we can't open or read it. But we don't do that for
semantic errors about the manifest contents (cf.
combinebackup_per_file_cb). This file would need a lot fewer charges
if you didn't feed the backup directory name through here. Third, the
error message is not well-chosen, because a user who looks at it won't
understand WHY the manifest doesn't support incremental changes. I
suggest "backup manifest version 1 does not support incremental
backup".
+ /* Incremental backups supported on manifest version 2 or later */
+ if (manifest_version == 1)
+ context->error_cb(context,
+ "incremental backups
cannot be taken for this backup");
Let's use the same error message as in the previous case here also.
+ for (i = 0; i < n_backups; i++)
+ {
+ if (manifests[i]->system_identifier != system_identifier)
+ {
+ char *controlpath;
+
+ controlpath = psprintf("%s/%s",
prior_backup_dirs[i], "global/pg_control");
+
+ pg_fatal("manifest is from different database
system: manifest database system identifier is %llu, %s system
identifier is %llu",
+ (unsigned long long)
manifests[i]->system_identifier,
+ controlpath,
+ (unsigned long long)
system_identifier);
+ }
+ }
check_control_files() already verifies that all of the control files
contain the same system identifier as each other, so what we're really
checking here is that the backup manifest in each directory has the
same system identifier as the control file in that same directory. One
problem is that backup manifests are optional here, as per the comment
in load_backup_manifests(), so you need to skip over NULL entries
cleanly to avoid seg faulting if one is missing. I also think the
error message should be changed. How about "%s: manifest system
identifier is %llu, but control file has %llu"?
+ context->error_cb(context,
+ "manifest is from
different database system: manifest database system identifier is
%llu, pg_control database system identifier is %llu",
+ (unsigned long long)
manifest_system_identifier,
+ (unsigned long long)
system_identifier);
And here, while I'm kibitzing, how about "manifest system identifier
is %llu, but this system's identifier is %llu"?
- qr/could not open directory/,
+ qr/could not open file/,
I don't think that the expected error message here should be changing.
Does it really, with the latest patch version? Why? Can we fix that?
+ else if (!parse->saw_system_identifier_field &&
+
strcmp(parse->manifest_version, "1") != 0)
I don't think this has any business testing the manifest version.
That's something to sort out at some later stage.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Wed, Jan 24, 2024 at 10:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jan 22, 2024 at 2:22 AM Amul Sul <sulamul@gmail.com> wrote:
Thinking a bit more on this, I realized parse_manifest_file() has many
out
parameters. Instead parse_manifest_file() should simply return manifest
data
like load_backup_manifest(). Attached 0001 patch doing the same, and
removed
parser_context structure, and added manifest_data, and did the required
adjustments to pg_verifybackup code.InitializeBackupManifest(&manifest, opt->manifest,
-
opt->manifest_checksum_type);
+
opt->manifest_checksum_type,
+
GetSystemIdentifier());InitializeBackupManifest() can just call GetSystemIdentifier() itself,
instead of passing another parameter, I think.
Ok.
+ if (manifest_version == 1) + context->error_cb(context, + "%s: backup manifest doesn't support incremental changes", + private_context->backup_directory);I think this is weird. First, there doesn't seem to be any reason to
bounce through error_cb() here. You could just call pg_fatal(), as we
do elsewhere in this file. Second, there doesn't seem to be any need
to include the backup directory in this error message. We include the
file name (not the directory name) in errors that pertain to the file
itself, like if we can't open or read it. But we don't do that for
semantic errors about the manifest contents (cf.
combinebackup_per_file_cb). This file would need a lot fewer charges
if you didn't feed the backup directory name through here. Third, the
error message is not well-chosen, because a user who looks at it won't
understand WHY the manifest doesn't support incremental changes. I
suggest "backup manifest version 1 does not support incremental
backup".+ /* Incremental backups supported on manifest version 2 or later */ + if (manifest_version == 1) + context->error_cb(context, + "incremental backups cannot be taken for this backup");Let's use the same error message as in the previous case here also.
Ok.
+ for (i = 0; i < n_backups; i++) + { + if (manifests[i]->system_identifier != system_identifier) + { + char *controlpath; + + controlpath = psprintf("%s/%s", prior_backup_dirs[i], "global/pg_control"); + + pg_fatal("manifest is from different database system: manifest database system identifier is %llu, %s system identifier is %llu", + (unsigned long long) manifests[i]->system_identifier, + controlpath, + (unsigned long long) system_identifier); + } + }check_control_files() already verifies that all of the control files
contain the same system identifier as each other, so what we're really
checking here is that the backup manifest in each directory has the
same system identifier as the control file in that same directory. One
problem is that backup manifests are optional here, as per the comment
in load_backup_manifests(), so you need to skip over NULL entries
cleanly to avoid seg faulting if one is missing. I also think the
error message should be changed. How about "%s: manifest system
identifier is %llu, but control file has %llu"?
Ok.
+ context->error_cb(context, + "manifest is from different database system: manifest database system identifier is %llu, pg_control database system identifier is %llu", + (unsigned long long) manifest_system_identifier, + (unsigned long long) system_identifier);And here, while I'm kibitzing, how about "manifest system identifier
is %llu, but this system's identifier is %llu"?
I used "database system identifier" instead of "this system's identifier "
like
we are using in WalReceiverMain() and libpqrcv_identify_system().
- qr/could not open directory/, + qr/could not open file/,I don't think that the expected error message here should be changing.
Does it really, with the latest patch version? Why? Can we fix that?
Because, we were trying to access pg_control to check the system identifier
before any other backup directory/file validation.
+ else if (!parse->saw_system_identifier_field && + strcmp(parse->manifest_version, "1") != 0)I don't think this has any business testing the manifest version.
That's something to sort out at some later stage.
That is for backward compatibility, otherwise, we would have an "expected
system identifier" error for manifest version 1.
Thank you for the review-comments, updated version attached.
Regards,
Amul
On Thu, Jan 25, 2024 at 2:52 AM Amul Sul <sulamul@gmail.com> wrote:
Thank you for the review-comments, updated version attached.
I generally agree with 0001. I spent a long time thinking about your
decision to make verifier_context contain a pointer to manifest_data
instead of, as it does currently, a pointer to manifest_files_hash. I
don't think that's a horrible idea, but it also doesn't seem to be
used anywhere currently. One advantage of the current approach is that
we know that none of the code downstream of verify_backup_directory()
or verify_backup_checksums() actually cares about anything other than
the manifest_files_hash. That's kind of nice. If we didn't change this
as you have done here, then we would need to continue passing the WAL
ranges to parse_required_walI() and the system identifier would have
to be passed explicitly to the code that checks the system identifier,
but that's not such a bad thing, either. It makes it clear which
functions are using which information.
But before you go change anything there, exactly when should 0002 be
checking the system identifier in the control file? What happens now
is that we first walk over the directory tree and make sure we have
the files (verify_backup_directory) and then go through and verify
checksums in a second pass (verify_backup_checksums). We do this
because it lets us report problems that can be detected cheaply --
like missing files -- relatively quickly, and problems that are more
expensive to detect -- like mismatching checksums -- only after we've
reported all the cheap-to-detect problems. At what stage should we
verify the control file? I don't really like verifying it first, as
you've done, because I think the error message change in
004_options.pl is a clear regression. When the whole directory is
missing, it's much more pleasant to complain about the directory being
missing than some file inside the directory being missing.
What I'd be inclined to suggest is that you have verify_backup_file()
notice when the file it's being asked to verify is the control file,
and have it check the system identifier at that stage. I think if you
do that, then the error message change in 004_options.pl goes away.
Now, to do that, you'd need to have the whole manifest_data available
from the context, not just the manifest_files_hash, so that you can
see the expected system identifier. And, interestingly, if you take
this approach, then it appears to me that 0001 is correct as-is and
doesn't need any changes.
Thoughts?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, Feb 1, 2024 at 3:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jan 25, 2024 at 2:52 AM Amul Sul <sulamul@gmail.com> wrote:
Thank you for the review-comments, updated version attached.
I generally agree with 0001. I spent a long time thinking about your
decision to make verifier_context contain a pointer to manifest_data
instead of, as it does currently, a pointer to manifest_files_hash. I
don't think that's a horrible idea, but it also doesn't seem to be
used anywhere currently. One advantage of the current approach is that
we know that none of the code downstream of verify_backup_directory()
or verify_backup_checksums() actually cares about anything other than
the manifest_files_hash. That's kind of nice. If we didn't change this
as you have done here, then we would need to continue passing the WAL
ranges to parse_required_walI() and the system identifier would have
to be passed explicitly to the code that checks the system identifier,
but that's not such a bad thing, either. It makes it clear which
functions are using which information.
I intended to minimize the out param of parse_manifest_file(), which
currently
returns manifest_files_hash and manifest_wal_range, and I need two more --
manifest versions and the system identifier.
But before you go change anything there, exactly when should 0002 be
checking the system identifier in the control file? What happens now
is that we first walk over the directory tree and make sure we have
the files (verify_backup_directory) and then go through and verify
checksums in a second pass (verify_backup_checksums). We do this
because it lets us report problems that can be detected cheaply --
like missing files -- relatively quickly, and problems that are more
expensive to detect -- like mismatching checksums -- only after we've
reported all the cheap-to-detect problems. At what stage should we
verify the control file? I don't really like verifying it first, as
you've done, because I think the error message change in
004_options.pl is a clear regression. When the whole directory is
missing, it's much more pleasant to complain about the directory being
missing than some file inside the directory being missing.What I'd be inclined to suggest is that you have verify_backup_file()
notice when the file it's being asked to verify is the control file,
and have it check the system identifier at that stage. I think if you
do that, then the error message change in 004_options.pl goes away.
Now, to do that, you'd need to have the whole manifest_data available
from the context, not just the manifest_files_hash, so that you can
see the expected system identifier. And, interestingly, if you take
this approach, then it appears to me that 0001 is correct as-is and
doesn't need any changes.
Yeah, we can do that, but I think it is a bit inefficient to have strcmp()
check for the pg_control file on each verify_backup_file() call, despite, we
know that path. Also, I think, we need additional handling to ensure that
the
system identifier has been verified in verify_backup_file(), what if the
pg_control file itself missing from the backup -- might be a rare case, but
possible.
For now, we can do the system identifier validation after
verify_backup_directory().
Regards,
Amul
On Thu, Feb 1, 2024 at 2:18 AM Amul Sul <sulamul@gmail.com> wrote:
I intended to minimize the out param of parse_manifest_file(), which currently
returns manifest_files_hash and manifest_wal_range, and I need two more --
manifest versions and the system identifier.
Sure, but you could do context.ht = manifest_data->files instead of
context.manifest = manifest_data. The question isn't whether you
should return the whole manifest_data from parse_manifest_file -- I
agree with that decision -- but rather whether you should feed the
whole thing through into the context, or just the file hash.
Yeah, we can do that, but I think it is a bit inefficient to have strcmp()
check for the pg_control file on each verify_backup_file() call, despite, we
know that path. Also, I think, we need additional handling to ensure that the
system identifier has been verified in verify_backup_file(), what if the
pg_control file itself missing from the backup -- might be a rare case, but
possible.For now, we can do the system identifier validation after
verify_backup_directory().
Yes, that's another option, but I don't think it's as good.
Suppose you do it that way. Then what will happen when the file is
altogether missing or inaccessible? I think verify_backup_file() will
complain, and then you'll have to do something ugly to avoid having
verify_system_identifier() emit the same complaint all over again.
Remember, unless you find some complicated way of passing data around,
it won't know whether verify_backup_file() emitted a warning or not --
it can redo the stat() and see what happens, but it's not absolutely
guaranteed to be the same as what happened before. Making sure that
you always emit any given complaint once rather than twice or zero
times is going to be tricky.
It seems more natural to me to just accept the (small) cost of a
strcmp() in verify_backup_file(). If the initial stat() fails, it
emits whatever complaint is appropriate and returns and the logic to
check the system identifier is never reached. If it succeeds, you can
proceed to try to open the file and do what you need to do.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Feb 2, 2024 at 12:03 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Feb 1, 2024 at 2:18 AM Amul Sul <sulamul@gmail.com> wrote:
I intended to minimize the out param of parse_manifest_file(), which
currently
returns manifest_files_hash and manifest_wal_range, and I need two more
--
manifest versions and the system identifier.
Sure, but you could do context.ht = manifest_data->files instead of
context.manifest = manifest_data. The question isn't whether you
should return the whole manifest_data from parse_manifest_file -- I
agree with that decision -- but rather whether you should feed the
whole thing through into the context, or just the file hash.Yeah, we can do that, but I think it is a bit inefficient to have
strcmp()
check for the pg_control file on each verify_backup_file() call,
despite, we
know that path. Also, I think, we need additional handling to ensure
that the
system identifier has been verified in verify_backup_file(), what if the
pg_control file itself missing from the backup -- might be a rare case,but
possible.
For now, we can do the system identifier validation after
verify_backup_directory().Yes, that's another option, but I don't think it's as good.
Suppose you do it that way. Then what will happen when the file is
altogether missing or inaccessible? I think verify_backup_file() will
complain, and then you'll have to do something ugly to avoid having
verify_system_identifier() emit the same complaint all over again.
Remember, unless you find some complicated way of passing data around,
it won't know whether verify_backup_file() emitted a warning or not --
it can redo the stat() and see what happens, but it's not absolutely
guaranteed to be the same as what happened before. Making sure that
you always emit any given complaint once rather than twice or zero
times is going to be tricky.It seems more natural to me to just accept the (small) cost of a
strcmp() in verify_backup_file(). If the initial stat() fails, it
emits whatever complaint is appropriate and returns and the logic to
check the system identifier is never reached. If it succeeds, you can
proceed to try to open the file and do what you need to do.
Ok, I did that way in the attached version, I have passed the control file's
full path as a second argument to verify_system_identifier() what we gets in
verify_backup_file(), but that is not doing any useful things with it,
since we
were using get_controlfile() to open the control file, which takes the
directory as an input and computes the full path on its own.
Regards,
Amul
On Wed, Feb 14, 2024 at 12:29:07PM +0530, Amul Sul wrote:
Ok, I did that way in the attached version, I have passed the control file's
full path as a second argument to verify_system_identifier() what we gets in
verify_backup_file(), but that is not doing any useful things with it,
since we
were using get_controlfile() to open the control file, which takes the
directory as an input and computes the full path on its own.
I've read through the patch, and that's pretty cool.
-static void
-parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p,
- manifest_wal_range **first_wal_range_p)
+static manifest_data *
+parse_manifest_file(char *manifest_path)
In 0001, should the comment describing this routine be updated as
well?
+ identifier with pg_control of the backup directory or fails verification
This is missing a <filename> markup here.
+ <productname>PostgreSQL</productname> 17, it is 2; in older versions,
+ it is 1.
Perhaps a couple of <literal>s here.
+ if (strcmp(parse->manifest_version, "1") != 0 &&
+ strcmp(parse->manifest_version, "2") != 0)
+ json_manifest_parse_failure(parse->context,
+ "unexpected manifest version");
+
+ /* Parse version. */
+ version = strtoi64(parse->manifest_version, &ep, 10);
+ if (*ep)
+ json_manifest_parse_failure(parse->context,
+ "manifest version not an integer");
+
+ /* Invoke the callback for version */
+ context->version_cb(context, version);
Shouldn't these two checks be reversed? And is there actually a need
for the first check at all knowing that the version callback should be
in charge of performing the validation vased on the version received?
+my $node2;
+{
+ local $ENV{'INITDB_TEMPLATE'} = undef;
Not sure that it is a good idea to duplicate this pattern twice.
Shouldn't this be something we'd want to control with an option in the
init() method instead?
+static void
+verify_system_identifier(verifier_context *context, char *controlpath)
Relying both on controlpath, being a full path to the control file
including the data directory, and context->backup_directory to read
the contents of the control file looks a bit weird. Wouldn't it be
cleaner to just use one of them?
--
Michael
On Thu, Feb 15, 2024 at 7:18 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Feb 14, 2024 at 12:29:07PM +0530, Amul Sul wrote:
Ok, I did that way in the attached version, I have passed the control
file's
full path as a second argument to verify_system_identifier() what we
gets in
verify_backup_file(), but that is not doing any useful things with it,
since we
were using get_controlfile() to open the control file, which takes the
directory as an input and computes the full path on its own.I've read through the patch, and that's pretty cool.
Thank you for looking into this.
-static void -parse_manifest_file(char *manifest_path, manifest_files_hash **ht_p, - manifest_wal_range **first_wal_range_p) +static manifest_data * +parse_manifest_file(char *manifest_path)In 0001, should the comment describing this routine be updated as
well?
Ok, updated in the attached version.
+ identifier with pg_control of the backup directory or fails
verificationThis is missing a <filename> markup here.
Done, in the attached version.
+ <productname>PostgreSQL</productname> 17, it is 2; in older versions, + it is 1.Perhaps a couple of <literal>s here.
Done.
+ if (strcmp(parse->manifest_version, "1") != 0 && + strcmp(parse->manifest_version, "2") != 0) + json_manifest_parse_failure(parse->context, + "unexpected manifest version"); + + /* Parse version. */ + version = strtoi64(parse->manifest_version, &ep, 10); + if (*ep) + json_manifest_parse_failure(parse->context, + "manifest version not an integer"); + + /* Invoke the callback for version */ + context->version_cb(context, version);Shouldn't these two checks be reversed? And is there actually a need
for the first check at all knowing that the version callback should be
in charge of performing the validation vased on the version received?
Make sense, reversed the order.
I think, particular allowed versions should be placed at the central place,
and
the callback can check and react on the versions suitable to them, IMHO.
+my $node2; +{ + local $ENV{'INITDB_TEMPLATE'} = undef;Not sure that it is a good idea to duplicate this pattern twice.
Shouldn't this be something we'd want to control with an option in the
init() method instead?
Yes, I did that in a separate patch, see 0001 patch.
+static void +verify_system_identifier(verifier_context *context, char *controlpath)Relying both on controlpath, being a full path to the control file
including the data directory, and context->backup_directory to read
the contents of the control file looks a bit weird. Wouldn't it be
cleaner to just use one of them?
Well, yes, I had to have the same feeling, how about having another function
that can accept a full path of pg_control?
I tried in the 0002 patch, where the original function is renamed to
get_dir_controlfile(), which accepts the data directory path as before, and
get_controlfile() now accepts the full path of the pg_control file.
Kindly have a look at the attached version.
Regards,
Amul
Attachments:
v7-0004-Add-system-identifier-to-backup-manifest.patchapplication/octet-stream; name=v7-0004-Add-system-identifier-to-backup-manifest.patchDownload+368-26
v7-0003-pg_verifybackup-code-refactor.patchapplication/octet-stream; name=v7-0003-pg_verifybackup-code-refactor.patchDownload+34-42
v7-0001-Add-option-to-force-initdb-in-PostgreSQL-Test-Clu.patchapplication/octet-stream; name=v7-0001-Add-option-to-force-initdb-in-PostgreSQL-Test-Clu.patchDownload+14-18
v7-0002-Code-refactor-get_controlfile-to-accept-full-path.patchapplication/octet-stream; name=v7-0002-Code-refactor-get_controlfile-to-accept-full-path.patchDownload+25-12
On Thu, Feb 15, 2024 at 3:05 PM Amul Sul <sulamul@gmail.com> wrote:
Kindly have a look at the attached version.
IMHO, 0001 looks fine, except probably the comment could be phrased a
bit more nicely. That can be left for whoever commits this to
wordsmith. Michael, what are your plans?
0002 seems like a reasonable approach, but there's a hunk in the wrong
patch: 0004 modifies pg_combinebackup's check_control_files to use
get_dir_controlfile() rather than git_controlfile(), but it looks to
me like that should be part of 0002.
--
Robert Haas
EDB: http://www.enterprisedb.com