a very minor bug and a couple of comment changes for basebackup.c

Started by Robert Haasalmost 3 years ago4 messages
#1Robert Haas
robertmhaas@gmail.com
3 attachment(s)

Here are a few small patches for basebackup.c:

0001 fixes what I believe to be a slight logical error in sendFile(),
introduced by me during the v15 development cycle when I introduced
the bbsink abstraction. I believe that it is theoretically possible
for this to cause an assertion failure, although the chances of that
actually happening seem extremely remote in practice. I don't believe
there are any consequences worse than that; for instance, I don't
think this can result in your backup getting corrupted. See the
proposed commit message for full details. Because I believe that this
is formally a bug, I am inclined to think that this should be
back-patched, but I also think it's fairly likely that no one would
ever notice if we didn't. However, patch authors have been known to be
wrong about the consequences of their own bugs from time to time, so
please do let me know if this seems more serious to you than what I'm
indicating, or conversely if you think it's not a problem at all for
some reason.

0002 removes an old comment from the file that I find useless and
slightly misleading.

0003 rewrites a comment about the way that we verify checksums during
backups. If we get a checksum mismatch, we reread the block and see if
the perceived problem goes away. If it doesn't, then we report it.
This is intended as protection against the backup reading a block
while some other process is in the midst of writing it, but there's no
guarantee that any concurrent write will finish quickly enough for our
second read attempt to see the updated contents. The comment claims
otherwise, and that's false, and I'm getting tired of reading this
false claim every time I read this code, so I rewrote the comment to
say what I believe to be true, namely, that our algorithm is flaky and
we have no good way to fix that right now. I'm pretty sure that Andres
pointed this problem out when this feature was under discussion, but
somehow it's still like this. There's another nearby comment which is
also false or at least misleading for basically the same reasons which
probably should be rewritten too, but I was a bit less certain how to
rewrite it and it wasn't making me as annoyed as this one, so for now
I only rewrote the one.

Comments?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

0002-Remove-an-old-comment-that-doesn-t-seem-especially-u.patchapplication/octet-stream; name=0002-Remove-an-old-comment-that-doesn-t-seem-especially-u.patchDownload
From 946fe92ab57f939ec603fc9dcd3328143163b00d Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 2 Feb 2023 14:51:56 -0500
Subject: [PATCH 2/3] Remove an old comment that doesn't seem especially
 useful.

The functions that follow are concerned with various things, of
which the tar format is only one, so this comment doesn't really
seem helpful. The file isn't really divided into sections in the
way that this comment seems to contemplate -- or at least, not
any more.
---
 src/backend/backup/basebackup.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 1be55e485f..6547e37d12 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1467,13 +1467,6 @@ is_checksummed_file(const char *fullpath, const char *filename)
 		return false;
 }
 
-/*****
- * Functions for handling tar file format
- *
- * Copied from pg_dump, but modified to work with libpq for sending
- */
-
-
 /*
  * Given the member, write the TAR header & send the file.
  *
-- 
2.37.1 (Apple Git-137.1)

0001-In-basebackup.c-perform-end-of-file-test-after-check.patchapplication/octet-stream; name=0001-In-basebackup.c-perform-end-of-file-test-after-check.patchDownload
From 03991970cda30706710e346b31c6aeb691326221 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 2 Feb 2023 12:04:16 -0500
Subject: [PATCH 1/3] In basebackup.c, perform end-of-file test after checksum
 validation.

We read blocks of data from files that we're backing up in chunks,
some multiple of BLCKSZ for each read. If checksum verification fails,
we then try rereading just the one block for which validation failed.
If that block happened to be the first block of the chunk, and if
the file was concurrently truncated to remove that block, then we'd
reach a call to bbsink_archive_contents() with a buffer length of 0.
That causes an assertion failure.

As far as I can see, there are no particularly bad consequences if
this happens in a non-assert build, and it's pretty unlikely to happen
in the first place because it requires a series of somewhat unlikely
things to happen in very quick succession. However, assertion failures
are bad, so rearrange the code to avoid that possibility.
---
 src/backend/backup/basebackup.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 3fb9451643..1be55e485f 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1567,14 +1567,6 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 								   Min(sink->bbs_buffer_length, remaining),
 								   len, readfilename, true);
 
-		/*
-		 * If we hit end-of-file, a concurrent truncation must have occurred.
-		 * That's not an error condition, because WAL replay will fix things
-		 * up.
-		 */
-		if (cnt == 0)
-			break;
-
 		/*
 		 * The checksums are verified at block level, so we iterate over the
 		 * buffer in chunks of BLCKSZ, after making sure that
@@ -1677,6 +1669,15 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 			}
 		}
 
+		/*
+		 * If we hit end-of-file, a concurrent truncation must have occurred.
+		 * That's not an error condition, because WAL replay will fix things
+		 * up.
+		 */
+		if (cnt == 0)
+			break;
+
+		/* Archive the data we just read. */
 		bbsink_archive_contents(sink, cnt);
 
 		/* Also feed it to the checksum machinery. */
-- 
2.37.1 (Apple Git-137.1)

0003-Reword-overly-optimistic-comment-about-backup-checks.patchapplication/octet-stream; name=0003-Reword-overly-optimistic-comment-about-backup-checks.patchDownload
From a037bb36625e47618621d5c588f477fb0efa6e3a Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 2 Feb 2023 15:00:38 -0500
Subject: [PATCH 3/3] Reword overly-optimistic comment about backup checksum
 verification.

The comment implies that a single retry is sufficient to avoid
spurious checksum failures, but in fact no number of retries is
sufficient for that purpose. Update the comment accordingly.
---
 src/backend/backup/basebackup.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 6547e37d12..6efdefb591 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1602,11 +1602,21 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 						 * Retry the block on the first failure.  It's
 						 * possible that we read the first 4K page of the
 						 * block just before postgres updated the entire block
-						 * so it ends up looking torn to us.  We only need to
-						 * retry once because the LSN should be updated to
-						 * something we can ignore on the next pass.  If the
-						 * error happens again then it is a true validation
-						 * failure.
+						 * so it ends up looking torn to us. If, before we
+						 * retry the read, the concurrent write of the block
+						 * finishes, the page LSN will be updated and we'll
+						 * realize that we should ignore this block.
+						 *
+						 * There's no guarantee that this will actually
+						 * happen, though: the torn write could take an
+						 * arbitrarily long time to complete. Retrying multiple
+						 * times wouldn't fix this problem, either, though
+						 * it would reduce the chances of it happening in
+						 * practice. The only real fix here seems to be to
+						 * have some kind of interlock that allows us to wait
+						 * until we can be certain that no write to the block
+						 * is in progress. Since we don't have any such thing
+						 * right now, we just do this and hope for the best.
 						 */
 						if (block_retry == false)
 						{
-- 
2.37.1 (Apple Git-137.1)

#2Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#1)
Re: a very minor bug and a couple of comment changes for basebackup.c

On Thu, Feb 02, 2023 at 03:23:51PM -0500, Robert Haas wrote:

0001 fixes what I believe to be a slight logical error in sendFile(),
introduced by me during the v15 development cycle when I introduced
the bbsink abstraction. I believe that it is theoretically possible
for this to cause an assertion failure, although the chances of that
actually happening seem extremely remote in practice. I don't believe
there are any consequences worse than that; for instance, I don't
think this can result in your backup getting corrupted. See the
proposed commit message for full details. Because I believe that this
is formally a bug, I am inclined to think that this should be
back-patched, but I also think it's fairly likely that no one would
ever notice if we didn't. However, patch authors have been known to be
wrong about the consequences of their own bugs from time to time, so
please do let me know if this seems more serious to you than what I'm
indicating, or conversely if you think it's not a problem at all for
some reason.

Seems right, I think that you should backpatch that as
VERIFY_CHECKSUMS is the default.

0002 removes an old comment from the file that I find useless and
slightly misleading.

Okay.

0003 rewrites a comment about the way that we verify checksums during
backups. If we get a checksum mismatch, we reread the block and see if
the perceived problem goes away. If it doesn't, then we report it.
This is intended as protection against the backup reading a block
while some other process is in the midst of writing it, but there's no
guarantee that any concurrent write will finish quickly enough for our
second read attempt to see the updated contents.

There is more to it: the page LSN is checked before its checksum.
Hence, if the page's LSN is corrupted in a way that it is higher than
sink->bbs_state->startptr, the checksum verification is just skipped
while the page is broken but not reported as such. (Spoiler: this has
been mentioned in the past, and maybe we'd better remove this stuff in
its entirety.)

The comment claims
otherwise, and that's false, and I'm getting tired of reading this
false claim every time I read this code, so I rewrote the comment to
say what I believe to be true, namely, that our algorithm is flaky and
we have no good way to fix that right now. I'm pretty sure that Andres
pointed this problem out when this feature was under discussion, but
somehow it's still like this. There's another nearby comment which is
also false or at least misleading for basically the same reasons which
probably should be rewritten too, but I was a bit less certain how to
rewrite it and it wasn't making me as annoyed as this one, so for now
I only rewrote the one.

Indeed, that's an improvement ;)
--
Michael

#3Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#2)
Re: a very minor bug and a couple of comment changes for basebackup.c

Thanks for the review. I have committed the patches.

On Thu, Mar 2, 2023 at 2:59 AM Michael Paquier <michael@paquier.xyz> wrote:

Seems right, I think that you should backpatch that as
VERIFY_CHECKSUMS is the default.

Done.

There is more to it: the page LSN is checked before its checksum.
Hence, if the page's LSN is corrupted in a way that it is higher than
sink->bbs_state->startptr, the checksum verification is just skipped
while the page is broken but not reported as such. (Spoiler: this has
been mentioned in the past, and maybe we'd better remove this stuff in
its entirety.)

Yep. It's pretty embarrassing that we have a checksum verification
feature that has both known ways of producing false positives and
known ways of producing false negatives and we have no plan to ever
fix that, we're just going to keep shipping what we've got. I think
it's pretty clear that the feature shouldn't have been committed like
this; valid criticisms of the design were offered and simply not
addressed, not even by updating the comments or documentation with
disclaimers. I find the code in sendFile() pretty ugly, too. For all
of that, I'm a bit uncertain whether ripping it out is the right thing
to do. It might be (I'm not sure) that it tends to work decently well
in practice. Like, yes, it could produce false checksum warnings, but
does that actually happen to people? It's probably not too likely that
a read() or write() of 8kB gets updated after doing only part of the
I/O, so the concurrent read or write is fairly likely to be on-CPU, I
would guess, and therefore this algorithm might kind of work OK in
practice despite begin garbage on a theoretical level. Similarly, the
problems with how the LSN is vetted make it likely that a page
replaced with random garbage will go undetected, but a page where a
few bytes get flipped in a random position within the page is likely
to get caught, and maybe the latter is actually a bigger risk than the
former. I don't really know. I'd be interested to hear from anyone
with a lot of practical experience using the feature. A few anecdotes
of the form "this routine fails to tell us about problems" or "this
often complains about problems that are not real" or "this has really
helped us out on a number of occasions and we have had no problems
with it" would be really helpful.

On the other hand, we could just say that the code is nonsense and
therefore, regardless of practical experience, it ought to be removed.
I'm somewhat open to that idea, too.

--
Robert Haas
EDB: http://www.enterprisedb.com

#4Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#3)
Re: a very minor bug and a couple of comment changes for basebackup.c

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

Thanks for the review. I have committed the patches.

No objections to what was committed.

On Thu, Mar 2, 2023 at 2:59 AM Michael Paquier <michael@paquier.xyz> wrote:

There is more to it: the page LSN is checked before its checksum.
Hence, if the page's LSN is corrupted in a way that it is higher than
sink->bbs_state->startptr, the checksum verification is just skipped
while the page is broken but not reported as such. (Spoiler: this has
been mentioned in the past, and maybe we'd better remove this stuff in
its entirety.)

Yep. It's pretty embarrassing that we have a checksum verification
feature that has both known ways of producing false positives and
known ways of producing false negatives and we have no plan to ever
fix that, we're just going to keep shipping what we've got. I think
it's pretty clear that the feature shouldn't have been committed like
this; valid criticisms of the design were offered and simply not
addressed, not even by updating the comments or documentation with
disclaimers. I find the code in sendFile() pretty ugly, too. For all
of that, I'm a bit uncertain whether ripping it out is the right thing
to do. It might be (I'm not sure) that it tends to work decently well
in practice. Like, yes, it could produce false checksum warnings, but
does that actually happen to people? It's probably not too likely that
a read() or write() of 8kB gets updated after doing only part of the
I/O, so the concurrent read or write is fairly likely to be on-CPU, I
would guess, and therefore this algorithm might kind of work OK in
practice despite begin garbage on a theoretical level. Similarly, the
problems with how the LSN is vetted make it likely that a page
replaced with random garbage will go undetected, but a page where a
few bytes get flipped in a random position within the page is likely
to get caught, and maybe the latter is actually a bigger risk than the
former. I don't really know. I'd be interested to hear from anyone
with a lot of practical experience using the feature. A few anecdotes
of the form "this routine fails to tell us about problems" or "this
often complains about problems that are not real" or "this has really
helped us out on a number of occasions and we have had no problems
with it" would be really helpful.

The concern about the LSN is certainly a valid one and we ended up
ripping that check out of pgbackrest in favor of taking a different
approach- simply re-read the page and see if it changed. If it changed,
then we punt and figure that it was a hot page that PG was actively
writing to and so it'll be in the WAL and we don't have to worry about
it. We're generally concerned more about latent on-disk corruption that
is missed than about some kind of in-memory corruption and the page
changing in the filesystem cache without some other process writing to
it just seems to be awfully unlikely.

https://github.com/pgbackrest/pgbackrest/commit/9eec98c61302121134d2067326dbd2cd0f2f0b9c

From a practical perspective, while this has the afore-mentioned risk
regarding our loop happening twice fast enough that it re-reads the same
page without the i/o on the page progressing at all, I'm not aware of
even one report of that actually happening. We absolutely have seen
cases where the first read picks up a torn page and that's not even
uncommon in busy environments.

One of the ideas I've had around how to address all of this is to not
depend on inferring things from what's been written out but instead to
just ask PG and that's one of the (relatively few...) benefits that I
see to an archive_library- the ability to check if a given page is in
shared buffers and, if so, what its LSN is to see if it's past the start
of the backup. Given that pg_basebackup is already working with a lot
of server-side code.. perhaps this could be a direction to go in for it.

Another idea we played around with was keeping track of the LSNs of the
pages with invalid checksums and checking that they fall within the
range of start LSN-end LSN of the backup; while that wouldn't
completely prevent corrupted LSNs from skipping detection using the LSN
approach, it'd sure make it a whole lot less likely.

Lastly, I've also wondered about trying to pull (clean) pages from
shared_buffers directly instead of reading them off of disk at all,
perhaps using a ring buffer in shared_buffers to read them in if
they're not already there, and letting all the usual checksum validation
and such happening with PG handling it. Dirty pages would have to be
handled though, presumably by locking them and then reading the page
from disk as I don't think folks would be pleased if we decided to
forcibly write them out. For an incremental backup, if you have
reliable knowledge that the page is in the WAL from PG, perhaps you
could even skip the page entirely.

Independently of verifying PG checksums, we've also considered an
approach where we take the file-level checksums we have in the
pgbackrest manifest and check if the file has changed on the filesystem
without the timestamp changing as that would certainly be indicative of
something gone wrong.

Certainly interested in other ideas about how to improve things in this
area. I don't think we should rip it out though.

Thanks,

Stephen