basebackup checksum verification

Started by Andres Freundalmost 7 years ago13 messages
#1Andres Freund
andres@anarazel.de

Hi,

As detailed in
/messages/by-id/20190319200050.ncuxejradurjakdc@alap3.anarazel.de
the way the backend's basebackup checksum verification works makes its
error detection capabilities very dubious.

I think we need to fix this before the next set of backbranch releases,
or at the very least add a big fat warning that the feature isn't doing
much.

The more I think about it, the less convinced I am of the method to
avoid the torn page problem using LSNs. To make e.g. the PageIsNew()
check correct, we need to verify that the whole page is zeroes - but we
can't use the LSN for that, as it's not on the page. But there very well
could be a torn page issue with only the second half of the page being
written back (with the default 8kb pages that can trivially happen as
the kernel's pagecache commonly uses 4kb granularity).

I basically think it needs to work like this:

1) Perform the entire set of PageIsVerified() checks *without*
previously checking the page's LSN, but don't error out.

2) If 1) errored out, ensure that that's because the backend is
currently writing out the page. That basically requires doing what
BufferAlloc() does. So I think we'd need to end up with a function
like:

LockoutBackendWrites():
buf = ReadBufferWithoutRelcache(relfilenode);
LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
/*
* Reread page from OS, and recheck. This needs to happen while
* the IO lock prevents rereading from the OS. Note that we do
* not want to rely on the buffer contents here, as that could
* be very old cache contents.
*/
perform_checksum_check(relfilenode, ERROR);

LWLockRelease(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
ReleaseBuffer(buf);

3) If 2) also failed, then we can be sure that the page is truly
corrupted.

Greetings,

Andres Freund

#2Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#1)
Re: basebackup checksum verification

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

As detailed in
/messages/by-id/20190319200050.ncuxejradurjakdc@alap3.anarazel.de
the way the backend's basebackup checksum verification works makes its
error detection capabilities very dubious.

I disagree that it's 'very dubious', even with your analysis. I thought
Robert's response was generally good, pointing out that we're talking
about this being an issue if the corruption happens in a certain set of
bytes. That said, I'm happy to see improvements in this area but I'm
flat out upset about the notion that we must be perfect here- our
checksums themselves aren't perfect for catching corruption either.

I think we need to fix this before the next set of backbranch releases,
or at the very least add a big fat warning that the feature isn't doing
much.

I disagree about this level of urgency, but if you have a decent idea
about how to improve the situation, I'm fully in support of it.

The more I think about it, the less convinced I am of the method to
avoid the torn page problem using LSNs. To make e.g. the PageIsNew()
check correct, we need to verify that the whole page is zeroes - but we
can't use the LSN for that, as it's not on the page. But there very well
could be a torn page issue with only the second half of the page being
written back (with the default 8kb pages that can trivially happen as
the kernel's pagecache commonly uses 4kb granularity).

I basically think it needs to work like this:

1) Perform the entire set of PageIsVerified() checks *without*
previously checking the page's LSN, but don't error out.

Performing the PageIsVerified() checks seems reasonable, I don't see any
downside to doing that, so if you'd like to add that, sure, go for it.

2) If 1) errored out, ensure that that's because the backend is
currently writing out the page. That basically requires doing what
BufferAlloc() does. So I think we'd need to end up with a function
like:

LockoutBackendWrites():
buf = ReadBufferWithoutRelcache(relfilenode);

This is going to cause it to be pulled into shared buffers, if it isn't
already there, isn't it? That seems less than ideal and isn't it going
to end up just doing exactly the same PageIsVerified() call, and what
happens when that fails? You're going to end up getting an
ereport(ERROR) and long-jump out of here... Depending on the other
code, maybe that's something you can manage, but it seems rather tricky
to me. I do think, as was discussed extensively previously, that the
backup *should* continue even in the face of corruption, but there
should be warnings issued to notify the user of the issues.

LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
/*
* Reread page from OS, and recheck. This needs to happen while
* the IO lock prevents rereading from the OS. Note that we do
* not want to rely on the buffer contents here, as that could
* be very old cache contents.
*/
perform_checksum_check(relfilenode, ERROR);

LWLockRelease(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
ReleaseBuffer(buf);

I don't particularly like having to lock pages in this way while
performing this check, espectially with having to read the page into
shared buffers potentially.

This also isn't the only approach to dealing with this issue that the
LSN might be corrupted. There's at least two other ways we can improve
the situation here- we can keep track of the highest LSN seen, perhaps
on a per-file basis, and then compare those to the end-of-backup LSN,
and issue a warning or perform a re-check or do something else if we
discover that the LSN found was later than the end-of-backup LSN.
That's not perfect, but it's certainly a good improvement over what we
have today. The other approach would be to track all of the pages
which were skipped and then compare them to the pages in the WAL which
were archived during the backup, making sure that all pages which failed
checksum exist in the WAL. That should allow us to confirm that the
page was actually being modified and won't ever be used in the state
that we saw it in, since it'll be replayed over by WAL, and therefore we
don't have to worry about the LSN or the page itself being corrupt. Of
course, that requires tracking all the pages which are modified by the
WAL for the duration of the backup, and tracking all the pages which
failed checksum and/or other validation, and then performing the
cross-check. That seems like a fair bit of work for this, but I'm not
sure that it's avoidable, ultimately.

I'm happy with incremental improvements in this area though, and just
checking that the LSN of pages skipped isn't completely insane would
definitely be a good improvement to begin with and might be simple
enough to back-patch. I don't think back-patching changes like those
proposed here is a good idea. I don't have any problem adding
additional documentation to explain what's being done though, with
appropriate caveats at how this might not catch all types of corruption
(we should do the same for the checksum feature itself, if we don't
already have such caveats...).

Thanks!

Stephen

#3Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#2)
Re: basebackup checksum verification

Hi,

On 2019-03-26 19:22:03 -0400, Stephen Frost wrote:

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

As detailed in
/messages/by-id/20190319200050.ncuxejradurjakdc@alap3.anarazel.de
the way the backend's basebackup checksum verification works makes its
error detection capabilities very dubious.

I disagree that it's 'very dubious', even with your analysis.

I really don't know what to say. The current algorithm is flat out
bogus.

I thought Robert's response was generally good, pointing out that
we're talking about this being an issue if the corruption happens in a
certain set of bytes. That said, I'm happy to see improvements in
this area but I'm flat out upset about the notion that we must be
perfect here- our checksums themselves aren't perfect for catching
corruption either.

The point is that we're not detecting errors that we can detect when
read outside of basebackup. I really entirely completely fail how that
can be defended.

I think we're making promises with this the basebackup feature we're not
even remotely keeping. I don't understand how you can defend that, given
the current state, you can have a basebackup that you took with
checksums enabled, and then when actually use that basebackup you get
checksum failures. Like it's one thing not to detect all storage
issues, but if we do detect them after using the basebackup, that's
really not ok.

2) If 1) errored out, ensure that that's because the backend is
currently writing out the page. That basically requires doing what
BufferAlloc() does. So I think we'd need to end up with a function
like:

LockoutBackendWrites():
buf = ReadBufferWithoutRelcache(relfilenode);

This is going to cause it to be pulled into shared buffers, if it isn't
already there, isn't it?

I can't see that being a problem. We're only going to enter this path if
we encountered a buffer where the checksum was wrong. And either that's
a data corruption even in which case we don't care about a small
performance penalty, or it's a race around writing out the page because
basebackup read it while half writen - in which case it's pretty pretty
likely that the page is still in shared buffers.

That seems less than ideal and isn't it going
to end up just doing exactly the same PageIsVerified() call, and what
happens when that fails?

That seems quite easily handled with a different RBM_ mode.

LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
/*
* Reread page from OS, and recheck. This needs to happen while
* the IO lock prevents rereading from the OS. Note that we do
* not want to rely on the buffer contents here, as that could
* be very old cache contents.
*/
perform_checksum_check(relfilenode, ERROR);

LWLockRelease(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
ReleaseBuffer(buf);

This should be the IO lock, not content lock, sorry. Copy & pasto.

I don't particularly like having to lock pages in this way while
performing this check, espectially with having to read the page into
shared buffers potentially.

Given it's only the IO lock (see above correction), and only if we can't
verify the checksum during the race, I fail to see how that can be a
problem?

Greetings,

Andres Freund

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#3)
Re: basebackup checksum verification

On Tue, Mar 26, 2019 at 04:49:21PM -0700, Andres Freund wrote:

Hi,

On 2019-03-26 19:22:03 -0400, Stephen Frost wrote:

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

As detailed in
/messages/by-id/20190319200050.ncuxejradurjakdc@alap3.anarazel.de
the way the backend's basebackup checksum verification works makes its
error detection capabilities very dubious.

I disagree that it's 'very dubious', even with your analysis.

I really don't know what to say. The current algorithm is flat out
bogus.

Bogus might be a bit too harsh, but yeah - failure to reliably detect obviously
invalid checksums when the LSN just happens to be high due to randomness is not
a good thing. We'll still detect pages corrupted in other places, but this is
rather unfortunate.

I thought Robert's response was generally good, pointing out that
we're talking about this being an issue if the corruption happens in a
certain set of bytes. That said, I'm happy to see improvements in
this area but I'm flat out upset about the notion that we must be
perfect here- our checksums themselves aren't perfect for catching
corruption either.

The point is that we're not detecting errors that we can detect when
read outside of basebackup. I really entirely completely fail how that
can be defended.

I think we're making promises with this the basebackup feature we're not
even remotely keeping. I don't understand how you can defend that, given
the current state, you can have a basebackup that you took with
checksums enabled, and then when actually use that basebackup you get
checksum failures. Like it's one thing not to detect all storage
issues, but if we do detect them after using the basebackup, that's
really not ok.

Yeah, if basebackup completes without reporting any invalid checksums, but
running pg_verify_checksums on the same backups detects those, that probably
should raise some eyebrows.

We already have such blind spot, but it's expected to be pretty small
(essentially pages modified since start of the backup).

2) If 1) errored out, ensure that that's because the backend is
currently writing out the page. That basically requires doing what
BufferAlloc() does. So I think we'd need to end up with a function
like:

LockoutBackendWrites():
buf = ReadBufferWithoutRelcache(relfilenode);

This is going to cause it to be pulled into shared buffers, if it isn't
already there, isn't it?

I can't see that being a problem. We're only going to enter this path if
we encountered a buffer where the checksum was wrong. And either that's
a data corruption even in which case we don't care about a small
performance penalty, or it's a race around writing out the page because
basebackup read it while half writen - in which case it's pretty pretty
likely that the page is still in shared buffers.

Yep, I think this is fine. Although, I think in the other thread where this
failure mode was discussed, I think we've only discussed to get I/O lock on
the buffer, no? But as you say, this should be a rare code path.

That seems less than ideal and isn't it going
to end up just doing exactly the same PageIsVerified() call, and what
happens when that fails?

That seems quite easily handled with a different RBM_ mode.

LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
/*
* Reread page from OS, and recheck. This needs to happen while
* the IO lock prevents rereading from the OS. Note that we do
* not want to rely on the buffer contents here, as that could
* be very old cache contents.
*/
perform_checksum_check(relfilenode, ERROR);

LWLockRelease(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
ReleaseBuffer(buf);

This should be the IO lock, not content lock, sorry. Copy & pasto.

I don't particularly like having to lock pages in this way while
performing this check, espectially with having to read the page into
shared buffers potentially.

Given it's only the IO lock (see above correction), and only if we can't
verify the checksum during the race, I fail to see how that can be a
problem?

IMHO not a problem.

cheers

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#4)
Re: basebackup checksum verification

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On Tue, Mar 26, 2019 at 04:49:21PM -0700, Andres Freund wrote:

On 2019-03-26 19:22:03 -0400, Stephen Frost wrote:

* Andres Freund (andres@anarazel.de) wrote:

As detailed in
/messages/by-id/20190319200050.ncuxejradurjakdc@alap3.anarazel.de
the way the backend's basebackup checksum verification works makes its
error detection capabilities very dubious.

I disagree that it's 'very dubious', even with your analysis.

I really don't know what to say. The current algorithm is flat out
bogus.

Bogus might be a bit too harsh, but yeah - failure to reliably detect obviously
invalid checksums when the LSN just happens to be high due to randomness is not
a good thing. We'll still detect pages corrupted in other places, but this is
rather unfortunate.

I'm all for improving it, as I said originally.

I thought Robert's response was generally good, pointing out that
we're talking about this being an issue if the corruption happens in a
certain set of bytes. That said, I'm happy to see improvements in
this area but I'm flat out upset about the notion that we must be
perfect here- our checksums themselves aren't perfect for catching
corruption either.

The point is that we're not detecting errors that we can detect when
read outside of basebackup. I really entirely completely fail how that
can be defended.

I think we're making promises with this the basebackup feature we're not
even remotely keeping. I don't understand how you can defend that, given
the current state, you can have a basebackup that you took with
checksums enabled, and then when actually use that basebackup you get
checksum failures. Like it's one thing not to detect all storage
issues, but if we do detect them after using the basebackup, that's
really not ok.

Yeah, if basebackup completes without reporting any invalid checksums, but
running pg_verify_checksums on the same backups detects those, that probably
should raise some eyebrows.

That isn't actually what would happen at this point, just so we're
clear. What Andres is talking about is a solution which would only
actually work for pg_basebackup, and not for pg_verify_checksums
(without some serious changes which make it connect to the running
server and run various functions to perform the locking that he's
proposing pg_basebackup do...).

We already have such blind spot, but it's expected to be pretty small
(essentially pages modified since start of the backup).

eh..? This is.. more-or-less entirely what's being discussed here:
exactly how we detect and determine which pages were modified since the
start of the backup, and which might have been partially written out
when we tried to read them and therefore fail a checksum check, but it
doesn't matter because we don't actually end up using those pages.

I outlined a couple of other approaches to improving that situation,
which would be able to be used with pg_verify_checksums without having
to connect to the backend, but I'll note that those were completely
ignored, leading me to believe that there's really not much more to
discuss here since other ideas are just not open to being considered.

Thanks,

Stephen

In reply to: Tomas Vondra (#4)
Re: basebackup checksum verification

On Tue, Mar 26, 2019 at 5:10 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Bogus might be a bit too harsh, but yeah - failure to reliably detect obviously
invalid checksums when the LSN just happens to be high due to randomness is not
a good thing. We'll still detect pages corrupted in other places, but this is
rather unfortunate.

I have personally seen real world corruption that involved a page
image consisting of random noise. Several times. Failing to detect
blatant corruption is unacceptable IMV.

Can't we do better here without great difficulty? There are plenty of
generic things that you we could do that can verify that almost any
type of initialized page is at least somewhat sane. For example, you
can verify that line pointers indicate that tuples are
non-overlapping.

That said, Andres' approach sounds like the way to go to me.

--
Peter Geoghegan

#7Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#5)
Re: basebackup checksum verification

Hi,

On 2019-03-26 20:18:31 -0400, Stephen Frost wrote:

I thought Robert's response was generally good, pointing out that
we're talking about this being an issue if the corruption happens in a
certain set of bytes. That said, I'm happy to see improvements in
this area but I'm flat out upset about the notion that we must be
perfect here- our checksums themselves aren't perfect for catching
corruption either.

The point is that we're not detecting errors that we can detect when
read outside of basebackup. I really entirely completely fail how that
can be defended.

I think we're making promises with this the basebackup feature we're not
even remotely keeping. I don't understand how you can defend that, given
the current state, you can have a basebackup that you took with
checksums enabled, and then when actually use that basebackup you get
checksum failures. Like it's one thing not to detect all storage
issues, but if we do detect them after using the basebackup, that's
really not ok.

Yeah, if basebackup completes without reporting any invalid checksums, but
running pg_verify_checksums on the same backups detects those, that probably
should raise some eyebrows.

That isn't actually what would happen at this point, just so we're
clear. What Andres is talking about is a solution which would only
actually work for pg_basebackup, and not for pg_verify_checksums
(without some serious changes which make it connect to the running
server and run various functions to perform the locking that he's
proposing pg_basebackup do...).

Well, I still think it's just plain wrong to do online checksum
verification outside of the server, and we should just reject adding
that as a feature.

Besides the fact that I think having at precisely equal or more error
detection capabilities than the backend, I think all the LSN based
approaches also have the issue that they'll prevent us from using them
on non WAL logged data. There's ongoing work to move SLRUs into the
backend allowing them to be checksummed (Shawn Debnath is IIRC planning
to propose a patch for v13), and we also really should offer to also
checksum unlogged tables (and temp tables?) - just because they'd be
gone after a crash, imo doesn't make it OK to not detect corrupted on
disk data outside of a crash. For those things we won't necessarily
have LSNs that we can conveniently can associate with those buffers -
making LSN based logic harder.

I outlined a couple of other approaches to improving that situation,
which would be able to be used with pg_verify_checksums without having
to connect to the backend, but I'll note that those were completely
ignored, leading me to believe that there's really not much more to
discuss here since other ideas are just not open to being considered.

Well, given that we can do an accurate determination without too much
code in the basebackup case, I don't see what your proposals gain over
that? That's why I didn't comment on them. I'm focusing on the
basebackup case, over the online checksum case, because it's released
code.

Greetings,

Andres Freund

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#5)
Re: basebackup checksum verification

On Tue, Mar 26, 2019 at 08:18:31PM -0400, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On Tue, Mar 26, 2019 at 04:49:21PM -0700, Andres Freund wrote:

On 2019-03-26 19:22:03 -0400, Stephen Frost wrote:

* Andres Freund (andres@anarazel.de) wrote:

As detailed in
/messages/by-id/20190319200050.ncuxejradurjakdc@alap3.anarazel.de
the way the backend's basebackup checksum verification works makes its
error detection capabilities very dubious.

I disagree that it's 'very dubious', even with your analysis.

I really don't know what to say. The current algorithm is flat out
bogus.

Bogus might be a bit too harsh, but yeah - failure to reliably detect obviously
invalid checksums when the LSN just happens to be high due to randomness is not
a good thing. We'll still detect pages corrupted in other places, but this is
rather unfortunate.

I'm all for improving it, as I said originally.

I thought Robert's response was generally good, pointing out that
we're talking about this being an issue if the corruption happens in a
certain set of bytes. That said, I'm happy to see improvements in
this area but I'm flat out upset about the notion that we must be
perfect here- our checksums themselves aren't perfect for catching
corruption either.

The point is that we're not detecting errors that we can detect when
read outside of basebackup. I really entirely completely fail how that
can be defended.

I think we're making promises with this the basebackup feature we're not
even remotely keeping. I don't understand how you can defend that, given
the current state, you can have a basebackup that you took with
checksums enabled, and then when actually use that basebackup you get
checksum failures. Like it's one thing not to detect all storage
issues, but if we do detect them after using the basebackup, that's
really not ok.

Yeah, if basebackup completes without reporting any invalid checksums, but
running pg_verify_checksums on the same backups detects those, that probably
should raise some eyebrows.

That isn't actually what would happen at this point, just so we're
clear. What Andres is talking about is a solution which would only
actually work for pg_basebackup, and not for pg_verify_checksums
(without some serious changes which make it connect to the running
server and run various functions to perform the locking that he's
proposing pg_basebackup do...).

I was talking about pg_verify_checksums in offline mode, i.e. when you
take a backup and then run pg_verify_checksums on it. I'm pretty sure
that does not need to talk to the cluster. Sorry if that was not clear.

We already have such blind spot, but it's expected to be pretty small
(essentially pages modified since start of the backup).

eh..? This is.. more-or-less entirely what's being discussed here:
exactly how we detect and determine which pages were modified since the
start of the backup, and which might have been partially written out
when we tried to read them and therefore fail a checksum check, but it
doesn't matter because we don't actually end up using those pages.

And? All I'm saying is that we knew there's a gap that we don't check,
but that the understanding was that it's rather small and limited to
recently modified pages. If we can further reduce that window, that's
great and we should do it, of course.

I outlined a couple of other approaches to improving that situation,
which would be able to be used with pg_verify_checksums without having
to connect to the backend, but I'll note that those were completely
ignored, leading me to believe that there's really not much more to
discuss here since other ideas are just not open to being considered.

Uh.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#8)
Re: basebackup checksum verification

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On Tue, Mar 26, 2019 at 08:18:31PM -0400, Stephen Frost wrote:

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On Tue, Mar 26, 2019 at 04:49:21PM -0700, Andres Freund wrote:

On 2019-03-26 19:22:03 -0400, Stephen Frost wrote:

* Andres Freund (andres@anarazel.de) wrote:

As detailed in
/messages/by-id/20190319200050.ncuxejradurjakdc@alap3.anarazel.de
the way the backend's basebackup checksum verification works makes its
error detection capabilities very dubious.

I disagree that it's 'very dubious', even with your analysis.

I really don't know what to say. The current algorithm is flat out
bogus.

Bogus might be a bit too harsh, but yeah - failure to reliably detect obviously
invalid checksums when the LSN just happens to be high due to randomness is not
a good thing. We'll still detect pages corrupted in other places, but this is
rather unfortunate.

I'm all for improving it, as I said originally.

I thought Robert's response was generally good, pointing out that
we're talking about this being an issue if the corruption happens in a
certain set of bytes. That said, I'm happy to see improvements in
this area but I'm flat out upset about the notion that we must be
perfect here- our checksums themselves aren't perfect for catching
corruption either.

The point is that we're not detecting errors that we can detect when
read outside of basebackup. I really entirely completely fail how that
can be defended.

I think we're making promises with this the basebackup feature we're not
even remotely keeping. I don't understand how you can defend that, given
the current state, you can have a basebackup that you took with
checksums enabled, and then when actually use that basebackup you get
checksum failures. Like it's one thing not to detect all storage
issues, but if we do detect them after using the basebackup, that's
really not ok.

Yeah, if basebackup completes without reporting any invalid checksums, but
running pg_verify_checksums on the same backups detects those, that probably
should raise some eyebrows.

That isn't actually what would happen at this point, just so we're
clear. What Andres is talking about is a solution which would only
actually work for pg_basebackup, and not for pg_verify_checksums
(without some serious changes which make it connect to the running
server and run various functions to perform the locking that he's
proposing pg_basebackup do...).

I was talking about pg_verify_checksums in offline mode, i.e. when you
take a backup and then run pg_verify_checksums on it. I'm pretty sure
that does not need to talk to the cluster. Sorry if that was not clear.

To make that work, you'd have to take a backup, then restore it, then
bring PG up and have it replay all of the outstanding WAL, then shut
down PG cleanly, and *then* you could run pg_verify_checksums on it.

We already have such blind spot, but it's expected to be pretty small
(essentially pages modified since start of the backup).

eh..? This is.. more-or-less entirely what's being discussed here:
exactly how we detect and determine which pages were modified since the
start of the backup, and which might have been partially written out
when we tried to read them and therefore fail a checksum check, but it
doesn't matter because we don't actually end up using those pages.

And? All I'm saying is that we knew there's a gap that we don't check,
but that the understanding was that it's rather small and limited to
recently modified pages. If we can further reduce that window, that's
great and we should do it, of course.

The point that Andres is making is that in the face of corruption of
certain particular bytes, the set of pages that we check can be much
less than the overall size of the DB (or that of what was recently
modified), and we end up skipping a lot of other pages due to the
corruption rather than because they've been recently modified because
the determination of what's been recently modified is based on those
bytes. With the changes being made to pg_checksums, we'd at least
report back those pages as having been 'skipped', but better would be if
we could accurately determine that the pages were recently modified and
therefore what we saw was a torn page.

Thanks,

Stephen

#10Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#7)
Re: basebackup checksum verification

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2019-03-26 20:18:31 -0400, Stephen Frost wrote:

I thought Robert's response was generally good, pointing out that
we're talking about this being an issue if the corruption happens in a
certain set of bytes. That said, I'm happy to see improvements in
this area but I'm flat out upset about the notion that we must be
perfect here- our checksums themselves aren't perfect for catching
corruption either.

The point is that we're not detecting errors that we can detect when
read outside of basebackup. I really entirely completely fail how that
can be defended.

I think we're making promises with this the basebackup feature we're not
even remotely keeping. I don't understand how you can defend that, given
the current state, you can have a basebackup that you took with
checksums enabled, and then when actually use that basebackup you get
checksum failures. Like it's one thing not to detect all storage
issues, but if we do detect them after using the basebackup, that's
really not ok.

Yeah, if basebackup completes without reporting any invalid checksums, but
running pg_verify_checksums on the same backups detects those, that probably
should raise some eyebrows.

That isn't actually what would happen at this point, just so we're
clear. What Andres is talking about is a solution which would only
actually work for pg_basebackup, and not for pg_verify_checksums
(without some serious changes which make it connect to the running
server and run various functions to perform the locking that he's
proposing pg_basebackup do...).

Well, I still think it's just plain wrong to do online checksum
verification outside of the server, and we should just reject adding
that as a feature.

I get that, and I disagree with it.

Besides the fact that I think having at precisely equal or more error
detection capabilities than the backend, I think all the LSN based
approaches also have the issue that they'll prevent us from using them
on non WAL logged data. There's ongoing work to move SLRUs into the
backend allowing them to be checksummed (Shawn Debnath is IIRC planning
to propose a patch for v13), and we also really should offer to also
checksum unlogged tables (and temp tables?) - just because they'd be
gone after a crash, imo doesn't make it OK to not detect corrupted on
disk data outside of a crash. For those things we won't necessarily
have LSNs that we can conveniently can associate with those buffers -
making LSN based logic harder.

I'm kinda guessing that the SLRUs are still going to be WAL'd. If
that's changing, I'd be very curious to hear the details.

As for unlogged table and temp tables, it's an interesting idea to
checksum them but far less valuable by the very nature of what those are
used for. Futher, it's utterly useless to checksum as part of backup,
like what pg_basebackup is actually doing, and is actually valuable to
skip over them rather than back them up, since otherwise we'd back them
up, and then restore them ... and then remove them immediately during
recovery from the backup state. Having a way to verify checksum on
unlogged tables or temp tables using some *other* tool or background
process could be valuable, provided it doesn't cause any issues for
ongoing operations.

I outlined a couple of other approaches to improving that situation,
which would be able to be used with pg_verify_checksums without having
to connect to the backend, but I'll note that those were completely
ignored, leading me to believe that there's really not much more to
discuss here since other ideas are just not open to being considered.

Well, given that we can do an accurate determination without too much
code in the basebackup case, I don't see what your proposals gain over
that? That's why I didn't comment on them. I'm focusing on the
basebackup case, over the online checksum case, because it's released
code.

I'm fine with improving the basebackup case, but I don't agree at all
with the idea that all checksum validation must be exclusively done in a
backend process.

I'm also not convinced that these changes to pg_basebackup will be free
of issues that may impact users in a negative way, making me concerned
that we're going to end up doing more harm than good with such a change
being back-patched. Simply comparing the skipped LSNs to the
end-of-backup LSN seems much less invasive when it comes to this core
code, and certainly increases the chances quite a bit that we'll detect
an issue with corruption in the LSN.

Thanks!

Stephen

#11Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#10)
Re: basebackup checksum verification

Hi,

On 2019-03-26 21:01:27 -0400, Stephen Frost wrote:

I'm also not convinced that these changes to pg_basebackup will be free
of issues that may impact users in a negative way, making me concerned
that we're going to end up doing more harm than good with such a change
being back-patched. Simply comparing the skipped LSNs to the
end-of-backup LSN seems much less invasive when it comes to this core
code, and certainly increases the chances quite a bit that we'll detect
an issue with corruption in the LSN.

Yea, in the other thread we'd discussed that that might be the correct
course for backpatch, at least initially. But I think the insert/replay
LSN would be the correct LSN to compare to in the basebackup.c case?

Greetings,

Andres Freund

#12Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#11)
Re: basebackup checksum verification

Greetings,

* Andres Freund (andres@anarazel.de) wrote:

On 2019-03-26 21:01:27 -0400, Stephen Frost wrote:

I'm also not convinced that these changes to pg_basebackup will be free
of issues that may impact users in a negative way, making me concerned
that we're going to end up doing more harm than good with such a change
being back-patched. Simply comparing the skipped LSNs to the
end-of-backup LSN seems much less invasive when it comes to this core
code, and certainly increases the chances quite a bit that we'll detect
an issue with corruption in the LSN.

Yea, in the other thread we'd discussed that that might be the correct
course for backpatch, at least initially. But I think the insert/replay
LSN would be the correct LSN to compare to in the basebackup.c case?

Yes, it seems like that could be done in the basebackup case and would
avoid the need to track the skipped LSNs, since you could just look up
the insert/replay LSN at the time and do the comparison right away.

Thanks,

Stephen

#13Michael Paquier
michael@paquier.xyz
In reply to: Peter Geoghegan (#6)
Re: basebackup checksum verification

On Tue, Mar 26, 2019 at 05:23:01PM -0700, Peter Geoghegan wrote:

I have personally seen real world corruption that involved a page
image consisting of random noise. Several times. Failing to detect
blatant corruption is unacceptable IMV.

Yeah, I have seen that as well. If we have a tool not able to detect
checksums failures in any reliable and robust way, then we don't have
something that qualifies as a checksum verification tool.
--
Michael