16-bit page checksums for 9.2
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.
page_checksums = on | off (default)
There are no required block changes; checksums are optional and some
blocks may have a checksum, others not. This means that the patch will
allow pg_upgrade.
That capability also limits us to 16-bit checksums. Fletcher's 16 is
used in this patch and seems rather quick, though that is easily
replaceable/tuneable if desired, perhaps even as a parameter enum.
This patch is a step on the way to 32-bit checksums in a future
redesign of the page layout, though that is not a required future
change, nor does this prevent that.
Checksum is set whenever the buffer is flushed to disk, and checked
when the page is read in from disk. It is not set at other times, and
for much of the time may not be accurate. This follows earlier
discussions from 2010-12-22, and is discussed in detail in patch
comments.
Note it works with buffer manager pages, which includes shared and
local data buffers, but not SLRU pages (yet? an easy addition but
needs other discussion around contention).
Note that all this does is detect bit errors on the page, it doesn't
identify where the error is, how bad and definitely not what caused it
or when it happened.
The main body of the patch involves changes to bufpage.c/.h so this
differs completely from the VMware patch, for technical reasons. Also
included are facilities to LockBufferForHints() with usage in various
AMs, to avoid the case where hints are set during calculation of the
checksum.
In my view this is a fully working, committable patch but I'm not in a
hurry to do so given the holiday season.
Hopefully its a gift not a turkey, and therefore a challenge for some
to prove that wrong. Enjoy either way,
Merry Christmas,
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16.v1.patchtext/x-patch; charset=US-ASCII; name=checksum16.v1.patchDownload+376-64
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.
I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint. I also wonder whether it might not result in
deadlocks.
Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted. Do you really think this is
the last page layout change we'll ever make?
regards, tom lane
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint. I also wonder whether it might not result in
deadlocks.
Why don't you use the same tricks as the former patch and copy the buffer,
compute the checksum on that, and then write out that copy (you can even do
both at the same time). I have a hard time believing that the additional copy
is more expensive than the locking.
Andres
On Sat, Dec 24, 2011 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint.
Anyone choosing page_checksums = on has already made a performance
reducing decision in favour of reliability. So they understand and
accept the impact. There is no locking when the parameter is off.
A safe alternative is to use LockBuffer, which has a much greater
performance impact.
I did think about optimistically checking after the write, but if we
crash at that point we will then see a block that has an invalid
checksum. It's faster but you may get a checksum failure if you crash
- but then one important aspect of this is to spot problems in case of
a crash, so that seems unacceptable.
I also wonder whether it might not result in
deadlocks.
If you can see how, please say. I can't see any ways for that myself.
Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted. Do you really think this is
the last page layout change we'll ever make?
No, I don't. I hope and expect the next page layout change to
reintroduce such a field.
But since we're agreed now that upgrading is important, changing page
format isn't likely to be happening until we get an online upgrade
process. So future changes are much less likely. If they do happen, we
have some flag bits spare that can be used to indicate later versions.
It's not the prettiest thing in the world, but it's a small ugliness
in return for an important feature. If there was a way without that, I
would have chosen it.
pg_filedump will need to be changed more than normal, but the version
isn't used anywhere else in the server code.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint. I also wonder whether it might not result in
deadlocks.
Why don't you use the same tricks as the former patch and copy the buffer,
compute the checksum on that, and then write out that copy (you can even do
both at the same time). I have a hard time believing that the additional copy
is more expensive than the locking.
We would copy every time we write, yet lock only every time we set hint bits.
If that option is favoured, I'll write another version after Christmas.
ISTM we can't write and copy at the same time because the cheksum is
not a trailer field.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 24, 2011 at 3:51 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
Not an expert here, but after reading through the patch quickly, I
don't see anything that changes the torn-page problem though, right?Hint bits aren't wal-logged, and FPW isn't forced on the hint-bit-only
dirty, right?
Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.
So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.
Checksums work even when there is no crash, so if your disk goes bad
and corrupts data then you'll know about it as soon as it happens.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Import Notes
Reply to msg id not found: CAC_2qU_wNTVAA8aJ0upPK3BxmfQQXt+3ECAzK6JEq+vczHc+SA@mail.gmail.com
On Saturday, December 24, 2011 05:01:02 PM Simon Riggs wrote:
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint. I also wonder whether it might not result in
deadlocks.Why don't you use the same tricks as the former patch and copy the
buffer, compute the checksum on that, and then write out that copy (you
can even do both at the same time). I have a hard time believing that
the additional copy is more expensive than the locking.We would copy every time we write, yet lock only every time we set hint
bits.
Isn't setting hint bits also a rather frequent operation? At least in a well-
cached workload where most writeout happens due to checkpoints.
If that option is favoured, I'll write another version after Christmas.
Seems less complicated (wrt deadlocking et al) to me. But I havent read your
patch, so I will shut up now ;)
Andres
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.
This is missing the point. If you have a torn page on a page that is
only dirty due to hint bits then the checksum will show a spurious
checksum failure. It will "detect" a problem that isn't there.
The problem is that there is no WAL indicating the hint bit change.
And if the torn page includes the new checksum but not the new hint
bit or vice versa it will be a checksum mismatch.
The strategy discussed in the past was moving all the hint bits to a
common area and skipping them in the checksum. No amount of double
writing or buffering or locking will avoid this problem.
--
greg
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark@mit.edu> wrote:
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.This is missing the point. If you have a torn page on a page that is
only dirty due to hint bits then the checksum will show a spurious
checksum failure. It will "detect" a problem that isn't there.
It will detect a problem that *is* there, but one you are classifying
it as a non-problem because it is a correctable or acceptable bit
error. Given that acceptable bit errors on hints cover no more than 1%
of a block, the great likelihood is that the bit error is unacceptable
in any case, so false positives page errors are in fact very rare.
Any bit error is an indicator of problems on the external device, so
many would regard any bit error as unacceptable.
The problem is that there is no WAL indicating the hint bit change.
And if the torn page includes the new checksum but not the new hint
bit or vice versa it will be a checksum mismatch.The strategy discussed in the past was moving all the hint bits to a
common area and skipping them in the checksum. No amount of double
writing or buffering or locking will avoid this problem.
I completely agree we should do this, but we are unable to do it now,
so this patch is a stop-gap and provides a much requested feature
*now*.
In the future, we will be able to tell the difference between an
acceptable and an unacceptable bit error. Right now, all we have is
the ability to detect a bit error and as I point out above that is 99%
of the problem solves, at least.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs wrote:
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark wrote:
The problem is that there is no WAL indicating the hint bit
change. And if the torn page includes the new checksum but not the
new hint bit or vice versa it will be a checksum mismatch.
With *just* this patch, true. An OS crash or hardware failure could
sometimes create an invalid page.
The strategy discussed in the past was moving all the hint bits to
a common area and skipping them in the checksum. No amount of
double writing or buffering or locking will avoid this problem.
I don't believe that. Double-writing is a technique to avoid torn
pages, but it requires a checksum to work. This chicken-and-egg
problem requires the checksum to be implemented first.
I completely agree we should do this, but we are unable to do it
now, so this patch is a stop-gap and provides a much requested
feature *now*.
Yes, for people who trust their environment to prevent torn pages, or
who are willing to tolerate one bad page per OS crash in return for
quick reporting of data corruption from unreliable file systems, this
is a good feature even without double-writes.
In the future, we will be able to tell the difference between an
acceptable and an unacceptable bit error.
A double-write patch would provide that, and it sounds like VMware
has a working patch for that which is being polished for submission.
It would need to wait until we have some consensus on the checksum
patch before it can be finalized. I'll try to review the patch from
this thread today, to do what I can to move that along.
-Kevin
Import Notes
Resolved by subject fallback
On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote:
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
Why don't you use the same tricks as the former patch and copy the buffer,
compute the checksum on that, and then write out that copy (you can even do
both at the same time). I have a hard time believing that the additional copy
is more expensive than the locking.ISTM we can't write and copy at the same time because the cheksum is
not a trailer field.
Ofcourse you can. If the checksum is in the trailer field you get the
nice property that the whole block has a constant checksum. However, if
you store the checksum elsewhere you just need to change the checking
algorithm to copy the checksum out, zero those bytes and run the
checksum and compare with the extracted checksum.
Not pretty, but I don't think it makes a difference in performence.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
On Sun, Dec 25, 2011 at 5:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark@mit.edu> wrote:
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.This is missing the point. If you have a torn page on a page that is
only dirty due to hint bits then the checksum will show a spurious
checksum failure. It will "detect" a problem that isn't there.It will detect a problem that *is* there, but one you are classifying
it as a non-problem because it is a correctable or acceptable bit
error.
I don't agree with this. We don't WAL-log hint bit changes precisely
because it's OK if they make it to disk and it's OK if they don't.
Given that, I don't see how we can say that writing out only half of a
page that has had hint bit changes is a problem. It's not.
(And if it is, then we ought to WAL-log all such changes regardless of
whether CRCs are in use.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 25.12.2011 15:01, Kevin Grittner wrote:
I don't believe that. Double-writing is a technique to avoid torn
pages, but it requires a checksum to work. This chicken-and-egg
problem requires the checksum to be implemented first.
I don't think double-writes require checksums on the data pages
themselves, just on the copies in the double-write buffers. In the
double-write buffer, you'll need some extra information per-page anyway,
like a relfilenode and block number that indicates which page it is in
the buffer.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 25.12.2011 15:01, Kevin Grittner wrote:
I don't believe that. Double-writing is a technique to avoid torn
pages, but it requires a checksum to work. This chicken-and-egg
problem requires the checksum to be implemented first.I don't think double-writes require checksums on the data pages themselves,
just on the copies in the double-write buffers. In the double-write buffer,
you'll need some extra information per-page anyway, like a relfilenode and
block number that indicates which page it is in the buffer.
How would you know when to look in the double write buffer?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Dec 25, 2011 at 1:01 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
This chicken-and-egg
problem requires the checksum to be implemented first.
v2 of checksum patch, using a conditional copy if checksumming is
enabled, so locking is removed.
Thanks to Andres for thwacking me with the cluestick, though I have
used a simple copy rather than a copy & calc.
Tested using make installcheck with parameter on/off, then restart and
vacuumdb to validate all pages.
Reviews, objections, user interface tweaks all welcome.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
checksum16.v2.patchtext/x-patch; charset=US-ASCII; name=checksum16.v2.patchDownload+383-71
On 28.12.2011 01:39, Simon Riggs wrote:
On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:On 25.12.2011 15:01, Kevin Grittner wrote:
I don't believe that. Double-writing is a technique to avoid torn
pages, but it requires a checksum to work. This chicken-and-egg
problem requires the checksum to be implemented first.I don't think double-writes require checksums on the data pages themselves,
just on the copies in the double-write buffers. In the double-write buffer,
you'll need some extra information per-page anyway, like a relfilenode and
block number that indicates which page it is in the buffer.How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double write
buffer that has a valid checksum, you copy to the main storage. There's
no need to check validity of pages in the main storage.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double write buffer
that has a valid checksum, you copy to the main storage. There's no need to
check validity of pages in the main storage.
OK, then we are talking at cross purposes. Double write buffers, in
the way you explain them allow us to remove full page writes. They
clearly don't do anything to check page validity on read. Torn pages
are not the only fault we wish to correct against... and the double
writes idea is orthogonal to the idea of checksums.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 28.12.2011 11:22, Simon Riggs wrote:
On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double write buffer
that has a valid checksum, you copy to the main storage. There's no need to
check validity of pages in the main storage.OK, then we are talking at cross purposes. Double write buffers, in
the way you explain them allow us to remove full page writes. They
clearly don't do anything to check page validity on read. Torn pages
are not the only fault we wish to correct against... and the double
writes idea is orthogonal to the idea of checksums.
The reason we're talking about double write buffers in this thread is
that double write buffers can be used to solve the problem with hint
bits and checksums.
You're right, though, that it's academical whether double write buffers
can be used without checksums on data pages, if the whole point of the
exercise is to make it possible to have checksums on data pages..
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Dec 28, 2011 at 5:45 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 28.12.2011 11:22, Simon Riggs wrote:
On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double write
buffer
that has a valid checksum, you copy to the main storage. There's no need
to
check validity of pages in the main storage.OK, then we are talking at cross purposes. Double write buffers, in
the way you explain them allow us to remove full page writes. They
clearly don't do anything to check page validity on read. Torn pages
are not the only fault we wish to correct against... and the double
writes idea is orthogonal to the idea of checksums.The reason we're talking about double write buffers in this thread is that
double write buffers can be used to solve the problem with hint bits and
checksums.
Torn pages are not the only problem we need to detect.
You said "You scan the double write buffer...". When exactly would you do that?
Please explain how a double write buffer detects problems that do not
occur as the result of a crash.
We don't have much time, so please be clear and lucid.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Heikki Linnakangas wrote:
On 28.12.2011 01:39, Simon Riggs wrote:On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
wrote:On 25.12.2011 15:01, Kevin Grittner wrote:
I don't believe that. Double-writing is a technique to avoid
torn pages, but it requires a checksum to work. This chicken-
and-egg problem requires the checksum to be implemented first.I don't think double-writes require checksums on the data pages
themselves, just on the copies in the double-write buffers. In
the double-write buffer, you'll need some extra information per-
page anyway, like a relfilenode and block number that indicates
which page it is in the buffer.
You are clearly right -- if there is no checksum in the page itself,
you can put one in the double-write metadata. I've never seen that
discussed before, but I'm embarrassed that it never occurred to me.
How would you know when to look in the double write buffer?
You scan the double-write buffer, and every page in the double
write buffer that has a valid checksum, you copy to the main
storage. There's no need to check validity of pages in the main
storage.
Right. I'll recap my understanding of double-write (from memory --
if there's a material error or omission, I hope someone will correct
me).
The write-ups I've seen on double-write techniques have all the
writes to the double-write buffer (a single, sequential file that
stays around). This is done as sequential writing to a file which is
overwritten pretty frequently, making the writes to a controller very
fast, and a BBU write-back cache unlikely to actually write to disk
very often. On good server-quality hardware, it should be blasting
RAM-to_RAM very efficiently. The file is fsync'd (like I said,
hopefully to BBU cache), then each page in the double-write buffer is
written to the normal page location, and that is fsync'd. Once that
is done, the database writes have no risk of being torn, and the
double-write buffer is marked as empty. This all happens at the
point when you would be writing the page to the database, after the
WAL-logging.
On crash recovery you read through the double-write buffer from the
start and write the pages which look good (including a good checksum)
to the database before replaying WAL. If you find a checksum error
in processing the double-write buffer, you assume that you never got
as far as the fsync of the double-write buffer, which means you never
started writing the buffer contents to the database, which means
there can't be any torn pages there. If you get to the end and
fsync, you can be sure any torn pages from a previous attempt to
write to the database itself have been overwritten with the good copy
in the double-write buffer. Either way, you move on to WAL
processing.
You wind up with a database free of torn pages before you apply WAL.
full_page_writes to the WAL are not needed as long as double-write is
used for any pages which would have been written to the WAL. If
checksums were written to the double-buffer metadata instead of
adding them to the page itself, this could be implemented alone. It
would probably allow a modest speed improvement over using
full_page_writes and would eliminate those full-page images from the
WAL files, making them smaller.
If we do add a checksum to the page header, that could be used for
testing for torn pages in the double-write buffer without needing a
redundant calculation for double-write. With no torn pages in the
actual database, checksum failures there would never be false
positives. To get this right for a checksum in the page header,
double-write would need to be used for all cases where
full_page_writes now are used (i.e., the first write of a page after
a checkpoint), and for all unlogged writes (e.g., hint-bit-only
writes). There would be no correctness problem for always using
double-write, but it would be unnecessary overhead for other page
writes, which I think we can avoid.
-Kevin
Import Notes
Resolved by subject fallback