Block-level CRC checks
A customer of ours has been having trouble with corrupted data for some
time. Of course, we've almost always blamed hardware (and we've seen
RAID controllers have their firmware upgraded, among other actions), but
the useful thing to know is when corruption has happened, and where.
So we've been tasked with adding CRCs to data files.
The idea is that these CRCs are going to be checked just after reading
files from disk, and calculated just before writing it. They are
just a protection against the storage layer going mad; they are not
intended to protect against faulty RAM, CPU or kernel.
This code would be run-time or compile-time configurable. I'm not
absolutely sure which yet; the problem with run-time is what to do if
the user restarts the server with the setting flipped. It would have
almost no impact on users who don't enable it.
The implementation I'm envisioning requires the use of a new relation
fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum
for each block. FlushBuffer would calculate the checksum and store it
in the CRC fork; ReadBuffer_common would read the page, calculate the
checksum, and compare it to the one stored in the CRC fork.
A buffer's io_in_progress lock protects the buffer's CRC. We read and
pin the CRC page before acquiring the lock, to avoid having two buffer
IO operations in flight.
I'd like to submit this for 8.4, but I want to ensure that -hackers at
large approve of this feature before starting serious coding.
Opinions?
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Sep 30, 2008 at 2:02 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
A customer of ours has been having trouble with corrupted data for some
time. Of course, we've almost always blamed hardware (and we've seen
RAID controllers have their firmware upgraded, among other actions), but
the useful thing to know is when corruption has happened, and where.
Agreed.
So we've been tasked with adding CRCs to data files.
Awesome.
The idea is that these CRCs are going to be checked just after reading
files from disk, and calculated just before writing it. They are
just a protection against the storage layer going mad; they are not
intended to protect against faulty RAM, CPU or kernel.
This is the common case.
This code would be run-time or compile-time configurable. I'm not
absolutely sure which yet; the problem with run-time is what to do if
the user restarts the server with the setting flipped. It would have
almost no impact on users who don't enable it.
I've supported this forever!
The implementation I'm envisioning requires the use of a new relation
fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum
for each block. FlushBuffer would calculate the checksum and store it
in the CRC fork; ReadBuffer_common would read the page, calculate the
checksum, and compare it to the one stored in the CRC fork.A buffer's io_in_progress lock protects the buffer's CRC. We read and
pin the CRC page before acquiring the lock, to avoid having two buffer
IO operations in flight.
If the CRC gets written before the block, how is recovery going to
handle it? I'm not too familiar with the new forks stuff, but
recovery will pull the old block, compare it against the checksum, and
consider the block invalid, correct?
I'd like to submit this for 8.4, but I want to ensure that -hackers at
large approve of this feature before starting serious coding.
IMHO, this is a functionality that should be enabled by default (as it
is on most other RDBMS). It would've prevented severe corruption in
the 20 or so databases I've had to fix, and other than making it
optional, I don't see the reasoning for a separate relation fork
rather than storing it directly on the block (as everyone else does).
Similarly, I think Greg Stark was playing with a patch for it
(http://archives.postgresql.org/pgsql-hackers/2007-02/msg01850.php).
--
Jonah H. Harris, Senior DBA
myYearbook.com
Alvaro Herrera <alvherre@commandprompt.com> writes:
The implementation I'm envisioning requires the use of a new relation
fork to store the per-block CRCs.
That seems bizarre, and expensive, and if you lose one block of the CRC
fork you lose confidence in a LOT of data. Why not keep the CRCs in the
page headers?
A buffer's io_in_progress lock protects the buffer's CRC.
Unfortunately, it doesn't. See hint bits.
regards, tom lane
Alvaro Herrera wrote:
A customer of ours has been having trouble with corrupted data for some
time. Of course, we've almost always blamed hardware (and we've seen
RAID controllers have their firmware upgraded, among other actions), but
the useful thing to know is when corruption has happened, and where.So we've been tasked with adding CRCs to data files.
The idea is that these CRCs are going to be checked just after reading
files from disk, and calculated just before writing it. They are
just a protection against the storage layer going mad; they are not
intended to protect against faulty RAM, CPU or kernel.
This has been suggested before, and the usual objection is precisely
that it only protects from errors in the storage layer, giving a false
sense of security.
Doesn't some filesystems include a per-block CRC, which would achieve
the same thing? ZFS?
This code would be run-time or compile-time configurable. I'm not
absolutely sure which yet; the problem with run-time is what to do if
the user restarts the server with the setting flipped. It would have
almost no impact on users who don't enable it.
Yeah, seems like it would need to be compile-time or initdb-time
configurable.
The implementation I'm envisioning requires the use of a new relation
fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum
for each block. FlushBuffer would calculate the checksum and store it
in the CRC fork; ReadBuffer_common would read the page, calculate the
checksum, and compare it to the one stored in the CRC fork.
Surely it would be much simpler to just add a field to the page header.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, 30 Sep 2008 14:33:04 -0400
"Jonah H. Harris" <jonah.harris@gmail.com> wrote:
I'd like to submit this for 8.4, but I want to ensure that -hackers
at large approve of this feature before starting serious coding.IMHO, this is a functionality that should be enabled by default (as it
is on most other RDBMS). It would've prevented severe corruption in
What other RDMS have it enabled by default?
Sincerely,
Joshua D. Drake
--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
On Tue, Sep 30, 2008 at 2:49 PM, Joshua Drake <jd@commandprompt.com> wrote:
On Tue, 30 Sep 2008 14:33:04 -0400
"Jonah H. Harris" <jonah.harris@gmail.com> wrote:I'd like to submit this for 8.4, but I want to ensure that -hackers
at large approve of this feature before starting serious coding.IMHO, this is a functionality that should be enabled by default (as it
is on most other RDBMS). It would've prevented severe corruption inWhat other RDMS have it enabled by default?
Oracle and (I belive) SQL Server >= 2005
--
Jonah H. Harris, Senior DBA
myYearbook.com
Hello Alvaro,
some random thoughts while reading your proposal follow...
Alvaro Herrera wrote:
So we've been tasked with adding CRCs to data files.
Disks get larger and relative reliability shrinks, it seems. So I agree
that this is a worthwhile thing to have. But shouldn't that be the job
of the filesystem? Think of ZFS or the upcoming BTRFS.
The idea is that these CRCs are going to be checked just after reading
files from disk, and calculated just before writing it. They are
just a protection against the storage layer going mad; they are not
intended to protect against faulty RAM, CPU or kernel.
That sounds reasonable if we do it from Postgres.
This code would be run-time or compile-time configurable. I'm not
absolutely sure which yet; the problem with run-time is what to do if
the user restarts the server with the setting flipped. It would have
almost no impact on users who don't enable it.
I'd say calculating a CRC is close enough to be considered "no impact".
A single core of a modern CPU easily reaches way above 200 MiB/s
throughput for CRC32 today. See [1]Crypto++ benchmarks: http://www.cryptopp.com/benchmarks.html.
Maybe consider Adler-32 which is 3-4x faster [2]Wikipedia about hash functions: http://en.wikipedia.org/wiki/List_of_hash_functions#Computational_costs_of_CRCs_vs_Hashes, also part of zlib and
AFAIK about equally safe for 8k blocks and above.
The implementation I'm envisioning requires the use of a new relation
fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum
for each block. FlushBuffer would calculate the checksum and store it
in the CRC fork; ReadBuffer_common would read the page, calculate the
checksum, and compare it to the one stored in the CRC fork.
Huh? Aren't CRCs normally stored as part of the block they are supposed
to protect? Or how do you expect to ensure the data from the CRC
relation fork is correct? How about crash safety (a data block written,
but not its CRC block or vice versa)?
Wouldn't that double the amount of seeking required for writes?
I'd like to submit this for 8.4, but I want to ensure that -hackers at
large approve of this feature before starting serious coding.
Very cool!
Regards
Markus Wanner
[1]: Crypto++ benchmarks: http://www.cryptopp.com/benchmarks.html
http://www.cryptopp.com/benchmarks.html
[2]: Wikipedia about hash functions: http://en.wikipedia.org/wiki/List_of_hash_functions#Computational_costs_of_CRCs_vs_Hashes
http://en.wikipedia.org/wiki/List_of_hash_functions#Computational_costs_of_CRCs_vs_Hashes
Alvaro Herrera wrote:
Initially I'm aiming at a CRC32 sum
for each block. FlushBuffer would calculate the checksum and store it
in the CRC fork; ReadBuffer_common would read the page, calculate the
checksum, and compare it to the one stored in the CRC fork.
There's one fundamental problem with that, related to the way our hint
bits are written.
Currently, hint bit updates are not WAL-logged, and thus no full page
write is done when only hint bits are changed. Imagine what happens if
hint bits are updated on a page, but there's no other changes, and we
crash so that only one half of the new page version makes it to disk (=
torn page). The CRC would not match, even though the page is actually valid.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
A customer of ours has been having trouble with corrupted data for some
time. Of course, we've almost always blamed hardware (and we've seen
RAID controllers have their firmware upgraded, among other actions), but
the useful thing to know is when corruption has happened, and where.
That is an important statement, to know when it happens not necessarily to
be able to recover the block or where in the block it is corrupt. Is that
correct?
So we've been tasked with adding CRCs to data files.
CRC or checksum? If the objective is merely general "detection" there
should be some latitude in choosing the methodology for performance.
The idea is that these CRCs are going to be checked just after reading
files from disk, and calculated just before writing it. They are
just a protection against the storage layer going mad; they are not
intended to protect against faulty RAM, CPU or kernel.
It will actually find faults in all if it. If the CPU can't add and/or a
RAM location lost a bit, this will blow up just as easily as a bad block.
It may cause "false identification" of an error, but it will keep a bad
system from hiding.
This code would be run-time or compile-time configurable. I'm not
absolutely sure which yet; the problem with run-time is what to do if
the user restarts the server with the setting flipped. It would have
almost no impact on users who don't enable it.
CPU capacity on modern hardware within a small area of RAM is practically
infinite when compared to any sort of I/O.
The implementation I'm envisioning requires the use of a new relation
fork to store the per-block CRCs. Initially I'm aiming at a CRC32 sum
for each block. FlushBuffer would calculate the checksum and store it
in the CRC fork; ReadBuffer_common would read the page, calculate the
checksum, and compare it to the one stored in the CRC fork.
Hell, all that is needed is a long or a short checksum value in the block.
I mean, if you just want a sanity test, it doesn't take much. Using a
second relation creates confusion. If there is a CRC discrepancy between
two different blocks, who's wrong? You need a third "control" to know. If
the block knows its CRC or checksum and that is in error, the block is
bad.
A buffer's io_in_progress lock protects the buffer's CRC. We read and
pin the CRC page before acquiring the lock, to avoid having two buffer
IO operations in flight.I'd like to submit this for 8.4, but I want to ensure that -hackers at
large approve of this feature before starting serious coding.Opinions?
If its fast enough, its a good idea. It could be very helpful in
protecting users data.
Show quoted text
--
Alvaro Herrera
http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, 30 Sep 2008, Heikki Linnakangas wrote:
Doesn't some filesystems include a per-block CRC, which would achieve the
same thing? ZFS?
Yes, there is a popular advoacy piece for ZFS with a high-level view of
why and how they implement that at
http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data The guarantees are
stronger than what you can get if you just put a CRC in the block itself.
I'd never really thought too hard about putting this in the database
knowing that ZFS is available for environments where this is a concern,
but it certainly would be a nice addition.
The best analysis I've ever seen that makes a case for OS or higher level
disk checksums of some sort, by looking at the myriad ways that disks and
disk arrays fail in the real world, is in
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
(there is a shorter version that hits the high points of that at
http://www.usenix.org/publications/login/2008-06/openpdfs/bairavasundaram.pdf
)
One really interesting bit in there I'd never seen before is that they
find real data that supports the stand that enterprise drives are
significantly more reliable than consumer ones. While general failure
rates aren't that different, "SATA disks have an order of magnitude higher
probability of developing checksum mismatches than Fibre Channel disks. We
find that 0.66% of SATA disks develop at least one mismatch during the
first 17 months in the field, whereas only 0.06% of Fibre Channel disks
develop a mismatch during that time."
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Alvaro Herrera wrote:
A customer of ours has been having trouble with corrupted data for some
time. Of course, we've almost always blamed hardware (and we've seen
RAID controllers have their firmware upgraded, among other actions), but
the useful thing to know is when corruption has happened, and where.So we've been tasked with adding CRCs to data files.
Maybe a stupid question, but what I/O subsystems corrupt data and fail
to report it?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Tue, Sep 30, 2008 at 1:41 PM, Bruce Momjian <bruce@momjian.us> wrote:
Alvaro Herrera wrote:
A customer of ours has been having trouble with corrupted data for some
time. Of course, we've almost always blamed hardware (and we've seen
RAID controllers have their firmware upgraded, among other actions), but
the useful thing to know is when corruption has happened, and where.So we've been tasked with adding CRCs to data files.
Maybe a stupid question, but what I/O subsystems corrupt data and fail
to report it?
Practically all of them. Here is a good paper on various checksums, their
failure rates, and practical applications.
"Parity Lost and Parity Regained"
http://www.usenix.org/event/fast08/tech/full_papers/krioukov/krioukov_html/index.html
-jwb
On Sep 30, 2008, at 2:17 PM, pgsql@mohawksoft.com wrote:
A customer of ours has been having trouble with corrupted data for
some
time. Of course, we've almost always blamed hardware (and we've seen
RAID controllers have their firmware upgraded, among other
actions), but
the useful thing to know is when corruption has happened, and where.That is an important statement, to know when it happens not
necessarily to
be able to recover the block or where in the block it is corrupt.
Is that
correct?
Oh, correcting the corruption would be AWESOME beyond belief! But at
this point I'd settle for just knowing it had happened.
So we've been tasked with adding CRCs to data files.
CRC or checksum? If the objective is merely general "detection" there
should be some latitude in choosing the methodology for performance.
See above. Perhaps the best win would be a case where you could
choose which method you wanted. We generally have extra CPU on the
servers, so we could afford to burn some cycles with more complex
algorithms.
The idea is that these CRCs are going to be checked just after
reading
files from disk, and calculated just before writing it. They are
just a protection against the storage layer going mad; they are not
intended to protect against faulty RAM, CPU or kernel.It will actually find faults in all if it. If the CPU can't add and/
or a
RAM location lost a bit, this will blow up just as easily as a bad
block.
It may cause "false identification" of an error, but it will keep a
bad
system from hiding.
Well, very likely not, since the intention is to only compute the CRC
when we write the block out, at least for now. In the future I would
like to be able to detect when a CPU or memory goes bonkers and poops
on something, because that's actually happened to us as well.
The implementation I'm envisioning requires the use of a new relation
fork to store the per-block CRCs. Initially I'm aiming at a CRC32
sum
for each block. FlushBuffer would calculate the checksum and
store it
in the CRC fork; ReadBuffer_common would read the page, calculate the
checksum, and compare it to the one stored in the CRC fork.Hell, all that is needed is a long or a short checksum value in the
block.
I mean, if you just want a sanity test, it doesn't take much. Using a
second relation creates confusion. If there is a CRC discrepancy
between
two different blocks, who's wrong? You need a third "control" to
know. If
the block knows its CRC or checksum and that is in error, the block is
bad.
I believe the idea was to make this as non-invasive as possible. And
it would be really nice if this could be enabled without a dump/
reload (maybe the upgrade stuff would make this possible?)
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Attachments:
On Tue, 30 Sep 2008 13:48:52 -0700
"Jeffrey Baker" <jwbaker@gmail.com> wrote:
Practically all of them. Here is a good paper on various checksums,
their failure rates, and practical applications."Parity Lost and Parity Regained"
http://www.usenix.org/event/fast08/tech/full_papers/krioukov/krioukov_html/index.html
In a related article published in Login called "Data Corruption in the
storage stack: a closer look" they say:
During a 41-month period we observed more than 400,000 instances of
checksum mistmatches, 8% of which were discovered during RAID
reconstruction, creating the possibility of real data loss.
They also have a wonderful term they mention, "Silent Data corruptions".
Joshua D. Drake
[1]: Login June 2008
-jwb
--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
I believe the idea was to make this as non-invasive as possible. And
it would be really nice if this could be enabled without a dump/
reload (maybe the upgrade stuff would make this possible?)
--
It's all about the probability of a duplicate check being generated. If
you use a 32 bit checksum, then you have a theoretical probability of 1 in
4 billion that a corrupt block will be missed (probably much lower
depending on your algorithm). If you use a short, then a 1 in 65 thousand
probability. If you use an 8 bit number, then 1 in 256.
Why am I going on? Well, if there are any spare bits in a block header,
they could be used for the check value.
On Sep 30, 2008, at 1:48 PM, Heikki Linnakangas wrote:
This has been suggested before, and the usual objection is
precisely that it only protects from errors in the storage layer,
giving a false sense of security.
If you can come up with a mechanism for detecting non-storage errors
as well, I'm all ears. :)
In the meantime, you're way, way more likely to experience corruption
at the storage layer than anywhere else. We've had several corruption
events, only one of which was memory related... and we *know* it was
memory related because we actually got logs saying so. But with a SAN
environment there's a lot of moving parts, all waiting to screw up
your data:
filesystem
SAN device driver
SAN network
SAN BIOS
drive BIOS
drive
That's above things that could hose your data outside of storage:
kernel
CPU
memory
motherboard
Doesn't some filesystems include a per-block CRC, which would
achieve the same thing? ZFS?
Sure, some do. We're on linux and can't run ZFS. And I'll argue that
no linux FS is anywhere near as tested as ext3 is, which means that
going to some other FS that offers you CRC means you're now exposing
yourself to the possibility of issues with the FS itself. Not to
mention that changing filesystems on a large production system is
very painful.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Attachments:
On 30 Sep 2008, at 10:17 PM, Decibel! <decibel@decibel.org> wrote:
On Sep 30, 2008, at 1:48 PM, Heikki Linnakangas wrote:
This has been suggested before, and the usual objection is
precisely that it only protects from errors in the storage layer,
giving a false sense of security.If you can come up with a mechanism for detecting non-storage errors
as well, I'm all ears. :)In the meantime, you're way, way more likely to experience
corruption at the storage layer than anywhere else.
Fwiw this hasn't been my experience. Bad memory is extremely common
and even the storage failures I've seen (excluding the drive crashes)
turned out to actually be caused by bad memory.
That said I've always been interested in doing this. The main use case
in my mind has actually been for data that's been restored from old
backups which have been lying round and floating between machines for
a while with many opportunities for bit errors to show up.
The main stumbling block I ran into was how to deal with turning the
option off and on. I wanted it to be possible to turn off the option
to have the database ignore any errors and to avoid the overhead.
But that means including an escape hatch value which is always
considered to be correct. But that dramatically reduces the
effectiveness of the scheme.
Another issue is it will make space available on each page smaller
making it harder to do in place upgrades.
If you can deal with those issues and carefully deal with the
contingencies so it's clear to people what to do when errra occur or
they want to turn the feature on or off then I'm all for it. That
despite my experience of memory errors being a lot more common than
undetected storage errors.
Joshua Drake wrote:
During a 41-month period we observed more than 400,000 instances of
checksum mistmatches, 8% of which were discovered during RAID
reconstruction, creating the possibility of real data loss.They also have a wonderful term they mention, "Silent Data corruptions".
Exactely!
From my experience, the only assumption to be made about storage is that it can
and will fail ... frequently! It is unreliable (not to mention slooow) and
should not be trusted; regardless of the price tag or brand.
This could help detect:
- fs corruption
- vfs bug
- raid controller firmware bug
- bad disk sector
- power crash
- weird martian-like raid rebuilds
Although, this idea won't prevent anything. Everything would still sinisterly
fail on you. The difference is, no more silence.
--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/
If you are concerned with data integrity (not caused by bugs in the code
itself), you may be interested in utilizing ZFS; however, be aware that I
found and reported a bug in their implementation of the Fletcher checksum
algorithm they use by default to attempt to verify the integrity of the data
stored in their file system, and further aware that checksums/CRCs do not
enable the correction of errors in general, therefore be prepared to make
the decision of "what should be done in the event of a failure"; ZFS
effectively locks up in certain circumstances rather risk silently using
suspect data with some form of persistent indication that the result may be
corrupted. (strong CRC's and FEC's are relatively inexpensive to compute).
So in summary, my two cents: a properly implemented 32/64 bit Fletcher
checksum is likely adequate to detect most errors (and correct them if
presumed to be a result of a single flipped bit within 128KB or so, as such
a Fletcher checksum has a hamming distance of 3 within blocks of this size,
albeit fairly expensive to do so by trial and error; further presuming that
this can not be relied upon, a strategy potentially utilizing the suspect
data as if it were good likely needs to be adopted, accompanied somehow with
a persistent indication that the query results (or specific sub-results) are
themselves suspect, as it may often be a lesser evil than the alternative
(but not always). Or use a file system like ZFS, and let it do its thing,
and hope for the best.
Import Notes
Resolved by subject fallback
Joshua D. Drake wrote:
...
ZFS is not an option; generally speaking.
Then in general, if the corruption occurred within the:
- read path, try again and hope it takes care of itself.
- write path, the best that can be hoped for is a single bit error
within the data itself which can be both detected and corrected
with a sufficiently strong check sum; or worst case if address or
control information was corrupted, god knows what happed to the
data, and what other data may have been destroyed by having the
data written to the wrong blocks and typically unrecoverable.
- drive itself, this is most typically very unlikely, as strong FEC
codes typically prevent the misidentification of unrecoverable
data as being otherwise.
The simplest thing to do would seem to be to upon reading blocks
check the check sum, if bad, try read again; if that doesn't fix
the problem, assume a single bit error, and iteratively flip
single bits until the check sum matches (hopefully not making the
problem worse as may be the case if many bits were actually already
in error) and write the data back, and proceed as normal, possibly
logging the action; otherwise presume the data is unrecoverable and
in error, somehow mark it as being so such that subsequent queries
which may utilize any portion of it knows it may be corrupt (which
I suspect may be best done not on file-system blocks, but actually
on a logical rows or even individual entries if very large, as my
best initial guess, and likely to measurably affect performance
when enabled, and haven't a clue how resulting query should/could
be identified as being potentially corrupt without confusing the
client which requested it).
Import Notes
Resolved by subject fallback