Detecting corrupted pages earlier
Postgres has a bad habit of becoming very confused if the page header of
a page on disk has become corrupted. In particular, bogus values in the
pd_lower field tend to make it look like there are many more tuples than
there really are, and of course these "tuples" contain garbage. That
leads to core dumps, weird complaints about out-of-range transaction
numbers (the latter generally in the form of an abort referencing a
nonexistent pg_clog file), and other un-fun stuff.
I'm thinking of modifying ReadBuffer() so that it errors out if the
page read in does not contain either zeroes or a valid-looking header.
(The exception for zeroes seems to be needed for hash indexes, which
tend to initialize pages out-of-order.) This would make it much easier
for people to recognize situations where a page header has become
corrupted on disk.
Comments? Can anyone think of a scenario where this would be a bad
idea?
regards, tom lane
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Tom> Postgres has a bad habit of becoming very confused if the
Tom> page header of a page on disk has become corrupted. In
Tom> particular, bogus values in the pd_lower field tend to make
I haven't read this piece of pgsql code very carefully so I apologize
if what I suggest is already present.
One "standard" solution to handle disk page corruption is the use of
"consistency" bits.
The idea is that the bit that starts every 256th byte of a page is a
consistency bit. In a 8K page, you'd have 32 consistency bits. If the
page is in a "consistent" state, then all 32 bits will be either 0 or
1. When a page is written to disk, the "actual" bit in each c-bit
position is copied out and placed in the header (end/beginning) of the
page. With a 8K page, there will be one word that contains the
"actual" bit. Then the c-bits are all either set or reset depending on
the state when the page was last read: if on read time the c-bits were
set, then on write time they are reset. So when you read a page, if
some of the consistency bits are set and some others are reset then
you know that there was a corruption.
This is of course based on the assumption that most disk arms manage
to atomically write 256 bytes at a time.
--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh
On Mon, 17 Feb 2003, Tom Lane wrote:
Postgres has a bad habit of becoming very confused if the page header of
a page on disk has become corrupted.
What typically causes this corruption?
If it's any kind of a serious problem, maybe it would be worth keeping
a CRC of the header at the end of the page somewhere.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes:
On Mon, 17 Feb 2003, Tom Lane wrote:
Postgres has a bad habit of becoming very confused if the page header of
a page on disk has become corrupted.
What typically causes this corruption?
Well, I'd like to know that too. I have seen some cases that were
identified as hardware problems (disk wrote data to wrong sector, RAM
dropped some bits, etc). I'm not convinced that that's the whole story,
but I have nothing to chew on that could lead to identifying a software
bug.
If it's any kind of a serious problem, maybe it would be worth keeping
a CRC of the header at the end of the page somewhere.
See past discussions about keeping CRCs of page contents. Ultimately
I think it's a significant expenditure of CPU for very marginal returns
--- the layers underneath us are supposed to keep their own CRCs or
other cross-checks, and a very substantial chunk of the problem seems
to be bad RAM, against which occasional software CRC checks aren't
especially useful.
regards, tom lane
Tom Lane wrote:
Curt Sampson <cjs@cynic.net> writes:
On Mon, 17 Feb 2003, Tom Lane wrote:
Postgres has a bad habit of becoming very confused if the page header of
a page on disk has become corrupted.What typically causes this corruption?
Well, I'd like to know that too. I have seen some cases that were
identified as hardware problems (disk wrote data to wrong sector, RAM
dropped some bits, etc). I'm not convinced that that's the whole story,
but I have nothing to chew on that could lead to identifying a software
bug.If it's any kind of a serious problem, maybe it would be worth keeping
a CRC of the header at the end of the page somewhere.See past discussions about keeping CRCs of page contents. Ultimately I think it's a significant expenditure of CPU for very marginal returns --- the layers underneath us are supposed to keep their own CRCs or other cross-checks, and a very substantial chunk of the problem seems to be bad RAM, against which occasional software CRC checks aren't especially useful.
I believe the farthest we got was the idea of adding a CRC page
check option in case you suspected bad hardware.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, 17 Feb 2003, Tom Lane wrote:
Curt Sampson <cjs@cynic.net> writes:
If it's any kind of a serious problem, maybe it would be worth keeping
a CRC of the header at the end of the page somewhere.See past discussions about keeping CRCs of page contents. Ultimately I think it's a significant expenditure of CPU for very marginal returns --- the layers underneath us are supposed to keep their own CRCs or other cross-checks, and a very substantial chunk of the problem seems to be bad RAM, against which occasional software CRC checks aren't especially useful.
Well, I wasn't proposing the whole page, just the header. That would be
significantly cheaper (in fact, there's no real need even for a CRC;
probably just xoring all of the words in the header into one word would
be fine) and would tell you if the page was torn during the write, which
was what I was imagining the problem might be.
But bad memory, well, not much you can do about that beyond saying, "buy
ECC, dude."
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes:
Well, I wasn't proposing the whole page, just the header. That would be
significantly cheaper (in fact, there's no real need even for a CRC;
probably just xoring all of the words in the header into one word would
be fine) and would tell you if the page was torn during the write, which
was what I was imagining the problem might be.
The header is only a dozen or two bytes long, so torn-page syndrome
won't result in header corruption.
The cases I've been able to study look like the header and a lot of the
following page data have been overwritten with garbage --- when it made
any sense at all, it looked like the contents of non-Postgres files (eg,
plain text), which is why I mentioned the possibility of disks writing
data to the wrong sector. Another recent report suggested that all
bytes of the header had been replaced with 0x55, which sounds more like
RAM or disk-controller malfeasance.
You're right that we don't need a heck of a powerful check to catch
this sort of thing. I was envisioning checks comparable to what's now
in PageAddItem: valid pagesize, valid version, pd_lower and pd_upper and
pd_special sane relative to each other and to the pagesize. I think this
would be nearly as effective as an XOR sum --- and it has the major
advantage of being compatible with the existing page layout. I'd like
to think we're done munging the page layout for awhile.
regards, tom lane
On Tue, 18 Feb 2003, Tom Lane wrote:
The header is only a dozen or two bytes long, so torn-page syndrome
won't result in header corruption.
No. But the checksum would detect both header corruption and torn pages.
Two for the price of one. But I don't think it's worth changing the page
layout for, either. Maybe, if anybody still cares next time the page layout
is changed, pop it in with whatever else is being changed.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Tom Lane wrote:
The cases I've been able to study look like the header and a lot of the
following page data have been overwritten with garbage --- when it made
any sense at all, it looked like the contents of non-Postgres files (eg,
plain text), which is why I mentioned the possibility of disks writing
data to the wrong sector.
That also sounds suspiciously like the behavior of certain filesystems
(Reiserfs, for one) after a crash when the filesystem prior to the
crash was highly active with writes. Had the sites that reported this
experienced OS crashes or power interruptions?
--
Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes:
Tom Lane wrote:
The cases I've been able to study look like the header and a lot of the
following page data have been overwritten with garbage --- when it made
any sense at all, it looked like the contents of non-Postgres files (eg,
plain text), which is why I mentioned the possibility of disks writing
data to the wrong sector.
That also sounds suspiciously like the behavior of certain filesystems
(Reiserfs, for one) after a crash when the filesystem prior to the
crash was highly active with writes.
Isn't reiserfs supposed to be more crash-resistant than ext2, rather
than less so?
Had the sites that reported this
experienced OS crashes or power interruptions?
Can't recall whether they admitted to such or not.
regards, tom lane
Tom Lane kirjutas T, 18.02.2003 kell 17:21:
Kevin Brown <kevin@sysexperts.com> writes:
Tom Lane wrote:
The cases I've been able to study look like the header and a lot of the
following page data have been overwritten with garbage --- when it made
any sense at all, it looked like the contents of non-Postgres files (eg,
plain text), which is why I mentioned the possibility of disks writing
data to the wrong sector.That also sounds suspiciously like the behavior of certain filesystems
(Reiserfs, for one) after a crash when the filesystem prior to the
crash was highly active with writes.
I was bitten by it about a year ago as well.
Isn't reiserfs supposed to be more crash-resistant than ext2, rather
than less so?
It's supposed to be, but when it is run in (default?)
metadata-only-logging mode, then you can well get perfectly good
metadata with unallocated (zero-filled) data pages. There had been some
more severe errors as well.
-----------------
Hannu
On Mon, 2003-02-17 at 22:04, Tom Lane wrote:
Curt Sampson <cjs@cynic.net> writes:
On Mon, 17 Feb 2003, Tom Lane wrote:
Postgres has a bad habit of becoming very confused if the page header of
a page on disk has become corrupted.What typically causes this corruption?
Well, I'd like to know that too. I have seen some cases that were
identified as hardware problems (disk wrote data to wrong sector, RAM
dropped some bits, etc). I'm not convinced that that's the whole story,
but I have nothing to chew on that could lead to identifying a software
bug.If it's any kind of a serious problem, maybe it would be worth keeping
a CRC of the header at the end of the page somewhere.See past discussions about keeping CRCs of page contents. Ultimately I think it's a significant expenditure of CPU for very marginal returns --- the layers underneath us are supposed to keep their own CRCs or other cross-checks, and a very substantial chunk of the problem seems to be bad RAM, against which occasional software CRC checks aren't especially useful.
This is exactly why "magic numbers" or simple algorithmic bit patterns
are commonly used. If the "magic number" or bit pattern doesn't match
it's page number accordingly, you know something is wrong. Storage cost
tends to be slightly and CPU overhead low.
I agree with you that a CRC is seems overkill for little return.
Regards,
--
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting
Tom Lane wrote:
Postgres has a bad habit of becoming very confused if the page header of
a page on disk has become corrupted. In particular, bogus values in the
pd_lower field tend to make it look like there are many more tuples than
there really are, and of course these "tuples" contain garbage. That
leads to core dumps, weird complaints about out-of-range transaction
numbers (the latter generally in the form of an abort referencing a
nonexistent pg_clog file), and other un-fun stuff.I'm thinking of modifying ReadBuffer() so that it errors out if the
What does the *error out* mean ?
Is there a way to make our way around the pages ?
page read in does not contain either zeroes or a valid-looking header.
(The exception for zeroes seems to be needed for hash indexes, which
tend to initialize pages out-of-order.) This would make it much easier
for people to recognize situations where a page header has become
corrupted on disk.Comments? Can anyone think of a scenario where this would be a bad
idea?
IIRC there was a similar thread long ago.
IMHO CRC isn't sufficient because CRC could be calculated
even for (silently) corrupted pages.
regards,
Hiroshi Inoue
http://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
Tom Lane wrote:
I'm thinking of modifying ReadBuffer() so that it errors out if the
What does the *error out* mean ?
Mark the buffer as having an I/O error and then elog(ERROR).
Is there a way to make our way around the pages ?
If the header is corrupt, I don't think so. You'd need at the very
least to fix the bad header fields (particularly pd_lower) before you
could safely try to examine tuples. (In the cases that I've seen,
some or all of the line pointers are clobbered too, making it even less
likely that any useful data can be extracted automatically.)
Basically I'd rather have accesses to the clobbered page fail with
elog(ERROR) than with more drastic errors. Right now, the least
dangerous result you are likely to get is elog(FATAL) out of the clog
code, and you can easily get a PANIC or backend coredump instead.
IMHO CRC isn't sufficient because CRC could be calculated
even for (silently) corrupted pages.
Yeah, it seems a great expense for only marginal additional protection.
regards, tom lane
Tom Lane wrote:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
Tom Lane wrote:
I'm thinking of modifying ReadBuffer() so that it errors out if the
What does the *error out* mean ?
Mark the buffer as having an I/O error and then elog(ERROR).
Is there a way to make our way around the pages ?
If the header is corrupt, I don't think so.
What I asked is how to read all other sane pages.
Once pages are corrupted users would copy the sane data ASAP.
regards,
Hiroshi Inoue
http://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
Tom Lane wrote:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
Is there a way to make our way around the pages ?
If the header is corrupt, I don't think so.
What I asked is how to read all other sane pages.
Oh, I see. You can do "SELECT ... LIMIT n" to get the rows before the
broken page, but there's no way to get the ones after it. My proposal
won't make this worse, but it won't make it any better either. Do you
have an idea how to get the rows after the broken page?
regards, tom lane
Tom Lane wrote:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
Tom Lane wrote:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
Is there a way to make our way around the pages ?
If the header is corrupt, I don't think so.
What I asked is how to read all other sane pages.
Oh, I see. You can do "SELECT ... LIMIT n" to get the rows before the
broken page, but there's no way to get the ones after it. My proposal
won't make this worse, but it won't make it any better either. Do you
have an idea how to get the rows after the broken page?
How about adding a new option to skip corrupted pages ?
regards,
Hiroshi Inoue
http://www.geocities.jp/inocchichichi/psqlodbc/
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
How about adding a new option to skip corrupted pages ?
I have committed changes to implement checking for damaged page headers,
along the lines of last month's discussion. It includes a GUC variable
to control the response as suggested by Hiroshi.
Given the number of data-corruption reports we've seen recently, I am
more than half tempted to commit the change into the 7.3.* branch too.
However the GUC variable makes it seem a little like a new feature.
I could backpatch the whole change, or backpatch it without the GUC
variable (so the only possible response is elog(ERROR)), or leave
well enough alone. Any votes?
The patch is pretty small; I include the non-boilerplate parts below.
regards, tom lane
*** /home/tgl/pgsql/src/backend/storage/buffer/bufmgr.c.orig Tue Mar 25 09:06:11 2003
--- /home/tgl/pgsql/src/backend/storage/buffer/bufmgr.c Fri Mar 28 11:54:48 2003
***************
*** 59,64 ****
--- 60,69 ----
(*((XLogRecPtr*) MAKE_PTR((bufHdr)->data)))
+ /* GUC variable */
+ bool zero_damaged_pages = false;
+
+
static void WaitIO(BufferDesc *buf);
static void StartBufferIO(BufferDesc *buf, bool forInput);
static void TerminateBufferIO(BufferDesc *buf);
***************
*** 217,222 ****
--- 222,241 ----
{
status = smgrread(DEFAULT_SMGR, reln, blockNum,
(char *) MAKE_PTR(bufHdr->data));
+ /* check for garbage data */
+ if (status == SM_SUCCESS &&
+ !PageHeaderIsValid((PageHeader) MAKE_PTR(bufHdr->data)))
+ {
+ if (zero_damaged_pages)
+ {
+ elog(WARNING, "Invalid page header in block %u of %s; zeroing out page",
+ blockNum, RelationGetRelationName(reln));
+ MemSet((char *) MAKE_PTR(bufHdr->data), 0, BLCKSZ);
+ }
+ else
+ elog(ERROR, "Invalid page header in block %u of %s",
+ blockNum, RelationGetRelationName(reln));
+ }
}
if (isLocalBuf)
*** /home/tgl/pgsql/src/backend/storage/page/bufpage.c.orig Tue Mar 25 09:06:12 2003
--- /home/tgl/pgsql/src/backend/storage/page/bufpage.c Fri Mar 28 11:38:43 2003
***************
*** 48,53 ****
--- 46,96 ----
}
+ /*
+ * PageHeaderIsValid
+ * Check that the header fields of a page appear valid.
+ *
+ * This is called when a page has just been read in from disk. The idea is
+ * to cheaply detect trashed pages before we go nuts following bogus item
+ * pointers, testing invalid transaction identifiers, etc.
+ *
+ * It turns out to be necessary to allow zeroed pages here too. Even though
+ * this routine is *not* called when deliberately adding a page to a relation,
+ * there are scenarios in which a zeroed page might be found in a table.
+ * (Example: a backend extends a relation, then crashes before it can write
+ * any WAL entry about the new page. The kernel will already have the
+ * zeroed page in the file, and it will stay that way after restart.) So we
+ * allow zeroed pages here, and are careful that the page access macros
+ * treat such a page as empty and without free space. Eventually, VACUUM
+ * will clean up such a page and make it usable.
+ */
+ bool
+ PageHeaderIsValid(PageHeader page)
+ {
+ char *pagebytes;
+ int i;
+
+ /* Check normal case */
+ if (PageGetPageSize(page) == BLCKSZ &&
+ PageGetPageLayoutVersion(page) == PG_PAGE_LAYOUT_VERSION &&
+ page->pd_lower >= SizeOfPageHeaderData &&
+ page->pd_lower <= page->pd_upper &&
+ page->pd_upper <= page->pd_special &&
+ page->pd_special <= BLCKSZ &&
+ page->pd_special == MAXALIGN(page->pd_special))
+ return true;
+
+ /* Check all-zeroes case */
+ pagebytes = (char *) page;
+ for (i = 0; i < BLCKSZ; i++)
+ {
+ if (pagebytes[i] != 0)
+ return false;
+ }
+ return true;
+ }
+
+
/* ----------------
* PageAddItem
*
On Fri, 28 Mar 2003, Tom Lane wrote:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
How about adding a new option to skip corrupted pages ?
I have committed changes to implement checking for damaged page headers,
along the lines of last month's discussion. It includes a GUC variable
to control the response as suggested by Hiroshi.
Is zeroing the pages the only / best option? Hiroshi suggested skipping
the pages as I recall. Is there any chance of recovering data from a
trashed page manually? If so perhaps the GUC variable should allow three
options: error, zero, and skip.
Kris Jurka
Kris Jurka <books@ejurka.com> writes:
Is zeroing the pages the only / best option?
It's the only way to avoid a core dump when the system tries to process
the page. And no, I don't want to propagate the notion that "this page
is broken" beyond the buffer manager, so testing elsewhere isn't an
acceptable answer.
Basically, one should only turn this variable on after giving up on the
possibility of getting any data out of the broken page itself. It would
be folly to run with it turned on as a normal setting.
regards, tom lane