Protecting against unexpected zero-pages: proposal
A customer of ours is quite bothered about finding zero pages in an index
after
a system crash. The task now is to improve the diagnosability of such an
issue
and be able to definitively point to the source of zero pages.
The proposed solution below has been vetted in-house at EnterpriseDB and am
posting here to see any possible problems we missed, and also if the
community
would be interested in incorporating this capability.
Background:
-----------
SUSE Linux, ATCA board, 4 dual core CPUs => 8 cores, 24 GB RAM, 140 GB disk,
PG 8.3.11. RAID-1 SAS with SCSIinfo reporting that write-caching is
disabled.
The corrupted index's file contents, based on hexdump:
It has a total of 525 pages (cluster block size is 8K: per
pg_controldata)
Blocks 0 to 278 look sane.
Blocks 279 to 518 are full of zeroes.
Block 519 to 522 look sane.
Block 523 is filled with zeroes.
Block 524 looks sane.
The tail end of blocks 278 and 522 have some non-zero data, meaning that
those
index pages have some valid 'Special space' contents. Also, head of blocks
519
and 524 look sane. These two findings imply that the zeroing action happened
at
8K page boundary. This is a standard ext3 FS with 4K block size, so this
raises
question as to how we can ascertain that this was indeed a hardware/FS
malfunction. And if it was a hardware/FS problem, then why didn't we see
zeroes
at 1/2 K boundary (generally the disk's sector size) or 4K boundary (default
ext3 FS block size) which does not align with an 8 K boundary.
The backup from before the crash does not have these zero-pages.
Disk Page Validity Check Using Magic Number
===========================================
Requirement:
------------
We have encountered quite a few zero pages in an index after a machine
crash,
causing this index to be unusable. Although REINDEX is an option but we have
no way of telling if these zero pages were caused by hardware or filesystem
or
by Postgres. Postgres code analysis shows that Postgres being the culprit is
a
very low probablity, and similarly, since our hardware is also considered of
good quality with hardware level RAID-1 over 2 disks, it is difficult to
consider
the hardware to be a problem. The ext3 filesystem being used is also quite a
time-tested piece of software, hence it becomes very difficult to point
fingers
at any of these 3 components for this corruption.
Postgres is being deployed as a component of a carrier-grade platform, and
it is
required to run unattended as much as possible. There is a High Availability
monitoring component that is tasked with performing switchover to a standby
node
in the event of any problem with the primary node. This HA component needs
to
perform regular checks on health of all the other components, including
Postgres,
and take corrective actions.
With the zero pages comes the difficulty of ascertaining whether these are
legitimate zero pages, (since Postgres considers zero pages as valid (maybe
leftover from previous extend-file followed by a crash)), or are these zero
pages
a result of FS/hardware failure.
We are required to definitively differentiate between zero pages from
Postgres
vs. zero pages caused by hardware failure. Obviously this is not possible by
the
very nature of the problem, so we explored a few ideas, including per-block
checksums in-block or in checksum-fork, S.M.A.R.T monitoring of disk drives,
PageInit() before smgrextend() in ReadBuffer_common(), and additional member
in
PageHeader for a magic number.
Following is an approach which we think is least invasive, and does not
threaten
code-breakage, yet provides a definitive detection of corruption/data-loss
outside Postgres with least performance penalty.
Implementation:
---------------
.) The basic idea is to have a magic number in every PageHeader before it is
written to disk, and check for this magic number when performing page
validity
checks.
.) To avoid adding a new field to PageHeader, and any code breakage, we
reuse
an existing member of the structure.
.) We exploit the following facts and assumptions:
-) Relations/files are extended 8 KB (BLCKSZ) at a time.
-) Every I/O unit contains PageHeader structure (table/index/fork files),
which in turn contains pd_lsn as the first member.
-) Every newly written block is considered to be zero filled.
-) PageIsNew() assumes that if pd_upper is 0 then the page is zero.
-) PageHeaderIsValid() allows zero filled pages to be considered valid.
-) Anyone wishing to use a new page has to do PageInit() on the page.
-) PageInit() does a MemSet(0) on the whole page.
-) XLogRecPtr={x,0} is considered invalid
-) XLogRecPtr={x, ~((uint32)0)} is not valid either (i.e. last byte of an
xlog
file (not segment)); we'll use this as the magic number.
... Above is my assumption, since it is not mentioned anywhere in the
code.
The XLogFileSize calculation seems to support this assumptiopn.
... If this assumption doesn't hold good, then the previous assumption
{x,0}
can also be used to implement this magic number (with x > 0).
-) There's only one implementation of Storage Manager, i.e. md.c.
-) smgr_extend() -> mdextend() is the only place where a relation is
extended.
-) Writing beyond EOF in a file causes the intermediate space to become a
hole,
and any reads from such a hole returns zero filled pages.
-) Anybody trying to extend a file makes sure that there's no cuncurrent
extension going on from somewhere else.
... This is ensured either by implicit nature of the calling code, or
by
calling LockRelationForExtension().
.) In mdextend(), if the buffer being written is zero filled, then we write
the
magic number in that page's pd_lsn.
... This check can be optimized to just check sizeof(pd_lsn) worth of
buffer.
.) In mdextend(), if the buffer is being written beyond current EOF, then we
forcibly write the intermediate blocks too, and write the magic number in
each of those.
... This needs an _mdnblocks() call and FileSeek(SEEK_END)+FileWrite()
calls
for every block in the hole.
... Creation of holes is being assumed to be a very limited corner case,
hence this performace hit is acceptable in these rare corner cases. Tests
are
being planned using real application, to check how many times this
occurs.
.) PageHeaderIsValid() needs to be modified to allow
MagicNumber-followed-by-zeroes
as a valid page (rather than a completely zero page)
... If the page is completely filled with zeroes, this confirms the fact
that
either the filesystem or the disk storage zeroed these pages, since
Postgres
never wrote zero pages to disk.
.) PageInit() and PageIsNew() require no change.
.) XLByteLT(), XLByteLE() and XLByteEQ() may be changed to contain
AssertMacro( !MagicNumber(a) && !MagicNumber(b) )
.) I haven't analyzed the effects of this change on the recovery code, but I
have a feeling that we might not need to change anything there.
.) We can create a contrib module (standalone binary or a loadable module)
that
goes through each disk page and checks it for being zero filled, and
raises
alarm if it finds any.
Thoughts welcome.
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com
singh.gurjeet@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet
Mail sent from my BlackLaptop device
Gurjeet Singh <singh.gurjeet@gmail.com> writes:
.) The basic idea is to have a magic number in every PageHeader before it is
written to disk, and check for this magic number when performing page
validity
checks.
Um ... and exactly how does that differ from the existing behavior?
.) To avoid adding a new field to PageHeader, and any code breakage, we
reuse
an existing member of the structure.
The amount of fragility introduced by the assumptions you have to make
for this seems to me to be vastly riskier than the risk you are trying
to respond to.
regards, tom lane
On Sat, Nov 6, 2010 at 11:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Gurjeet Singh <singh.gurjeet@gmail.com> writes:
.) The basic idea is to have a magic number in every PageHeader before it
is
written to disk, and check for this magic number when performing page
validity
checks.Um ... and exactly how does that differ from the existing behavior?
Right now a zero filled page considered valid, and is treated as a new page;
PageHeaderIsValid()->/* Check all-zeroes case */, and PageIsNew(). This
means that looking at a zero-filled page on disk (say after a crash) does
not give us any clue if it was indeed left zeroed by Postgres, or did
FS/storage failed to do their job.
With the proposed change, if it is a valid page (a page actually written by
Postgres) it will either have a sensible LSN or the magic-LSN; the LSN will
never be zero. OTOH, if we encounter a zero filled page ( => LSN={0,0)} ) it
clearly would implicate elements outside Postgres in making that page zero.
The amount of fragility introduced by the assumptions you have to make
for this seems to me to be vastly riskier than the risk you are trying
to respond to.
I understand that it is a pretty low-level change, but IMHO the change is
minimal and is being applied in well understood places. All the assumptions
listed have been effective for quite a while, and I don't see these
assumptions being affected in the near future. Most crucial assumptions we
have to work with are, that XLogPtr{n, 0xFFFFFFFF} will never be used, and
that mdextend() is the only place that extends a relation (until we
implement an md.c sibling, say flash.c or tape.c; the last change to md.c
regarding mdextend() was in January 2007).
Only mdextend() and PageHeaderIsValid() need to know this change in
behaviour, and all the other APIs work and behave the same as they do now.
This change would increase the diagnosability of zero-page issues, and help
the users point fingers at right places.
Regards,
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com
singh.gurjeet@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet
Mail sent from my BlackLaptop device
On Sun, Nov 7, 2010 at 4:23 AM, Gurjeet Singh <singh.gurjeet@gmail.com> wrote:
I understand that it is a pretty low-level change, but IMHO the change is
minimal and is being applied in well understood places. All the assumptions
listed have been effective for quite a while, and I don't see these
assumptions being affected in the near future. Most crucial assumptions we
have to work with are, that XLogPtr{n, 0xFFFFFFFF} will never be used, and
that mdextend() is the only place that extends a relation (until we
implement an md.c sibling, say flash.c or tape.c; the last change to md.c
regarding mdextend() was in January 2007).
I think the assumption that isn't tested here is what happens if the
server crashes. The logic may work fine as long as nothing goes wrong
but if something does it has to be fool-proof.
I think having zero-filled blocks at the end of the file if it has
been extended but hasn't been fsynced is an expected failure mode of a
number of filesystems. The log replay can't assume seeing such a block
is a problem since that may be precisely the result of the crash that
caused the replay. And if you disable checking for this during WAL
replay then you've lost your main chance to actually detect the
problem.
Another issue -- though I think a manageable one -- is that I expect
we'll want to be be using posix_fallocate() sometime soon. That will
allow efficient guaranteed pre-allocated space with better contiguous
layout than currently. But ext4 can only pretend to give zero-filled
blocks, not any random bitpattern we request. I can see this being an
optional feature that is just not compatible with using
posix_fallocate() though.
It does seem like this is kind of part and parcel of adding checksums
to blocks. It's arguably kind of silly to add checksums to blocks but
have an commonly produced bitpattern in corruption cases go
undetected.
--
greg
On Sun, Nov 7, 2010 at 1:04 AM, Greg Stark <gsstark@mit.edu> wrote:
It does seem like this is kind of part and parcel of adding checksums
to blocks. It's arguably kind of silly to add checksums to blocks but
have an commonly produced bitpattern in corruption cases go
undetected.
Getting back to the checksum debate (and this seems like a
semi-version of the checksum debate), now that we have forks, could we
easily add block checksumming to a fork? IT would mean writing to 2
files but that shouldn't be a problem, because until the checkpoint is
done (and thus both writes), the full-page-write in WAL is going to
take precedence on recovery.
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
Aidan Van Dyk <aidan@highrise.ca> writes:
Getting back to the checksum debate (and this seems like a
semi-version of the checksum debate), now that we have forks, could we
easily add block checksumming to a fork? IT would mean writing to 2
files but that shouldn't be a problem, because until the checkpoint is
done (and thus both writes), the full-page-write in WAL is going to
take precedence on recovery.
Doesn't seem like a terribly good design: damage to a checksum page
would mean that O(1000) data pages are now thought to be bad.
More generally, this re-opens the question of whether data in secondary
forks is authoritative or just hints. Currently, we treat it as just
hints, for both FSM and VM, and thus sidestep the problem of
guaranteeing its correctness. To use a secondary fork for checksums,
you'd need to guarantee correctness of writes to it. This is the same
problem that index-only scans are hung up on, ie making the VM reliable.
I forget whether Heikki had a credible design sketch for making that
happen, but in any case it didn't look easy.
regards, tom lane
Gurjeet Singh <singh.gurjeet@gmail.com> writes:
On Sat, Nov 6, 2010 at 11:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Um ... and exactly how does that differ from the existing behavior?
Right now a zero filled page considered valid, and is treated as a new page;
PageHeaderIsValid()->/* Check all-zeroes case */, and PageIsNew(). This
means that looking at a zero-filled page on disk (say after a crash) does
not give us any clue if it was indeed left zeroed by Postgres, or did
FS/storage failed to do their job.
I think this is really a non-problem. You said earlier that the
underlying filesystem uses 4K blocks. Filesystem misfeasance would
therefore presumably affect 4K at a time. If you see that both halves
of an 8K block are zero, it's far more likely that Postgres left it that
way than that the filesystem messed up. Of course, if only one half of
an 8K page went to zeroes, you know the filesystem or disk did it.
There are also crosschecks that you can apply: if it's a heap page, are
there any index pages with pointers to it? If it's an index page, are
there downlink or sibling links to it from elsewhere in the index?
A page that Postgres left as zeroes would not have any references to it.
IMO there are a lot of methods that can separate filesystem misfeasance
from Postgres errors, probably with greater reliability than this hack.
I would also suggest that you don't really need to prove conclusively
that any particular instance is one or the other --- a pattern across
multiple instances will tell you what you want to know.
This change would increase the diagnosability of zero-page issues, and help
the users point fingers at right places.
[ shrug... ] If there were substantial user clamor for diagnosing
zero-page issues, I might be for this. As is, I think it's a non
problem. What's more, if I did believe that this was a safe and
reliable technique, I'd be unhappy about the opportunity cost of
reserving it for zero-page testing rather than other purposes.
regards, tom lane
I wrote:
Aidan Van Dyk <aidan@highrise.ca> writes:
Getting back to the checksum debate (and this seems like a
semi-version of the checksum debate), now that we have forks, could we
easily add block checksumming to a fork?
More generally, this re-opens the question of whether data in secondary
forks is authoritative or just hints. Currently, we treat it as just
hints, for both FSM and VM, and thus sidestep the problem of
guaranteeing its correctness. To use a secondary fork for checksums,
you'd need to guarantee correctness of writes to it.
... but wait a minute. What if we treated the checksum as a hint ---
namely, on checksum failure, we just log a warning rather than doing
anything drastic? A warning is probably all you want to happen anyway.
A corrupted page of checksums would then show up as warnings for most or
all of a range of data pages, and it'd be pretty obvious (if the data
seemed OK) where the failure had been.
So maybe Aidan's got a good idea here. It would sure be a lot easier
to shoehorn checksum checking in as an optional feature if the checksums
were kept someplace else.
regards, tom lane
On Mon, Nov 8, 2010 at 5:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So maybe Aidan's got a good idea here. It would sure be a lot easier
to shoehorn checksum checking in as an optional feature if the checksums
were kept someplace else.
Would it? I thought the only problem was the hint bits being set
behind the checksummers back. That'll still happen even if it's
written to a different place.
--
greg
On Mon, Nov 8, 2010 at 12:53 PM, Greg Stark <gsstark@mit.edu> wrote:
On Mon, Nov 8, 2010 at 5:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So maybe Aidan's got a good idea here. It would sure be a lot easier
to shoehorn checksum checking in as an optional feature if the checksums
were kept someplace else.Would it? I thought the only problem was the hint bits being set
behind the checksummers back. That'll still happen even if it's
written to a different place.
The problem that putting checksums in a different place solves is the
page layout (binary upgrade) problem. You're still doing to need to
"buffer" the page as you calculate the checksum and write it out.
buffering that page is absolutely necessary no mater where you put the
checksum, unless you've got an exclusive lock that blocks even hint
updates on the page.
But if we can start using forks to put "other data", that means that
keeping the page layouts is easier, and thus binary upgrades are much
more feasible.
At least, that was my thought WRT checksums being out-of-page.
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On Mon, Nov 8, 2010 at 5:59 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
The problem that putting checksums in a different place solves is the
page layout (binary upgrade) problem. You're still doing to need to
"buffer" the page as you calculate the checksum and write it out.
buffering that page is absolutely necessary no mater where you put the
checksum, unless you've got an exclusive lock that blocks even hint
updates on the page.
But buffering the page only means you've got some consistent view of
the page. It doesn't mean the checksum will actually match the data in
the page that gets written out. So when you read it back in the
checksum may be invalid.
I wonder if we could get by by having some global counter on the page
which you increment when you set a hint bit. That way when we you read
the page back in you could compare the counter on the page and the
counter for the checksum and if the checksum counter is behind ignore
the checksum? It would be nice to do better but I'm not sure we can.
But if we can start using forks to put "other data", that means that
keeping the page layouts is easier, and thus binary upgrades are much
more feasible.
The difficulty with the page layout didn't come from the checksum
itself. We can add 4 or 8 bytes to the page header easily enough. The
difficulty came from trying to move the hint bits for all the tuples
to a dedicated area. That means three resizable areas so either one of
them would have to be relocatable or some other solution (like not
checksumming the line pointers and putting the hint bits in the line
pointers). If we're willing to have invalid checksums whenever the
hint bits get set then this wouldn't be necessary.
--
greg
On Tue, Nov 9, 2010 at 8:45 AM, Greg Stark <gsstark@mit.edu> wrote:
But buffering the page only means you've got some consistent view of
the page. It doesn't mean the checksum will actually match the data in
the page that gets written out. So when you read it back in the
checksum may be invalid.
I was assuming that if the code went through the trouble to buffer the
shared page to get a "stable, non-changing" copy to use for
checksumming/writing it, it would write() the buffered copy it just
made, not the original in shared memory... I'm not sure how that
write could be in-consistent.
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On Tue, Nov 9, 2010 at 2:28 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
On Tue, Nov 9, 2010 at 8:45 AM, Greg Stark <gsstark@mit.edu> wrote:
But buffering the page only means you've got some consistent view of
the page. It doesn't mean the checksum will actually match the data in
the page that gets written out. So when you read it back in the
checksum may be invalid.I was assuming that if the code went through the trouble to buffer the
shared page to get a "stable, non-changing" copy to use for
checksumming/writing it, it would write() the buffered copy it just
made, not the original in shared memory... I'm not sure how that
write could be in-consistent.
Oh, I'm mistaken. The problem was that buffering the writes was
insufficient to deal with torn pages. Even if you buffer the writes if
the machine crashes while only having written half the buffer out then
the checksum won't match. If the only changes on the page were hint
bit updates then there will be no full page write in the WAL log to
repair the block.
It's possible that *that* situation is rare enough to let the checksum
raise a warning but not an error.
But personally I'm pretty loath to buffer every page write. The state
of the art are zero-copy processes and we should be looking to reduce
copies rather than increase them. Though I suppose if we did a
zero-copy CRC that might actually get us this buffered write for free.
On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark <gsstark@mit.edu> wrote:
Oh, I'm mistaken. The problem was that buffering the writes was
insufficient to deal with torn pages. Even if you buffer the writes if
the machine crashes while only having written half the buffer out then
the checksum won't match. If the only changes on the page were hint
bit updates then there will be no full page write in the WAL log to
repair the block.
Huh, this implies that if we did go through all the work of
segregating the hint bits and could arrange that they all appear on
the same 512-byte sector and if we buffered them so that we were
writing the same bits we checksummed then we actually *could* include
them in the CRC after all since even a torn page will almost certainly
not tear an individual sector.
--
greg
On Nov 9, 2010, at 9:27 AM, Greg Stark wrote:
On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark <gsstark@mit.edu> wrote:
Oh, I'm mistaken. The problem was that buffering the writes was
insufficient to deal with torn pages. Even if you buffer the writes if
the machine crashes while only having written half the buffer out then
the checksum won't match. If the only changes on the page were hint
bit updates then there will be no full page write in the WAL log to
repair the block.Huh, this implies that if we did go through all the work of
segregating the hint bits and could arrange that they all appear on
the same 512-byte sector and if we buffered them so that we were
writing the same bits we checksummed then we actually *could* include
them in the CRC after all since even a torn page will almost certainly
not tear an individual sector.
If there's a torn page then we've crashed, which means we go through crash recovery, which puts a valid page (with valid CRC) back in place from the WAL. What am I missing?
BTW, I agree that at minimum we need to leave the option of only raising a warning when we hit a checksum failure. Some people might want Postgres to treat it as an error by default, but most folks will at least want the option to look at their (corrupt) data.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On Tue, Nov 9, 2010 at 12:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
There are also crosschecks that you can apply: if it's a heap page, are
there any index pages with pointers to it? If it's an index page, are
there downlink or sibling links to it from elsewhere in the index?
A page that Postgres left as zeroes would not have any references to it.IMO there are a lot of methods that can separate filesystem misfeasance
from Postgres errors, probably with greater reliability than this hack.
I would also suggest that you don't really need to prove conclusively
that any particular instance is one or the other --- a pattern across
multiple instances will tell you what you want to know.
Doing this postmortem on a regular deployment and fixing the problem would
not be too difficult. But this platform, which Postgres is a part of, would
be mostly left unattended once deployed (pardon me for not sharing the
details, as I am not sure if I can).
An external HA component is supposed to detect any problems (by querying
Postgres or by external means) and take an evasive action. It is this
automation of problem detection that we are seeking.
As Greg pointed out, even with this hack in place, we might still get zero
pages from the FS (say, when ext3 does metadata journaling but not block
journaling). In that case we'd rely on recovery's WAL replay of relation
extension to reintroduce the magic number in pages.
What's more, if I did believe that this was a safe and
reliable technique, I'd be unhappy about the opportunity cost of
reserving it for zero-page testing rather than other purposes.
This is one of those times where you are a bit too terse for me. What does
zero-page imply that this hack wouldn't?
Regards,
--
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com
singh.gurjeet@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet
Mail sent from my BlackLaptop device
On Tue, Nov 9, 2010 at 4:26 PM, Jim Nasby <jim@nasby.net> wrote:
On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark <gsstark@mit.edu> wrote:
Oh, I'm mistaken. The problem was that buffering the writes was
insufficient to deal with torn pages. Even if you buffer the writes if
the machine crashes while only having written half the buffer out then
the checksum won't match. If the only changes on the page were hint
bit updates then there will be no full page write in the WAL log to
repair the block.If there's a torn page then we've crashed, which means we go through crash recovery, which puts a valid page (with valid CRC) back in place from the WAL. What am I missing?
"If the only changes on the page were hint bit updates then there will
be no full page write in the WAL to repair the block"
--
greg
On Tue, Nov 9, 2010 at 11:26 AM, Jim Nasby <jim@nasby.net> wrote:
Huh, this implies that if we did go through all the work of
segregating the hint bits and could arrange that they all appear on
the same 512-byte sector and if we buffered them so that we were
writing the same bits we checksummed then we actually *could* include
them in the CRC after all since even a torn page will almost certainly
not tear an individual sector.If there's a torn page then we've crashed, which means we go through crash recovery, which puts a valid page (with valid CRC) back in place from the WAL. What am I missing?
The problem case is where hint-bits have been set. Hint bits have
always been "we don't really care, but we write them".
A torn-page on hint-bit-only writes is ok, because with a torn page
(assuming you dont' get zero-ed pages), you get the old or new chunks
of the complete 8K buffer, but they are identical except for only
hint-bits, which eiterh the old or new state is sufficient.
But with a check-sum, now, getting a torn page w/ only hint-bit
updates now becomes noticed. Before, it might have happened, but we
wouldn't have noticed or cared.
So, for getting checksums, we have to offer up a few things:
1) zero-copy writes, we need to buffer the write to get a consistent
checksum (or lock the buffer tight)
2) saving hint-bits on an otherwise unchanged page. We either need to
just not write that page, and loose the work the hint-bits did, or do
a full-page WAL of it, so the torn-page checksum is fixed
Both of these are theoretical performance tradeoffs. How badly do we
want to verify on read that it is *exactly* what we thought we wrote?
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
Gurjeet Singh <singh.gurjeet@gmail.com> writes:
On Tue, Nov 9, 2010 at 12:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
IMO there are a lot of methods that can separate filesystem misfeasance
from Postgres errors, probably with greater reliability than this hack.
Doing this postmortem on a regular deployment and fixing the problem would
not be too difficult. But this platform, which Postgres is a part of, would
be mostly left unattended once deployed (pardon me for not sharing the
details, as I am not sure if I can).
An external HA component is supposed to detect any problems (by querying
Postgres or by external means) and take an evasive action. It is this
automation of problem detection that we are seeking.
To be blunt, this argument is utter nonsense. The changes you propose
would still require manual analysis of any detected issues in order to
do anything useful about them. Once you know that there is, or isn't,
a filesystem-level error involved, what are you going to do next?
You're going to go try to debug the component you know is at fault,
that's what. And that problem is still AI-complete.
regards, tom lane
On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
So, for getting checksums, we have to offer up a few things:
1) zero-copy writes, we need to buffer the write to get a consistent
checksum (or lock the buffer tight)
2) saving hint-bits on an otherwise unchanged page. We either need to
just not write that page, and loose the work the hint-bits did, or do
a full-page WAL of it, so the torn-page checksum is fixed
Actually the consensus the last go-around on this topic was to
segregate the hint bits into a single area of the page and skip them
in the checksum. That way we don't have to do any of the above. It's
just that that's a lot of work.
--
greg