Re: TODO list

Started by Bruce Momjianalmost 25 years ago21 messages

pgman@candle.pha.pa.us

almost 25 years ago

Bruce,

Two changes for the TODO list.

1. Under "RELIABILITY/MISC", add:

Write out a CRC with each data block, and verify it on reading.

2. Under SOURCE CODE, I believe Tom has already implemented:

Correct CRC WAL code to be a real CRC64 algorithm

TODO updated. I know we did number 2, but did we agree on #1 and is it
done?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Import Notes

Reply to msg id not found: 20010404135342.A9514@store.zembu.com

Tom Lane

tgl@sss.pgh.pa.us

almost 25 years ago

In reply to: Bruce Momjian (#1)

Re: Re: TODO list

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Two changes for the TODO list.

1. Under "RELIABILITY/MISC", add:

Write out a CRC with each data block, and verify it on reading.

2. Under SOURCE CODE, I believe Tom has already implemented:

Correct CRC WAL code to be a real CRC64 algorithm

TODO updated. I know we did number 2, but did we agree on #1 and is it
done?

#2 is indeed done. #1 is not done, and possibly not agreed to ---
I think Vadim had doubts about its usefulness, though personally I'd
like to see it.

regards, tom lane

Bruce Momjian

pgman@candle.pha.pa.us

almost 25 years ago

In reply to: Tom Lane (#2)

Re: Re: TODO list

TODO updated. I know we did number 2, but did we agree on #1 and is it
done?

#2 is indeed done. #1 is not done, and possibly not agreed to ---
I think Vadim had doubts about its usefulness, though personally I'd
like to see it.

That was my recollection too. This was the discussion about testing the
disk hardware. #1 removed.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Zeugswetter Andreas SB

ZeugswetterA@wien.spardat.at

almost 25 years ago

In reply to: Bruce Momjian (#3)

AW: Re: TODO list

1. Under "RELIABILITY/MISC", add:

Write out a CRC with each data block, and verify it on reading.

TODO updated. I know we did number 2, but did we agree on #1 and is it
done?

Has anybody done performance and reliability tests with CRC64 ?
I think it must be a CPU eater. It looks a lot more complex than a CRC32.

Since we need to guard a maximum of 32k bytes for pg pages I would - if at all -
consider to use a 32bit adler instead of a CRC, since that is a lot cheaper
to calculate.

Andreas

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

almost 25 years ago

In reply to: Zeugswetter Andreas SB (#4)

Re: AW: Re: TODO list

Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at> writes:

Has anybody done performance and reliability tests with CRC64 ?
I think it must be a CPU eater. It looks a lot more complex than a CRC32.

On my box (PA-RISC) the inner loop is about 14 cycles/byte, vs. about
7 cycles/byte for CRC32. On almost any machine, either one will be
negligible in comparison to the cost of disk I/O.

Since we need to guard a maximum of 32k bytes for pg pages I would -
if at all - consider to use a 32bit adler instead of a CRC, since that
is a lot cheaper to calculate.

You are several months too late to re-open that argument. It's done and
it's not changing for 7.1.

regards, tom lane

Ken Hirsch

kahirsch@bellsouth.net

almost 25 years ago

In reply to: Bruce Momjian (#3)

Re: Re: TODO list

TODO updated. I know we did number 2, but did we agree on #1 and is

done?

#2 is indeed done. #1 is not done, and possibly not agreed to ---
I think Vadim had doubts about its usefulness, though personally I'd
like to see it.

That was my recollection too. This was the discussion about testing the
disk hardware. #1 removed.

What is recommended in the bible (Gray and Reuter), especially for larger
disk block sizes that may not be written atomically, is to have a word at
the end of the that must match a word at the beginning of the block. It
gets changed each time you write the block.

Ken Hirsch
All your database are belong to us.

Noname

ncm@zembu.com

almost 25 years ago

In reply to: Ken Hirsch (#6)

Re: Re: TODO list

On Thu, Apr 05, 2001 at 04:25:42PM -0400, Ken Hirsch wrote:

TODO updated. I know we did number 2, but did we agree on #1 and is

it

done?

#2 is indeed done. #1 is not done, and possibly not agreed to ---
I think Vadim had doubts about its usefulness, though personally I'd
like to see it.

That was my recollection too. This was the discussion about testing the
disk hardware. #1 removed.

What is recommended in the bible (Gray and Reuter), especially for larger
disk block sizes that may not be written atomically, is to have a word at
the end of the that must match a word at the beginning of the block. It
gets changed each time you write the block.

That only works if your blocks are atomic. Even SCSI disks reorder
sector writes, and they are free to write the first and last sectors
of an 8k-32k block, and not have written the intermediate blocks
before the power goes out. On IDE disks it is of course far worse.

(On many (most?) IDE drives, even when they have been told to report
write completion only after data is physically on the platter, they will
"forget" if they see activity that looks like benchmarking. Others just
ignore the command, and in any case they all default to unsafe mode.)

If the reason that a block CRC isn't on the TODO list is that Vadim
objects, maybe we should hear some reasons why he objects? Maybe
the objections could be dealt with, and everyone satisfied.

Nathan Myers
ncm@zembu.com

Mikheev, Vadim

vmikheev@SECTORBASE.COM

almost 25 years ago

In reply to: Noname (#7)

RE: Re: TODO list

If the reason that a block CRC isn't on the TODO list is that Vadim
objects, maybe we should hear some reasons why he objects? Maybe
the objections could be dealt with, and everyone satisfied.

Unordered disk writes are covered by backing up modified blocks
in log. It allows not only catch such writes, as would CRC do,
but *avoid* them.

So, for what CRC could be used? To catch disk damages?
Disk has its own CRC for this.

Vadim

Import Notes

Resolved by subject fallback

Noname

ncm@zembu.com

almost 25 years ago

In reply to: Mikheev, Vadim (#8)

Re: Re: TODO list

On Thu, Apr 05, 2001 at 02:27:48PM -0700, Mikheev, Vadim wrote:

If the reason that a block CRC isn't on the TODO list is that Vadim
objects, maybe we should hear some reasons why he objects? Maybe
the objections could be dealt with, and everyone satisfied.

Unordered disk writes are covered by backing up modified blocks
in log. It allows not only catch such writes, as would CRC do,
but *avoid* them.

So, for what CRC could be used? To catch disk damages?
Disk has its own CRC for this.

OK, this was already discussed, maybe while Vadim was absent.
Should I re-post the previous text?

Nathan Myers
ncm@zembu.com

#10

Mikheev, Vadim

vmikheev@SECTORBASE.COM

almost 25 years ago

In reply to: Noname (#9)

RE: Re: TODO list

So, for what CRC could be used? To catch disk damages?
Disk has its own CRC for this.

OK, this was already discussed, maybe while Vadim was absent.
Should I re-post the previous text?

Let's return to this discussion *after* 7.1 release.
My main objection was (and is) - no time to deal with
this issue for 7.1

Vadim

Import Notes

Resolved by subject fallback

#11

Noname

ncm@zembu.com

almost 25 years ago

In reply to: Mikheev, Vadim (#10)

Re: Re: TODO list

On Thu, Apr 05, 2001 at 02:47:41PM -0700, Mikheev, Vadim wrote:

So, for what CRC could be used? To catch disk damages?
Disk has its own CRC for this.

OK, this was already discussed, maybe while Vadim was absent.
Should I re-post the previous text?

Let's return to this discussion *after* 7.1 release.
My main objection was (and is) - no time to deal with
this issue for 7.1.

OK, everybody agreed on that before.

This doesn't read like an objection to having it on the TODO list for
some future release.

Nathan Myers
ncm@zembu.com

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 25 years ago

In reply to: Mikheev, Vadim (#8)

Re: Re: TODO list

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

If the reason that a block CRC isn't on the TODO list is that Vadim
objects, maybe we should hear some reasons why he objects? Maybe
the objections could be dealt with, and everyone satisfied.

Unordered disk writes are covered by backing up modified blocks
in log. It allows not only catch such writes, as would CRC do,
but *avoid* them.

So, for what CRC could be used? To catch disk damages?
Disk has its own CRC for this.

Oh, I see. For anyone else who has trouble reading between the lines:

Blocks that have recently been written, but failed to make it down to
the disk platter intact, should be restorable from the WAL log. So we
do not need a block-level CRC to guard against partial writes.

A block-level CRC might be useful to guard against long-term data
lossage, but Vadim thinks that the disk's own CRCs ought to be
sufficient for that (and I can't say I disagree).

So the only real benefit of a block-level CRC would be to guard against
bits dropped in transit from the disk surface to someplace else, ie,
during read or during a "cp -r" type copy of the database to another
location. That's not a totally negligible risk, but is it worth the
overhead of updating and checking block CRCs? Seems dubious at best.

regards, tom lane

#13

Bruce Momjian

pgman@candle.pha.pa.us

almost 25 years ago

In reply to: Tom Lane (#12)

Re: Re: TODO list

So, for what CRC could be used? To catch disk damages?
Disk has its own CRC for this.

Oh, I see. For anyone else who has trouble reading between the lines:

Blocks that have recently been written, but failed to make it down to
the disk platter intact, should be restorable from the WAL log. So we
do not need a block-level CRC to guard against partial writes.

A block-level CRC might be useful to guard against long-term data
lossage, but Vadim thinks that the disk's own CRCs ought to be
sufficient for that (and I can't say I disagree).

So the only real benefit of a block-level CRC would be to guard against
bits dropped in transit from the disk surface to someplace else, ie,
during read or during a "cp -r" type copy of the database to another
location. That's not a totally negligible risk, but is it worth the
overhead of updating and checking block CRCs? Seems dubious at best.

Agreed.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#14

Noname

ncm@zembu.com

almost 25 years ago

In reply to: Tom Lane (#12)

Re: Re: TODO list

On Thu, Apr 05, 2001 at 06:25:17PM -0400, Tom Lane wrote:

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

If the reason that a block CRC isn't on the TODO list is that Vadim
objects, maybe we should hear some reasons why he objects? Maybe
the objections could be dealt with, and everyone satisfied.

Unordered disk writes are covered by backing up modified blocks
in log. It allows not only catch such writes, as would CRC do,
but *avoid* them.

So, for what CRC could be used? To catch disk damages?
Disk has its own CRC for this.

Blocks that have recently been written, but failed to make it down to
the disk platter intact, should be restorable from the WAL log. So we
do not need a block-level CRC to guard against partial writes.

If a block is missing some sectors in the middle, how would you know
to reconstruct it from the WAL, without a block CRC telling you that
the block is corrupt?

A block-level CRC might be useful to guard against long-term data
lossage, but Vadim thinks that the disk's own CRCs ought to be
sufficient for that (and I can't say I disagree).

The people who make the disks don't agree.

They publish the error rate they guarantee, and they meet it, more
or less. They publish a rate that is _just_ low enough to satisfy
noncritical requirements (on the correct assumption that they can't
satisfy critical requirements in any case) and high enough not to
interfere with benchmarks. They assume that if you need better
reliability you can and will provide it yourself, and rely on their
CRC only as a performance optimization.

At the raw sector level, they get (and correct) errors very frequently;
when they are not getting "enough" errors, they pack the bits more
densely until they do, and sell a higher-density drive.

So the only real benefit of a block-level CRC would be to guard against
bits dropped in transit from the disk surface to someplace else, ie,
during read or during a "cp -r" type copy of the database to another
location. That's not a totally negligible risk, but is it worth the
overhead of updating and checking block CRCs? Seems dubious at best.

Vadim didn't want to re-open this discussion until after 7.1 is out
the door, but that "dubious at best" demands an answer. See the archive
posting:

http://www.postgresql.org/mhonarc/pgsql-hackers/2001-01/msg00473.html

...

Incidentally, is the page at

http://www.postgresql.org/mhonarc/pgsql-hackers/2001-01/

the best place to find old messages? It's never worked right for me.

Nathan Myers
ncm@zembu.com

#15

Philip Warner

pjw@rhyme.com.au

almost 25 years ago

In reply to: Tom Lane (#12)

Re: Re: TODO list

At 18:25 5/04/01 -0400, Tom Lane wrote:

A block-level CRC might be useful to guard against long-term data
lossage, but Vadim thinks that the disk's own CRCs ought to be
sufficient for that (and I can't say I disagree).

So the only real benefit of a block-level CRC would be to guard against
bits dropped in transit from the disk surface to someplace else

What about guarding against file system problems, like blocks of one
(non-PG) file erroneously writing to blocks of another (PG table) file?

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#16

Mikheev, Vadim

vmikheev@SECTORBASE.COM

almost 25 years ago

In reply to: Philip Warner (#15)

RE: Re: TODO list

Blocks that have recently been written, but failed to make
it down to the disk platter intact, should be restorable from
the WAL log. So we do not need a block-level CRC to guard
against partial writes.

If a block is missing some sectors in the middle, how would you know
to reconstruct it from the WAL, without a block CRC telling you that
the block is corrupt?

On recovery we unconditionally copy *entire* block content from the log
for each block modified since last checkpoint. And we do not write new
checkpoint record (ie do not advance recovery start point) untill we know
that all data blocks are flushed on disk (including blocks modified before
checkpointer started).

Vadim

Import Notes

Resolved by subject fallback

#17

Tom Lane

tgl@sss.pgh.pa.us

almost 25 years ago

In reply to: Philip Warner (#15)

Re: Re: TODO list

Philip Warner <pjw@rhyme.com.au> writes:

So the only real benefit of a block-level CRC would be to guard against
bits dropped in transit from the disk surface to someplace else

What about guarding against file system problems, like blocks of one
(non-PG) file erroneously writing to blocks of another (PG table) file?

Well, what about it? Can you offer numbers demonstrating that this risk
is probable enough to justify the effort and runtime cost of a block
CRC?

If we're in the business of expending cycles to guard against
nil-probability risks, let's checksum our executables every time we
start up, to make sure they're not overwritten. Actually, we'd better
re-checksum program text memory every few seconds, in case RAM dropped
a bit since we looked last. And let's follow every memcpy by a memcmp
to make sure that didn't drop a bit. Heck, let's keep a CRC on every
palloc'd memory block. And so on and so forth. Sooner or later you've
got to draw the line at diminishing returns, both for runtime costs
and for the programming effort you spent on this stuff (instead of on
finding/fixing bugs that might bite you with far greater frequency than
anything a CRC might catch for you).

To be perfectly clear: I have actually seen bug reports trace to
problems that I think a block-level CRC might have detected (not
corrected, of course, but at least the user might have realized he had
flaky hardware a little sooner). So I do not say that the upside to
a block CRC is nil. But I am unconvinced that it exceeds the downside,
in development effort, runtime, false failure reports (is that CRC error
really due to hardware trouble, or a software bug that failed to update
the CRC? and how do you get around the CRC error to get at your data??)
etc etc.

regards, tom lane

#18

Rod Taylor

rod.taylor@inquent.com

almost 25 years ago

In reply to: Mikheev, Vadim (#8)

Re: Re: TODO list

If we're in the business of expending cycles to guard against
nil-probability risks, let's checksum our executables every time we
start up, to make sure they're not overwritten. Actually, we'd

better

re-checksum program text memory every few seconds, in case RAM

dropped

a bit since we looked last. And let's follow every memcpy by a

memcmp

to make sure that didn't drop a bit. Heck, let's keep a CRC on

every

Why does it sound like you have problems with radiation eating away at
your live memory for satellite operations?

#19

Philip Warner

pjw@rhyme.com.au

almost 25 years ago

In reply to: Tom Lane (#17)

Re: Re: TODO list

At 22:52 5/04/01 -0400, Tom Lane wrote:

What about guarding against file system problems, like blocks of one
(non-PG) file erroneously writing to blocks of another (PG table) file?

Well, what about it? Can you offer numbers demonstrating that this risk
is probable enough to justify the effort and runtime cost of a block
CRC?

Rhetorical crap aside, I've had more file system falures (including badly
mapped file data) than I have had disk hardware failures. So, if you are
considering 'bits dropped in transit', you should also be considering data
corruption not related to the hardware.

#20

Mikheev, Vadim

vmikheev@SECTORBASE.COM

almost 25 years ago

In reply to: Philip Warner (#19)

RE: Re: TODO list

To be perfectly clear: I have actually seen bug reports trace to
problems that I think a block-level CRC might have detected (not
corrected, of course, but at least the user might have realized he had
flaky hardware a little sooner). So I do not say that the upside to
a block CRC is nil. But I am unconvinced that it exceeds the
downside, in development effort, runtime, false failure reports
(is that CRC error really due to hardware trouble, or a software bug
that failed to update the CRC? and how do you get around the CRC error
to get at your data??) etc etc.

Something to remember: currently we update t_infomask (set
HEAP_XMAX_COMMITTED etc) while holding share lock on buffer -
we have to change this before block CRC implementation.

Vadim

Import Notes

Resolved by subject fallback

#21

Tom Lane

tgl@sss.pgh.pa.us

almost 25 years ago

In reply to: Mikheev, Vadim (#20)

Re: Re: TODO list

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Something to remember: currently we update t_infomask (set
HEAP_XMAX_COMMITTED etc) while holding share lock on buffer -
we have to change this before block CRC implementation.

Yeah, we'd lose some concurrency there.

regards, tom lane