Page Checksums + Double Writes

Started by David Fetterover 14 years ago49 messageshackers
Jump to latest
#1David Fetter
david@fetter.org

Folks,

One of the things VMware is working on is double writes, per previous
discussions of how, for example, InnoDB does things. I'd initially
thought that introducing just one of the features in $Subject at a
time would help, but I'm starting to see a mutual dependency.

The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

If submitting these things together seems like a better idea than
having them arrive separately, I'll work with my team here to make
that happen soonest.

There's a separate issue we'd like to get clear on, which is whether
it would be OK to make a new PG_PAGE_LAYOUT_VERSION.

If so, there's less to do, but pg_upgrade as it currently stands is
broken.

If not, we'll have to do some extra work on the patch as described
below. Thanks to Kevin Grittner for coming up with this :)

- Use a header bit to say whether we've got a checksum on the page.
We're using 3/16 of the available bits as described in
src/include/storage/bufpage.h.

- When that bit is set, place the checksum somewhere convenient on the
page. One way to do this would be to have an optional field at the
end of the special space based on the new bit. Rows from pg_upgrade
would have the bit clear, and would have the shorter special
structure without the checksum.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Fetter (#1)
Re: Page Checksums + Double Writes

Excerpts from David Fetter's message of mié dic 21 18:59:13 -0300 2011:

If not, we'll have to do some extra work on the patch as described
below. Thanks to Kevin Grittner for coming up with this :)

- Use a header bit to say whether we've got a checksum on the page.
We're using 3/16 of the available bits as described in
src/include/storage/bufpage.h.

- When that bit is set, place the checksum somewhere convenient on the
page. One way to do this would be to have an optional field at the
end of the special space based on the new bit. Rows from pg_upgrade
would have the bit clear, and would have the shorter special
structure without the checksum.

If you get away with a new page format, let's make sure and coordinate
so that we can add more info into the header. One thing I wanted was to
have an ID struct on each file, so that you know what
DB/relation/segment the file corresponds to. So the first page's
special space would be a bit larger than the others.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Alvaro Herrera (#2)
Re: Page Checksums + Double Writes

Alvaro Herrera <alvherre@commandprompt.com> wrote:

If you get away with a new page format, let's make sure and
coordinate so that we can add more info into the header. One
thing I wanted was to have an ID struct on each file, so that you
know what DB/relation/segment the file corresponds to. So the
first page's special space would be a bit larger than the others.

Couldn't that also be done by burning a bit in the page header
flags, without a page layout version bump? If that were done, you
wouldn't have the additional information on tables converted by
pg_upgrade, but you would get them on new tables, including those
created by pg_dump/psql conversions. Adding them could even be made
conditional, although I don't know whether that's a good idea....

-Kevin

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Kevin Grittner (#3)
Re: Page Checksums + Double Writes

On Wed, Dec 21, 2011 at 10:19 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Alvaro Herrera <alvherre@commandprompt.com> wrote:

If you get away with a new page format, let's make sure and
coordinate so that we can add more info into the header.  One
thing I wanted was to have an ID struct on each file, so that you
know what DB/relation/segment the file corresponds to.  So the
first page's special space would be a bit larger than the others.

Couldn't that also be done by burning a bit in the page header
flags, without a page layout version bump?  If that were done, you
wouldn't have the additional information on tables converted by
pg_upgrade, but you would get them on new tables, including those
created by pg_dump/psql conversions.  Adding them could even be made
conditional, although I don't know whether that's a good idea....

These are good thoughts because they overcome the major objection to
doing *anything* here for 9.2.

We don't need to use any flag bits at all. We add
PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking
becomes an initdb option. All new pages can be created with
PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must
be either the layout version from this release (4) or the next version
(5). Page validity then becomes version dependent.

pg_upgrade still works.

Layout 5 is where we add CRCs, so its basically optional.

We can also have a utility that allows you to bump the page version
for all new pages, even after you've upgraded, so we may end with a
mix of page layout versions in the same relation. That's more
questionable but I see no problem with it.

Do we need CRCs as a table level option? I hope not. That complicates
many things.

All of this allows us to have another more efficient page version (6)
in future without problems, so its good infrastructure.

I'm now personally game on to make something work here for 9.2.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Fetter (#1)
Re: Page Checksums + Double Writes

David Fetter <david@fetter.org> writes:

There's a separate issue we'd like to get clear on, which is whether
it would be OK to make a new PG_PAGE_LAYOUT_VERSION.

If you're not going to provide pg_upgrade support, I think there is no
chance of getting a new page layout accepted. The people who might want
CRC support are pretty much exactly the same people who would find lack
of pg_upgrade a showstopper.

Now, given the hint bit issues, I rather doubt that you can make this
work without a page format change anyway. So maybe you ought to just
bite the bullet and start working on the pg_upgrade problem, rather than
imagining you will find an end-run around it.

The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

Um. So how is that going to work if checksums are optional?

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#4)
Re: Page Checksums + Double Writes

Simon Riggs <simon@2ndQuadrant.com> writes:

We don't need to use any flag bits at all. We add
PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking
becomes an initdb option. All new pages can be created with
PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must
be either the layout version from this release (4) or the next version
(5). Page validity then becomes version dependent.

We can also have a utility that allows you to bump the page version
for all new pages, even after you've upgraded, so we may end with a
mix of page layout versions in the same relation. That's more
questionable but I see no problem with it.

It seems like you've forgotten all of the previous discussion of how
we'd manage a page format version change.

Having two different page formats running around in the system at the
same time is far from free; in the worst case it means that every single
piece of code that touches pages has to know about and be prepared to
cope with both versions. That's a rather daunting prospect, from a
coding perspective and even more from a testing perspective. Maybe
the issues can be kept localized, but I've seen no analysis done of
what the impact would be or how we could minimize it. I do know that
we considered the idea and mostly rejected it a year or two back.

A "utility to bump the page version" is equally a whole lot easier said
than done, given that the new version has more overhead space and thus
less payload space than the old. What does it do when the old page is
too full to be converted? "Move some data somewhere else" might be
workable for heap pages, but I'm less sanguine about rearranging indexes
like that. At the very least it would imply that the utility has full
knowledge about every index type in the system.

I'm now personally game on to make something work here for 9.2.

If we're going to freeze 9.2 in the spring, I think it's a bit late
for this sort of work to be just starting. What you've just described
sounds to me like possibly a year's worth of work.

regards, tom lane

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#6)
Re: Page Checksums + Double Writes

On Wed, Dec 21, 2011 at 11:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It seems like you've forgotten all of the previous discussion of how
we'd manage a page format version change.

Maybe I've had too much caffeine. It's certainly late here.

Having two different page formats running around in the system at the
same time is far from free; in the worst case it means that every single
piece of code that touches pages has to know about and be prepared to
cope with both versions.  That's a rather daunting prospect, from a
coding perspective and even more from a testing perspective.  Maybe
the issues can be kept localized, but I've seen no analysis done of
what the impact would be or how we could minimize it.  I do know that
we considered the idea and mostly rejected it a year or two back.

I'm looking at that now.

My feeling is it probably depends upon how different the formats are,
so given we are discussing a 4 byte addition to the header, it might
be doable.

I'm investing some time on the required analysis.

A "utility to bump the page version" is equally a whole lot easier said
than done, given that the new version has more overhead space and thus
less payload space than the old.  What does it do when the old page is
too full to be converted?  "Move some data somewhere else" might be
workable for heap pages, but I'm less sanguine about rearranging indexes
like that.  At the very least it would imply that the utility has full
knowledge about every index type in the system.

I agree, rewriting every page is completely out and I never even considered it.

I'm now personally game on to make something work here for 9.2.

If we're going to freeze 9.2 in the spring, I think it's a bit late
for this sort of work to be just starting.

I agree with that. If this goes adrift it will have to be killed for 9.2.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#8Rob Wultsch
wultsch@gmail.com
In reply to: David Fetter (#1)
Re: Page Checksums + Double Writes

On Wed, Dec 21, 2011 at 1:59 PM, David Fetter <david@fetter.org> wrote:

One of the things VMware is working on is double writes, per previous
discussions of how, for example, InnoDB does things.

The world is moving to flash, and the lifetime of flash is measured
writes. Potentially doubling the number of writes is potentially
halving the life of the flash.

Something to think about...

--
Rob Wultsch
wultsch@gmail.com

#9Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#7)
Re: Page Checksums + Double Writes

On Wed, Dec 21, 2011 at 7:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

My feeling is it probably depends upon how different the formats are,
so given we are discussing a 4 byte addition to the header, it might
be doable.

I agree. When thinking back on Zoltan's patches, it's worth
remembering that he had a number of pretty bad ideas mixed in with the
good stuff - such as taking a bunch of things that are written as
macros for speed, and converting them to function calls. Also, he
didn't make any attempt to isolate the places that needed to know
about both page versions; everybody knew about everything, everywhere,
and so everything needed to branch in places where it had not needed
to do so before. I don't think we should infer from the failure of
those patches that no one can do any better.

On the other hand, I also agree with Tom that the chances of getting
this done in time for 9.2 are virtually zero, assuming that (1) we
wish to ship 9.2 in 2012 and (2) we don't wish to be making
destabilizing changes beyond the end of the last CommitFest. There is
a lot of work here, and I would be astonished if we could wrap it all
up in the next month. Or even the next four months.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10David Fetter
david@fetter.org
In reply to: Rob Wultsch (#8)
Re: Page Checksums + Double Writes

On Wed, Dec 21, 2011 at 04:18:33PM -0800, Rob Wultsch wrote:

On Wed, Dec 21, 2011 at 1:59 PM, David Fetter <david@fetter.org> wrote:

One of the things VMware is working on is double writes, per
previous discussions of how, for example, InnoDB does things.

The world is moving to flash, and the lifetime of flash is measured
writes. Potentially doubling the number of writes is potentially
halving the life of the flash.

Something to think about...

Modern flash drives let you have more write cycles than modern
spinning rust, so while yes, there is something happening, it's also
happening to spinning rust, too.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#7)
Re: Page Checksums + Double Writes

On Thu, Dec 22, 2011 at 12:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Having two different page formats running around in the system at the
same time is far from free; in the worst case it means that every single
piece of code that touches pages has to know about and be prepared to
cope with both versions.  That's a rather daunting prospect, from a
coding perspective and even more from a testing perspective.  Maybe
the issues can be kept localized, but I've seen no analysis done of
what the impact would be or how we could minimize it.  I do know that
we considered the idea and mostly rejected it a year or two back.

I'm looking at that now.

My feeling is it probably depends upon how different the formats are,
so given we are discussing a 4 byte addition to the header, it might
be doable.

I'm investing some time on the required analysis.

We've assumed to now that adding a CRC to the Page Header would add 4
bytes, meaning that we are assuming we are taking a CRC-32 check
field. This will change the size of the header and thus break
pg_upgrade in a straightforward implementation. Breaking pg_upgrade is
not acceptable. We can get around this by making code dependent upon
page version, allowing mixed page versions in one executable. That
causes the PageGetItemId() macro to be page version dependent. After
review, altering the speed of PageGetItemId() is not acceptable either
(show me microbenchmarks if you doubt that). In a large minority of
cases the line pointer and the page header will be in separate cache
lines.

As Kevin points out, we have 13 bits spare on the pd_flags of
PageHeader, so we have a little wiggle room there. In addition to that
I notice that pd_pagesize_version itself is 8 bits (page size is other
8 bits packed together), yet we currently use just one bit of that,
since version is 4. Version 3 was last seen in Postgres 8.2, now
de-supported.

Since we don't care too much about backwards compatibility with data
in Postgres 8.2 and below, we can just assume that all pages are
version 4, unless marked otherwise with additional flags. We then use
two separate bits to pd_flags to show PD_HAS_CRC (0x0008 and 0x8000).
We then completely replace the 16 bit version field with a 16-bit CRC
value, rather than a 32-bit value. Why two flag bits? If either CRC
bit is set we assume the page's CRC is supposed to be valid. This
ensures that a single bit error doesn't switch off CRC checking when
it was supposed to be active. I suggest we remove the page size data
completely; if we need to keep that we should mark 8192 bytes as the
default and set bits for 16kB and 32 kB respectively.

With those changes, we are able to re-organise the page header so that
we can add a 16 bit checksum (CRC), yet retain the same size of
header. Thus, we don't need to change PageGetItemId(). We would
require changes to PageHeaderIsValid() and PageInit() only. Making
these changes means we are reducing the number of bits used to
validate the page header, though we are providing a much better way of
detecting page validity, so the change is of positive benefit.

Adding a CRC was a performance concern because of the hint bit
problem, so making the value 16 bits long gives performance where it
is needed. Note that we do now have a separation of bgwriter and
checkpointer, so we have more CPU bandwidth to address the problem.
Adding multiple bgwriters is also possible.

Notably, this proposal makes CRC checking optional, so if performance
is a concern it can be disabled completely.

Which CRC algorithm to choose?
"A study of error detection capabilities for random independent bit
errors and burst errors reveals that XOR, two's complement addition,
and Adler checksums are suboptimal for typical network use. Instead,
one's complement addition should be used for networks willing to
sacrifice error detection effectiveness to reduce compute cost,
Fletcher checksum for networks looking for a balance of error
detection and compute cost, and CRCs for networks willing to pay a
higher compute cost for significantly improved error detection."
The Effectiveness of Checksums for Embedded Control Networks,
Maxino, T.C. Koopman, P.J.,
Dependable and Secure Computing, IEEE Transactions on
Issue Date: Jan.-March 2009
Available here - http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf

Based upon that paper, I suggest we use Fletcher-16. The overall
concept is not sensitive to the choice of checksum algorithm however
and the algorithm itself could be another option. F16 or CRC. My poor
understanding of the difference is that F16 is about 20 times cheaper
to calculate, at the expense of about 1000 times worse error detection
(but still pretty good).

16 bit CRCs are not the strongest available, but still support
excellent error detection rates - better than 1 failure in a million,
possibly much better depending on which algorithm and block size.
That's good easily enough to detect our kind of errors.

This idea doesn't rule out the possibility of a 4 byte CRC-32 added in
the future, since we still have 11 bits spare for use as future page
version indicators. (If we did that, it is clear that we should add
the checksum as a *trailer* not as part of the header.)

So overall, I do now think its still possible to add an optional
checksum in the 9.2 release and am willing to pursue it unless there
are technical objections.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#6)
Re: Page Checksums + Double Writes

On 22.12.2011 01:43, Tom Lane wrote:

A "utility to bump the page version" is equally a whole lot easier said
than done, given that the new version has more overhead space and thus
less payload space than the old. What does it do when the old page is
too full to be converted? "Move some data somewhere else" might be
workable for heap pages, but I'm less sanguine about rearranging indexes
like that. At the very least it would imply that the utility has full
knowledge about every index type in the system.

Remembering back the old discussions, my favorite scheme was to have an
online pre-upgrade utility that runs on the old cluster, moving things
around so that there is enough spare room on every page. It would do
normal heap updates to make room on heap pages (possibly causing
transient serialization failures, like all updates do), and split index
pages to make room on them. Yes, it would need to know about all index
types. And it would set a global variable to indicate that X bytes must
be kept free on all future updates, too.

Once the pre-upgrade utility has scanned through the whole cluster, you
can run pg_upgrade. After the upgrade, old page versions are converted
to new format as pages are read in. The conversion is staightforward, as
there the pre-upgrade utility ensured that there is enough spare room on
every page.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Florian Weimer
fweimer@bfk.de
In reply to: David Fetter (#1)
Re: Page Checksums + Double Writes

* David Fetter:

The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

How desirable is it to disable full_page_writes? Doesn't it cut down
recovery time significantly because it avoids read-modify-write cycles
with a cold cache?

--
Florian Weimer <fweimer@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#12)
Re: Page Checksums + Double Writes

On Thu, Dec 22, 2011 at 7:44 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 22.12.2011 01:43, Tom Lane wrote:

A "utility to bump the page version" is equally a whole lot easier said
than done, given that the new version has more overhead space and thus
less payload space than the old.  What does it do when the old page is
too full to be converted?  "Move some data somewhere else" might be
workable for heap pages, but I'm less sanguine about rearranging indexes
like that.  At the very least it would imply that the utility has full
knowledge about every index type in the system.

Remembering back the old discussions, my favorite scheme was to have an
online pre-upgrade utility that runs on the old cluster, moving things
around so that there is enough spare room on every page. It would do normal
heap updates to make room on heap pages (possibly causing transient
serialization failures, like all updates do), and split index pages to make
room on them. Yes, it would need to know about all index types. And it would
set a global variable to indicate that X bytes must be kept free on all
future updates, too.

Once the pre-upgrade utility has scanned through the whole cluster, you can
run pg_upgrade. After the upgrade, old page versions are converted to new
format as pages are read in. The conversion is staightforward, as there the
pre-upgrade utility ensured that there is enough spare room on every page.

That certainly works, but we're still faced with pg_upgrade rewriting
every page, which will take a significant amount of time and with no
backout plan or rollback facility. I don't like that at all, hence why
I think we need an online upgrade facility if we do have to alter page
headers.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Florian Weimer (#13)
Re: Page Checksums + Double Writes

On Thu, Dec 22, 2011 at 8:42 AM, Florian Weimer <fweimer@bfk.de> wrote:

* David Fetter:

The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

How desirable is it to disable full_page_writes?  Doesn't it cut down
recovery time significantly because it avoids read-modify-write cycles
with a cold cache?

It's way too late in the cycle to suggest removing full page writes or
code them. We're looking to add protection, not swap out existing
ones.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#16Jesper Krogh
jesper@krogh.cc
In reply to: Florian Weimer (#13)
Re: Page Checksums + Double Writes

On 2011-12-22 09:42, Florian Weimer wrote:

* David Fetter:

The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

How desirable is it to disable full_page_writes? Doesn't it cut down
recovery time significantly because it avoids read-modify-write cycles
with a cold cache

What is the downsides of having full_page_writes enabled .. except from
log-volume? The manual mentions something about speed, but it is
a bit unclear where that would come from, since the full pages must
be somewhere in memory when being worked on anyway,.

Anyway, I have an archive_command that looks like:
archive_command = 'test ! -f /data/wal/%f.gz && gzip --fast < %p >
/data/wal/%f.gz'

It brings on along somewhere between 50 and 75% reduction in log-volume
with "no cost" on the production system (since gzip just occupices one
of the
many cores on the system) and can easily keep up even during
quite heavy writes.

Recovery is a bit more tricky, because hooking gunzip into the command
there
will cause the system to replay log, gunzip, read data, replay log cycle
where the gunzip
easily could be done on the other logfiles while replay are being done
on one.

So a "straightforward" recovery will cost in recovery time, but that can
be dealt with.

Jesper
--
Jesper

#17Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jesper Krogh (#16)
Re: Page Checksums + Double Writes

Simon Riggs wrote:

So overall, I do now think its still possible to add an optional
checksum in the 9.2 release and am willing to pursue it unless
there are technical objections.

Just to restate Simon's proposal, to make sure I'm understanding it,
we would support a new page header format number and the old one in
9.2, both to be the same size and carefully engineered to minimize
what code would need to be aware of the version. PageHeaderIsValid()
and PageInit() certainly would, and we would need some way to set,
clear (maybe), and validate a CRC. We would need a GUC to indicate
whether to write the CRC, and if present we would always test it on
read and treat it as a damaged page if it didn't match. (Perhaps
other options could be added later, to support recovery attempts, but
let's not complicate a first cut.) This whole idea would depend on
either (1) trusting your storage system never to tear a page on write
or (2) getting the double-write feature added, too.

I see some big advantages to this over what I suggested to David.
For starters, using a flag bit and putting the CRC somewhere other
than the page header would require that each AM deal with the CRC,
exposing some function(s) for that. Simon's idea doesn't require
that. I was also a bit concerned about shifting tuple images to
convert non-protected pages to protected pages. No need to do that,
either. With the bit flags, I think there might be some cases where
we would be unable to add a CRC to a converted page because space was
too tight; that's not an issue with Simon's proposal.

Heikki was talking about a pre-convert tool. Neither approach really
needs that, although with Simon's approach it would be possible to
have a background *post*-conversion tool to add CRCs, if desired.
Things would continue to function if it wasn't run; you just wouldn't
have CRC protection on pages not updated since pg_upgrade was run.

Simon, does it sound like I understand your proposal?

Now, on to the separate-but-related topic of double-write. That
absolutely requires some form of checksum or CRC to detect torn
pages, in order for the technique to work at all. Adding a CRC
without double-write would work fine if you have a storage stack
which prevents torn pages in the file system or hardware driver. If
you don't have that, it could create a damaged page indication after
a hardware or OS crash, although I suspect that would be the
exception, not the typical case. Given all that, and the fact that
it would be cleaner to deal with these as two separate patches, it
seems the CRC patch should go in first. (And, if this is headed for
9.2, *very soon*, so there is time for the double-write patch to
follow.)

It seems to me that the full_page_writes GUC could become an
enumeration, with "off" having the current meaning, "wal" meaning
what "on" now does, and "double" meaning that the new double-write
technique would be used. (It doesn't seem to make any sense to do
both at the same time.) I don't think we need a separate GUC to tell
us *what* to protect against torn pages -- if not "off" we should
always protect the first write of a page after checkpoint, and if
"double" and write_page_crc (or whatever we call it) is "on", then we
protect hint-bit-only writes. I think. I can see room to argue that
with CRCs on we should do a full-page write to the WAL for a
hint-bit-only change, or that we should add another GUC to control
when we do this.

I'm going to take a shot at writing a patch for background hinting
over the holidays, which I think has benefit alone but also boosts
the value of these patches, since it would reduce double-write
activity otherwise needed to prevent spurious error when using CRCs.

This whole area has some overlap with spreading writes, I think. The
double-write approach seems to count on writing a bunch of pages
(potentially from different disk files) sequentially to the
double-write buffer, fsyncing that, and then writing the actual pages
-- which must be fsynced before the related portion of the
double-write buffer can be reused. The simple implementation would
be to simply fsync the files just written to if they required a prior
write to the double-write buffer, although fancier techniques could
be used to try to optimize that. Again, setting hint bits set before
the write when possible would help reduce the impact of that.

-Kevin

#18Jignesh K. Shah
J.K.Shah@Sun.COM
In reply to: Jesper Krogh (#16)
Re: Page Checksums + Double Writes

On Thu, Dec 22, 2011 at 4:00 AM, Jesper Krogh <jesper@krogh.cc> wrote:

On 2011-12-22 09:42, Florian Weimer wrote:

* David Fetter:

The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

How desirable is it to disable full_page_writes?  Doesn't it cut down
recovery time significantly because it avoids read-modify-write cycles
with a cold cache

What is the downsides of having full_page_writes enabled .. except from
log-volume? The manual mentions something about speed, but it is
a bit unclear where that would come from, since the full pages must
be somewhere in memory when being worked on anyway,.

I thought I will share some of my perspective on this checksum +
doublewrite from a performance point of view.

Currently from what I see in our tests based on dbt2, DVDStore, etc
is that checksum does not impact scalability or total throughput
measured. It does increase CPU cycles depending on the algorithm used
by not really anything that causes problems. The Doublewrite change
will be the big win to performance compared to full_page_write. For
example compared to other databases our WAL traffic is one of the
highest. Most of it is attributed to full_page_write. The reason
full_page_write is necessary in production (atleast without worrying
about replication impact) is that if a write fails, we can recover
that whole page from WAL Logs as it is and just put it back out there.
(In fact I believe thats the recovery does.) However the net impact is
during high OLTP the runtime impact on WAL is high due to the high
traffic and compared to other databases due to the higher traffic, the
utilization is high. Also this has a huge impact on transaction
response time the first time a page is changed which in all OLTP
environments it is huge because by nature the transactions are all on
random pages.

When we use Doublewrite with checksums, we can safely disable
full_page_write causing a HUGE reduction to the WAL traffic without
loss of reliatbility due to a write fault since there are two writes
always. (Implementation detail discussable). Since the double writes
itself are sequential bundling multiple such writes further reduces
the write time. The biggest improvement is that now these writes are
not done during TRANSACTION COMMIT but during CHECKPOINT WRITES which
improves performance drastically for OLTP application's transaction
performance and you still get the reliability that is needed.

Typically Performance in terms of throughput tps system is like
tps(Full_page Write) << tps (no full page write)
With the double write and CRC we see
tps (Full_page_write) << tps (Doublewrite) < tps(no full page
write)
Which is a big win for production systems to get the reliability of
full_page write.

Also the side effect for response times is that they are more leveled
unlike full page write where the response time varies like 0.5ms to
5ms depending on whether the same transaction needs to write a full
page onto WAL or not. With doublewrite it can always be around 0.5ms
rather than have a huge deviation on transaction performance. With
this folks measuring the 90 %ile response time will see a huge relief
on trying to meet their SLAs.

Also from WAL perspective, I like to put the WAL on its own
LUN/spindle/VMDK etc .. The net result that I have with the reduced
WAL traffic, my utilization drops which means the same hardware can
now handle higher WAL traffic in terms of IOPS resulting that WAL
itself becomes lesser of a bottleneck. Typically this is observed by
the reduction in response times of the transactions and increase in
tps till some other bottleneck becomes the gating factor.

So overall this is a big win.

Regards,
Jignesh

#19Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jignesh K. Shah (#18)
Re: Page Checksums + Double Writes

Jignesh Shah <jkshah@gmail.com> wrote:

When we use Doublewrite with checksums, we can safely disable
full_page_write causing a HUGE reduction to the WAL traffic
without loss of reliatbility due to a write fault since there are
two writes always. (Implementation detail discussable).

The "always" there surprised me. It seemed to me that we only need
to do the double-write where we currently do full page writes or
unlogged writes. In thinking about your message, it finally struck
me that this might require a WAL record to be written with the
checksum (or CRC; whatever we use). Still, writing a WAL record
with a CRC prior to the page write would be less data than the full
page. Doing double-writes instead for situations without the torn
page risk seems likely to be a net performance loss, although I have
no benchmarks to back that up (not having a double-write
implementation to test). And if we can get correct behavior without
doing either (the checksum WAL record or the double-write), that's
got to be a clear win.

-Kevin

#20Jignesh K. Shah
J.K.Shah@Sun.COM
In reply to: Kevin Grittner (#19)
Re: Page Checksums + Double Writes

On Thu, Dec 22, 2011 at 11:16 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Jignesh Shah <jkshah@gmail.com> wrote:

When we use Doublewrite with checksums, we can safely disable
full_page_write causing a HUGE reduction to the WAL traffic
without loss of reliatbility due to a write fault since there are
two writes always. (Implementation detail discussable).

The "always" there surprised me.  It seemed to me that we only need
to do the double-write where we currently do full page writes or
unlogged writes.  In thinking about your message, it finally struck

Currently PG only does full page write for the first change that makes
the dirty after a checkpoint. This scheme works when all changes are
relative to that first page so when checkpoint write fails then it can
recreate the page by using the full page write + all the delta changes
from WAL.

In the double write implementation, every checkpoint write is double
writed, so if the first doublewrite page write fails then then
original page is not corrupted and if the second write to the actual
datapage fails, then one can recover it from the earlier write. Now
while it seems that there are 2X double writes during checkpoint is
true. I can argue that there are the same 2 X writes right now except
1X of the write goes to WAL DURING TRANSACTION COMMIT. Also since
doublewrite is generally written in its own file it is essentially
sequential so it doesnt have the same write latencies as the actual
checkpoint write. So if you look at the net amount of the writes it is
the same. For unlogged tables even if you do doublewrite it is not
much of a penalty while that may not be logging before in the WAL. By
doing the double write for it, it is still safe and gives resilience
for those tables to it eventhough it is not required. The net result
is that the underlying page is never "irrecoverable" due to failed
writes.

me that this might require a WAL record to be written with the
checksum (or CRC; whatever we use).  Still, writing a WAL record
with a CRC prior to the page write would be less data than the full
page.  Doing double-writes instead for situations without the torn
page risk seems likely to be a net performance loss, although I have
no benchmarks to back that up (not having a double-write
implementation to test).  And if we can get correct behavior without
doing either (the checksum WAL record or the double-write), that's
got to be a clear win.

I am not sure why would one want to write the checksum to WAL.
As for the double writes, infact there is not a net loss because
(a) the writes to the doublewrite area is sequential the writes calls
are relatively very fast and infact does not cause any latency
increase to any transactions unlike full_page_write.
(b) It can be moved to a different location to have no stress on the
default tablespace if you are worried about that spindle handling 2X
writes which is mitigated in full_page_writes if you move pg_xlogs to
different spindle

and my own tests supports that the net result is almost as fast as
full_page_write=off but not the same due to the extra write (which
gives you the desired reliability) but way better than
full_page_write=on.

Regards,
Jignesh

Show quoted text

-Kevin

#21Robert Haas
robertmhaas@gmail.com
In reply to: Jignesh K. Shah (#20)
#22Jignesh K. Shah
J.K.Shah@Sun.COM
In reply to: Robert Haas (#21)
#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Kevin Grittner (#17)
#24Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#23)
#25Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#27)
#29Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#26)
#30Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#27)
#31Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jeff Janes (#29)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#29)
#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#23)
#34Jeff Davis
pgsql@j-davis.com
In reply to: Kevin Grittner (#17)
#35Merlin Moncure
mmoncure@gmail.com
In reply to: Jeff Davis (#34)
#36Jeff Davis
pgsql@j-davis.com
In reply to: Merlin Moncure (#35)
#37Bruce Momjian
bruce@momjian.us
In reply to: Merlin Moncure (#35)
#38Merlin Moncure
mmoncure@gmail.com
In reply to: Bruce Momjian (#37)
#39Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Kevin Grittner (#31)
#40Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jim Nasby (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#40)
#42Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#41)
#43Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Kevin Grittner (#40)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#42)
#45Florian Pflug
fgp@phlo.org
In reply to: Robert Haas (#41)
#46Merlin Moncure
mmoncure@gmail.com
In reply to: Florian Pflug (#45)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Florian Pflug (#45)
#48Benedikt Grundmann
bgrundmann@janestreet.com
In reply to: Kevin Grittner (#31)
#49Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Benedikt Grundmann (#48)