Checkpoint cost, looks like it is WAL/CRC
Tom, All:
Ok, finally managed though the peristent efforts of Mark Wong to get some
tests through. Here are two tests with the CRC and wall buffer checking
completely cut out of the code, as Tom suggested:
5-min checkpoint:
http://khack.osdl.org/stp/302738/results/0/
http://khack.osdl.org/stp/302706/results/0/
60-min checkpoint:
http://khack.osdl.org/stp/302739/results/0/
(please note that OSDL is having technical difficulties and some links may
not work)
This is the performance profile I'd expect and want to see, and the
frequency for checkpoints doesn't affect the overall performance at all.
Contrast it with these:
5-min checkpoint:
http://khack.osdl.org/stp/302671/results/0/
I don't hae a 60-minute checkpoint for comparison because of failures on
the STP :-(
So, now that we know what the performance bottleneck is, how do we fix it?
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Ok, finally managed though the peristent efforts of Mark Wong to get some
tests through. Here are two tests with the CRC and wall buffer checking
completely cut out of the code, as Tom suggested:
Uh, what exactly did you cut out? I suggested dropping the dumping of
full page images, but not removing CRCs altogether ...
regards, tom lane
Tom,
Uh, what exactly did you cut out? I suggested dropping the dumping of
full page images, but not removing CRCs altogether ...
Attached is the patch I used. (it's a -Urn patch 'cause that's what STP
takes)
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Attachments:
xlogomit.patchtext/x-diff; charset=utf-8; name=xlogomit.patchDownload+0-38
Josh Berkus <josh@agliodbs.com> writes:
Uh, what exactly did you cut out? I suggested dropping the dumping of
full page images, but not removing CRCs altogether ...
Attached is the patch I used.
OK, thanks for the clarification. So it does seem that dumping full
page images is a pretty big hit these days. (In defense of the original
idea, I believe it was not such a hit at the time --- but as we continue
to improve performance, things that weren't originally at the top of the
profile become significant.)
It seems like we have two basic alternatives:
1. Offer a GUC to turn off full-page-image dumping, which you'd use only
if you really trust your hardware :-(
2. Think of a better defense against partial-page writes.
I like #2, or would if I could think of a better defense. Ideas anyone?
regards, tom lane
Tom,
1. Offer a GUC to turn off full-page-image dumping, which you'd use only
if you really trust your hardware :-(
Are these just WAL pages? Or database pages as well?
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
1. Offer a GUC to turn off full-page-image dumping, which you'd use only
if you really trust your hardware :-(
Are these just WAL pages? Or database pages as well?
Database pages. The current theory is that we can completely
reconstruct from WAL data every page that's been modified since the
last checkpoint. So the first write of any page after a checkpoint
dumps a full image of the page into WAL; subsequent writes only write
differences.
This is nice and secure ... at least when you are using hardware that
guarantees write ordering ... otherwise it's probably mostly useless
overhead. Still, I'd not like to abandon the contract that if the disk
does what it is supposed to do then we will do what we are supposed to.
regards, tom lane
2. Think of a better defense against partial-page writes.
I like #2, or would if I could think of a better defense.
Ideas anyone?
FWIW, MSSQL deals with this using "Torn Page Detection". This is off by
default (no check at all!), but can be abled on a per-database level.
Note that it only *detects* torn pages. If it finds one, it won't start
and tell you to recover from backup. It can't automatically recover. I
would assume this greatly decreases the amount of data you have to
save...
From the BOL:
"
This option allows SQL Server to detect incomplete I/O operations caused
by power failures or other system outages. When true, it causes a bit to
be flipped for each 512-byte sector in an 8-kilobyte (KB) database page
whenever the page is written to disk. If a bit is in the wrong state
when the page is later read by SQL Server, then the page was written
incorrectly; a torn page is detected. Torn pages are usually detected
during recovery because any page that was written incorrectly is likely
to be read by recovery.
Although SQL Server database pages are 8 KB, disks perform I/O
operations using a 512-byte sector. Therefore, 16 sectors are written
per database page. A torn page can occur if the system fails (for
example, due to power failure) between the time the operating system
writes the first 512-byte sector to disk and the completion of the 8 KB
I/O operation. If the first sector of a database page is successfully
written before the failure, the database page on disk will appear as
updated, although it may not have succeeded.
Using battery-backed disk controller caches can ensure that data is
successfully written to disk or not written at all. In this case, do not
set torn page detection to true, for it is not needed.
"
and some FAQs:
"
Q. Does enabling the "torn page" database option add any measurable
performance overhead to a server?
A. The torn page option does not add much CPU cost at all, but it can
increase contention on "hot" pages. With torn page detection off, a page
can be accessed while it is being written to disk. This is not true if
torn page detection is on.
Q. When does SQL Server check for torn pages? At startup? Any time it
reads a page from disk?
A. Torn page detection is done whenever a page is read from disk. In
practice, this is likely to be during recovery, because any page on
which the write did not complete during normal operations is very likely
to be read by recovery (except for non-logged operations, such as index
creation, bcp, and so on).
Q. What happens when SQL Server detects a torn page?
A. When a torn page is detected, a severe I/O error is raised. This
error will close the connection. The database is only marked suspect if
the torn page is detected during recovery.
Q. How can I recover from torn pages?
A. Restoring the database from a backup and rolling the transaction log
forward should correct the problem with no data loss.
Q. What situations are most likely to cause torn pages?
A. Lab tests have shown that torn pages are quite likely to happen when
disk writes are occurring and power is lost. If you do not have a
battery backup or uninterruptible power supply (UPS), you should
consider enabling this option.
"
Perhaps a GUC that would let you chose between "full recovery, detection
or none at all"? Though that might complicate the code...
//Magnus
Import Notes
Resolved by subject fallback
Tom,
Database pages. The current theory is that we can completely
reconstruct from WAL data every page that's been modified since the
last checkpoint. So the first write of any page after a checkpoint
dumps a full image of the page into WAL; subsequent writes only write
differences.
What I'm confused about is that this shouldn't be anything new for 8.1. Yet
8.1 has *worse* performance on the STP machines than 8.0 does, and it's
pretty much entirely due to this check.
This is nice and secure ... at least when you are using hardware that
guarantees write ordering ... otherwise it's probably mostly useless
overhead. Still, I'd not like to abandon the contract that if the disk
does what it is supposed to do then we will do what we are supposed to.
Given the huge performance difference (30%), I think we have to give an option
to turn it off. So DBAs whose machines are in danger of being shut off a lot
can have it on an the more performance-sensitive can turn it off.
One thing I am confused about, though: if the whole pages are in the database,
why do we need a full copy in WAL instead of just the diffs?
--
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
What I'm confused about is that this shouldn't be anything new for 8.1. Yet
8.1 has *worse* performance on the STP machines than 8.0 does, and it's
pretty much entirely due to this check.
That's simply not believable --- better recheck your analysis. If 8.1
is worse it's not because of page-dumping, because we are more efficient
on that than before not less so. Perhaps there's another issue?
regards, tom lane
Tom,
What I'm confused about is that this shouldn't be anything new for
8.1. Yet 8.1 has *worse* performance on the STP machines than 8.0
does, and it's pretty much entirely due to this check.That's simply not believable --- better recheck your analysis. If 8.1
is worse it's not because of page-dumping, because we are more efficient
on that than before not less so. Perhaps there's another issue?
Ach, I want to run head-to-head tests, but the system is down again.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
""Magnus Hagander"" <mha@sollentuna.net> writes
FWIW, MSSQL deals with this using "Torn Page Detection". This is off by
default (no check at all!), but can be abled on a per-database level.
Note that it only *detects* torn pages. If it finds one, it won't start
and tell you to recover from backup. It can't automatically recover. I
would assume this greatly decreases the amount of data you have to
save...
After reading the long discussion on torn page detection and many related
issues in doc/TODO.details/wal, I believe we need carefully analysis what
the sequence of a partial write.
case 1: if a partial write just write some sectors of a 8K page, then it is
ok - assuming we will replay xlog in any ways without comparing page header
LSN and xlog LSN (currently we do not do so);
case 2: if a patial write destroied some old data of a page, which are not
in our xlog, we will worry about this;
So the main issue is case2. If OS could schedule atomic sector-wise write ,
so case2 could never happen?
Regards,
Qingqing
Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
Uh, what exactly did you cut out? I suggested dropping the dumping of
full page images, but not removing CRCs altogether ...Attached is the patch I used.
OK, thanks for the clarification. So it does seem that dumping full
page images is a pretty big hit these days. (In defense of the original
idea, I believe it was not such a hit at the time --- but as we continue
to improve performance, things that weren't originally at the top of the
profile become significant.)It seems like we have two basic alternatives:
1. Offer a GUC to turn off full-page-image dumping, which you'd use only
if you really trust your hardware :-(2. Think of a better defense against partial-page writes.
I like #2, or would if I could think of a better defense. Ideas anyone?
I have an idea! Currently we write the backup pages (copies of pages
modified since the last checkpoint) when we write the WAL changes as
part of the commit. See the XLogCheckBuffer() call in XLogInsert().
However, it seems that this is not the most efficient time to be writing
the backup pages. It would be more efficient to write these just before
the page is written to disk (not when it is modified in the buffer
cache). This would allow the background writer to be writing most of
the backup pages, rather than the transaction committer.
A further optimization would be to write the backup pages to the same 8k
file over and over again rather than adding I/O to pg_xlog WAL files.
Comments?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I have an idea! Currently we write the backup pages (copies of pages
modified since the last checkpoint) when we write the WAL changes as
part of the commit. See the XLogCheckBuffer() call in XLogInsert().
Can someone explain exactly what the problem being defeated by writing whole
pages to the WAL log? Like, if page is half flushed to disk and contains half
the old data and half of the new data, what exactly would go wrong with
recovery? When postgres sees the write why would it have trouble recovering
the correct contents for the page based on the half-written page and the
regular WAL entry?
A further optimization would be to write the backup pages to the same 8k
file over and over again rather than adding I/O to pg_xlog WAL files.
This is an interesting idea. But it's not necessarily universally a win. For
users with a battery backed cache it would be absolutely ideal and a huge win.
It would avoid pushing other pages out of cache and avoid doing extra i/o
syncing the no longer relevant data to the actual disk.
But for users without a battery backed cache I don't think it would be a win.
It would mean another data file that ideally would be on a segregated set of
spindles to avoid any unnecessary seeks. Currently the ideal postgres setup
has WAL on one set of spindles, data on another set of spindles, and OS
writable partitions like /var on an entirely separate set of spindles. That
takes at least 6 drives if you want mirrored drives. This would raise the
minimum to 8 drives.
--
greg
On Sun, 3 Jul 2005 04:47 pm, Greg Stark wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I have an idea! Currently we write the backup pages (copies of pages
modified since the last checkpoint) when we write the WAL changes as
part of the commit. See the XLogCheckBuffer() call in XLogInsert().Can someone explain exactly what the problem being defeated by writing whole
pages to the WAL log? Like, if page is half flushed to disk and contains half
the old data and half of the new data, what exactly would go wrong with
recovery? When postgres sees the write why would it have trouble recovering
the correct contents for the page based on the half-written page and the
regular WAL entry?
Problem we are solving.
Checkpoint Happens. (Ensuring that ALL data make it to the disk)
- This means we don't need the information in WAL before the checkpoint.
We write some updates to the db, changing some pages.
The Power is pulled and one page is half written. There is no way to know if the page is in a valid state.
There is no way to recover the page from wal, becuase unless we wrote the full page into wal, we don't have a "starting point" for modification.
That's probably very unclear but that is the idea.
Regards
Russell Smith
Greg Stark <gsstark@mit.edu> writes:
Can someone explain exactly what the problem being defeated by writing whole
pages to the WAL log?
Partial writes. Without the full-page image, we do not have enough
information in WAL to reconstruct the correct page contents.
A further optimization would be to write the backup pages to the same 8k
file over and over again rather than adding I/O to pg_xlog WAL files.
How does that work, and why is it a win compared to doing the same
amount of I/O to WAL?
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes:
Greg Stark <gsstark@mit.edu> writes:
Can someone explain exactly what the problem being defeated by writing whole
pages to the WAL log?Partial writes. Without the full-page image, we do not have enough
information in WAL to reconstruct the correct page contents.
Sure, but why not?
If a 8k page contains 16 low level segments on disk and the old data is
AAAAAAAAAAAAAAAA and the new data is AAABAAACAAADAAAE then the WAL would
contain the B, C, D, and E. Shouldn't that be enough to reconstruct the page?
If the actual write only got out AAABAAACAAAAAAAA the resulting page is
garbage but why isn't what the WAL has enough information to reconstruct the
new version?
I do worry that a drive could write AAABAAACXYZKWBFH if it loses power in the
middle of the write, though. That would be bad.
--
greg
Greg Stark <gsstark@mit.edu> writes:
Tom Lane <tgl@sss.pgh.pa.us> writes:
Partial writes. Without the full-page image, we do not have enough
information in WAL to reconstruct the correct page contents.
Sure, but why not?
If a 8k page contains 16 low level segments on disk and the old data is
AAAAAAAAAAAAAAAA and the new data is AAABAAACAAADAAAE then the WAL would
contain the B, C, D, and E. Shouldn't that be enough to reconstruct the page?
It might contain parts of it ... scattered across a large number of WAL
entries ... but I don't think that's enough to reconstruct the page.
As an example, a btree insert WAL record will say "insert this tuple
at position N, shifting the other entries accordingly"; that does not
give you the ability to reconstruct entries that shifted across sector
boundaries, as they may not be present in the on-disk data of either
sector. You're also going to have fairly serious problems interpreting
the page contents if what's on disk includes the effects of multiple
WAL records beyond the record you are currently looking at.
We could possibly do it if we added more information to the WAL records,
but that strikes me as a net loss: essentially it would pay the penalty
all the time instead of only on the first write after a checkpoint.
Also, you are assuming that the content of each sector is uniquely ---
and determinably --- either "old data" or "new data", not for example
"unreadable because partially written".
regards, tom lane
On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
Uh, what exactly did you cut out? I suggested dropping the dumping of
full page images, but not removing CRCs altogether ...Attached is the patch I used.
OK, thanks for the clarification. So it does seem that dumping full
page images is a pretty big hit these days.
Yes the performance results are fairly damning. That's a shame, I
convinced myself that the CRC32 and block-hole compression was enough.
The 50% performance gain isn't the main thing for me. The 10 sec drop in
response time immediately after checkpoint is the real issue. Most sites
are looking for good response as an imperative, rather than throughput.
Overall, IMHO we must do something about this for 8.1. Possibly
something more for 8.2 also, but definitely *something* now.
(In defense of the original
idea, I believe it was not such a hit at the time --- but as we continue
to improve performance, things that weren't originally at the top of the
profile become significant.)
No defense required. As you say, it was the best idea at the time.
It seems like we have two basic alternatives:
1. Offer a GUC to turn off full-page-image dumping, which you'd use only
if you really trust your hardware :-(2. Think of a better defense against partial-page writes.
I like #2, or would if I could think of a better defense. Ideas anyone?
Well, I'm all for #2 if we can think of one that will work. I can't.
Option #1 seems like the way forward, but I don't think it is
sufficiently safe just to have the option to turn things off.
With wal_changed_pages= off *any* crash would possibly require an
archive recovery, or a replication rebuild. It's good that we now have
PITR, but we do also have other options for availability. Users of
replication could well be amongst the first to try out this option.
The problem is that you just wouldn't *know* whether the possibly was
yes or no. The temptation would be to assume "no" and just continue,
which could lead to data loss. And that would lead to a lack of trust in
PostgreSQL and eventual reputational loss. Would I do an archive
recovery, or would I trust that RAID array had written everything
properly? With an irate Web Site Manager saying "you think? it might?
maybe? You mean you don't know???"
If we pick option #1, it *must* also include a way of deciding whether a
crash recovery has succeeded, or not. Other commentators have mentioned
the torn_pages option for other DBMS. It seems we also need an
indicator. That should be a non-optional feature of 8.1 and then perhaps
an optional feature in later releases when we have more experience to
say that turning it off is OK in some circumstances.
We could implement the torn-pages option, but that seems a lot of work.
Another way of implementing a tell-tale would be to append the LSN again
as a data page trailer as the last 4 bytes of the page. Thus the LSN
would be both the first and last item on the data page. Any partial
write would fail to update the LSN trailer and we would be able to see
that a page was torn. That's considerably easier than trying to write a
torn page tell-tale to each 512 byte sector of a page as SQLServer does.
During recovery, if a full page image is not available, we would read
the page from the database and check that the first and last LSNs match.
If they do, then the page is not torn and recovery can be successful. If
they do not match, then we attempt to continue recovery, but issue a
warning that torn page has been detected and a full archive recovery is
recommended. It is likely that the recovery itself will fail almost
immediately following this, since changes will try to be made to a page
in the wrong state to receive it, but there's no harm in trying....
Like this specific idea or not, I'm saying that we need a tell-tale: a
way of knowing whether we have a torn page, or not. That way we can
safely continue to rely upon crash recovery.
Tom, I think you're the only person that could or would be trusted to
make such a change. Even past the 8.1 freeze, I say we need to do
something now on this issue.
Best Regards, Simon Riggs
Simon Riggs wrote:
On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
Uh, what exactly did you cut out? I suggested dropping the dumping of
full page images, but not removing CRCs altogether ...Attached is the patch I used.
OK, thanks for the clarification. So it does seem that dumping full
page images is a pretty big hit these days.Yes the performance results are fairly damning. That's a shame, I
convinced myself that the CRC32 and block-hole compression was enough.The 50% performance gain isn't the main thing for me. The 10 sec drop in
response time immediately after checkpoint is the real issue. Most sites
are looking for good response as an imperative, rather than throughput.
Yep.
No defense required. As you say, it was the best idea at the time.
It seems like we have two basic alternatives:
1. Offer a GUC to turn off full-page-image dumping, which you'd use only
if you really trust your hardware :-(2. Think of a better defense against partial-page writes.
I like #2, or would if I could think of a better defense. Ideas anyone?
Well, I'm all for #2 if we can think of one that will work. I can't.
Option #1 seems like the way forward, but I don't think it is
sufficiently safe just to have the option to turn things off.
Well, I added #1 yesterday as 'full_page_writes', and it has the same
warnings as fsync (namely, on crash, be prepared to recovery or check
your system thoroughly.
As far as #2, my posted proposal was to write the full pages to WAL when
they are written to the file system, and not when they are first
modified in the shared buffers --- the goal being that it will even out
the load, and it will happen in a non-critical path, hopefully by the
background writer or at checkpoint time.
With wal_changed_pages= off *any* crash would possibly require an
archive recovery, or a replication rebuild. It's good that we now have
PITR, but we do also have other options for availability. Users of
replication could well be amongst the first to try out this option.
Seems it is similar to fsync in risk, which is not a new option.
The problem is that you just wouldn't *know* whether the possibly was
yes or no. The temptation would be to assume "no" and just continue,
which could lead to data loss. And that would lead to a lack of trust in
PostgreSQL and eventual reputational loss. Would I do an archive
recovery, or would I trust that RAID array had written everything
properly? With an irate Web Site Manager saying "you think? it might?
maybe? You mean you don't know???"
That is a serious problem, but the same problem we have in turning off
fsync.
During recovery, if a full page image is not available, we would read
the page from the database and check that the first and last LSNs match.
If they do, then the page is not torn and recovery can be successful. If
they do not match, then we attempt to continue recovery, but issue a
warning that torn page has been detected and a full archive recovery is
recommended. It is likely that the recovery itself will fail almost
immediately following this, since changes will try to be made to a page
in the wrong state to receive it, but there's no harm in trying....
I like the idea of checking the page during recovery so we don't have to
check all the pages, just certain pages.
Like this specific idea or not, I'm saying that we need a tell-tale: a
way of knowing whether we have a torn page, or not. That way we can
safely continue to rely upon crash recovery.Tom, I think you're the only person that could or would be trusted to
make such a change. Even past the 8.1 freeze, I say we need to do
something now on this issue.
I think if we document full_page_writes as similar to fsync in risk, we
are OK for 8.1, but if something can be done easily, it sounds good.
Now that we have a GUC we can experiment with the full page write load
and see how it can be improved.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote:
Well, I added #1 yesterday as 'full_page_writes', and it has the same
warnings as fsync (namely, on crash, be prepared to recovery or check
your system thoroughly.
Yes, which is why I comment now that the GUC alone is not enough.
There is no way to "check your system thoroughly". If there is a certain
way of knowing torn pages had *not* occurred, then I would be happy.
As far as #2, my posted proposal was to write the full pages to WAL when
they are written to the file system, and not when they are first
modified in the shared buffers --- the goal being that it will even out
the load, and it will happen in a non-critical path, hopefully by the
background writer or at checkpoint time.
The page must be written before the changes to the page are written, so
that they are available sequentially in the log for replay. The log and
the database are not connected, so we cannot do it that way. If the page
is written out of sequence from the changes to it, how would recovery
know where to get the page from?
ISTM there is mileage in your idea of trying to shift the work to
another time. My thought is "which blocks exactly are the ones being
changed?". Maybe that would lead to a reduction.
With wal_changed_pages= off *any* crash would possibly require an
archive recovery, or a replication rebuild. It's good that we now have
PITR, but we do also have other options for availability. Users of
replication could well be amongst the first to try out this option.Seems it is similar to fsync in risk, which is not a new option.
Risk is not acceptable. We must have certainty, either way.
Why have two GUCs? Why not just have one GUC that does both at the same
time? When would you want one but not the other?
risk_data_loss_to_gain_performance = true
I think if we document full_page_writes as similar to fsync in risk, we
are OK for 8.1, but if something can be done easily, it sounds good.
Documenting something simply isn't enough. I simply cannot advise
anybody ever to use the new GUC. If their data was low value, they
wouldn't even be using PostgreSQL, they'd use a non-transactional DBMS.
I agree we *must* have the GUC, but we also *must* have a way for crash
recovery to tell us for certain that it has definitely worked, not just
maybe worked.
Best regards, Simon Riggs