Why is checkpoint so costly?
Folks,
Going over some performance test results at OSDL, our single greatest
performance issue seems to be checkpointing. Not matter how I fiddle
with it, checkpoints seem to cost us 1/2 of our throughput while they're
taking place. Overally, checkpointing costs us about 25% of our
performance on OLTP workloads.
Example: http://khack.osdl.org/stp/302671/results/0/
Can we break down everything that happens during a checkpoint so that we
can see where this huge cost is coming from? Checkpointing should be
limited to fsyncing to disk and marking WAL files as recyclable, but there
seems to be something more.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Can we break down everything that happens during a checkpoint so that we
can see where this huge cost is coming from? Checkpointing should be
limited to fsyncing to disk and marking WAL files as recyclable, but there
seems to be something more.
I already asked you to measure the thing I think is the likely candidate
(to wit, dumping full page images into WAL).
regards, tom lane
On Tue, Jun 21, 2005 at 12:00:56PM -0700, Josh Berkus wrote:
Folks,
Going over some performance test results at OSDL, our single greatest
performance issue seems to be checkpointing. Not matter how I fiddle
with it, checkpoints seem to cost us 1/2 of our throughput while they're
taking place. Overally, checkpointing costs us about 25% of our
performance on OLTP workloads.Example: http://khack.osdl.org/stp/302671/results/0/
Can we break down everything that happens during a checkpoint so that we
can see where this huge cost is coming from? Checkpointing should be
limited to fsyncing to disk and marking WAL files as recyclable, but there
seems to be something more.
Not only you have to fsync the files; you have to write them before as
well. If the bgwriter is not able to keep up then at checkpoint time
there is a lot of writing to do. One idea is to fiddle with bgwriter
settings, or did you do that already? I see this for the URL above:
bgwriter_delay | 200
bgwriter_maxpages | 100
bgwriter_percent | 1
Maybe it should be more aggressive.
Another thing to blame is the dump-whole-pages-after-checkpoint
business. Maybe the load you are seeing is not completely during
checkpoint, but right after it as well. How do you tell from the
results that the checkpoint is complete?
--
Alvaro Herrera (<alvherre[a]surnet.cl>)
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)
Alvaro, Tom,
bgwriter_delay | 200
bgwriter_maxpages | 100
bgwriter_percent | 1Maybe it should be more aggressive.
Yeah, a bgwriter progression is running now. I don't expect it to make
much difference. Most of sync impact is syncing the FS cache, which the
bgwriter doesn't touch.
Another thing to blame is the dump-whole-pages-after-checkpoint
business. Maybe the load you are seeing is not completely during
checkpoint, but right after it as well. How do you tell from the
results that the checkpoint is complete?
I can't relate that to the performance numbers, unfortunately. I think
that the paging is probably the cause, but I don't know what to do about
it.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
On Tue, Jun 21, 2005 at 02:45:32PM -0700, Josh Berkus wrote:
Another thing to blame is the dump-whole-pages-after-checkpoint
business. Maybe the load you are seeing is not completely during
checkpoint, but right after it as well. How do you tell from the
results that the checkpoint is complete?I can't relate that to the performance numbers, unfortunately. I think
that the paging is probably the cause, but I don't know what to do about
it.
Tom gave instructions in a mail (to you I think) to patch the xlog.c
file so page dumps stop happening. I'm too lazy to search for that mail
(I deleted my local copy) but if you find it in your mailbox, resend it
to me and I'll produce a patch for you to test. (I'd produce the patch
myself but I don't know the xlog code well enough to find the right spot
quickly.)
--
Alvaro Herrera (<alvherre[a]surnet.cl>)
Jason Tesser: You might not have understood me or I am not understanding you.
Paul Thomas: It feels like we're 2 people divided by a common language...
Alvaro,
Tom gave instructions in a mail (to you I think) to patch the xlog.c
file so page dumps stop happening. I'm too lazy to search for that mail
(I deleted my local copy) but if you find it in your mailbox, resend it
to me and I'll produce a patch for you to test. (I'd produce the patch
myself but I don't know the xlog code well enough to find the right spot
quickly.)
Found it. Testing now.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Folks,
Going over some performance test results at OSDL, our single greatest
performance issue seems to be checkpointing. Not matter how I fiddle
with it, checkpoints seem to cost us 1/2 of our throughput while they're
taking place. Overally, checkpointing costs us about 25% of our
performance on OLTP workloads.
I think this is a silly statement. *Of course* checkpointing is a big
performance "issue". Checkpointing basically *is* what the database's job is.
It stores data; checkpointing is the name for the process of storing the data.
Looking at the performance without counting the checkpoint time is cheating,
the database hasn't actually completed processing the data; it's still sitting
in the pipeline of the WAL log.
The question should be why is there any time when a checkpoint *isn't*
happening? For maximum performance the combination of bgwriter (basically
preemptive checkpoint i/o) and the actual checkpoint i/o should be executing
at a more or less even pace throughout the time interval between checkpoints.
I do have one suggestion. Is the WAL log on a separate set of drives from the
data files? If not then the checkpoint (and bgwriter i/o) will hurt WAL log
performance by forcing the drive heads to move away from their sequential
writing of WAL logs.
That said, does checkpointing (and bgwriter i/o) require rereading the WAL
logs? If so then if the buffers aren't found in cache then it'll cause some
increase in seek latency just from that even if it does have a dedicated
set of drives.
--
greg
Greg Stark <gsstark@mit.edu> writes:
The question should be why is there any time when a checkpoint *isn't*
happening? For maximum performance the combination of bgwriter (basically
preemptive checkpoint i/o) and the actual checkpoint i/o should be executing
at a more or less even pace throughout the time interval between checkpoints.
I think Josh's complaint has to do with the fact that performance
remains visibly affected after the checkpoint is over. (It'd be nice
if those TPM graphs could be marked with the actual checkpoint begin
and end instants, so we could confirm or deny that we are looking at a
post-checkpoint recovery curve and not some very weird behavior inside
the checkpoint.) It's certainly true that tuning the bgwriter ought to
help in reducing the amount of I/O done by a checkpoint, but why is
there a persistent effect?
That said, does checkpointing (and bgwriter i/o) require rereading the WAL
logs?
No. In fact, the WAL is never read at all, except during actual
post-crash recovery.
regards, tom lane