Huge iowait during checkpoint finish

Started by Anton Belyaevover 16 years ago9 messagesgeneral
Jump to latest
#1Anton Belyaev
anton.belyaev@gmail.com

Hello dear list members,

I have strange problem with my new 8.4 deployment, which I never
encountered on previous 8.3 deployment.
IOwait values are extremely high exactly when Postgres finishes a checkpoint.
During the checkpoint itself (which is quite lengthy) IOwait is very low.
Why does this happen and how to fix it?

Take a look at this chart:

http://kpox.s3.amazonaws.com/cpu-day.png

The lates (rightest) IOwait peak corresponds to this log entry:

2010-01-08 18:40:36 CET LOG: checkpoint complete: wrote 46357 buffers
(35.4%); 0 transaction log file(s) added, 0 removed, 40 recycled;
write=2502.800 s, sync=79.972 s, total=2583.184 s

IOwait peak starts about 18:39:30 and finishes at 18:40:30.
I think all the IOwait comes during sync time, which is 80 s,
according to the log entry.

Config details that might be related to the problem:

bgwriter_lru_maxpages = 0 # BG writer is off
checkpoint_segments = 45
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9

I had mostly the same config with my 8.3 deployment.
But hardware is different:
Disk is software RAID-5 with 3 hard drives.

Operating system is Ubuntu 9.10 Server x64.

Thanks.
Anton.

#2Greg Smith
gsmith@gregsmith.com
In reply to: Anton Belyaev (#1)
Re: Huge iowait during checkpoint finish

Anton Belyaev wrote:

I think all the IOwait comes during sync time, which is 80 s,
according to the log entry.

I believe you are correctly diagnosing the issue. The "sync time" entry
in the log was added there specifically to make it easier to confirm
this problem you're having exists on a given system.

bgwriter_lru_maxpages = 0 # BG writer is off
checkpoint_segments = 45
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9

These are reasonable settings. You can look at pg_stat_bgwriter to get
more statistics about your checkpoints; grab a snapshot of that now,
another one later, and then compute the difference between the two.
I've got an example of that
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

You should be aiming to have a checkpoint no more than every 5 minutes,
and on a write-heavy system shooting for closer to every 10 is probably
more appropriate. Do you know how often they're happening on yours?
Two pg_stat_bgwriter snapshots from a couple of hours apart, with a
timestamp on each, can be used to figure that out.

I had mostly the same config with my 8.3 deployment.
But hardware is different:
Disk is software RAID-5 with 3 hard drives.
Operating system is Ubuntu 9.10 Server x64.

Does the new server have a lot more RAM than the 8.3 one? Some of the
problems in this area get worse the more RAM you've got.

Does the new server use ext4 while the old one used ext3?

Basically, you have a couple of standard issues here:

1) You're using RAID-5, which is not known for good write performance.
Are you sure the disk array performs well on writes? And if you didn't
benchmark it, you can't be sure.

2) Linux is buffering a lot of writes that are only making it to disk at
checkpoint time. This could be simply because of (1)--maybe the disk is
always overloaded. But it's possible this is just due to excessive
Linux buffering being lazy about the writes. I wrote something about
that topic at
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html
you might find interesting.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#3Scott Marlowe
scott.marlowe@gmail.com
In reply to: Greg Smith (#2)
Re: Huge iowait during checkpoint finish

On Fri, Jan 8, 2010 at 3:07 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Basically, you have a couple of standard issues here:

1) You're using RAID-5, which is not known for good write performance.  Are
you sure the disk array performs well on writes?  And if you didn't
benchmark it, you can't be sure.

This can be doubly bad if he's now moved to a set of disks that are
properly obeying fsync but was on disks that were lying about it
before.

#4Anton Belyaev
anton.belyaev@gmail.com
In reply to: Greg Smith (#2)
Re: Huge iowait during checkpoint finish

Hello Greg,

Thanks for you extensive reply.

2010/1/9 Greg Smith <greg@2ndquadrant.com>:

Anton Belyaev wrote:

I think all the IOwait comes during sync time, which is 80 s,
according to the log entry.

I believe you are correctly diagnosing the issue.  The "sync time" entry in
the log was added there specifically to make it easier to confirm this
problem you're having exists on a given system.

bgwriter_lru_maxpages = 0 # BG writer is off
checkpoint_segments = 45
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9

These are reasonable settings.  You can look at pg_stat_bgwriter to get more
statistics about your checkpoints; grab a snapshot of that now, another one
later, and then compute the difference between the two.  I've got an example
of that http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

You should be aiming to have a checkpoint no more than every 5 minutes, and
on a write-heavy system shooting for closer to every 10 is probably more
appropriate.  Do you know how often they're happening on yours?  Two
pg_stat_bgwriter snapshots from a couple of hours apart, with a timestamp on
each, can be used to figure that out.

Checkpoint happens about once an hour, sometimes a bit more offen (30
minutes) - during daily peaks.

I had mostly the same config with my 8.3 deployment.
But hardware is different:
Disk is software RAID-5 with 3 hard drives.
Operating system is Ubuntu 9.10 Server x64.

Does the new server have a lot more RAM than the 8.3 one?  Some of the
problems in this area get worse the more RAM you've got.

Yes, new server has 12 GB while old one only 8 GB.

Does the new server use ext4 while the old one used ext3?

Same ext3 filesystem.

Basically, you have a couple of standard issues here:

1) You're using RAID-5, which is not known for good write performance.  Are
you sure the disk array performs well on writes?  And if you didn't
benchmark it, you can't be sure.

I did some dd benchmarks (according to
http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm):
Old server with its "hardware RAID-1" shows 60 mb/s on write.
New server with software RAID-5 shows 85 mb/s on write.

2) Linux is buffering a lot of writes that are only making it to disk at
checkpoint time.  This could be simply because of (1)--maybe the disk is
always overloaded.  But it's possible this is just due to excessive Linux
buffering being lazy about the writes.  I wrote something about that topic
at http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you
might find interesting.

Old server has
dirty_ratio = 10
dirty_background_ratio = 5

New server had
dirty_ratio = 20
dirty_background_ratio = 10

Assuming all the tests and measures above:

Server has more RAM, leaving Linux some room for write cache. During
dd test DirtyPages of /proc/meminfo were up to 2 GB.

RAID-5 is a bit faster (at least on sequential write). Drives arent
overloaded, because their utilization during lengthy checkpoint is
low. IOwait problems occur only at final sync part of checkpoint. And
during this short period drives are almost 100% utilized (according to
sar -d 1).

I played a bit, setting dirty_background_ratio = 1, but this had
negative effect somehow.
And this is strange. I hoped this will force to distribute the load
from 2 min sync period to 1 hour checkpoint span, but it did not.

As the result, I am dont know still where is the real problem. Drives
arent overloaded. Linux cache is really misterious, but modifying its
parameters does not give the desired effect.

Thanks.
Anton.

#5Anton Belyaev
anton.belyaev@gmail.com
In reply to: Scott Marlowe (#3)
Re: Huge iowait during checkpoint finish

2010/1/9 Scott Marlowe <scott.marlowe@gmail.com>:

On Fri, Jan 8, 2010 at 3:07 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Basically, you have a couple of standard issues here:

1) You're using RAID-5, which is not known for good write performance.  Are
you sure the disk array performs well on writes?  And if you didn't
benchmark it, you can't be sure.

This can be doubly bad if he's now moved to a set of disks that are
properly obeying fsync but was on disks that were lying about it
before.

Scott, thanks for the interesting suggestion.

And how do I check this?
Old RAID-1 has "hardware" LSI controller.
I still have access to old server.

#6Scott Marlowe
scott.marlowe@gmail.com
In reply to: Anton Belyaev (#5)
Re: Huge iowait during checkpoint finish

On Mon, Jan 11, 2010 at 3:53 AM, Anton Belyaev <anton.belyaev@gmail.com> wrote:

2010/1/9 Scott Marlowe <scott.marlowe@gmail.com>:

On Fri, Jan 8, 2010 at 3:07 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Basically, you have a couple of standard issues here:

1) You're using RAID-5, which is not known for good write performance.  Are
you sure the disk array performs well on writes?  And if you didn't
benchmark it, you can't be sure.

This can be doubly bad if he's now moved to a set of disks that are
properly obeying fsync but was on disks that were lying about it
before.

Scott, thanks for the interesting suggestion.

And how do I check this?
Old RAID-1 has "hardware" LSI controller.
I still have access to old server.

The old RAID card likely had a battery backed cache, which would make
the fsyncs much faster, as long as you hadn't run out of cache.

While RAID-5 can have reasonable throughput sequentially, it has a
much higher random write costs, (two reads and two writes for each
random write).

Likely those two things, lack of cache, and more expensive writes is
why the new system seems so much slower.

If you can shoehorn one more drive, you could run RAID-10 and get much
better performance.

#7Greg Smith
gsmith@gregsmith.com
In reply to: Scott Marlowe (#6)
Re: Huge iowait during checkpoint finish

Scott Marlowe wrote:

On Mon, Jan 11, 2010 at 3:53 AM, Anton Belyaev <anton.belyaev@gmail.com> wrote:

Old RAID-1 has "hardware" LSI controller.
I still have access to old server.

The old RAID card likely had a battery backed cache, which would make
the fsyncs much faster, as long as you hadn't run out of cache.

To be super clear here: it's possible to see a 100:1 performance drop
going from a system with a battery-backed write cache to one that
doesn't. This one of the three main weak spots of software RAID that
still keeps hardware RAID vendors in business: it can't do anything to
speed up the type of writes done during transactions commit and at
checkpoint time. (The others are that it's hard to setup transparent
failover after failure in software RAID so that it always works at boot
time, and that motherboard chipsets can easily lose their minds and take
down the whole system when one drive goes bad).

If you can shoehorn one more drive, you could run RAID-10 and get much
better performance.

And throwing drives at the problem may not help. I've see a system with
a 48 disk software RAID-10 that only got 100 TPS when running a
commit-heavy test, because it didn't have any way to cache writes
usefully for that purpose.

If the old system had a write caching card, and the new one doesn't,
that's certainly your most likely suspect for the source of the
slowdown. As for testing that specifically, if you have the old system
too you can look at the slides I've got for "Database Hardware
Benchmarking" at
http://www.westnet.com/~gsmith/content/postgresql/index.htm and use the
sysbench example I show on P26 to measure commit fsync rate. There's a
video of that presentation where I explain a lot of the background in
this area too.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#8Scott Marlowe
scott.marlowe@gmail.com
In reply to: Greg Smith (#7)
Re: Huge iowait during checkpoint finish

On Mon, Jan 11, 2010 at 2:59 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Scott Marlowe wrote:

If you can shoehorn one more drive, you could run RAID-10 and get much
better performance.

And throwing drives at the problem may not help.  I've see a system with a
48 disk software RAID-10 that only got 100 TPS when running a commit-heavy
test, because it didn't have any way to cache writes usefully for that
purpose.

A 4 disk RAID-10 will be about 4 to 8 times faster than a RAID-5 of 3
disks. It won't be as fast as a good sized RAID-10 with HW caching,
but it would be a big improvement.

If the old system had a write caching card, and the new one doesn't, that's
certainly your most likely suspect for the source of the slowdown.  As for

Agreed. If the new machine is limited to 3 disks, and any one is big
enough to hold the db, I'd look at a two disk mirror with a hot spare
on a HW RAID controller with battery backed chat. If they can't get a
HW RAID controller than switching to SW RAID-1 would be a positive
step.

#9Craig Ringer
craig@2ndquadrant.com
In reply to: Greg Smith (#7)
Re: Huge iowait during checkpoint finish

Greg Smith wrote:

If the old system had a write caching card, and the new one doesn't
that's certainly your most likely suspect for the source of the
slowdown.

Note that it's even possible that the old system had a card with write
caching enabled, but *no* battery backed cache. That's crazily
dangerous, but does get you the speed benefits of write caching without
the need for a BBU. I only mention this to help explain a possible
performance difference - you should NEVER do this if you care about your
data.

--
Craig Ringer