System Load analyze

Started by Peter Bauerover 18 years ago6 messagesgeneral
Jump to latest
#1Peter Bauer
peter.bauer@apus.co.at

Hi all,

i have a system here with 2 2.4GHz Xeon Processors, 2GB RAM, ONE Disk on
a Battery Backed Write Cache SCSI Controller and PostgreSQL 8.1.4
running with the data on a DRBD Device for High Availability. The used
database is also replicated to two similar machines with slony1.

Since the load average is between 1 (most of the time) and 10 (peeks) i
am worried about the load and executed vmstat and iostat which show that
1000-6000 Blocks are writen per second. Please check the attached output
for further details.
top shows that the CPUs are at least 80% idle most of the time so i
think there is an I/O bottleneck. I'm aware that this hardware setup is
probably not sufficient but is would like to investigate how critical
the situation is.

thanks,
Peter

Attachments:

load.txttext/plain; name=load.txtDownload
postgresql.conftext/plain; name=postgresql.confDownload
#2Scott Marlowe
scott.marlowe@gmail.com
In reply to: Peter Bauer (#1)
Re: System Load analyze

On Nov 24, 2007 10:57 AM, Peter Bauer <peter.bauer@apus.co.at> wrote:

i have a system here with 2 2.4GHz Xeon Processors, 2GB RAM, ONE Disk on
a Battery Backed Write Cache SCSI Controller and PostgreSQL 8.1.4
running with the data on a DRBD Device for High Availability. The used
database is also replicated to two similar machines with slony1.

Why are you running a version of PostgreSQL with known data eating
bugs? If you care for your data, you will keep up to date on releases.
8.1.10 was released on 2007-09-17. 8.1.4 was released on 2006-05-23.
That's 16 months of bug fixes you're missing. Go here:
http://www.postgresql.org/docs/8.1/static/release.html and read up on
the fixes you're missing. Then update. Or just update.

OK, on the the issue at hand.

Since the load average is between 1 (most of the time) and 10 (peeks) i
am worried about the load and executed vmstat and iostat which show that
1000-6000 Blocks are writen per second. Please check the attached output
for further details.
top shows that the CPUs are at least 80% idle most of the time so i
think there is an I/O bottleneck. I'm aware that this hardware setup is
probably not sufficient but is would like to investigate how critical
the situation is.

Yes. Battery backed cache can only do so much, it's not magic pixie
dust. Once it's full, the drive becomes the bottle neck. Real db
servers have more than one disk drive. They usually have at least 4
or so, and often dozens to hundreds. Also, not all battery backed
caching RAID controllers are created equal.

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 398256 78328 140612 1063556 0 0 0 1472 2029 5081 4 3 92 0
0 2 398256 78268 140612 1063576 0 0 0 2304 1928 4216 0 2 98 0
1 2 398256 78100 140612 1063576 0 0 0 1464 1716 3994 1 3 96 0
0 0 398256 78704 140612 1063592 0 0 0 916 1435 3570 5 3 91 0
0 0 398256 77876 140612 1063616 0 0 0 0 305 1169 3 1 96 0

See that dip in the last line above where the blocks in drop to 0,
idle jumps to 96, and blocks out drops, and context switches drop?
That's most likely where postgresql is checkpointing. Checkpointing
is where it writes out all the dirty buffers to disk. If the bgwriter
is not tuned aggresively enough, checkpoints happen and make the whole
database slow down for a few seconds. If it's tuned too aggresively
then the db spends too much CPU time tracking the dirty buffers and
then writing them. If tuned just right, it will write out the dirty
buffers just fast enough that a checkpoint is never needed.

You tune the bgwriter to your machine and I/O subsystem. If you're
planning on getting more hard drives, do that first. Then tune the
bgwriter.

btw, if this is "vmstat 1" running, it's showing a checkpoint every 20
or so seconds I think

0 2 398256 79136 140612 1063964 0 0 0 1736 1959 4494 4 2 94 0

checkpoint here:

0 0 398256 79132 140612 1063964 0 0 0 4 260 1039 1 1 98 0
0 0 398256 79052 140612 1063980 0 0 0 2444 3084 6955 6 5 89 0
0 2 398256 79060 140612 1063988 0 0 0 948 1146 3616 3 1 96 0
0 1 398256 78268 140612 1064056 0 0 0 1908 1809 4086 6 5 88 0
0 1 398256 76728 140612 1064056 0 0 0 6256 6637 15472 5 5 90 0
0 2 398256 77000 140612 1064064 0 0 0 4916 5840 12107 1 4 95 0
0 2 398256 76956 140612 1064068 0 0 0 6468 7432 15211 1 3 96 0
0 6 398256 77388 140612 1064072 0 0 0 8116 7826 18265 1 8 91 0
0 2 398256 74312 140612 1064076 0 0 0 7032 6886 16136 2 7 91 0
0 2 398256 74264 140612 1064076 0 0 0 5680 7143 13411 0 5 95 0
0 2 398256 72980 140612 1064140 0 0 0 5396 6377 13251 6 6 88 0
0 3 398256 76972 140612 1064148 0 0 0 5652 6793 14079 4 9 87 0
0 2 398256 77836 140612 1064148 0 0 0 3968 5321 14187 10 8 82 0
1 0 398256 77280 140612 1064148 0 0 0 1608 3188 8974 21 12 67 0
1 0 398256 77832 140612 1064152 0 0 0 236 834 2625 7 5 87 0
0 0 398256 77464 140612 1064152 0 0 0 244 505 1378 2 4 94 0
1 0 398256 77828 140612 1064164 0 0 0 316 580 1954 7 2 91 0
0 0 398256 77804 140612 1064180 0 0 0 740 673 2248 2 2 96 0
0 0 398256 77000 140612 1064180 0 0 0 304 589 1739 1 3 96 0

20 rows later, checkpoint here:

0 0 398256 77000 140612 1064184 0 0 0 0 216 886 0 1 99 0
0 0 398256 75452 140612 1064184 0 0 0 432 755 2032 6 1 93 0

max_fsm_pages = 40000 # min max_fsm_relations*16, 6 bytes each, APUS

This seems a little low for a busy server.

# - Background writer -

bgwriter_delay = 100 # 10-10000 milliseconds between rounds, APUS
bgwriter_lru_percent = 2.0 # 0-100% of LRU buffers scanned/round, APUS
bgwriter_lru_maxpages = 10 # 0-1000 buffers max written/round, APUS
bgwriter_all_percent = 1 # 0-100% of all buffers scanned/round, APUS
bgwriter_all_maxpages = 10 # 0-1000 buffers max written/round, APUS

So, bgwriter wakes up 10 times a second, and each time it processes 2%
of the Least Recently Used pages for writing, and writes up to 10 of
those pages. And it only checks 1% of the total pages and writes 10
of those at the most. This is not aggresive enough, and given how
much spare CPU you have left over, you can be a fair bit more
aggresive. The main thing to increase is the maxes. Try changing
them to the 100 to 300 range, and maybe increase your percentages to
5% or so. What we're shooting for is to see those checkpoints go
away.

Then, when running your benchmark, after a few minutes, run a
checkpoint by hand and see if you get one of those slow downs like we
saw in vmstat above. If your bgwriter is tuned properly, you should
get an almost instant response from the checkpoint and no noticeable
slow down in the vmstat numbers for context switches per second.

Once you reach the point where the bgwriter is just keeping ahead of
check points, there's little to be gained in more aggressive tuning of
the bgwriter and you'll just be chewing up memory and cpu bandwidth if
you do get too aggressive with it.

#3Peter Bauer
peter.bauer@apus.co.at
In reply to: Scott Marlowe (#2)
Re: System Load analyze

Am Dienstag 27 November 2007 schrieb Scott Marlowe:

On Nov 24, 2007 10:57 AM, Peter Bauer <peter.bauer@apus.co.at> wrote:

i have a system here with 2 2.4GHz Xeon Processors, 2GB RAM, ONE Disk on
a Battery Backed Write Cache SCSI Controller and PostgreSQL 8.1.4
running with the data on a DRBD Device for High Availability. The used
database is also replicated to two similar machines with slony1.

Why are you running a version of PostgreSQL with known data eating
bugs? If you care for your data, you will keep up to date on releases.
8.1.10 was released on 2007-09-17. 8.1.4 was released on 2006-05-23.
That's 16 months of bug fixes you're missing. Go here:
http://www.postgresql.org/docs/8.1/static/release.html and read up on
the fixes you're missing. Then update. Or just update.

OK, on the the issue at hand.

Since the load average is between 1 (most of the time) and 10 (peeks) i
am worried about the load and executed vmstat and iostat which show that
1000-6000 Blocks are writen per second. Please check the attached output
for further details.
top shows that the CPUs are at least 80% idle most of the time so i
think there is an I/O bottleneck. I'm aware that this hardware setup is
probably not sufficient but is would like to investigate how critical
the situation is.

Yes. Battery backed cache can only do so much, it's not magic pixie
dust. Once it's full, the drive becomes the bottle neck. Real db
servers have more than one disk drive. They usually have at least 4
or so, and often dozens to hundreds. Also, not all battery backed
caching RAID controllers are created equal.

procs -----------memory---------- ---swap-- -----io---- --system--
----cpu---- r b swpd free buff cache si so bi bo in
cs us sy id wa 0 0 398256 78328 140612 1063556 0 0 0 1472
2029 5081 4 3 92 0 0 2 398256 78268 140612 1063576 0 0 0
2304 1928 4216 0 2 98 0 1 2 398256 78100 140612 1063576 0 0
0 1464 1716 3994 1 3 96 0 0 0 398256 78704 140612 1063592 0
0 0 916 1435 3570 5 3 91 0 0 0 398256 77876 140612 1063616
0 0 0 0 305 1169 3 1 96 0

See that dip in the last line above where the blocks in drop to 0,
idle jumps to 96, and blocks out drops, and context switches drop?
That's most likely where postgresql is checkpointing. Checkpointing
is where it writes out all the dirty buffers to disk. If the bgwriter
is not tuned aggresively enough, checkpoints happen and make the whole
database slow down for a few seconds. If it's tuned too aggresively
then the db spends too much CPU time tracking the dirty buffers and
then writing them. If tuned just right, it will write out the dirty
buffers just fast enough that a checkpoint is never needed.

You tune the bgwriter to your machine and I/O subsystem. If you're
planning on getting more hard drives, do that first. Then tune the
bgwriter.

btw, if this is "vmstat 1" running, it's showing a checkpoint every 20
or so seconds I think

0 2 398256 79136 140612 1063964 0 0 0 1736 1959 4494 4 2
94 0

checkpoint here:

0 0 398256 79132 140612 1063964 0 0 0 4 260 1039 1 1
98 0 0 0 398256 79052 140612 1063980 0 0 0 2444 3084 6955
6 5 89 0 0 2 398256 79060 140612 1063988 0 0 0 948 1146
3616 3 1 96 0 0 1 398256 78268 140612 1064056 0 0 0 1908
1809 4086 6 5 88 0 0 1 398256 76728 140612 1064056 0 0 0
6256 6637 15472 5 5 90 0 0 2 398256 77000 140612 1064064 0 0
0 4916 5840 12107 1 4 95 0 0 2 398256 76956 140612 1064068 0
0 0 6468 7432 15211 1 3 96 0 0 6 398256 77388 140612 1064072
0 0 0 8116 7826 18265 1 8 91 0 0 2 398256 74312 140612
1064076 0 0 0 7032 6886 16136 2 7 91 0 0 2 398256 74264
140612 1064076 0 0 0 5680 7143 13411 0 5 95 0 0 2 398256
72980 140612 1064140 0 0 0 5396 6377 13251 6 6 88 0 0 3
398256 76972 140612 1064148 0 0 0 5652 6793 14079 4 9 87 0
0 2 398256 77836 140612 1064148 0 0 0 3968 5321 14187 10 8
82 0 1 0 398256 77280 140612 1064148 0 0 0 1608 3188 8974
21 12 67 0 1 0 398256 77832 140612 1064152 0 0 0 236 834
2625 7 5 87 0 0 0 398256 77464 140612 1064152 0 0 0 244
505 1378 2 4 94 0 1 0 398256 77828 140612 1064164 0 0 0
316 580 1954 7 2 91 0 0 0 398256 77804 140612 1064180 0 0
0 740 673 2248 2 2 96 0 0 0 398256 77000 140612 1064180 0
0 0 304 589 1739 1 3 96 0

20 rows later, checkpoint here:

0 0 398256 77000 140612 1064184 0 0 0 0 216 886 0 1
99 0 0 0 398256 75452 140612 1064184 0 0 0 432 755 2032
6 1 93 0

max_fsm_pages = 40000 # min max_fsm_relations*16, 6
bytes each, APUS

This seems a little low for a busy server.

# - Background writer -

bgwriter_delay = 100 # 10-10000 milliseconds between
rounds, APUS bgwriter_lru_percent = 2.0 # 0-100% of LRU
buffers scanned/round, APUS bgwriter_lru_maxpages = 10 #
0-1000 buffers max written/round, APUS bgwriter_all_percent = 1
# 0-100% of all buffers scanned/round, APUS bgwriter_all_maxpages =
10 # 0-1000 buffers max written/round, APUS

So, bgwriter wakes up 10 times a second, and each time it processes 2%
of the Least Recently Used pages for writing, and writes up to 10 of
those pages. And it only checks 1% of the total pages and writes 10
of those at the most. This is not aggresive enough, and given how
much spare CPU you have left over, you can be a fair bit more
aggresive. The main thing to increase is the maxes. Try changing
them to the 100 to 300 range, and maybe increase your percentages to
5% or so. What we're shooting for is to see those checkpoints go
away.

Then, when running your benchmark, after a few minutes, run a
checkpoint by hand and see if you get one of those slow downs like we
saw in vmstat above. If your bgwriter is tuned properly, you should
get an almost instant response from the checkpoint and no noticeable
slow down in the vmstat numbers for context switches per second.

Once you reach the point where the bgwriter is just keeping ahead of
check points, there's little to be gained in more aggressive tuning of
the bgwriter and you'll just be chewing up memory and cpu bandwidth if
you do get too aggressive with it.

Hi Scott,

thank you for the great suggestions, i will keep the list informed.

br,
Peter

--
Peter Bauer
APUS Software G.m.b.H.
A-8074 Raaba, Bahnhofstrasse 1/1
Email: peter.bauer@apus.co.at
Tel: +43 316 401629 24
Fax: +43 316 401629 9

#4Greg Smith
gsmith@gregsmith.com
In reply to: Peter Bauer (#1)
Re: System Load analyze

On Sat, 24 Nov 2007, Peter Bauer wrote:

top shows that the CPUs are at least 80% idle most of the time so i
think there is an I/O bottleneck.

top also shows that you're never waiting for I/O which is usually evidence
there isn't an I/O bottleneck. You passed along most of the right data,
but some useful additional things to know are:

-Actual brand/model of SCSI controller
-Operating system
-What time interval the vmstat and iostat information you gave were
produced at.

I agree with Scott that checkpoints should be considered as a possibility
here. I'd suggest you set checkpoint_warning to a high value so you get a
note in the logs every time one happens, then see if those happen at the
same time as your high load average. More on that topic and how to adjust
the background writer if that proves to be the cause of your slowdown is
at http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#5Peter Bauer
peter.bauer@apus.co.at
In reply to: Greg Smith (#4)
Re: System Load analyze

Hi Greg,

Am Mittwoch 28 November 2007 schrieb Greg Smith:

On Sat, 24 Nov 2007, Peter Bauer wrote:

top shows that the CPUs are at least 80% idle most of the time so i
think there is an I/O bottleneck.

top also shows that you're never waiting for I/O which is usually evidence
there isn't an I/O bottleneck. You passed along most of the right data,
but some useful additional things to know are:

-Actual brand/model of SCSI controller
-Operating system
-What time interval the vmstat and iostat information you gave were
produced at.

here are the hardware specs:
2x POWEREDGE 2850 - XEON 3.0GHZ/2MB, 800FSB
2048MB SINGLE RANK DDR2
73 GB SCSI-Disk , 15.000 rpm, UL
PERC4E/DI DC ULTRA320 SCSI RAID, 256MB

Its Debian sarge with kernel 2.4.26.050719-686 #1 SMP.

vmstat and iostat were running with 1 second intervals.

I agree with Scott that checkpoints should be considered as a possibility
here. I'd suggest you set checkpoint_warning to a high value so you get a
note in the logs every time one happens, then see if those happen at the
same time as your high load average. More on that topic and how to adjust
the background writer if that proves to be the cause of your slowdown is
at http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

thank you,
Peter

--
Peter Bauer
APUS Software G.m.b.H.
A-8074 Raaba, Bahnhofstrasse 1/1
Email: peter.bauer@apus.co.at
Tel: +43 316 401629 24
Fax: +43 316 401629 9

#6Greg Smith
gsmith@gregsmith.com
In reply to: Peter Bauer (#5)
Re: System Load analyze

On Wed, 28 Nov 2007, Peter Bauer wrote:

PERC4E/DI DC ULTRA320 SCSI RAID, 256MB
Its Debian sarge with kernel 2.4.26.050719-686 #1 SMP.

OK, so I'd expect you're using the megaraid2 driver. That and kernel
2.4.26 are a few years behind current at this point, and there have been
plenty of performance improvements in Linux and that driver since then.
Since you're asking about this before it's a serious problem and don't
need an immediate fix, you may want to consider whether upgrading to a
more modern 2.6 kernel is in your future; that change alone may resolve
some of the possibly too high load you're having. I doubt any 2.4 kernel
is really getting the best from your fairly modern server hardsare.

Also, the PERC4E and similar Megaraid cards are known to be generally
sluggish on write throughput compared with some of the competing products
out there. I don't know that I'd replace it though, as spending the same
amount of money adding disks would probably be more useful.

There's actually a way to combine these two ideas and get an upgrade
transition plan. Buy a second disk, find a downtime window, install a
newer Linux onto it and test. If that works well switch to it. If it
doesn't you still have the original unmodified system around and you can
at least split the disk load between the two drives.

vmstat and iostat were running with 1 second intervals.

Good, that means the responses you've already gotten made the right
assumptions. There's some additional fun statistics you can collect with
the Linux iostat, but sadly that needs a newer kernel as well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD