Checkpoint Tuning Question

Started by Dan Armbrustalmost 17 years ago28 messagesgeneral
Jump to latest
#1Dan Armbrust
daniel.armbrust.list@gmail.com

I'm running a steady state test where I am pushing about 600 queries
per second through a Posgres 8.3 system on an 8 CPU Linux system.
It's a mix of inserts, updates, and deletes on a few tables - the two
biggest ones probably have about 200,000 rows.

Harddrive is just a simple, run-of-the-mill desktop drive.

Here are parameters that I have changed from defaults:

shared_buffers =100MB
synchronous_commit=off

And, after noting complaints in the log file about checkpoint intervals, I set:

checkpoint_segments=10

Then I turned on slow query logging for queries that take more than 1
second, and checkpoint logging.

Typically, I see no slow queries. The system handles the load just fine.
Once in a while, I'll see a query that takes 3 - 5 seconds.

However, once the checkpoint process begins, I get a whole flood of
queries that take between 1 and 10 seconds to complete. My throughput
crashes to near nothing. The checkpoint takes between 45 seconds and
a minute to complete.

After the checkpoint completes - the system returns to having very few
slow queries, and the keeps up with the load fine.

Is there anything I can do to prevent the occasional slow query?

Is there anything I can do to prevent (or minimize) the performance
impact of the checkpoint?

Thanks,

Dan

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dan Armbrust (#1)
Re: Checkpoint Tuning Question

Dan Armbrust <daniel.armbrust.list@gmail.com> writes:

However, once the checkpoint process begins, I get a whole flood of
queries that take between 1 and 10 seconds to complete. My throughput
crashes to near nothing. The checkpoint takes between 45 seconds and
a minute to complete.

You sure this is 8.3? It should spread out checkpoints over a couple of
minutes by default. [thinks...] Maybe you need to increase
checkpoint_segments some more. If it's forcing the checkpoint in order
to hold down the number of WAL segments used up, that would explain a
fast checkpoint.

regards, tom lane

#3Dan Armbrust
daniel.armbrust.list@gmail.com
In reply to: Tom Lane (#2)
Re: Checkpoint Tuning Question

On Wed, Jul 8, 2009 at 12:50 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Dan Armbrust <daniel.armbrust.list@gmail.com> writes:

However, once the checkpoint process begins, I get a whole flood of
queries that take between 1 and 10 seconds to complete.  My throughput
crashes to near nothing.  The checkpoint takes between 45 seconds and
a minute to complete.

You sure this is 8.3?  It should spread out checkpoints over a couple of
minutes by default.  [thinks...]  Maybe you need to increase
checkpoint_segments some more.  If it's forcing the checkpoint in order
to hold down the number of WAL segments used up, that would explain a
fast checkpoint.

                       regards, tom lane

Just checked - currently running 8.3.4 on the system I'm testing on.

With checkpoint_segments set to 10, the checkpoints appear to be
happening due to checkpoint_timeout - which I've left at the default
of 5 minutes.

If I double my test load, I end up with checkpoints happening about
every 4 minutes, with the log message that I read to indicate that it
used up all 10 segments.

So not much I can do to keep the checkpoint process from causing a
burst of slow queries?

Thanks,

Dan

#4Dan Armbrust
daniel.armbrust.list@gmail.com
In reply to: Dan Armbrust (#1)
Re: Checkpoint Tuning Question

On Wed, Jul 8, 2009 at 1:23 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Dan Armbrust <daniel.armbrust.list@gmail.com> writes:

With checkpoint_segments set to 10, the checkpoints appear to be
happening due to checkpoint_timeout - which I've left at the default
of 5 minutes.

Well, you could increase both those settings so as to put the
checkpoints further apart, and/or increase checkpoint_completion_target
to spread the checkpoint I/O over a larger fraction of the cycle.

                       regards, tom lane

Sorry, didn't mean to get off list.

Wouldn't increasing the length between checkpoints result in the
checkpoint process taking even longer to complete?

The way my system processes and buffers incoming data, having
infrequent (but long and disruptive) checkpoints is bad, since it
causes the throughput to suffer so bad - my buffers can't hold the
flood, and I have to drop data. If I can reduce the impact of the
checkpoints, and have them occur more frequently, they my buffers
should be able to hold the incoming data during the short durations
that I have slow queries.

I'll go experiment with checkpoint_completion_target.

Thanks,

Dan

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dan Armbrust (#4)
Re: Checkpoint Tuning Question

Dan Armbrust <daniel.armbrust.list@gmail.com> writes:

On Wed, Jul 8, 2009 at 1:23 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:

Well, you could increase both those settings so as to put the
checkpoints further apart, and/or increase checkpoint_completion_target
to spread the checkpoint I/O over a larger fraction of the cycle.

Wouldn't increasing the length between checkpoints result in the
checkpoint process taking even longer to complete?

You don't really care how long it takes. What you want is for it not to
be chewing a bigger fraction of your I/O bandwidth than you can spare.
Hence, you want it to take longer. Trying to shorten it is just going
to make the spike worse.

regards, tom lane

#6Dan Armbrust
daniel.armbrust.list@gmail.com
In reply to: Tom Lane (#5)
Re: Checkpoint Tuning Question

Wouldn't increasing the length between checkpoints result in the
checkpoint process taking even longer to complete?

You don't really care how long it takes.  What you want is for it not to
be chewing a bigger fraction of your I/O bandwidth than you can spare.
Hence, you want it to take longer.  Trying to shorten it is just going
to make the spike worse.

                       regards, tom lane

I bumped the segments up to 15, and the timeout up to 10 minutes, and
changed the completion target to .7.

What I observe now is that I get a short (1-2 second) period where I
get slow queries - I'm running about 30 queries in parallel at any
given time - it appears that all 30 queries get paused for a couple of
seconds at the moment that a checkpoint begins. However, after the
initial slowdown, I don't get any additional slow queries logged while
the checkpoint process runs.

My takeaway is that starting the checkpoint process is really
expensive - so I don't want to start it very frequently. And the only
downside to longer intervals between checkpoints is a longer recovery
time if the system crashes?

Thanks,

Dan

#7Greg Smith
gsmith@gregsmith.com
In reply to: Dan Armbrust (#3)
Re: Checkpoint Tuning Question

On Wed, 8 Jul 2009, Dan Armbrust wrote:

With checkpoint_segments set to 10, the checkpoints appear to be
happening due to checkpoint_timeout - which I've left at the default
of 5 minutes.

OK, then that's as far upwards as you probably need to tweak that for your
workload, even though most systems tuned for performance need 30+ instead.

If what you're getting nailed with is the pause during the initial
checkpoint processing, the only real option you have on the database side
is to lower shared_buffers. You might get some improvement playing with
operating system tunables too, to make it buffer less data and write more
aggressively instead. The idea is that when the checkpoint starts, there
should be little else buffered already. On Linux this is best
accomplished with tuning dirty_background_ratio downward for example.

There's more on this subject than you probably ever wanted to know at
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm if you
haven't seen that already.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#8Greg Smith
gsmith@gregsmith.com
In reply to: Dan Armbrust (#6)
Re: Checkpoint Tuning Question

On Wed, 8 Jul 2009, Dan Armbrust wrote:

My takeaway is that starting the checkpoint process is really
expensive - so I don't want to start it very frequently. And the only
downside to longer intervals between checkpoints is a longer recovery
time if the system crashes?

And additional disk space wasted in hold the write-ahead logs. You're
moving in the right direction here, the less checkpoints the better as
long as you can stand the recovery time. What you'll discover if you bump
checkpoint_segments up high enough is that you have to lengthen the test
run you're trying, because eventually you'll reach a point where there are
none of them happening during some test runs.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#7)
Re: Checkpoint Tuning Question

Greg Smith <gsmith@gregsmith.com> writes:
On Wed, 8 Jul 2009, Dan Armbrust wrote:

What I observe now is that I get a short (1-2 second) period where I
get slow queries - I'm running about 30 queries in parallel at any
given time - it appears that all 30 queries get paused for a couple of
seconds at the moment that a checkpoint begins. However, after the
initial slowdown, I don't get any additional slow queries logged while
the checkpoint process runs.

If what you're getting nailed with is the pause during the initial
checkpoint processing, the only real option you have on the database side
is to lower shared_buffers.

He's only got 100MB of shared buffers, which doesn't seem like much
considering it's apparently a fairly beefy system. I definitely
don't see how one CPU spinning over the buffer headers in BufferSync
is going to create the sort of hiccup he's describing.

Dan, are you sure that this hiccup is happening at the *start* of a
checkpoint? Do you have log_checkpoints turned on, and if so what
does it show?

regards, tom lane

#10John R Pierce
pierce@hogranch.com
In reply to: Tom Lane (#9)
Re: Checkpoint Tuning Question

Tom Lane wrote:

He's only got 100MB of shared buffers, which doesn't seem like much
considering it's apparently a fairly beefy system.

a beefy system with...

Harddrive is just a simple, run-of-the-mill desktop drive.

which is going to severely limit random write throughput....

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: John R Pierce (#10)
Re: Checkpoint Tuning Question

John R Pierce <pierce@hogranch.com> writes:

a beefy system with...

Harddrive is just a simple, run-of-the-mill desktop drive.

which is going to severely limit random write throughput....

True, which is why he's having to flail so hard to keep the checkpoint
from saturating his I/O. However, the latest report says that he
managed that, and yet there's still a one-or-two-second transient of
some sort. I'm wondering what's causing that. If it were at the *end*
of the checkpoint, it might be the disk again (failing to handle a bunch
of fsyncs, perhaps). But if it really is at the *start* then there's
something else odd happening.

regards, tom lane

#12Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#9)
Re: Checkpoint Tuning Question

On Wed, 8 Jul 2009, Tom Lane wrote:

He's only got 100MB of shared buffers, which doesn't seem like much
considering it's apparently a fairly beefy system. I definitely
don't see how one CPU spinning over the buffer headers in BufferSync
is going to create the sort of hiccup he's describing.

Agreed, it doesn't seem like a likely cause. If the problem reduces in
magnitude in proportion with the size of the buffer cache, we might have
to accept that's it's true regardless; that's why I was curious to see
what impact that had on the test results.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#13Dan Armbrust
daniel.armbrust.list@gmail.com
In reply to: Tom Lane (#11)
Re: Checkpoint Tuning Question

 However, the latest report says that he
managed that, and yet there's still a one-or-two-second transient of
some sort.  I'm wondering what's causing that.  If it were at the *end*
of the checkpoint, it might be the disk again (failing to handle a bunch
of fsyncs, perhaps).  But if it really is at the *start* then there's
something else odd happening.

                       regards, tom lane

Log output during heavy load - checkpoint logging on, and slow query
logging on for queries that take longer than 1 second.
Blank space inserted to show interesting bits....

Almost all of the slow query log messages are logged within about 3
seconds of the checkpoint starting message.

LOG: checkpoint starting: xlog

LOG: duration: 1101.419 ms execute <unnamed>: select
dynamichos0_.ethernetmacaddr as ethernet1_0_, dynamichos0_.ipaddr as
ipaddr0_, dynamichos0_.cpemac as cpemac0_, dynamichos0_.last_updated
as last3_0_, dynamichos0_.leasetime as leasetime0_,
dynamichos0_.regtime as regtime0_ from iphost dynamichos0_, cpe cpe1_
where dynamichos0_.cpemac=cpe1_.cpemac and 1=1 and
dynamichos0_.ethernetmacaddr=$1 and dynamichos0_.cpemac=$2 and
cpe1_.regBaseId=$3 and dynamichos0_.ipaddr<>$4
DETAIL: parameters: $1 = '01:01:01:56:01:7F', $2 = '00:55:00:82', $3
= '01:01:01:56', $4 = '000.000.000.000'
LOG: duration: 1101.422 ms execute <unnamed>: insert into iphost
(cpemac, last_updated, leasetime, regtime, ethernetmacaddr, ipaddr)
values ($1, $2, $3, $4, $5, $6)
DETAIL: parameters: $1 = '01:AA:00:A0', $2 = '2009-07-08
15:33:20.673', $3 = '2009-07-08 21:06:40.67', $4 = '2009-07-08
15:33:20.67', $5 = '01:01:02:AB:01:9D', $6 = '2.171.156.0'

<snip about 200 lines of similar messages>

LOG: duration: 1501.905 ms execute <unnamed>: select
dynamichos0_.ethernetmacaddr as ethernet1_0_, dynamichos0_.ipaddr as
ipaddr0_, dynamichos0_.cpemac as cpemac0_, dynamichos0_.last_updated
as last3_0_, dynamichos0_.leasetime as leasetime0_,
dynamichos0_.regtime as regtime0_ from iphost dynamichos0_, cpe cpe1_
where dynamichos0_.cpemac=cpe1_.cpemac and 1=1 and
dynamichos0_.ethernetmacaddr=$1 and dynamichos0_.cpemac=$2 and
cpe1_.regBaseId=$3 and dynamichos0_.ipaddr<>$4
DETAIL: parameters: $1 = '01:01:01:C3:01:8B', $2 = '00:C2:00:8E', $3
= '01:01:01:C3', $4 = '000.000.000.000'

LOG: checkpoint complete: wrote 9975 buffers (77.9%); 0 transaction
log file(s) added, 0 removed, 15 recycled; write=156.576 s, sync=0.065
s, total=156.662 s
LOG: checkpoint starting: xlog

LOG: duration: 1104.780 ms execute <unnamed>: delete from iphost
where ethernetmacaddr=$1 and ipaddr=$2 and last_updated=$3
DETAIL: parameters: $1 = '01:01:01:33:01:AA', $2 = '1.50.169.0', $3 =
'2009-07-08 15:32:57.131'
LOG: duration: 1106.499 ms execute <unnamed>: select cpe0_.cpemac as
cpemac2_0_, cpe0_.changeTime as changeTime2_0_, cpe0_.comment as
comment2_0_, cpe0_.configuration as configur4_2_0_, cpe0_.cpeconfigid
as cpeconf17_2_0_, cpe0_.cpefilterid as cpefilt18_2_0_, cpe0_.endTime
as endTime2_0_, cpe0_.ispId as ispId2_0_, cpe0_.last_updated as
last7_2_0_, cpe0_.reglocationid as regloca19_2_0_, cpe0_.modelId as
modelId2_0_, cpe0_.numberOfHosts as numberOf9_2_0_, cpe0_.regBaseId as
regBaseId2_0_, cpe0_.regTime as regTime2_0_, cpe0_.roamAllowed as
roamAll12_2_0_, cpe0_.serialNumber as serialN13_2_0_, cpe0_.slaid as
slaid2_0_, cpe0_.enable as enable2_0_, cpe0_.staticip as staticip2_0_,
cpe0_.subscriberid as subscri21_2_0_, cpe0_.swVersion as
swVersion2_0_, cpe0_.vlanid as vlanid2_0_, cpe0_.voipid as voipid2_0_
from cpe cpe0_ where cpe0_.cpemac=$1
DETAIL: parameters: $1 = '00:84:00:37'

<snip about 300 lines of similar messages>

LOG: duration: 1205.828 ms execute <unnamed>: insert into iphost
(cpemac, last_updated, leasetime, regtime, ethernetmacaddr, ipaddr)
values ($1, $2, $3, $4, $5, $6)
DETAIL: parameters: $1 = '02:31:00:25', $2 = '2009-07-08
15:39:53.718', $3 = '2009-07-08 21:13:13.715', $4 = '2009-07-08
15:39:53.715', $5 = '01:01:03:32:01:22', $6 = '3.51.33.0'
LOG: duration: 1203.287 ms execute <unnamed>: insert into iphost
(cpemac, last_updated, leasetime, regtime, ethernetmacaddr, ipaddr)
values ($1, $2, $3, $4, $5, $6)
DETAIL: parameters: $1 = '03:47:00:81', $2 = '2009-07-08
15:39:53.72', $3 = '2009-07-08 21:13:13.717', $4 = '2009-07-08
15:39:53.717', $5 = '01:01:04:48:01:7E', $6 = '4.74.125.0'
LOG: duration: 1201.480 ms execute <unnamed>: insert into iphost
(cpemac, last_updated, leasetime, regtime, ethernetmacaddr, ipaddr)
values ($1, $2, $3, $4, $5, $6)
DETAIL: parameters: $1 = '01:12:00:7F', $2 = '2009-07-08
15:39:53.725', $3 = '2009-07-08 21:13:13.721', $4 = '2009-07-08
15:39:53.721', $5 = '01:01:02:13:01:7C', $6 = '2.19.123.0'

LOG: checkpoint complete: wrote 9794 buffers (76.5%); 0 transaction
log file(s) added, 0 removed, 15 recycled; write=148.084 s, sync=0.062
s, total=148.172 s

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dan Armbrust (#13)
Re: Checkpoint Tuning Question

Dan Armbrust <daniel.armbrust.list@gmail.com> writes:

Almost all of the slow query log messages are logged within about 3
seconds of the checkpoint starting message.

LOG: checkpoint complete: wrote 9975 buffers (77.9%); 0 transaction
log file(s) added, 0 removed, 15 recycled; write=156.576 s, sync=0.065
s, total=156.662 s

Huh. And there's just about no daylight between the total checkpoint
time and the write+sync time (barely more than 20msec in both examples).
So that seems to wipe out the thought I had that maybe we'd
underestimated the work involved in one of the other steps of
checkpoint.

As Greg commented upthread, we seem to be getting forced to the
conclusion that the initial buffer scan in BufferSync() is somehow
causing this. There are a couple of things it'd be useful to try
here:

* see how the size of the hiccup varies with shared_buffers;

* try inserting a delay into that scan loop, as per attached
quick-and-dirty patch. (Numbers pulled from the air, but
we can worry about tuning after we see if this is really
where the problem is.)

regards, tom lane

#15Dan Armbrust
daniel.armbrust.list@gmail.com
In reply to: Tom Lane (#14)
Re: Checkpoint Tuning Question

As Greg commented upthread, we seem to be getting forced to the
conclusion that the initial buffer scan in BufferSync() is somehow
causing this.  There are a couple of things it'd be useful to try
here:

* see how the size of the hiccup varies with shared_buffers;

I tried decreasing shared buffers - both 25MB and 50MB were too small
for my load - I had slow queries at all times.

So then I increased it from what I was using - 100MB - to 500MB - and
the hiccup roughly doubles in length.

At 100MB, the hiccup is about 2-3 seconds long.
At 500MB, the hiccup is about 6 seconds long.

* try inserting a delay into that scan loop, as per attached
 quick-and-dirty patch.  (Numbers pulled from the air, but
 we can worry about tuning after we see if this is really
 where the problem is.)

After finally getting this particular system into a state where I
could build postgres (I was using the binary install) I built a 8.3.4,
using your patch - but I didn't see any change in the behaviour. I
see hiccups that appear to be the same length as I saw on the binary
build of 8.3.4.

Thanks,

Dan

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#14)
Re: Checkpoint Tuning Question

On Wed, 2009-07-08 at 18:22 -0400, Tom Lane wrote:

As Greg commented upthread, we seem to be getting forced to the
conclusion that the initial buffer scan in BufferSync() is somehow
causing this. There are a couple of things it'd be useful to try
here:

Not sure why you're forced to that conclusion?

ISTM more likely to be a problem with checkpointing clog or subtrans.
That would block everybody and the scale of the problem is about right.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#16)
Re: Checkpoint Tuning Question

Simon Riggs <simon@2ndQuadrant.com> writes:

ISTM more likely to be a problem with checkpointing clog or subtrans.
That would block everybody and the scale of the problem is about right.

That's what I had been thinking too, but the log_checkpoint output
conclusively disproves it: those steps are taking less than 20msec.

regards, tom lane

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#17)
Re: Checkpoint Tuning Question

On Fri, 2009-07-10 at 10:27 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

ISTM more likely to be a problem with checkpointing clog or subtrans.
That would block everybody and the scale of the problem is about right.

That's what I had been thinking too, but the log_checkpoint output
conclusively disproves it: those steps are taking less than 20msec.

OK, I was looking at total -write, not total - write - sync.

I think its a traffic jam.

After checkpoint in XLogInsert(), we discover that we now have to backup
a block that we didn't think so previously. So we have to drop the lock
and then re-access WALInsertLock. So every backend has to go through the
queue twice the first time it tries to write WAL immediately after a
checkpoint. Also, suddenly, every block needs to be copied to WAL, so
the CRC checks make each lock holder take longer than normal, so the
whole queue begins to backup. Then, because of wal_buffers being small
we find that the increased volume of WAL being written causes
WALInsertLock to be held across I/O.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#18)
Re: Checkpoint Tuning Question

Simon Riggs <simon@2ndQuadrant.com> writes:

I think its a traffic jam.

After checkpoint in XLogInsert(), we discover that we now have to backup
a block that we didn't think so previously. So we have to drop the lock
and then re-access WALInsertLock. So every backend has to go through the
queue twice the first time it tries to write WAL immediately after a
checkpoint. Also, suddenly, every block needs to be copied to WAL, so
the CRC checks make each lock holder take longer than normal, so the
whole queue begins to backup. Then, because of wal_buffers being small
we find that the increased volume of WAL being written causes
WALInsertLock to be held across I/O.

Hm, I'm not sure I believe any of that except the last bit, seeing that
he's got plenty of excess CPU capability. But the last bit fits with
the wimpy-I/O problem, and it also offers something we could test.
Dan, please see what happens when you vary the wal_buffers setting.
(Note you need a postmaster restart to change that.)

regards, tom lane

#20Dan Armbrust
daniel.armbrust.list@gmail.com
In reply to: Tom Lane (#19)
Re: Checkpoint Tuning Question

Hm, I'm not sure I believe any of that except the last bit, seeing that
he's got plenty of excess CPU capability.  But the last bit fits with
the wimpy-I/O problem, and it also offers something we could test.
Dan, please see what happens when you vary the wal_buffers setting.
(Note you need a postmaster restart to change that.)

                       regards, tom lane

Ok, I tried a few different values - 32kb, 64kb, 512kb, 2MB and 10MB.

I'm not seeing any highly noticeable change in behaviour with any
setting - it wasn't a scientific test, but I seem to have about the
same size hiccup with each setting. The hiccup may be slightly
shorter with the 10MB setting, but barely, if it is.

Thanks,

Dan

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Dan Armbrust (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#21)
#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#22)
#24Dan Armbrust
daniel.armbrust.list@gmail.com
In reply to: Tom Lane (#22)
#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Dan Armbrust (#24)
#26Dan Armbrust
daniel.armbrust.list@gmail.com
In reply to: Simon Riggs (#25)
#27Dan Armbrust
daniel.armbrust.list@gmail.com
In reply to: Dan Armbrust (#24)
#28tomrevam
tomer@fabrix.tv
In reply to: Dan Armbrust (#24)