Load distributed checkpoint V3

Started by ITAGAKI Takahiroalmost 19 years ago25 messages
#1ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
1 attachment(s)

Folks,

Here is the latest version of Load distributed checkpoint patch.

I've fixed some bugs, including in cases of missing file errors
and overlapping of asynchronous checkpoint requests.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

checkpoint_v3.patch.gzapplication/octet-stream; name=checkpoint_v3.patch.gzDownload
#2Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint V3

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:

Folks,

Here is the latest version of Load distributed checkpoint patch.

I've fixed some bugs, including in cases of missing file errors
and overlapping of asynchronous checkpoint requests.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#3Greg Smith
gsmith@gregsmith.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint V3

On Fri, 23 Mar 2007, ITAGAKI Takahiro wrote:

Here is the latest version of Load distributed checkpoint patch.

Couple of questions for you:

-Is it still possible to get the original behavior by adjusting your
tunables? It would be nice to do a before/after without having to
recompile, and I know I'd be concerned about something so different
becoming the new default behavior.

-Can you suggest a current test case to demonstrate the performance
improvement here? I've tried several variations on stretching out
checkpoints like you're doing here and they all made slow checkpoint
issues even worse on my Linux system. I'm trying to evaluate this fairly.

-This code operates on the assumption you have a good value for the
checkpoint timeout. Have you tested its behavior when checkpoints are
being triggered by checkpoint_segments being reached instead?

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#4ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Greg Smith (#3)
Re: Load distributed checkpoint V3

Greg Smith <gsmith@gregsmith.com> wrote:

Here is the latest version of Load distributed checkpoint patch.

Couple of questions for you:

-Is it still possible to get the original behavior by adjusting your
tunables? It would be nice to do a before/after without having to
recompile, and I know I'd be concerned about something so different
becoming the new default behavior.

Yes, if you want the original behavior, please set all of
checkpoint_[write|nap|sync]_percent to zero. They can be changed
at SIGHUP timing (pg_ctl reload). The new default configurations
are write/nap/sync = 50%/10%/20%. There might be room for discussion
in choice of the values.

-Can you suggest a current test case to demonstrate the performance
improvement here? I've tried several variations on stretching out
checkpoints like you're doing here and they all made slow checkpoint
issues even worse on my Linux system. I'm trying to evaluate this fairly.

You might need to increase checkpoint_segments and checkpoint_timeout.
Here is the results on my machine:
http://archives.postgresql.org/pgsql-hackers/2007-02/msg01613.php
I've set the values to 32 segs and 15 min to take advantage of it
in the case of pgbench -s100 then.

-This code operates on the assumption you have a good value for the
checkpoint timeout. Have you tested its behavior when checkpoints are
being triggered by checkpoint_segments being reached instead?

This patch does not work fully when checkpoints are triggered by segments.
Write phases still work because they refer to consumption of segments,
but nap and fsync phases only check amount of time. I'm assuming
checkpoints are triggered by timeout in normal use -- and it's my
recommended configuration whether the patch is installed or not.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#5Greg Smith
gsmith@gregsmith.com
In reply to: ITAGAKI Takahiro (#4)
Re: Load distributed checkpoint V3

On Mon, 26 Mar 2007, ITAGAKI Takahiro wrote:

I'm assuming checkpoints are triggered by timeout in normal use -- and
it's my recommended configuration whether the patch is installed or not.

I'm curious what other people running fairly serious hardware do in this
area for write-heavy loads, whether it's timeout or segment limits that
normally trigger their checkpoints.

I'm testing on a slightly different class of machine than your sample
results, something that is in the 1500 TPS range running the pgbench test
you describe. Running that test, I always hit the checkpoint_segments
wall well before any reasonable timeout. With 64 segments, I get a
checkpoint every two minutes or so.

There's something I'm working on this week that may help out other people
trying to test your patch out. I've put together some simple scripts that
graph (patched) pgbench results, which make it very easy to see what
changes when you alter the checkpoint behavior. Edges are still rough but
the scripts work for me, will be polishing and testing over the next few
days:

http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm

(Note that the example graphs there aren't from the production system I
mentioned above, they're from my server at home, which is similar to the
system your results came from).

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#6Heikki Linnakangas
heikki@enterprisedb.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint V3

ITAGAKI Takahiro wrote:

Here is the latest version of Load distributed checkpoint patch.

Unfortunately because of the recent instrumentation and
CheckpointStartLock patches this patch doesn't apply cleanly to CVS HEAD
anymore. Could you fix the bitrot and send an updated patch, please?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Heikki Linnakangas
heikki@enterprisedb.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint V3

ITAGAKI Takahiro wrote:

Here is the latest version of Load distributed checkpoint patch.

Bgwriter has two goals:
1. keep enough buffers clean that normal backends never need to do a write
2. smooth checkpoints by writing buffers ahead of time

Load distributed checkpoints will do 2. in a much better way than the
bgwriter_all_* guc options. I think we should remove that aspect of
bgwriter in favor of this patch.

The scheduling of bgwriter gets quite complicated with the patch. If I'm
reading it correctly, bgwriter will keep periodically writing buffers to
achieve 1. while the "write"-phase of checkpoint is in progress. That
makes sense; now that checkpoints take longer, we would miss goal 1.
otherwise. But we don't do that in the "sleep-between-write-and-fsync"-
and "fsync"-phases. We should, shouldn't we?

I'd suggest rearranging the code so that BgBufferSync and mdsync would
basically stay like they are without the patch; the signature wouldn't
change. To do the naps during a checkpoint, inject calls to new
functions like CheckpointWriteNap() and CheckpointFsyncNap() inside
BgBufferSync and mdsync. Those nap functions would check if enough
progress has been made since last call and sleep if so.

The piece of code that implements 1. would be refactored to a new
function, let's say BgWriteLRUBuffers(). The nap-functions would call
BgWriteLRUBuffers if more than bgwriter_delay milliseconds have passed
since last call to it.

This way the changes to CreateCheckpoint, BgBufferSync and mdsync would
be minimal, and bgwriter would keep cleaning buffers for normal backends
during the whole checkpoint.

Another thought is to have a separate checkpointer-process so that the
bgwriter process can keep cleaning dirty buffers while the checkpoint is
running in a separate process. One problem with that is that we
currently collect all the fsync requests in bgwriter. If we had a
separate checkpointer process, we'd need to do that in the checkpointer
instead, and bgwriter would need to send a message to the checkpointer
every time it flushes a buffer, which would be a lot of chatter.
Alternatively, bgwriter could somehow pass the pendingOpsTable to the
checkpointer process at the beginning of checkpoint, but that not
exactly trivial either.

PS. Great that you're working on this. It's a serious problem under
heavy load.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Greg Smith
gsmith@gregsmith.com
In reply to: Heikki Linnakangas (#6)
Re: Load distributed checkpoint V3

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

Unfortunately because of the recent instrumentation and CheckpointStartLock
patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix
the bitrot and send an updated patch, please?

The "Logging checkpoints and other slowdown causes" patch I submitted
touches some of the same code as well, that's another possible merge
coming depending on what order this all gets committed in. Running into
what I dubbed perpetual checkpoints was one of the reasons I started
logging timing information for the various portions of the checkpoint, to
tell when it was bogged down with slow writes versus being held up in sync
for various (possibly fixed with your CheckpointStartLock) issues.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#9Greg Smith
gsmith@gregsmith.com
In reply to: Heikki Linnakangas (#7)
Re: Load distributed checkpoint V3

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

Bgwriter has two goals:
1. keep enough buffers clean that normal backends never need to do a write
2. smooth checkpoints by writing buffers ahead of time
Load distributed checkpoints will do 2. in a much better way than the
bgwriter_all_* guc options. I think we should remove that aspect of bgwriter
in favor of this patch.

My first question about the LDC patch was whether I could turn it off and
return to the existing mechanism. I would like to see a large pile of
data proving this new approach is better before the old one goes away. I
think everyone needs to do some more research and measurement here before
assuming the problem can be knocked out so easily.

The reason I've been busy working on patches to gather statistics on this
area of code is because I've tried most simple answers to getting the
background writer to work better and made little progress, and I'd like to
see everyone else doing the same at least collecting the right data.

Let me suggest a different way of looking at this problem. At any moment,
some percentage of your buffer pool is dirty. Whether it's 0% or 100%
dramatically changes what the background writer should be doing. Whether
most of the data is usage_count>0 or not also makes a difference. None of
the current code has any idea what type of buffer pool they're working
with, and therefore they don't have enough information to make a
well-informed prediction about what is going to happen in the near future.

I'll tell you what I did to the all-scan. I ran a few hundred hours worth
of background writer tests to collect data on what it does wrong, then
wrote a prototype automatic background writer that resets the all-scan
parameters based on what I found. It keeps a running estimate of how
dirty the pool at large is using a weighted average of the most recent
scan with the past history. From there, I have a simple model that
predicts how much of the buffer we can scan in any interval, and intends
to enforce a maximum bound on the amount of physical I/O you're willing to
stream out. The beta code is sitting at
http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want to
see what I've done so far. The parts that are done work fine--as long as
you give it a reasonable % to scan by default, it will correct
all_max_pages and the interval in real-time to meet the scan rate
requested you want given how much is currently dirty; the I/O rate is
computed but doesn't limit properly yet.

Why haven't I brought this all up yet? Two reasons. The first is because
it doesn't work on my system; checkpoints and overall throughput get worse
when you try to shorten them by running the background writer at optimal
aggressiveness. Under really heavy load, the writes slow down as all the
disk caches fill, the background writer fights with reads on the data that
isn't in the mostly dirty cache (introducing massive seek delays), it
stops cleaning effectively, and it's better for it to not even try. My
next generation of code was going to start with the LRU flush and then
only move onto the all-scan if there's time leftover.

The second is that I just started to get useful results here in the last
few weeks, and I assumed it's too big of a topic to start suggesting major
redesigns to the background writer mechanism at that point (from me at
least!). I was waiting for 8.3 to freeze before even trying. If you want
to push through a redesign there, maybe you can get away with it at this
late moment. But I ask that you please don't remove anything from the
current design until you have significant test results to back up that
change.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#10Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Greg Smith (#9)
Re: Load distributed checkpoint V3

Greg Smith wrote:

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

Bgwriter has two goals:
1. keep enough buffers clean that normal backends never need to do a
write
2. smooth checkpoints by writing buffers ahead of time
Load distributed checkpoints will do 2. in a much better way than the
bgwriter_all_* guc options. I think we should remove that aspect of
bgwriter in favor of this patch.

...

Let me suggest a different way of looking at this problem. At any
moment, some percentage of your buffer pool is dirty. Whether it's 0%
or 100% dramatically changes what the background writer should be
doing. Whether most of the data is usage_count>0 or not also makes a
difference. None of the current code has any idea what type of buffer
pool they're working with, and therefore they don't have enough
information to make a well-informed prediction about what is going to
happen in the near future.

The purpose of the bgwriter_all_* settings is to shorten the duration of
the eventual checkpoint. The reason to shorten the checkpoint duration
is to limit the damage to other I/O activity it causes. My thinking is
that assuming the LDC patch is effective (agreed, needs more testing) at
smoothening the checkpoint, the duration doesn't matter anymore. Do you
want to argue there's other reasons to shorten the checkpoint duration?

I'll tell you what I did to the all-scan. I ran a few hundred hours
worth of background writer tests to collect data on what it does wrong,
then wrote a prototype automatic background writer that resets the
all-scan parameters based on what I found. It keeps a running estimate
of how dirty the pool at large is using a weighted average of the most
recent scan with the past history. From there, I have a simple model
that predicts how much of the buffer we can scan in any interval, and
intends to enforce a maximum bound on the amount of physical I/O you're
willing to stream out. The beta code is sitting at
http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want
to see what I've done so far. The parts that are done work fine--as
long as you give it a reasonable % to scan by default, it will correct
all_max_pages and the interval in real-time to meet the scan rate
requested you want given how much is currently dirty; the I/O rate is
computed but doesn't limit properly yet.

Nice. Enforcing a max bound on the I/O seems reasonable, if we accept
that shortening the checkpoint is a goal.

Why haven't I brought this all up yet? Two reasons. The first is
because it doesn't work on my system; checkpoints and overall throughput
get worse when you try to shorten them by running the background writer
at optimal aggressiveness. Under really heavy load, the writes slow
down as all the disk caches fill, the background writer fights with
reads on the data that isn't in the mostly dirty cache (introducing
massive seek delays), it stops cleaning effectively, and it's better for
it to not even try. My next generation of code was going to start with
the LRU flush and then only move onto the all-scan if there's time
leftover.

The second is that I just started to get useful results here in the last
few weeks, and I assumed it's too big of a topic to start suggesting
major redesigns to the background writer mechanism at that point (from
me at least!). I was waiting for 8.3 to freeze before even trying. If
you want to push through a redesign there, maybe you can get away with
it at this late moment. But I ask that you please don't remove anything
from the current design until you have significant test results to back
up that change.

Point taken. I need to start testing the LDC patch.

Since we're discussing this, let me tell what I've been thinking about
the lru cleaning behavior of bgwriter. ISTM that that's more
straigthforward to tune automatically. Bgwriter basically needs to
ensure that the next X buffers with usage_count=0 in the clock sweep are
clean. X is the predicted number of buffers backends will evict until
the next bgwriter round.

The number of buffers evicted by normal backends in a bgwriter_delay
period is simple to keep track of, just increase a counter in
StrategyGetBuffer and reset it when bgwriter wakes up. We can use that
as an estimate of X with some safety margin.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#10)
Re: Load distributed checkpoint V3

Heikki Linnakangas <heikki@enterprisedb.com> writes:

The number of buffers evicted by normal backends in a bgwriter_delay
period is simple to keep track of, just increase a counter in
StrategyGetBuffer and reset it when bgwriter wakes up. We can use that
as an estimate of X with some safety margin.

You'd want some kind of moving-average smoothing in there, probably with
a lot shorter ramp-up than ramp-down time constant, but this seems
reasonable enough to try.

regards, tom lane

#12Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Tom Lane (#11)
Re: Load distributed checkpoint V3

Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

The number of buffers evicted by normal backends in a bgwriter_delay
period is simple to keep track of, just increase a counter in
StrategyGetBuffer and reset it when bgwriter wakes up. We can use that
as an estimate of X with some safety margin.

You'd want some kind of moving-average smoothing in there, probably with
a lot shorter ramp-up than ramp-down time constant, but this seems
reasonable enough to try.

Ironically, I just noticed that we already have a patch in the patch
queue that implements exactly that, again by Itagaki. I need to start
paying more attention :-). Keep up the good work!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint V3

Hello, long time no see.

I'm sorry to interrupt your discussion. I'm afraid the code is getting
more complicated to continue to use fsync(). Though I don't intend to
say the current approach is wrong, could anyone evaluate O_SYNC
approach again that commercial databases use and tell me if and why
PostgreSQL's fsync() approach is better than theirs?

This January, I got a good result with O_SYNC, which I haven't
reported here. I'll show it briefly. Please forgive me for my abrupt
email, because I don't have enough time.
# Personally, I want to work in the community, if I'm allowed.
And sorry again. I reported that O_SYNC resulted in very bad
performance last year. But that was wrong. The PC server I borrowed
was configured that all the disks form one RAID5 device. So, the disks
for data and WAL (/dev/sdd and /dev/sde) came from the same RAID5
device, resulting in I/O conflict.

What I modified is md.c only. I just added O_SYNC to the open flag in
mdopen() and _mdfd_openseg(), if am_bgwriter is true. I didn't want
backends to use O_SYNC because mdextend() does not have to transfer
data to disk.

My evaluation environment was:

CPU: Intel Xeon 3.2GHz * 2 (HT on)
Memory: 4GB
Disk: Ultra320 SCSI (perhaps configured as write back)
OS: RHEL3.0 Update 6
Kernel: 2.4.21-37.ELsmp
PostgreSQL: 8.2.1

The relevant settings of PostgreSQL are:

shared_buffers = 2GB
wal_buffers = 1MB
wal_sync_method = open_sync
checkpoint_* and bgwriter_* parameters are left as their defaults.

I used pgbench, with the data of scaling factor 50.

[without O_SYNC, original behavior]
- pgbench -c1 -t16000
best response: 1ms
worst response: 6314ms
10th worst response: 427ms
tps: 318
- pgbench -c32 -t500
best response: 1ms
worst response: 8690ms
10th worst response: 8668ms
tps: 330

[with O_SYNC]
- pgbench -c1 -t16000
best response: 1ms
worst response: 350ms
10th worst response: 91ms
tps: 427
- pgbench -c32 -t500
best response: 1ms
worst response: 496ms
10th worst response: 435ms
tps: 1117

If the write back cache were disabled, the difference would be
smaller.
Windows version showed similar improvements.

However, this approach has two big problems.

(1) Slow down bulk updates

Updates of large amount of data get much slower because bgwriter seeks
and writes dirty buffers synchronously page-by-page. For example:

- COPY of accounts (5m records) and CHECKPOINT command after COPY
without O_SYNC: 100sec
with O_SYNC: 1046sec
- UPDATE of all records of accounts
without O_SYNC: 139sec
with O_SYNC: 639sec
- CHECKPOINT command for flushing 1.6GB of dirty buffers
without O_SYNC: 24sec
with O_SYNC: 126sec

To mitigate this problem, I sorted dirty buffers by their relfilenode
and block numbers and wrote multiple pages that are adjacent both on
memory and on disk. The result was:

- COPY of accounts (5m records) and CHECKPOINT command after COPY

227sec
- UPDATE of all records of accounts
569sec
- CHECKPOINT command for flushing 1.6GB of dirty buffers
71sec

Still bad...

(2) Can't utilize tablespaces

Though I didn't evaluate, update activity would be much less efficient
with O_SYNC than with fsync() when using multiple tablespaces, because
there is only one bgwriter.

Anyone can solve these problems?
One of my ideas is to use scattered I/O. I hear that readv()/writev()
became able to do real scattered I/O since kernel 2.6 (RHEL4.0). With
kernels before 2.6, readv()/writev() just performed I/Os sequentially.
Windows has provided reliable scattered I/O for years.

Another idea is to use async I/O, possibly combined with multiple
bgwriter approach on platforms where async I/O is not available. How
about the chance Josh-san has brought?

#14Greg Smith
gsmith@gregsmith.com
In reply to: Heikki Linnakangas (#10)
Re: Load distributed checkpoint V3

On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

The purpose of the bgwriter_all_* settings is to shorten the duration of
the eventual checkpoint. The reason to shorten the checkpoint duration
is to limit the damage to other I/O activity it causes. My thinking is
that assuming the LDC patch is effective (agreed, needs more testing) at
smoothening the checkpoint, the duration doesn't matter anymore. Do you
want to argue there's other reasons to shorten the checkpoint duration?

My testing results suggest that LDC doesn't smooth the checkpoint usefully
when under a high (>30 client here) load, because (on Linux at least) the
way the OS caches writes clashes badly with how buffers end up being
evicted if the buffer pool fills back up before the checkpoint is done.
In that context, anything that slows down the checkpoint duration is going
to make the problem worse rather than better, because it makes it more
likely that the tail end of the checkpoint will have to fight with the
clients for write bandwidth, at which point they both suffer. If you just
get the checkpoint done fast, the clients can't fill the pool as fast as
the BufferSync is writing it out, and things are as happy as they can be
without a major rewrite to all this code. I can get a tiny improvement in
some respects by delaying 2-5 seconds between finishing the writes and
calling fsync, because that gives Linux a moment to usefully spool some of
the data to the disk controller's cache; beyond that any additional delay
is a problem.

Since it's only the high load cases I'm having trouble dealing with, this
basically makes it a non-starter for me. The focus on checkpoint_timeout
and ignoring checkpoint_segments in the patch is also a big issue for me.
At the same time, I recognize that the approach taken in LDC probably is a
big improvement for many systems, it's just a step backwards for my
highest throughput one. I'd really enjoy hearing some results from
someone else.

The number of buffers evicted by normal backends in a bgwriter_delay period
is simple to keep track of, just increase a counter in StrategyGetBuffer and
reset it when bgwriter wakes up.

I see you've already found the other helpful Itagaki patch in this area.
I know I would like to see his code for tracking evictions commited, then
I'd like that to be added as another counter in pg_stat_bgwriter (I
mentioned that to Magnus in passing when he was setting the stats up but
didn't press it because of the patch dependency). Ideally, and this idea
was also in Itagaki's patch with the writtenByBgWriter/ByBackEnds debug
hook, I think it's important that you know how every buffer written to
disk got there--was it a background writer, a checkpoint, or an eviction
that wrote it out? Track all those and you can really learn something
about your write performance, data that's impossible to collect right now.

However, as Itagaki himself points out, doing something useful with
bgwriter_lru_maxpages is only one piece of automatically tuning the
background writer. I hate to join in on chopping his patches up, but
without some additional work I don't think the exact auto-tuning logic he
then applies will work in all cases, which could make it more a problem
than the current crude yet predictable method. The whole way
bgwriter_lru_maxpages and num_to_clean play off each other in his code
currently has a number of failure modes I'm concerned about. I'm not sure
if a re-write using a moving-average approach (as I did in my auto-tuning
writer prototype and as Tom just suggested here) will be sufficient to fix
all of them. Was already on my to-do list to investigate that further.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#15Greg Smith
gsmith@gregsmith.com
In reply to: Takayuki Tsunakawa (#13)
Re: Load distributed checkpoint V3

On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote:

could anyone evaluate O_SYNC approach again that commercial databases
use and tell me if and why PostgreSQL's fsync() approach is better than
theirs?

I noticed a big improvement switching the WAL to use O_SYNC (+O_DIRECT)
instead of fsync on my big and my little servers with battery-backed
cache, so I know sync writes perform reasonably well on my hardware.
Since I've had problems with the fsync at checkpoint time, I did a similar
test to yours recently, adding O_SYNC to the open calls and pulling the
fsyncs out to get a rough idea how things would work.

Performance was reasonable most of the time, but when I hit a checkpoint
with a lot of the buffer cache dirty it was incredibly bad. It took
minutes to write everything out, compared with a few seconds for the
current case, and the background writer was too sluggish as well to help.
This appears to match your data.

If you compare how Oracle handles their writes and checkpoints to the
Postgres code, it's obvious they have a different architecture that
enables them to support sync writing usefully. I'd recommend the Database
Writer Process section of
http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm
as an introduction for those not familiar with that; it's interesting
reading for anyone tinking with background writer code.

It would be great to compare performance of the current PostgreSQL code
with a fancy multiple background writer version using the latest sync
methods or AIO; there have actually been multiple updates to improve
O_SYNC writes within Linux during the 2.6 kernel series that make this
more practical than ever on that platform. But as you've already seen,
the performance hurdle to overcome is significant, and it would have to be
optional as a result. When you add all this up--have to keep the current
non-sync writes around as well, need to redesign the whole background
writer/checkpoint approach around the idea of sync writes, and the
OS-specific parts that would come from things like AIO--it gets real
messy. Good luck drumming up support for all that when the initial
benchmarks suggest it's going to be a big step back.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#16Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint V3

From: "Greg Smith" <gsmith@gregsmith.com>

If you compare how Oracle handles their writes and checkpoints to

the

Postgres code, it's obvious they have a different architecture that
enables them to support sync writing usefully. I'd recommend the

Database

Writer Process section of

http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm

as an introduction for those not familiar with that; it's

interesting

reading for anyone tinking with background writer code.

Hmm... what makes you think that sync writes is useful for Oracle and
not for PostgreSQL? The process architecture is similar; bgwriter
performs most of writes in PostgreSQL, while DBWn performs all writes
in Oracle. The difference is that Oracle can assure crash recovery
time by writing dirby buffers periodically in the order of their LSN.

It would be great to compare performance of the current PostgreSQL

code

with a fancy multiple background writer version using the latest

sync

methods or AIO; there have actually been multiple updates to improve
O_SYNC writes within Linux during the 2.6 kernel series that make

this

more practical than ever on that platform. But as you've already

seen,

the performance hurdle to overcome is significant, and it would have

to be

optional as a result. When you add all this up--have to keep the

current

non-sync writes around as well, need to redesign the whole

background

writer/checkpoint approach around the idea of sync writes, and the
OS-specific parts that would come from things like AIO--it gets real
messy. Good luck drumming up support for all that when the initial
benchmarks suggest it's going to be a big step back.

I agree with you in that write method has to be optional until there's
enough data from the field that help determine which is better.

... It's a pity not to utilize async I/O and Josh-san's offer. I hope
it will be used some day. I think OS developers have evolved async I/O
for databases.

#17Simon Riggs
simon@2ndquadrant.com
In reply to: Greg Smith (#15)
Re: Load distributed checkpoint V3

On Fri, 2007-04-06 at 02:53 -0400, Greg Smith wrote:

If you compare how Oracle handles their writes and checkpoints to the
Postgres code, it's obvious they have a different architecture that
enables them to support sync writing usefully. I'd recommend the
Database
Writer Process section of
http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm
as an introduction for those not familiar with that; it's interesting
reading for anyone tinking with background writer code.

Oracle does have a different checkpointing technique and we know it is
patented, so we need to go carefully there, especially when directly
referencing documentation.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#18Greg Smith
gsmith@gregsmith.com
In reply to: Takayuki Tsunakawa (#16)
Re: Load distributed checkpoint V3

On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote:

Hmm... what makes you think that sync writes is useful for Oracle and
not for PostgreSQL?

They do more to push checkpoint-time work in advance, batch writes up more
efficiently, and never let clients do the writing. All of which make for
a different type of checkpoint.

Like Simon points out, even if it were conceivable to mimic their design
it might not even be legally feasible. The point I was trying to make is
this: you've been saying that Oracle's writing technology has better
performance in this area, which is probably true, and suggesting the cause
of that was their using O_SYNC writes. I wanted to believe that and even
tested out a prototype. The reality here appears to be that their
checkpoints go smoother *despite* using the slower sync writes because
they're built their design around the limitations of that write method.

I suspect it would take a similar scale of redesign to move Postgres in
that direction; the issues you identified (the same ones I ran into) are
not so easy to resolve. You're certainly not going to move anybody in
that direction by throwing a random comment into a discussion on the
patches list about a feature useful *right now* in this area.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#19ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Heikki Linnakangas (#7)
1 attachment(s)
Load distributed checkpoint V4

Here is an updated version of LDC patch (V4).

- Refactor the codes to minimize the impact of changes.
- Progress of checkpoint is controlled not only based on checkpoint_timeout
but also checkpoint_segments. -- Now it works better with large
checkpoint_timeout and small checkpoint_segments.

We can control the delay of checkpoints using three parameters:
checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
If we set all of the values to zero, checkpoint behaves as it was.

Heikki Linnakangas <heikki@enterprisedb.com> wrote:

I'd suggest rearranging the code so that BgBufferSync and mdsync would
basically stay like they are without the patch; the signature wouldn't
change. To do the naps during a checkpoint, inject calls to new
functions like CheckpointWriteNap() and CheckpointFsyncNap() inside
BgBufferSync and mdsync. Those nap functions would check if enough
progress has been made since last call and sleep if so.

Yeah, it makes LDC less intrusive. Now the code flow in checkpoints stay
as it was and the nap-functions are called periodically in BufferSync()
and smgrsync(). But the signatures of some functions needed small changes;
the argument 'immediate' was added.

The nap-functions would call
BgWriteLRUBuffers if more than bgwriter_delay milliseconds have passed
since last call to it.

Only LRU buffers are written in nap and sync phases in the new patch.
The ALL activity of bgwriter was primarily designed to write drity buffers
on ahead of checkpoints, so the writes were not needed *in* checkpoints.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

LDC_v4.patchapplication/octet-stream; name=LDC_v4.patchDownload
diff -cpr HEAD/doc/src/sgml/config.sgml LDC_v4/doc/src/sgml/config.sgml
*** HEAD/doc/src/sgml/config.sgml	Tue Apr 17 03:29:50 2007
--- LDC_v4/doc/src/sgml/config.sgml	Thu Apr 19 11:32:50 2007
*************** SET ENABLE_SEQSCAN TO OFF;
*** 1565,1570 ****
--- 1565,1619 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-checkpoint-write-percent" xreflabel="checkpoint_write_percent">
+       <term><varname>checkpoint_write_percent</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_write_percent</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         To spread works in checkpoints, each checkpoint spends the specified
+         time and delays to write out all dirty buffers in the shared buffer
+         pool. The default value is 50.0 (50% of <varname>checkpoint_timeout</>).
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-checkpoint-nap-percent" xreflabel="checkpoint_nap_percent">
+       <term><varname>checkpoint_nap_percent</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_nap_percent</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the delay between writing out all dirty buffers and flushing
+         all modified files. Make the kernel's disk writer to flush dirty buffers
+         during this time in order to reduce works in the next flushing phase.
+         The default value is 10.0 (10% of <varname>checkpoint_timeout</>).
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-checkpoint-sync-percent" xreflabel="checkpoint_sync_percent">
+       <term><varname>checkpoint_sync_percent</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_sync_percent</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         To spread works in checkpoints, each checkpoint spends the specified
+         time and delays to flush all modified files.
+         The default value is 20.0 (20% of <varname>checkpoint_timeout</>).
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
        <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term>
        <indexterm>
diff -cpr HEAD/src/backend/access/transam/xlog.c LDC_v4/src/backend/access/transam/xlog.c
*** HEAD/src/backend/access/transam/xlog.c	Wed Apr  4 01:34:35 2007
--- LDC_v4/src/backend/access/transam/xlog.c	Thu Apr 19 11:32:50 2007
*************** static void readRecoveryCommandFile(void
*** 399,405 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
--- 399,405 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
*************** GetRedoRecPtr(void)
*** 5274,5279 ****
--- 5274,5296 ----
  }
  
  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+ 	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	XLogCtlInsert  *Insert = &XLogCtl->Insert;
+ 	XLogRecPtr		recptr;
+ 
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	INSERT_RECPTR(recptr, Insert, Insert->curridx);
+ 	SpinLockRelease(&xlogctl->info_lck);
+ 
+ 	return recptr;
+ }
+ 
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
*************** CreateCheckPoint(bool shutdown, bool for
*** 5546,5552 ****
  	 */
  	END_CRIT_SECTION();
  
! 	CheckPointGuts(checkPoint.redo);
  
  	START_CRIT_SECTION();
  
--- 5563,5569 ----
  	 */
  	END_CRIT_SECTION();
  
! 	CheckPointGuts(checkPoint.redo, force);
  
  	START_CRIT_SECTION();
  
*************** CreateCheckPoint(bool shutdown, bool for
*** 5652,5663 ****
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool();			/* performs all required fsyncs */
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
--- 5669,5680 ----
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool(immediate);		/* performs all required fsyncs */
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
*************** RecoveryRestartPoint(const CheckPoint *c
*** 5706,5712 ****
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
--- 5723,5729 ----
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo, true);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
diff -cpr HEAD/src/backend/commands/dbcommands.c LDC_v4/src/backend/commands/dbcommands.c
*** HEAD/src/backend/commands/dbcommands.c	Fri Apr 13 00:04:35 2007
--- LDC_v4/src/backend/commands/dbcommands.c	Thu Apr 19 11:32:50 2007
*************** createdb(const CreatedbStmt *stmt)
*** 400,406 ****
  	 * up-to-date for the copy.  (We really only need to flush buffers for the
  	 * source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync();
  
  	/*
  	 * Once we start copying subdirectories, we need to be able to clean 'em
--- 400,406 ----
  	 * up-to-date for the copy.  (We really only need to flush buffers for the
  	 * source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync(true);
  
  	/*
  	 * Once we start copying subdirectories, we need to be able to clean 'em
*************** dbase_redo(XLogRecPtr lsn, XLogRecord *r
*** 1417,1423 ****
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync();
  
  		/*
  		 * Copy this subdirectory to the new location
--- 1417,1423 ----
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync(true);
  
  		/*
  		 * Copy this subdirectory to the new location
diff -cpr HEAD/src/backend/postmaster/bgwriter.c LDC_v4/src/backend/postmaster/bgwriter.c
*** HEAD/src/backend/postmaster/bgwriter.c	Sat Mar 31 03:34:55 2007
--- LDC_v4/src/backend/postmaster/bgwriter.c	Thu Apr 19 12:52:06 2007
***************
*** 44,49 ****
--- 44,50 ----
  #include "postgres.h"
  
  #include <signal.h>
+ #include <sys/time.h>
  #include <time.h>
  #include <unistd.h>
  
*************** typedef struct
*** 117,122 ****
--- 118,124 ----
  	sig_atomic_t ckpt_failed;	/* advances when checkpoint fails */
  
  	sig_atomic_t ckpt_time_warn;	/* warn if too soon since last ckpt? */
+ 	sig_atomic_t ckpt_force;		/* any waiter for the checkpoint? */
  
  	int			num_requests;	/* current # of requests */
  	int			max_requests;	/* allocated array size */
*************** PgStat_MsgBgWriter BgWriterStats;
*** 138,143 ****
--- 140,148 ----
  int			BgWriterDelay = 200;
  int			CheckPointTimeout = 300;
  int			CheckPointWarning = 30;
+ double		checkpoint_write_percent = 50.0;
+ double		checkpoint_nap_percent = 10.0;
+ double		checkpoint_sync_percent = 20.0;
  
  /*
   * Flags set by interrupt handlers for later service in the main loop.
*************** static bool am_bg_writer = false;
*** 153,162 ****
--- 158,175 ----
  
  static bool ckpt_active = false;
  
+ static time_t		ckpt_start_time;
+ static XLogRecPtr	ckpt_start_recptr;
+ static double		ckpt_progress_at_sync_start;
+ 
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;
  
  
+ static void CheckArchiveTimeout(void);
+ static void BgWriterNap(long msec);
+ static bool NextCheckpointRequested(void);
+ static double GetCheckpointProgress(void);
  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
  static void ReqCheckpointHandler(SIGNAL_ARGS);
*************** BackgroundWriterMain(void)
*** 343,349 ****
  		bool		force_checkpoint = false;
  		time_t		now;
  		int			elapsed_secs;
- 		long		udelay;
  
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
--- 356,361 ----
*************** BackgroundWriterMain(void)
*** 362,374 ****
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
  		}
- 		if (checkpoint_requested)
- 		{
- 			checkpoint_requested = false;
- 			do_checkpoint = true;
- 			force_checkpoint = true;
- 			BgWriterStats.m_requested_checkpoints++;
- 		}
  		if (shutdown_requested)
  		{
  			/*
--- 374,379 ----
*************** BackgroundWriterMain(void)
*** 389,399 ****
  		 */
  		now = time(NULL);
  		elapsed_secs = now - last_checkpoint_time;
! 		if (elapsed_secs >= CheckPointTimeout)
  		{
  			do_checkpoint = true;
! 			if (!force_checkpoint)
! 				BgWriterStats.m_timed_checkpoints++;
  		}
  
  		/*
--- 394,410 ----
  		 */
  		now = time(NULL);
  		elapsed_secs = now - last_checkpoint_time;
! 		if (checkpoint_requested)
! 		{
! 			checkpoint_requested = false;
! 			force_checkpoint = BgWriterShmem->ckpt_force;
! 			do_checkpoint = true;
! 			BgWriterStats.m_requested_checkpoints++;
! 		}
! 		else if (elapsed_secs >= CheckPointTimeout)
  		{
  			do_checkpoint = true;
! 			BgWriterStats.m_timed_checkpoints++;
  		}
  
  		/*
*************** BackgroundWriterMain(void)
*** 416,428 ****
--- 427,444 ----
  								elapsed_secs),
  						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
  			BgWriterShmem->ckpt_time_warn = false;
+ 			BgWriterShmem->ckpt_force = false;
  
  			/*
  			 * Indicate checkpoint start to any waiting backends.
  			 */
  			ckpt_active = true;
+ 			elog(DEBUG1, "CHECKPOINT: start");
  			BgWriterShmem->ckpt_started++;
  
+ 			ckpt_start_time = now;
+ 			ckpt_start_recptr = GetInsertRecPtr();
+ 			ckpt_progress_at_sync_start = 0;
  			CreateCheckPoint(false, force_checkpoint);
  
  			/*
*************** BackgroundWriterMain(void)
*** 435,440 ****
--- 451,457 ----
  			 * Indicate checkpoint completion to any waiting backends.
  			 */
  			BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
+ 			elog(DEBUG1, "CHECKPOINT: end");
  			ckpt_active = false;
  
  			/*
*************** BackgroundWriterMain(void)
*** 451,458 ****
  		 * Check for archive_timeout, if so, switch xlog files.  First we do a
  		 * quick check using possibly-stale local state.
  		 */
! 		if (XLogArchiveTimeout > 0 &&
! 			(int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
  		{
  			/*
  			 * Update local state ... note that last_xlog_switch_time is the
--- 468,495 ----
  		 * Check for archive_timeout, if so, switch xlog files.  First we do a
  		 * quick check using possibly-stale local state.
  		 */
! 		CheckArchiveTimeout();
! 
! 		/* Nap for the configured time. */
! 		BgWriterNap(0);
! 	}
! }
! 
! /*
!  * CheckArchiveTimeout -- check for archive_timeout
!  */
! static void
! CheckArchiveTimeout(void)
! {
! 	time_t		now;
! 
! 	if (XLogArchiveTimeout <= 0)
! 		return;
! 
! 	now = time(NULL);
! 	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
! 		return;
! 
  		{
  			/*
  			 * Update local state ... note that last_xlog_switch_time is the
*************** BackgroundWriterMain(void)
*** 462,471 ****
  
  			last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
  
- 			/* if we did a checkpoint, 'now' might be stale too */
- 			if (do_checkpoint)
- 				now = time(NULL);
- 
  			/* Now we can do the real check */
  			if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
  			{
--- 499,504 ----
*************** BackgroundWriterMain(void)
*** 490,495 ****
--- 523,540 ----
  				last_xlog_switch_time = now;
  			}
  		}
+ }
+ 
+ /*
+  * BgWriterNap -- short nap in bgwriter
+  *
+  * Nap for the shorter time of the configured time or the mdelay unless
+  * it is zero. Return the actual nap time in msec.
+  */
+ static void
+ BgWriterNap(long mdelay)
+ {
+ 	long		udelay;
  
  		/*
  		 * Send off activity statistics to the stats collector
*************** BackgroundWriterMain(void)
*** 515,520 ****
--- 560,569 ----
  		else
  			udelay = 10000000L; /* Ten seconds */
  
+ 		/* Clamp the delay to the upper bound. */
+ 		if (mdelay > 0)
+ 			udelay = Min(udelay, mdelay * 1000L);
+ 
  		while (udelay > 999999L)
  		{
  			if (got_SIGHUP || checkpoint_requested || shutdown_requested)
*************** BackgroundWriterMain(void)
*** 526,534 ****
--- 575,727 ----
  
  		if (!(got_SIGHUP || checkpoint_requested || shutdown_requested))
  			pg_usleep(udelay);
+ }
+ 
+ /*
+  * CheckpointWriteDelay -- periodical sleep in checkpoint write phase
+  */
+ void
+ CheckpointWriteDelay(double progress)
+ {
+ 	if (!ckpt_active || checkpoint_write_percent <= 0)
+ 		return;
+ 
+ 	elog(DEBUG1, "CheckpointWriteDelay: progress=%.3f", progress);
+ 
+ 	if (!NextCheckpointRequested() &&
+ 		progress * checkpoint_write_percent > GetCheckpointProgress())
+ 	{
+ 		AbsorbFsyncRequests();
+ 		BgLruBufferSync();
+ 		BgWriterNap(0);
  	}
  }
  
+ /*
+  * CheckpointNapDelay -- sleep between checkpoint write and sync phases
+  */
+ void
+ CheckpointNapDelay(bool immediate)
+ {
+ 	if (!ckpt_active)
+ 		return;
+ 
+ 	if (!immediate)
+ 	{
+ 		double	ckpt_progress_at_nap_start = GetCheckpointProgress();
+ 		double	remain;
+ 
+ 		elog(DEBUG1, "CheckpointNapDelay: %f%%", checkpoint_nap_percent);
+ 
+ 		while (!NextCheckpointRequested() &&
+ 			(remain = ckpt_progress_at_nap_start + checkpoint_nap_percent
+ 				- GetCheckpointProgress()) > 0)
+ 		{
+ 			long	msec = (long) (CheckPointTimeout * 1000.0 * remain / 100.0);
+ 
+ 			AbsorbFsyncRequests();
+ 			BgLruBufferSync();
+ 			BgWriterNap(msec);
+ 		}
+ 	}
+ 
+ 	ckpt_progress_at_sync_start = GetCheckpointProgress();
+ }
+ 
+ /*
+  * CheckpointSyncDelay -- periodical sleep in checkpoint sync phase
+  */
+ void
+ CheckpointSyncDelay(double progress)
+ {
+ 	double	remain;
+ 
+ 	if (!ckpt_active || checkpoint_sync_percent <= 0)
+ 		return;
+ 
+ 	elog(DEBUG1, "CheckpointSyncDelay: progress=%.3f", progress);
+ 
+ 	while (!NextCheckpointRequested() &&
+ 		(remain = ckpt_progress_at_sync_start + progress *
+ 		checkpoint_sync_percent - GetCheckpointProgress()) > 0)
+ 	{
+ 		long	msec = (long) (CheckPointTimeout * 1000.0 * remain / 100.0);
+ 
+ 		AbsorbFsyncRequests();
+ 		BgLruBufferSync();
+ 		BgWriterNap(msec);
+ 	}
+ }
+ 
+ /*
+  * NextCheckpointRequested -- true iff the next checkpoint is requested
+  *
+  *	Do also check any signals received recently.
+  */
+ static bool
+ NextCheckpointRequested(void)
+ {
+ 	if (!am_bg_writer || !ckpt_active)
+ 		return true;
+ 
+ 	/* Don't sleep this checkpoint if next checkpoint is requested. */
+ 	if (checkpoint_requested || shutdown_requested ||
+ 		(time(NULL) - ckpt_start_time >= CheckPointTimeout))
+ 	{
+ 		elog(DEBUG1, "NextCheckpointRequested");
+ 		checkpoint_requested = true;
+ 		return true;
+ 	}
+ 
+ 	/* Process reload signals. */
+ 	if (got_SIGHUP)
+ 	{
+ 		got_SIGHUP = false;
+ 		ProcessConfigFile(PGC_SIGHUP);
+ 	}
+ 
+ 	/* Check for archive_timeout and nap for the configured time. */
+ 	CheckArchiveTimeout();
+ 
+ 	return false;
+ }
+ 
+ /*
+  * GetCheckpointProgress -- progress of the current checkpoint in range 0-100%
+  */
+ static double
+ GetCheckpointProgress(void)
+ {
+ 	struct timeval	now;
+ 	XLogRecPtr		recptr;
+ 	double			progress_in_time,
+ 					progress_in_xlog;
+ 	double			percent;
+ 
+ 	Assert(ckpt_active);
+ 
+ 	/* coordinate the progress with checkpoint_timeout */
+ 	gettimeofday(&now, NULL);
+ 	progress_in_time = ((double) (now.tv_sec - ckpt_start_time) +
+ 		now.tv_usec / 1000000.0) / CheckPointTimeout;
+ 
+ 	/* coordinate the progress with checkpoint_segments */
+ 	recptr = GetInsertRecPtr();
+ 	progress_in_xlog =
+ 		((double) (recptr.xlogid - ckpt_start_recptr.xlogid) * XLogSegsPerFile +
+ 		 (double) (recptr.xrecoff - ckpt_start_recptr.xrecoff) / XLogSegSize) /
+ 		CheckPointSegments;
+ 
+ 	percent = 100.0 * Max(progress_in_time, progress_in_xlog);
+ 	if (percent > 100.0)
+ 		percent = 100.0;
+ 
+ 	elog(DEBUG2, "GetCheckpointProgress : time=%.3f, xlog=%.3f",
+ 		progress_in_time, progress_in_xlog);
+ 
+ 	return percent;
+ }
+ 
  
  /* --------------------------------
   *		signal handler routines
*************** RequestCheckpoint(bool waitforit, bool w
*** 668,673 ****
--- 861,868 ----
  	/* Set warning request flag if appropriate */
  	if (warnontime)
  		bgs->ckpt_time_warn = true;
+ 	if (waitforit)
+ 		bgs->ckpt_force = true;
  
  	/*
  	 * Send signal to request checkpoint.  When waitforit is false, we
diff -cpr HEAD/src/backend/storage/buffer/bufmgr.c LDC_v4/src/backend/storage/buffer/bufmgr.c
*** HEAD/src/backend/storage/buffer/bufmgr.c	Sat Mar 31 03:34:55 2007
--- LDC_v4/src/backend/storage/buffer/bufmgr.c	Thu Apr 19 11:32:50 2007
*************** UnpinBuffer(volatile BufferDesc *buf, bo
*** 947,957 ****
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(void)
  {
  	int			buf_id;
  	int			num_to_scan;
  	int			absorb_counter;
  
  	/*
  	 * Find out where to start the circular scan.
--- 947,960 ----
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(bool immediate)
  {
  	int			buf_id;
  	int			num_to_scan;
+ 	int			num_written;
  	int			absorb_counter;
+ 	int			writes_per_nap = (bgwriter_all_maxpages > 0 ?
+ 					bgwriter_all_maxpages : WRITES_PER_ABSORB);
  
  	/*
  	 * Find out where to start the circular scan.
*************** BufferSync(void)
*** 965,970 ****
--- 968,974 ----
  	 * Loop over all buffers.
  	 */
  	num_to_scan = NBuffers;
+ 	num_written = 0;
  	absorb_counter = WRITES_PER_ABSORB;
  	while (num_to_scan-- > 0)
  	{
*************** BufferSync(void)
*** 972,977 ****
--- 976,988 ----
  		{
  			BgWriterStats.m_buf_written_checkpoints++;
  
+ 			if (!immediate && ++num_written >= writes_per_nap)
+ 			{
+ 				num_written = 0;
+ 				CheckpointWriteDelay(
+ 					(double) (NBuffers - num_to_scan) / NBuffers);
+ 			}
+ 
  			/*
  			 * If in bgwriter, absorb pending fsync requests after each
  			 * WRITES_PER_ABSORB write operations, to prevent overflow of the
*************** void
*** 998,1004 ****
  BgBufferSync(void)
  {
  	static int	buf_id1 = 0;
- 	int			buf_id2;
  	int			num_to_scan;
  	int			num_written;
  
--- 1009,1014 ----
*************** BgBufferSync(void)
*** 1044,1049 ****
--- 1054,1072 ----
  		BgWriterStats.m_buf_written_all += num_written;
  	}
  
+ 	BgLruBufferSync();
+ }
+ 
+ /*
+  * BgLruBufferSync -- Write out some lru dirty buffers in the pool.
+  */
+ void
+ BgLruBufferSync(void)
+ {
+ 	int			buf_id2;
+ 	int			num_to_scan;
+ 	int			num_written;
+ 
  	/*
  	 * This loop considers only unpinned buffers close to the clock sweep
  	 * point.
*************** PrintBufferLeakWarning(Buffer buffer)
*** 1286,1295 ****
   * flushed.
   */
  void
! FlushBufferPool(void)
  {
! 	BufferSync();
! 	smgrsync();
  }
  
  
--- 1309,1324 ----
   * flushed.
   */
  void
! FlushBufferPool(bool immediate)
  {
! 	elog(DEBUG1, "CHECKPOINT: write phase");
! 	BufferSync(immediate || checkpoint_write_percent <= 0);
! 
! 	elog(DEBUG1, "CHECKPOINT: nap phase");
! 	CheckpointNapDelay(immediate || checkpoint_nap_percent <= 0);
! 
! 	elog(DEBUG1, "CHECKPOINT: sync phase");
! 	smgrsync(immediate || checkpoint_sync_percent <= 0);
  }
  
  
diff -cpr HEAD/src/backend/storage/smgr/md.c LDC_v4/src/backend/storage/smgr/md.c
*** HEAD/src/backend/storage/smgr/md.c	Fri Apr 13 02:10:55 2007
--- LDC_v4/src/backend/storage/smgr/md.c	Thu Apr 19 12:28:18 2007
*************** mdimmedsync(SMgrRelation reln)
*** 863,875 ****
   *	mdsync() -- Sync previous writes to stable storage.
   */
  void
! mdsync(void)
  {
  	static bool mdsync_in_progress = false;
  
  	HASH_SEQ_STATUS hstat;
  	PendingOperationEntry *entry;
  	int			absorb_counter;
  
  	/*
  	 * This is only called during checkpoints, and checkpoints should only
--- 863,877 ----
   *	mdsync() -- Sync previous writes to stable storage.
   */
  void
! mdsync(bool immediate)
  {
  	static bool mdsync_in_progress = false;
  
  	HASH_SEQ_STATUS hstat;
  	PendingOperationEntry *entry;
  	int			absorb_counter;
+ 	double		progress = 0;	/* progress in bytes */
+ 	double		total = 0;		/* total filesize to fsync in bytes */
  
  	/*
  	 * This is only called during checkpoints, and checkpoints should only
*************** mdsync(void)
*** 910,922 ****
  	 * From a performance point of view it doesn't matter anyway, as this
  	 * path will never be taken in a system that's functioning normally.
  	 */
! 	if (mdsync_in_progress)
  	{
  		/* prior try failed, so update any stale cycle_ctr values */
  		hash_seq_init(&hstat, pendingOpsTable);
  		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
  		{
! 			entry->cycle_ctr = mdsync_cycle_ctr;
  		}
  	}
  
--- 912,940 ----
  	 * From a performance point of view it doesn't matter anyway, as this
  	 * path will never be taken in a system that's functioning normally.
  	 */
! 	if (mdsync_in_progress || (enableFsync && !immediate))
  	{
  		/* prior try failed, so update any stale cycle_ctr values */
  		hash_seq_init(&hstat, pendingOpsTable);
  		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
  		{
! 			if (mdsync_in_progress)
! 				entry->cycle_ctr = mdsync_cycle_ctr;
! 			else if (enableFsync && !immediate && !entry->canceled)
! 			{
! 				SMgrRelation	reln;
! 				MdfdVec		   *seg;
! 				long			len;
! 
! 				/* Sum up lengths of the files on non-immediate case. */
! 				reln = smgropen(entry->tag.rnode);
! 				seg = _mdfd_getseg(reln,
! 						entry->tag.segno * ((BlockNumber) RELSEG_SIZE),
! 						true, EXTENSION_RETURN_NULL);
! 				if (seg != NULL &&
! 					(len = FileSeek(seg->mdfd_vfd, 0, SEEK_END)) >= 0)
! 					total += len;
! 			}
  		}
  	}
  
*************** mdsync(void)
*** 931,936 ****
--- 949,956 ----
  	hash_seq_init(&hstat, pendingOpsTable);
  	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
  	{
+ 		long			seglen = -1;
+ 
  		/*
  		 * If the entry is new then don't process it this time.  Note that
  		 * "continue" bypasses the hash-remove call at the bottom of the loop.
*************** mdsync(void)
*** 1010,1016 ****
--- 1030,1039 ----
  								   false, EXTENSION_RETURN_NULL);
  				if (seg != NULL &&
  					FileSync(seg->mdfd_vfd) >= 0)
+ 				{
+ 					seglen = FileSeek(seg->mdfd_vfd, 0, SEEK_END);
  					break;		/* success; break out of retry loop */
+ 				}
  
  				/*
  				 * XXX is there any point in allowing more than one retry?
*************** mdsync(void)
*** 1054,1059 ****
--- 1077,1091 ----
  		if (hash_search(pendingOpsTable, &entry->tag,
  						HASH_REMOVE, NULL) == NULL)
  			elog(ERROR, "pendingOpsTable corrupted");
+ 
+ 		/*
+ 		 * Nap some seconds according to the file size.
+ 		 */
+ 		if (seglen > 0 && total > 0)
+ 		{
+ 			progress += seglen;
+ 			CheckpointSyncDelay(progress / total);
+ 		}
  	}	/* end loop over hashtable entries */
  
  	/* Flag successful completion of mdsync */
diff -cpr HEAD/src/backend/storage/smgr/smgr.c LDC_v4/src/backend/storage/smgr/smgr.c
*** HEAD/src/backend/storage/smgr/smgr.c	Sat Jan  6 07:19:39 2007
--- LDC_v4/src/backend/storage/smgr/smgr.c	Thu Apr 19 11:32:50 2007
***************
*** 21,26 ****
--- 21,27 ----
  #include "access/xlogutils.h"
  #include "commands/tablespace.h"
  #include "pgstat.h"
+ #include "postmaster/bgwriter.h"
  #include "storage/bufmgr.h"
  #include "storage/freespace.h"
  #include "storage/ipc.h"
*************** typedef struct f_smgr
*** 57,63 ****
  	void		(*smgr_immedsync) (SMgrRelation reln);
  	void		(*smgr_commit) (void);	/* may be NULL */
  	void		(*smgr_abort) (void);	/* may be NULL */
! 	void		(*smgr_sync) (void);	/* may be NULL */
  } f_smgr;
  
  
--- 58,64 ----
  	void		(*smgr_immedsync) (SMgrRelation reln);
  	void		(*smgr_commit) (void);	/* may be NULL */
  	void		(*smgr_abort) (void);	/* may be NULL */
! 	void		(*smgr_sync) (bool immediate);	/* may be NULL */
  } f_smgr;
  
  
*************** smgrabort(void)
*** 781,794 ****
   *	smgrsync() -- Sync files to disk at checkpoint time.
   */
  void
! smgrsync(void)
  {
  	int			i;
  
  	for (i = 0; i < NSmgr; i++)
  	{
  		if (smgrsw[i].smgr_sync)
! 			(*(smgrsw[i].smgr_sync)) ();
  	}
  }
  
--- 782,795 ----
   *	smgrsync() -- Sync files to disk at checkpoint time.
   */
  void
! smgrsync(bool immediate)
  {
  	int			i;
  
  	for (i = 0; i < NSmgr; i++)
  	{
  		if (smgrsw[i].smgr_sync)
! 			(*(smgrsw[i].smgr_sync))(immediate);
  	}
  }
  
diff -cpr HEAD/src/backend/utils/misc/guc.c LDC_v4/src/backend/utils/misc/guc.c
*** HEAD/src/backend/utils/misc/guc.c	Tue Apr 17 03:29:55 2007
--- LDC_v4/src/backend/utils/misc/guc.c	Thu Apr 19 11:32:50 2007
*************** static struct config_real ConfigureNames
*** 1821,1826 ****
--- 1821,1853 ----
  		0.1, 0.0, 100.0, NULL, NULL
  	},
  
+ 	{
+ 		{"checkpoint_write_percent", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the duration percentage of write phase in checkpoints."),
+ 			NULL
+ 		},
+ 		&checkpoint_write_percent,
+ 		50.0, 0.0, 100.0, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"checkpoint_nap_percent", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the duration percentage between write and sync phases in checkpoints."),
+ 			NULL
+ 		},
+ 		&checkpoint_nap_percent,
+ 		10.0, 0.0, 100.0, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"checkpoint_sync_percent", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the duration percentage of sync phase in checkpoints."),
+ 			NULL
+ 		},
+ 		&checkpoint_sync_percent,
+ 		20.0, 0.0, 100.0, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL
diff -cpr HEAD/src/backend/utils/misc/postgresql.conf.sample LDC_v4/src/backend/utils/misc/postgresql.conf.sample
*** HEAD/src/backend/utils/misc/postgresql.conf.sample	Tue Apr 17 03:29:55 2007
--- LDC_v4/src/backend/utils/misc/postgresql.conf.sample	Thu Apr 19 11:32:50 2007
***************
*** 168,173 ****
--- 168,176 ----
  
  #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
  #checkpoint_timeout = 5min		# range 30s-1h
+ #checkpoint_write_percent = 50.0		# duration percentage in write phase
+ #checkpoint_nap_percent = 10.0		# duration percentage between write and sync phases
+ #checkpoint_sync_percent = 20.0		# duration percentage in sync phase
  #checkpoint_warning = 30s		# 0 is off
  
  # - Archiving -
diff -cpr HEAD/src/include/access/xlog.h LDC_v4/src/include/access/xlog.h
*** HEAD/src/include/access/xlog.h	Sat Jan  6 07:19:51 2007
--- LDC_v4/src/include/access/xlog.h	Thu Apr 19 11:32:50 2007
*************** extern void InitXLOGAccess(void);
*** 165,170 ****
--- 165,171 ----
  extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
  
  #endif   /* XLOG_H */
diff -cpr HEAD/src/include/postmaster/bgwriter.h LDC_v4/src/include/postmaster/bgwriter.h
*** HEAD/src/include/postmaster/bgwriter.h	Sat Jan  6 07:19:57 2007
--- LDC_v4/src/include/postmaster/bgwriter.h	Thu Apr 19 11:32:50 2007
***************
*** 20,29 ****
--- 20,35 ----
  extern int	BgWriterDelay;
  extern int	CheckPointTimeout;
  extern int	CheckPointWarning;
+ extern double	checkpoint_write_percent;
+ extern double	checkpoint_nap_percent;
+ extern double	checkpoint_sync_percent;
  
  extern void BackgroundWriterMain(void);
  
  extern void RequestCheckpoint(bool waitforit, bool warnontime);
+ extern void CheckpointWriteDelay(double progress);
+ extern void CheckpointNapDelay(bool immediate);
+ extern void CheckpointSyncDelay(double progress);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
diff -cpr HEAD/src/include/storage/bufmgr.h LDC_v4/src/include/storage/bufmgr.h
*** HEAD/src/include/storage/bufmgr.h	Sat Jan  6 07:19:57 2007
--- LDC_v4/src/include/storage/bufmgr.h	Thu Apr 19 11:32:50 2007
*************** extern char *ShowBufferUsage(void);
*** 125,131 ****
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 125,131 ----
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool immediate);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*************** extern void LockBufferForCleanup(Buffer 
*** 150,157 ****
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern void BufferSync(void);
  extern void BgBufferSync(void);
  
  extern void AtProcExit_LocalBuffers(void);
  
--- 150,158 ----
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern void BufferSync(bool immediate);
  extern void BgBufferSync(void);
+ extern void BgLruBufferSync(void);
  
  extern void AtProcExit_LocalBuffers(void);
  
diff -cpr HEAD/src/include/storage/smgr.h LDC_v4/src/include/storage/smgr.h
*** HEAD/src/include/storage/smgr.h	Thu Jan 18 01:25:01 2007
--- LDC_v4/src/include/storage/smgr.h	Thu Apr 19 11:32:50 2007
*************** extern void AtSubAbort_smgr(void);
*** 82,88 ****
  extern void PostPrepare_smgr(void);
  extern void smgrcommit(void);
  extern void smgrabort(void);
! extern void smgrsync(void);
  
  extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
--- 82,88 ----
  extern void PostPrepare_smgr(void);
  extern void smgrcommit(void);
  extern void smgrabort(void);
! extern void smgrsync(bool immediate);
  
  extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
*************** extern void mdwrite(SMgrRelation reln, B
*** 103,109 ****
  extern BlockNumber mdnblocks(SMgrRelation reln);
  extern void mdtruncate(SMgrRelation reln, BlockNumber nblocks, bool isTemp);
  extern void mdimmedsync(SMgrRelation reln);
! extern void mdsync(void);
  
  extern void RememberFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void ForgetRelationFsyncRequests(RelFileNode rnode);
--- 103,109 ----
  extern BlockNumber mdnblocks(SMgrRelation reln);
  extern void mdtruncate(SMgrRelation reln, BlockNumber nblocks, bool isTemp);
  extern void mdimmedsync(SMgrRelation reln);
! extern void mdsync(bool immediate);
  
  extern void RememberFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void ForgetRelationFsyncRequests(RelFileNode rnode);
#20Heikki Linnakangas
hlinnaka@iki.fi
In reply to: ITAGAKI Takahiro (#19)
Re: Load distributed checkpoint V4

ITAGAKI Takahiro wrote:

Here is an updated version of LDC patch (V4).

Thanks! I'll start testing.

- Progress of checkpoint is controlled not only based on checkpoint_timeout
but also checkpoint_segments. -- Now it works better with large
checkpoint_timeout and small checkpoint_segments.

Great, much better now. I like the concept of "progress" used in the
calculations. We might want to call GetCheckpointProgress something
else, though. It doesn't return the amount of progress made, but rather
the amount of progress we should've made up to that point or we're in
danger of not completing the checkpoint in time.

We can control the delay of checkpoints using three parameters:
checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
If we set all of the values to zero, checkpoint behaves as it was.

The nap and sync phases are pretty straightforward. The write phase,
however, behaves a bit differently

In the nap phase, we just sleep until enough time/segments has passed,
where enough is defined by checkpoint_nap_percent. However, if we're
already past checkpoint_write_percent at the beginning of the nap, I
think we should clamp the nap time so that we don't run out of time
until the next checkpoint because of sleeping.

In the sync phase, we sleep between each fsync until enough
time/segments have passed, assuming that the time to fsync is
proportional to the file length. I'm not sure that's a very good
assumption. We might have one huge files with only very little changed
data, for example a logging table that is just occasionaly appended to.
If we begin by fsyncing that, it'll take a very short time to finish,
and we'll then sleep for a long time. If we then have another large file
to fsync, but that one has all pages dirty, we risk running out of time
because of the unnecessarily long sleep. The segmentation of relations
limits the risk of that, though, by limiting the max. file size, and I
don't really have any better suggestions.

In the write phase, bgwriter_all_maxpages is also factored in the
sleeps. On each iteration, we write bgwriter_all_maxpages pages and then
we sleep for bgwriter_delay msecs. checkpoint_write_percent only
controls the maximum amount of time we try spend in the write phase, we
skip the sleeps if we're exceeding checkpoint_write_percent, but it can
very well finish earlier. IOW, bgwriter_all_maxpages is the *minimum*
amount of pages to write between sleeps. If it's not set, we use
WRITERS_PER_ABSORB, which is hardcoded to 1000.

The approach of writing min. N pages per iteration seems sound to me. By
setting N we can control the maximum impact of a checkpoint under normal
circumstances. If there's very little work to do, it doesn't make sense
to stretch the write of say 10 buffers across a 15 min period; it's
indeed better to finish the checkpoint earlier. It's similar to
vacuum_cost_limit in that sense. But using bgwriter_all_maxpages for it
doesn't feel right, we should at least name it differently. The default
of 1000 is a bit high as well, with the default bgwriter_delay that adds
up to 39MB/s. That's ok for decent a I/O subsystem, but the default
really should be something that will still leave room for other I/O on a
small single-disk server.

Should we try doing something similar for the sync phase? If there's
only 2 small files to fsync, there's no point sleeping for 5 minutes
between them just to use up the checkpoint_sync_percent budget.

Should we give a warning if you set the *_percent settings so that they
exceed 100%?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Heikki Linnakangas (#20)
Re: Load distributed checkpoint V4

Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Thanks for making clearly understandable my patch!

We might want to call GetCheckpointProgress something
else, though. It doesn't return the amount of progress made, but rather
the amount of progress we should've made up to that point or we're in
danger of not completing the checkpoint in time.

GetCheckpointProgress might be a bad name; It returns the progress we should
have done, not at that time. How about GetCheckpointTargetProgress?

However, if we're
already past checkpoint_write_percent at the beginning of the nap, I
think we should clamp the nap time so that we don't run out of time
until the next checkpoint because of sleeping.

Yeah, I'm thinking nap time to be clamped to (100.0 - ckpt_progress_at_nap_
start - checkpoint_sync_percent). I think excess of checkpoint_write_percent
is not so important here, so I care about only the end of checkpoint.

In the sync phase, we sleep between each fsync until enough
time/segments have passed, assuming that the time to fsync is
proportional to the file length. I'm not sure that's a very good
assumption. We might have one huge files with only very little changed
data, for example a logging table that is just occasionaly appended to.
If we begin by fsyncing that, it'll take a very short time to finish,
and we'll then sleep for a long time. If we then have another large file
to fsync, but that one has all pages dirty, we risk running out of time
because of the unnecessarily long sleep. The segmentation of relations
limits the risk of that, though, by limiting the max. file size, and I
don't really have any better suggestions.

It is difficult to estimate fsync costs. We need additonal statistics to
do it. For example, if we record the number of write() for each segment,
we might use the value as the number of dirty pages in segments. We don't
have per-file write statistics now, but if we will have those information,
we can use them to control checkpoints more cleverly.

Should we try doing something similar for the sync phase? If there's
only 2 small files to fsync, there's no point sleeping for 5 minutes
between them just to use up the checkpoint_sync_percent budget.

Hmmm... if we add a new parameter like kernel_write_throughput [kB/s] and
clamp the maximum sleeping to size-of-segment / kernel_write_throuput (*1),
we can avoid unnecessary sleeping in fsync phase. Do we want to have such
a new parameter? I think we have many and many guc variables even now.
I don't want to add new parameters any more if possible...

(*1) dirty-area-in-segment / kernel_write_throuput is better.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#22Heikki Linnakangas
heikki@enterprisedb.com
In reply to: ITAGAKI Takahiro (#21)
Re: Load distributed checkpoint V4

ITAGAKI Takahiro wrote:

Heikki Linnakangas <hlinnaka@iki.fi> wrote:

We might want to call GetCheckpointProgress something
else, though. It doesn't return the amount of progress made, but rather
the amount of progress we should've made up to that point or we're in
danger of not completing the checkpoint in time.

GetCheckpointProgress might be a bad name; It returns the progress we should
have done, not at that time. How about GetCheckpointTargetProgress?

Better. A bit long though. Not that I have any better suggestions ;-)

In the sync phase, we sleep between each fsync until enough
time/segments have passed, assuming that the time to fsync is
proportional to the file length. I'm not sure that's a very good
assumption. We might have one huge files with only very little changed
data, for example a logging table that is just occasionaly appended to.
If we begin by fsyncing that, it'll take a very short time to finish,
and we'll then sleep for a long time. If we then have another large file
to fsync, but that one has all pages dirty, we risk running out of time
because of the unnecessarily long sleep. The segmentation of relations
limits the risk of that, though, by limiting the max. file size, and I
don't really have any better suggestions.

It is difficult to estimate fsync costs. We need additonal statistics to
do it. For example, if we record the number of write() for each segment,
we might use the value as the number of dirty pages in segments. We don't
have per-file write statistics now, but if we will have those information,
we can use them to control checkpoints more cleverly.

It's probably not worth it to be too clever with that. Even if we
recorded the number of writes we made, we still wouldn't know how many
of them haven't been flushed to disk yet.

I guess we're fine if we do just avoid excessive waiting per the
discussion in the next paragraph, and use a reasonable safety margin in
the default values.

Should we try doing something similar for the sync phase? If there's
only 2 small files to fsync, there's no point sleeping for 5 minutes
between them just to use up the checkpoint_sync_percent budget.

Hmmm... if we add a new parameter like kernel_write_throughput [kB/s] and
clamp the maximum sleeping to size-of-segment / kernel_write_throuput (*1),
we can avoid unnecessary sleeping in fsync phase. Do we want to have such
a new parameter? I think we have many and many guc variables even now.

How about using the same parameter that controls the minimum write speed
of the write-phase (the patch used bgwriter_all_maxpages, but I
suggested renaming it)?

I don't want to add new parameters any more if possible...

Agreed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#23Greg Smith
gsmith@gregsmith.com
In reply to: Heikki Linnakangas (#20)
Re: Load distributed checkpoint V4

On Thu, 19 Apr 2007, Heikki Linnakangas wrote:

In the sync phase, we sleep between each fsync until enough time/segments
have passed, assuming that the time to fsync is proportional to the file
length. I'm not sure that's a very good assumption.

I've been making scatter plots of fsync time vs. amount written to the
database for a couple of months now, and while there's a trend there it's
not a linear one based on data written. Under Linux, to make a useful
prediction about how long a fsync will take you first need to consider how
much dirty data is already in the OS cache (the "Dirty:" figure in
/proc/meminfo) before the write begins, relative to the kernel parameters
that control write behavior. Combine that with some knowledge of the
caching behavior of the controller/disk combination you're using, and it's
just barely possible to make a reasonable estimate. Any less information
than all that and you really have very little basis on which to guess how
long it's going to take.

Other operating systems are going to give completely different behavior
here, which of course makes the problem even worse.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#24ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Heikki Linnakangas (#22)
1 attachment(s)
Load distributed checkpoint V4.1

Here is an updated version of LDC patch (V4.1).
In this release, checkpoints finishes quickly if there is a few dirty pages
in the buffer pool following the suggestion from Heikki. Thanks.

If the last write phase was finished more quickly than the configuration,
the next nap phase is also shorten at the same rate. For example, if we
set checkpoint_write_percent = 50% and the write phase actually finished
in 25% of checkpoint time, the duration of nap time is adjusted to
checkpoint_nap_percent * 25% / 50%.

In the sync phase, we cut down the duration if there is a few files
to fsync. We assume that we have storages that throuput is at least
10 * bgwriter_all_maxpages (this is arguable). For example, when
bgwriter_delay=200ms and bgwriter_all_maxpages=5, we assume that
we can use 2MB/s of flush throughput (10 * 5page * 8kB / 200ms).
If there is 200MB of files to fsync, the duration of sync phase is
cut down to 100sec even if the duration is shorter than
checkpoint_sync_percent * checkpoint_timeout.
I use bgwriter_all_maxpages as something like 'reserved band of storage
for bgwriter' here. If there is a better name for it, please rename it.

Heikki Linnakangas <heikki@enterprisedb.com> wrote:

I guess we're fine if we do just avoid excessive waiting per the
discussion in the next paragraph, and use a reasonable safety margin in
the default values.

Should we try doing something similar for the sync phase? If there's
only 2 small files to fsync, there's no point sleeping for 5 minutes
between them just to use up the checkpoint_sync_percent budget.

Hmmm... if we add a new parameter like kernel_write_throughput [kB/s] and
clamp the maximum sleeping to size-of-segment / kernel_write_throuput (*1),
we can avoid unnecessary sleeping in fsync phase. Do we want to have such
a new parameter? I think we have many and many guc variables even now.

How about using the same parameter that controls the minimum write speed
of the write-phase (the patch used bgwriter_all_maxpages, but I
suggested renaming it)?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

LDC_V41.patchapplication/octet-stream; name=LDC_V41.patchDownload
diff -cpr HEAD/doc/src/sgml/config.sgml LDC_V41/doc/src/sgml/config.sgml
*** HEAD/doc/src/sgml/config.sgml	Fri Apr 20 11:37:37 2007
--- LDC_V41/doc/src/sgml/config.sgml	Wed Apr 25 15:07:11 2007
*************** SET ENABLE_SEQSCAN TO OFF;
*** 1565,1570 ****
--- 1565,1619 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-checkpoint-write-percent" xreflabel="checkpoint_write_percent">
+       <term><varname>checkpoint_write_percent</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_write_percent</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         To spread works in checkpoints, each checkpoint spends the specified
+         time and delays to write out all dirty buffers in the shared buffer
+         pool. The default value is 50.0 (50% of <varname>checkpoint_timeout</>).
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-checkpoint-nap-percent" xreflabel="checkpoint_nap_percent">
+       <term><varname>checkpoint_nap_percent</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_nap_percent</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Specifies the delay between writing out all dirty buffers and flushing
+         all modified files. Make the kernel's disk writer to flush dirty buffers
+         during this time in order to reduce works in the next flushing phase.
+         The default value is 10.0 (10% of <varname>checkpoint_timeout</>).
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
+      <varlistentry id="guc-checkpoint-sync-percent" xreflabel="checkpoint_sync_percent">
+       <term><varname>checkpoint_sync_percent</varname> (<type>floating point</type>)</term>
+       <indexterm>
+        <primary><varname>checkpoint_sync_percent</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         To spread works in checkpoints, each checkpoint spends the specified
+         time and delays to flush all modified files.
+         The default value is 20.0 (20% of <varname>checkpoint_timeout</>).
+         This parameter can only be set in the <filename>postgresql.conf</>
+         file or on the server command line.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
        <term><varname>checkpoint_warning</varname> (<type>integer</type>)</term>
        <indexterm>
diff -cpr HEAD/src/backend/access/transam/xlog.c LDC_V41/src/backend/access/transam/xlog.c
*** HEAD/src/backend/access/transam/xlog.c	Wed Apr  4 01:34:35 2007
--- LDC_V41/src/backend/access/transam/xlog.c	Wed Apr 25 15:07:11 2007
*************** static void readRecoveryCommandFile(void
*** 399,405 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
--- 399,405 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
*************** GetRedoRecPtr(void)
*** 5274,5279 ****
--- 5274,5296 ----
  }
  
  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+ 	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	XLogCtlInsert  *Insert = &XLogCtl->Insert;
+ 	XLogRecPtr		recptr;
+ 
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	INSERT_RECPTR(recptr, Insert, Insert->curridx);
+ 	SpinLockRelease(&xlogctl->info_lck);
+ 
+ 	return recptr;
+ }
+ 
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
*************** CreateCheckPoint(bool shutdown, bool for
*** 5546,5552 ****
  	 */
  	END_CRIT_SECTION();
  
! 	CheckPointGuts(checkPoint.redo);
  
  	START_CRIT_SECTION();
  
--- 5563,5569 ----
  	 */
  	END_CRIT_SECTION();
  
! 	CheckPointGuts(checkPoint.redo, force);
  
  	START_CRIT_SECTION();
  
*************** CreateCheckPoint(bool shutdown, bool for
*** 5652,5663 ****
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool();			/* performs all required fsyncs */
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
--- 5669,5680 ----
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool immediate)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool(immediate);		/* performs all required fsyncs */
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
*************** RecoveryRestartPoint(const CheckPoint *c
*** 5706,5712 ****
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
--- 5723,5729 ----
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo, true);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
diff -cpr HEAD/src/backend/commands/dbcommands.c LDC_V41/src/backend/commands/dbcommands.c
*** HEAD/src/backend/commands/dbcommands.c	Fri Apr 13 00:04:35 2007
--- LDC_V41/src/backend/commands/dbcommands.c	Wed Apr 25 15:07:11 2007
*************** createdb(const CreatedbStmt *stmt)
*** 400,406 ****
  	 * up-to-date for the copy.  (We really only need to flush buffers for the
  	 * source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync();
  
  	/*
  	 * Once we start copying subdirectories, we need to be able to clean 'em
--- 400,406 ----
  	 * up-to-date for the copy.  (We really only need to flush buffers for the
  	 * source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync(true);
  
  	/*
  	 * Once we start copying subdirectories, we need to be able to clean 'em
*************** dbase_redo(XLogRecPtr lsn, XLogRecord *r
*** 1417,1423 ****
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync();
  
  		/*
  		 * Copy this subdirectory to the new location
--- 1417,1423 ----
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync(true);
  
  		/*
  		 * Copy this subdirectory to the new location
diff -cpr HEAD/src/backend/postmaster/bgwriter.c LDC_V41/src/backend/postmaster/bgwriter.c
*** HEAD/src/backend/postmaster/bgwriter.c	Sat Mar 31 03:34:55 2007
--- LDC_V41/src/backend/postmaster/bgwriter.c	Wed Apr 25 15:07:11 2007
***************
*** 44,49 ****
--- 44,50 ----
  #include "postgres.h"
  
  #include <signal.h>
+ #include <sys/time.h>
  #include <time.h>
  #include <unistd.h>
  
*************** typedef struct
*** 117,122 ****
--- 118,124 ----
  	sig_atomic_t ckpt_failed;	/* advances when checkpoint fails */
  
  	sig_atomic_t ckpt_time_warn;	/* warn if too soon since last ckpt? */
+ 	sig_atomic_t ckpt_force;		/* any waiter for the checkpoint? */
  
  	int			num_requests;	/* current # of requests */
  	int			max_requests;	/* allocated array size */
*************** PgStat_MsgBgWriter BgWriterStats;
*** 138,143 ****
--- 140,148 ----
  int			BgWriterDelay = 200;
  int			CheckPointTimeout = 300;
  int			CheckPointWarning = 30;
+ double		checkpoint_write_percent = 50.0;
+ double		checkpoint_nap_percent = 10.0;
+ double		checkpoint_sync_percent = 20.0;
  
  /*
   * Flags set by interrupt handlers for later service in the main loop.
*************** static bool am_bg_writer = false;
*** 153,162 ****
--- 158,175 ----
  
  static bool ckpt_active = false;
  
+ static time_t		ckpt_start_time;
+ static XLogRecPtr	ckpt_start_recptr;
+ static double		ckpt_progress_at_sync_start;
+ 
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;
  
  
+ static void CheckArchiveTimeout(void);
+ static void BgWriterNap(long msec);
+ static bool NextCheckpointRequested(void);
+ static double GetCheckpointTargetProgress(void);
  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
  static void ReqCheckpointHandler(SIGNAL_ARGS);
*************** BackgroundWriterMain(void)
*** 343,349 ****
  		bool		force_checkpoint = false;
  		time_t		now;
  		int			elapsed_secs;
- 		long		udelay;
  
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
--- 356,361 ----
*************** BackgroundWriterMain(void)
*** 362,374 ****
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
  		}
- 		if (checkpoint_requested)
- 		{
- 			checkpoint_requested = false;
- 			do_checkpoint = true;
- 			force_checkpoint = true;
- 			BgWriterStats.m_requested_checkpoints++;
- 		}
  		if (shutdown_requested)
  		{
  			/*
--- 374,379 ----
*************** BackgroundWriterMain(void)
*** 389,399 ****
  		 */
  		now = time(NULL);
  		elapsed_secs = now - last_checkpoint_time;
! 		if (elapsed_secs >= CheckPointTimeout)
  		{
  			do_checkpoint = true;
! 			if (!force_checkpoint)
! 				BgWriterStats.m_timed_checkpoints++;
  		}
  
  		/*
--- 394,410 ----
  		 */
  		now = time(NULL);
  		elapsed_secs = now - last_checkpoint_time;
! 		if (checkpoint_requested)
  		{
+ 			checkpoint_requested = false;
+ 			force_checkpoint = BgWriterShmem->ckpt_force;
  			do_checkpoint = true;
! 			BgWriterStats.m_requested_checkpoints++;
! 		}
! 		else if (elapsed_secs >= CheckPointTimeout)
! 		{
! 			do_checkpoint = true;
! 			BgWriterStats.m_timed_checkpoints++;
  		}
  
  		/*
*************** BackgroundWriterMain(void)
*** 416,428 ****
--- 427,444 ----
  								elapsed_secs),
  						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
  			BgWriterShmem->ckpt_time_warn = false;
+ 			BgWriterShmem->ckpt_force = false;
  
  			/*
  			 * Indicate checkpoint start to any waiting backends.
  			 */
  			ckpt_active = true;
+ 			elog(DEBUG1, "CHECKPOINT: start");
  			BgWriterShmem->ckpt_started++;
  
+ 			ckpt_start_time = now;
+ 			ckpt_start_recptr = GetInsertRecPtr();
+ 			ckpt_progress_at_sync_start = 0;
  			CreateCheckPoint(false, force_checkpoint);
  
  			/*
*************** BackgroundWriterMain(void)
*** 435,440 ****
--- 451,457 ----
  			 * Indicate checkpoint completion to any waiting backends.
  			 */
  			BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
+ 			elog(DEBUG1, "CHECKPOINT: end");
  			ckpt_active = false;
  
  			/*
*************** BackgroundWriterMain(void)
*** 451,458 ****
  		 * Check for archive_timeout, if so, switch xlog files.  First we do a
  		 * quick check using possibly-stale local state.
  		 */
! 		if (XLogArchiveTimeout > 0 &&
! 			(int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
  		{
  			/*
  			 * Update local state ... note that last_xlog_switch_time is the
--- 468,495 ----
  		 * Check for archive_timeout, if so, switch xlog files.  First we do a
  		 * quick check using possibly-stale local state.
  		 */
! 		CheckArchiveTimeout();
! 
! 		/* Nap for the configured time. */
! 		BgWriterNap(0);
! 	}
! }
! 
! /*
!  * CheckArchiveTimeout -- check for archive_timeout
!  */
! static void
! CheckArchiveTimeout(void)
! {
! 	time_t		now;
! 
! 	if (XLogArchiveTimeout <= 0)
! 		return;
! 
! 	now = time(NULL);
! 	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
! 		return;
! 
  		{
  			/*
  			 * Update local state ... note that last_xlog_switch_time is the
*************** BackgroundWriterMain(void)
*** 462,471 ****
  
  			last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
  
- 			/* if we did a checkpoint, 'now' might be stale too */
- 			if (do_checkpoint)
- 				now = time(NULL);
- 
  			/* Now we can do the real check */
  			if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
  			{
--- 499,504 ----
*************** BackgroundWriterMain(void)
*** 490,495 ****
--- 523,540 ----
  				last_xlog_switch_time = now;
  			}
  		}
+ }
+ 
+ /*
+  * BgWriterNap -- short nap in bgwriter
+  *
+  * Nap for the shorter time of the configured time or the mdelay unless
+  * it is zero. Return the actual nap time in msec.
+  */
+ static void
+ BgWriterNap(long mdelay)
+ {
+ 	long		udelay;
  
  		/*
  		 * Send off activity statistics to the stats collector
*************** BackgroundWriterMain(void)
*** 515,520 ****
--- 560,569 ----
  		else
  			udelay = 10000000L; /* Ten seconds */
  
+ 		/* Clamp the delay to the upper bound. */
+ 		if (mdelay > 0)
+ 			udelay = Min(udelay, mdelay * 1000L);
+ 
  		while (udelay > 999999L)
  		{
  			if (got_SIGHUP || checkpoint_requested || shutdown_requested)
*************** BackgroundWriterMain(void)
*** 526,534 ****
--- 575,745 ----
  
  		if (!(got_SIGHUP || checkpoint_requested || shutdown_requested))
  			pg_usleep(udelay);
+ }
+ 
+ /*
+  * CheckpointWriteDelay -- periodical sleep in checkpoint write phase
+  */
+ void
+ CheckpointWriteDelay(double progress)
+ {
+ 	if (!ckpt_active || checkpoint_write_percent <= 0)
+ 		return;
+ 
+ 	elog(DEBUG1, "CheckpointWriteDelay: progress=%.3f", progress);
+ 
+ 	if (!NextCheckpointRequested() &&
+ 		progress * checkpoint_write_percent > GetCheckpointTargetProgress())
+ 	{
+ 		AbsorbFsyncRequests();
+ 		BgLruBufferSync();
+ 		BgWriterNap(0);
  	}
  }
  
+ /*
+  * CheckpointNapDelay -- sleep between checkpoint write and sync phases
+  */
+ void
+ CheckpointNapDelay(double percent)
+ {
+ 	if (!ckpt_active)
+ 		return;
+ 
+ 	if (percent > 0)
+ 	{
+ 		double	ckpt_progress_at_nap_start = GetCheckpointTargetProgress();
+ 		double	remain;
+ 
+ 		/*
+ 		 * If the write phase was finished in less time than the configuration,
+ 		 * we also shorten this nap phase.
+ 		 */
+ 		if (ckpt_progress_at_nap_start < checkpoint_write_percent)
+ 			percent *= ckpt_progress_at_nap_start / checkpoint_write_percent;
+ 
+ 		/*
+ 		 * If we're already past the deadline in the last write phase,
+ 		 * clamp the nap time so that we don't run out of time.
+ 		 */
+ 		percent = Min(percent,
+ 			100.0 - ckpt_progress_at_nap_start - checkpoint_sync_percent);
+ 
+ 		if (percent > 0)
+ 		{
+ 			elog(DEBUG1, "CheckpointNapDelay: %f%%", percent);
+ 
+ 			while (!NextCheckpointRequested() &&
+ 				(remain = ckpt_progress_at_nap_start + percent
+ 					- GetCheckpointTargetProgress()) > 0)
+ 			{
+ 				long	msec = (long)
+ 					(CheckPointTimeout * 1000.0 * remain / 100.0);
+ 
+ 				AbsorbFsyncRequests();
+ 				BgLruBufferSync();
+ 				BgWriterNap(msec);
+ 			}
+ 		}
+ 	}
+ 
+ 	ckpt_progress_at_sync_start = GetCheckpointTargetProgress();
+ }
+ 
+ /*
+  * CheckpointSyncDelay -- periodical sleep in checkpoint sync phase
+  */
+ void
+ CheckpointSyncDelay(double progress, double percent)
+ {
+ 	double	remain;
+ 
+ 	if (!ckpt_active || percent <= 0)
+ 		return;
+ 
+ 	elog(DEBUG1, "CheckpointSyncDelay: progress=%.3f", progress);
+ 
+ 	while (!NextCheckpointRequested() &&
+ 		(remain = ckpt_progress_at_sync_start + progress * percent
+ 			- GetCheckpointTargetProgress()) > 0)
+ 	{
+ 		long	msec = (long) (CheckPointTimeout * 1000.0 * remain / 100.0);
+ 
+ 		AbsorbFsyncRequests();
+ 		BgLruBufferSync();
+ 		BgWriterNap(msec);
+ 	}
+ }
+ 
+ /*
+  * NextCheckpointRequested -- true iff the next checkpoint is requested
+  *
+  *	Do also check any signals received recently.
+  */
+ static bool
+ NextCheckpointRequested(void)
+ {
+ 	if (!am_bg_writer || !ckpt_active)
+ 		return true;
+ 
+ 	/* Don't sleep this checkpoint if next checkpoint is requested. */
+ 	if (checkpoint_requested || shutdown_requested ||
+ 		(time(NULL) - ckpt_start_time >= CheckPointTimeout))
+ 	{
+ 		elog(DEBUG1, "NextCheckpointRequested");
+ 		checkpoint_requested = true;
+ 		return true;
+ 	}
+ 
+ 	/* Process reload signals. */
+ 	if (got_SIGHUP)
+ 	{
+ 		got_SIGHUP = false;
+ 		ProcessConfigFile(PGC_SIGHUP);
+ 	}
+ 
+ 	/* Check for archive_timeout and nap for the configured time. */
+ 	CheckArchiveTimeout();
+ 
+ 	return false;
+ }
+ 
+ /*
+  * GetCheckpointTargetProgress -- progress of the current checkpoint in range 0-100%
+  */
+ static double
+ GetCheckpointTargetProgress(void)
+ {
+ 	struct timeval	now;
+ 	XLogRecPtr		recptr;
+ 	double			progress_in_time,
+ 					progress_in_xlog;
+ 	double			percent;
+ 
+ 	Assert(ckpt_active);
+ 
+ 	/* coordinate the progress with checkpoint_timeout */
+ 	gettimeofday(&now, NULL);
+ 	progress_in_time = ((double) (now.tv_sec - ckpt_start_time) +
+ 		now.tv_usec / 1000000.0) / CheckPointTimeout;
+ 
+ 	/* coordinate the progress with checkpoint_segments */
+ 	recptr = GetInsertRecPtr();
+ 	progress_in_xlog =
+ 		(((double) recptr.xlogid - (double) ckpt_start_recptr.xlogid) * XLogSegsPerFile +
+ 		 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+ 		CheckPointSegments;
+ 
+ 	percent = 100.0 * Max(progress_in_time, progress_in_xlog);
+ 	if (percent > 100.0)
+ 		percent = 100.0;
+ 
+ 	elog(DEBUG2, "GetCheckpointTargetProgress: time=%.3f, xlog=%.3f",
+ 		progress_in_time, progress_in_xlog);
+ 
+ 	return percent;
+ }
+ 
  
  /* --------------------------------
   *		signal handler routines
*************** RequestCheckpoint(bool waitforit, bool w
*** 668,673 ****
--- 879,886 ----
  	/* Set warning request flag if appropriate */
  	if (warnontime)
  		bgs->ckpt_time_warn = true;
+ 	if (waitforit)
+ 		bgs->ckpt_force = true;
  
  	/*
  	 * Send signal to request checkpoint.  When waitforit is false, we
diff -cpr HEAD/src/backend/storage/buffer/bufmgr.c LDC_V41/src/backend/storage/buffer/bufmgr.c
*** HEAD/src/backend/storage/buffer/bufmgr.c	Sat Mar 31 03:34:55 2007
--- LDC_V41/src/backend/storage/buffer/bufmgr.c	Wed Apr 25 15:07:11 2007
*************** UnpinBuffer(volatile BufferDesc *buf, bo
*** 947,957 ****
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(void)
  {
  	int			buf_id;
  	int			num_to_scan;
  	int			absorb_counter;
  
  	/*
  	 * Find out where to start the circular scan.
--- 947,960 ----
   * This is called at checkpoint time to write out all dirty shared buffers.
   */
  void
! BufferSync(bool immediate)
  {
  	int			buf_id;
  	int			num_to_scan;
+ 	int			num_written;
  	int			absorb_counter;
+ 	int			writes_per_nap = (bgwriter_all_maxpages > 0 ?
+ 					bgwriter_all_maxpages : WRITES_PER_ABSORB);
  
  	/*
  	 * Find out where to start the circular scan.
*************** BufferSync(void)
*** 965,970 ****
--- 968,974 ----
  	 * Loop over all buffers.
  	 */
  	num_to_scan = NBuffers;
+ 	num_written = 0;
  	absorb_counter = WRITES_PER_ABSORB;
  	while (num_to_scan-- > 0)
  	{
*************** BufferSync(void)
*** 972,977 ****
--- 976,988 ----
  		{
  			BgWriterStats.m_buf_written_checkpoints++;
  
+ 			if (!immediate && ++num_written >= writes_per_nap)
+ 			{
+ 				num_written = 0;
+ 				CheckpointWriteDelay(
+ 					(double) (NBuffers - num_to_scan) / NBuffers);
+ 			}
+ 
  			/*
  			 * If in bgwriter, absorb pending fsync requests after each
  			 * WRITES_PER_ABSORB write operations, to prevent overflow of the
*************** void
*** 998,1004 ****
  BgBufferSync(void)
  {
  	static int	buf_id1 = 0;
- 	int			buf_id2;
  	int			num_to_scan;
  	int			num_written;
  
--- 1009,1014 ----
*************** BgBufferSync(void)
*** 1044,1049 ****
--- 1054,1072 ----
  		BgWriterStats.m_buf_written_all += num_written;
  	}
  
+ 	BgLruBufferSync();
+ }
+ 
+ /*
+  * BgLruBufferSync -- Write out some lru dirty buffers in the pool.
+  */
+ void
+ BgLruBufferSync(void)
+ {
+ 	int			buf_id2;
+ 	int			num_to_scan;
+ 	int			num_written;
+ 
  	/*
  	 * This loop considers only unpinned buffers close to the clock sweep
  	 * point.
*************** PrintBufferLeakWarning(Buffer buffer)
*** 1286,1295 ****
   * flushed.
   */
  void
! FlushBufferPool(void)
  {
! 	BufferSync();
! 	smgrsync();
  }
  
  
--- 1309,1324 ----
   * flushed.
   */
  void
! FlushBufferPool(bool immediate)
  {
! 	elog(DEBUG1, "CHECKPOINT: write phase");
! 	BufferSync(immediate || checkpoint_write_percent <= 0);
! 
! 	elog(DEBUG1, "CHECKPOINT: nap phase");
! 	CheckpointNapDelay(immediate ? 0 : checkpoint_nap_percent);
! 
! 	elog(DEBUG1, "CHECKPOINT: sync phase");
! 	smgrsync(immediate || checkpoint_sync_percent <= 0);
  }
  
  
diff -cpr HEAD/src/backend/storage/smgr/md.c LDC_V41/src/backend/storage/smgr/md.c
*** HEAD/src/backend/storage/smgr/md.c	Fri Apr 13 02:10:55 2007
--- LDC_V41/src/backend/storage/smgr/md.c	Wed Apr 25 15:07:11 2007
*************** mdimmedsync(SMgrRelation reln)
*** 863,875 ****
   *	mdsync() -- Sync previous writes to stable storage.
   */
  void
! mdsync(void)
  {
  	static bool mdsync_in_progress = false;
  
  	HASH_SEQ_STATUS hstat;
  	PendingOperationEntry *entry;
  	int			absorb_counter;
  
  	/*
  	 * This is only called during checkpoints, and checkpoints should only
--- 863,878 ----
   *	mdsync() -- Sync previous writes to stable storage.
   */
  void
! mdsync(bool immediate)
  {
  	static bool mdsync_in_progress = false;
  
  	HASH_SEQ_STATUS hstat;
  	PendingOperationEntry *entry;
  	int			absorb_counter;
+ 	double		sync_percent = checkpoint_sync_percent;
+ 	double		progress = 0;	/* progress in bytes */
+ 	double		total = 0;		/* total filesize to fsync in bytes */
  
  	/*
  	 * This is only called during checkpoints, and checkpoints should only
*************** mdsync(void)
*** 910,922 ****
  	 * From a performance point of view it doesn't matter anyway, as this
  	 * path will never be taken in a system that's functioning normally.
  	 */
! 	if (mdsync_in_progress)
  	{
  		/* prior try failed, so update any stale cycle_ctr values */
  		hash_seq_init(&hstat, pendingOpsTable);
  		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
  		{
! 			entry->cycle_ctr = mdsync_cycle_ctr;
  		}
  	}
  
--- 913,958 ----
  	 * From a performance point of view it doesn't matter anyway, as this
  	 * path will never be taken in a system that's functioning normally.
  	 */
! 	if (mdsync_in_progress || (enableFsync && !immediate))
  	{
  		/* prior try failed, so update any stale cycle_ctr values */
  		hash_seq_init(&hstat, pendingOpsTable);
  		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
  		{
! 			if (mdsync_in_progress)
! 				entry->cycle_ctr = mdsync_cycle_ctr;
! 			else if (enableFsync && !immediate && !entry->canceled)
! 			{
! 				SMgrRelation	reln;
! 				MdfdVec		   *seg;
! 				long			len;
! 
! 				/* Sum up lengths of the files on non-immediate case. */
! 				reln = smgropen(entry->tag.rnode);
! 				seg = _mdfd_getseg(reln,
! 						entry->tag.segno * ((BlockNumber) RELSEG_SIZE),
! 						true, EXTENSION_RETURN_NULL);
! 				if (seg != NULL &&
! 					(len = FileSeek(seg->mdfd_vfd, 0, SEEK_END)) >= 0)
! 					total += len;
! 			}
! 		}
! 
! 		/*
! 		 * Cut down the duration of the phase if there is only small area
! 		 * to be fsync-ed.
! 		 */
! 		if (total > 0 && bgwriter_all_maxpages > 0)
! 		{
! 			double	flush_per_delay = bgwriter_all_maxpages * 10;
! 
! 			sync_percent = total / BLCKSZ * BgWriterDelay / flush_per_delay
! 				/ 1000.0 / CheckPointTimeout * 100.0;
! 
! 			if (sync_percent > checkpoint_sync_percent)
! 				sync_percent = checkpoint_sync_percent;
! 
! 			elog(DEBUG1, "CHECKPOINT: adjust checkpoint_sync_percent to %f", sync_percent);
  		}
  	}
  
*************** mdsync(void)
*** 931,936 ****
--- 967,974 ----
  	hash_seq_init(&hstat, pendingOpsTable);
  	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
  	{
+ 		long			seglen = -1;
+ 
  		/*
  		 * If the entry is new then don't process it this time.  Note that
  		 * "continue" bypasses the hash-remove call at the bottom of the loop.
*************** mdsync(void)
*** 1010,1016 ****
--- 1048,1058 ----
  								   false, EXTENSION_RETURN_NULL);
  				if (seg != NULL &&
  					FileSync(seg->mdfd_vfd) >= 0)
+ 				{
+ 					if (total > 0)
+ 						seglen = FileSeek(seg->mdfd_vfd, 0, SEEK_END);
  					break;		/* success; break out of retry loop */
+ 				}
  
  				/*
  				 * XXX is there any point in allowing more than one retry?
*************** mdsync(void)
*** 1054,1059 ****
--- 1096,1110 ----
  		if (hash_search(pendingOpsTable, &entry->tag,
  						HASH_REMOVE, NULL) == NULL)
  			elog(ERROR, "pendingOpsTable corrupted");
+ 
+ 		/*
+ 		 * Nap some seconds according to the file size.
+ 		 */
+ 		if (seglen > 0 && total > 0)
+ 		{
+ 			progress += seglen;
+ 			CheckpointSyncDelay(progress / total, sync_percent);
+ 		}
  	}	/* end loop over hashtable entries */
  
  	/* Flag successful completion of mdsync */
diff -cpr HEAD/src/backend/storage/smgr/smgr.c LDC_V41/src/backend/storage/smgr/smgr.c
*** HEAD/src/backend/storage/smgr/smgr.c	Sat Jan  6 07:19:39 2007
--- LDC_V41/src/backend/storage/smgr/smgr.c	Wed Apr 25 15:07:11 2007
***************
*** 21,26 ****
--- 21,27 ----
  #include "access/xlogutils.h"
  #include "commands/tablespace.h"
  #include "pgstat.h"
+ #include "postmaster/bgwriter.h"
  #include "storage/bufmgr.h"
  #include "storage/freespace.h"
  #include "storage/ipc.h"
*************** typedef struct f_smgr
*** 57,63 ****
  	void		(*smgr_immedsync) (SMgrRelation reln);
  	void		(*smgr_commit) (void);	/* may be NULL */
  	void		(*smgr_abort) (void);	/* may be NULL */
! 	void		(*smgr_sync) (void);	/* may be NULL */
  } f_smgr;
  
  
--- 58,64 ----
  	void		(*smgr_immedsync) (SMgrRelation reln);
  	void		(*smgr_commit) (void);	/* may be NULL */
  	void		(*smgr_abort) (void);	/* may be NULL */
! 	void		(*smgr_sync) (bool immediate);	/* may be NULL */
  } f_smgr;
  
  
*************** smgrabort(void)
*** 781,794 ****
   *	smgrsync() -- Sync files to disk at checkpoint time.
   */
  void
! smgrsync(void)
  {
  	int			i;
  
  	for (i = 0; i < NSmgr; i++)
  	{
  		if (smgrsw[i].smgr_sync)
! 			(*(smgrsw[i].smgr_sync)) ();
  	}
  }
  
--- 782,795 ----
   *	smgrsync() -- Sync files to disk at checkpoint time.
   */
  void
! smgrsync(bool immediate)
  {
  	int			i;
  
  	for (i = 0; i < NSmgr; i++)
  	{
  		if (smgrsw[i].smgr_sync)
! 			(*(smgrsw[i].smgr_sync))(immediate);
  	}
  }
  
diff -cpr HEAD/src/backend/utils/misc/guc.c LDC_V41/src/backend/utils/misc/guc.c
*** HEAD/src/backend/utils/misc/guc.c	Sun Apr 22 12:52:40 2007
--- LDC_V41/src/backend/utils/misc/guc.c	Wed Apr 25 15:07:11 2007
*************** static struct config_real ConfigureNames
*** 1832,1837 ****
--- 1832,1864 ----
  		0.1, 0.0, 100.0, NULL, NULL
  	},
  
+ 	{
+ 		{"checkpoint_write_percent", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the duration percentage of write phase in checkpoints."),
+ 			NULL
+ 		},
+ 		&checkpoint_write_percent,
+ 		50.0, 0.0, 100.0, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"checkpoint_nap_percent", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the duration percentage between write and sync phases in checkpoints."),
+ 			NULL
+ 		},
+ 		&checkpoint_nap_percent,
+ 		10.0, 0.0, 100.0, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"checkpoint_sync_percent", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the duration percentage of sync phase in checkpoints."),
+ 			NULL
+ 		},
+ 		&checkpoint_sync_percent,
+ 		20.0, 0.0, 100.0, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL
diff -cpr HEAD/src/backend/utils/misc/postgresql.conf.sample LDC_V41/src/backend/utils/misc/postgresql.conf.sample
*** HEAD/src/backend/utils/misc/postgresql.conf.sample	Thu Apr 19 01:44:18 2007
--- LDC_V41/src/backend/utils/misc/postgresql.conf.sample	Wed Apr 25 15:07:11 2007
***************
*** 168,173 ****
--- 168,176 ----
  
  #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
  #checkpoint_timeout = 5min		# range 30s-1h
+ #checkpoint_write_percent = 50.0		# duration percentage in write phase
+ #checkpoint_nap_percent = 10.0		# duration percentage between write and sync phases
+ #checkpoint_sync_percent = 20.0		# duration percentage in sync phase
  #checkpoint_warning = 30s		# 0 is off
  
  # - Archiving -
diff -cpr HEAD/src/include/access/xlog.h LDC_V41/src/include/access/xlog.h
*** HEAD/src/include/access/xlog.h	Sat Jan  6 07:19:51 2007
--- LDC_V41/src/include/access/xlog.h	Wed Apr 25 15:07:11 2007
*************** extern void InitXLOGAccess(void);
*** 165,170 ****
--- 165,171 ----
  extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
  
  #endif   /* XLOG_H */
diff -cpr HEAD/src/include/postmaster/bgwriter.h LDC_V41/src/include/postmaster/bgwriter.h
*** HEAD/src/include/postmaster/bgwriter.h	Sat Jan  6 07:19:57 2007
--- LDC_V41/src/include/postmaster/bgwriter.h	Wed Apr 25 15:07:11 2007
***************
*** 20,29 ****
--- 20,35 ----
  extern int	BgWriterDelay;
  extern int	CheckPointTimeout;
  extern int	CheckPointWarning;
+ extern double	checkpoint_write_percent;
+ extern double	checkpoint_nap_percent;
+ extern double	checkpoint_sync_percent;
  
  extern void BackgroundWriterMain(void);
  
  extern void RequestCheckpoint(bool waitforit, bool warnontime);
+ extern void CheckpointWriteDelay(double progress);
+ extern void CheckpointNapDelay(double percent);
+ extern void CheckpointSyncDelay(double progress, double percent);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
diff -cpr HEAD/src/include/storage/bufmgr.h LDC_V41/src/include/storage/bufmgr.h
*** HEAD/src/include/storage/bufmgr.h	Sat Jan  6 07:19:57 2007
--- LDC_V41/src/include/storage/bufmgr.h	Wed Apr 25 15:07:11 2007
*************** extern char *ShowBufferUsage(void);
*** 125,131 ****
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(void);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
--- 125,131 ----
  extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
! extern void FlushBufferPool(bool immediate);
  extern BlockNumber BufferGetBlockNumber(Buffer buffer);
  extern BlockNumber RelationGetNumberOfBlocks(Relation relation);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
*************** extern void LockBufferForCleanup(Buffer 
*** 150,157 ****
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern void BufferSync(void);
  extern void BgBufferSync(void);
  
  extern void AtProcExit_LocalBuffers(void);
  
--- 150,158 ----
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern void BufferSync(bool immediate);
  extern void BgBufferSync(void);
+ extern void BgLruBufferSync(void);
  
  extern void AtProcExit_LocalBuffers(void);
  
diff -cpr HEAD/src/include/storage/smgr.h LDC_V41/src/include/storage/smgr.h
*** HEAD/src/include/storage/smgr.h	Thu Jan 18 01:25:01 2007
--- LDC_V41/src/include/storage/smgr.h	Wed Apr 25 15:07:11 2007
*************** extern void AtSubAbort_smgr(void);
*** 82,88 ****
  extern void PostPrepare_smgr(void);
  extern void smgrcommit(void);
  extern void smgrabort(void);
! extern void smgrsync(void);
  
  extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
--- 82,88 ----
  extern void PostPrepare_smgr(void);
  extern void smgrcommit(void);
  extern void smgrabort(void);
! extern void smgrsync(bool immediate);
  
  extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
*************** extern void mdwrite(SMgrRelation reln, B
*** 103,109 ****
  extern BlockNumber mdnblocks(SMgrRelation reln);
  extern void mdtruncate(SMgrRelation reln, BlockNumber nblocks, bool isTemp);
  extern void mdimmedsync(SMgrRelation reln);
! extern void mdsync(void);
  
  extern void RememberFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void ForgetRelationFsyncRequests(RelFileNode rnode);
--- 103,109 ----
  extern BlockNumber mdnblocks(SMgrRelation reln);
  extern void mdtruncate(SMgrRelation reln, BlockNumber nblocks, bool isTemp);
  extern void mdimmedsync(SMgrRelation reln);
! extern void mdsync(bool immediate);
  
  extern void RememberFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void ForgetRelationFsyncRequests(RelFileNode rnode);
#25Heikki Linnakangas
heikki@enterprisedb.com
In reply to: ITAGAKI Takahiro (#24)
Re: Load distributed checkpoint V4.1

ITAGAKI Takahiro wrote:

Here is an updated version of LDC patch (V4.1).
In this release, checkpoints finishes quickly if there is a few dirty pages
in the buffer pool following the suggestion from Heikki. Thanks.

Excellent, thanks! I was just looking at the results from my test runs
with version 4. I'll kick off some more tests with this version.

If the last write phase was finished more quickly than the configuration,
the next nap phase is also shorten at the same rate. For example, if we
set checkpoint_write_percent = 50% and the write phase actually finished
in 25% of checkpoint time, the duration of nap time is adjusted to
checkpoint_nap_percent * 25% / 50%.

You mean checkpoint_nap_percent * 25% * 50%, I presume, where 50% =
(actual time spent in write phase)/(checkpoint_write_percent)? Sounds
good to me.

In the sync phase, we cut down the duration if there is a few files
to fsync. We assume that we have storages that throuput is at least
10 * bgwriter_all_maxpages (this is arguable). For example, when
bgwriter_delay=200ms and bgwriter_all_maxpages=5, we assume that
we can use 2MB/s of flush throughput (10 * 5page * 8kB / 200ms).
If there is 200MB of files to fsync, the duration of sync phase is
cut down to 100sec even if the duration is shorter than
checkpoint_sync_percent * checkpoint_timeout.

Sounds reasonable. 10 * bgwriter_all_maxpages is indeed quite arbitrary,
but it should be enough to eliminate ridiculously long waits if there's
very little work to do. Or we could do the same thing you did with the
nap phase, scaling down the time allocated for sync phase by the ratio
of (actual time spent in write phase)/(checkpoint_write_percent). Using
the same mechanism in nap and sync phases sounds like a good idea.

I use bgwriter_all_maxpages as something like 'reserved band of storage
for bgwriter' here. If there is a better name for it, please rename it.

How about checkpoint_aggressiveness? Or checkpoint_throughput? I think
the correct metric is (k/M)bytes/sec, making it independent of
bgwriter_delay.

Do we want the same setting to be used for bgwriter_all_maxpages? I
don't think we have a reason to believe the same value is good for both.
In fact I think we should just get rid of bgwriter_all_* eventually, but
as Greg Smith pointed out we need more testing before we can do that :).

There's one more optimization I'd like to have. Checkpoint scans through
*all* dirty buffers and writes them out. However, some of those dirty
buffers might have been dirtied *after* the start of the checkpoint, and
flushing them is a waste of I/O if they get dirtied again before the
next checkpoint. Even if they don't, it seems better to not force them
to disk at checkpoint, checkpoint is heavy enough without any extra I/O.
It didn't make much difference without LDC, because we tried to complete
the writes as soon as possible so there wasn't a big window for that to
happen, but now that we spread out the writes it makes a lot of sense. I
wrote a quick & dirty patch to implement that, and at least in my test
case it does make some difference.

Here's results of some tests I ran with LDC v4.0:

http://community.enterprisedb.com/ldc/

Imola-164 is the a baseline run with CVS HEAD, with
bgwriter_all_maxpages and bgwriter_all_percent set to zero. I've
disabled think times in the test to make the checkpoint problem more
severe. Imola-162 is the same test with LDC patch applied. In Imola-163,
bgwriter_all_maxpages was set to 10. These runs show that the patch
clearly works; the response times during a checkpoint are much better.
Imola-163 is even better, which demonstrates that using
WRITES_PER_ABSORB (1000) in the absence of bgwriter_all_maxpages isn't a
good idea.

Imola-165 is the same as imola-163, but it has the optimization applied
I mentioned above. Only those dirty pages are written that are necessary
for a coherent checkpoint. The results look roughly the same, except
that imola-165 achieves a slightly higher total TPM, and the pits in the
TPM graph are slightly shallower.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com