Load distributed checkpoint

Started by ITAGAKI Takahiroabout 19 years ago109 messages
#1ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp

This is a proposal for load distributed checkpoint.
(It is presented on postgresql anniversary summit in last summer.)

We offen encounters performance gap during checkpoint. The reason is write
bursts. Storage devices are too overworked in checkpoint, so they can not
supply usual transaction processing.

Checkpoint consists of the following four steps, and the major performance
problem is 2nd step. All dirty buffers are written without interval in it.

1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control file

I suggested to write pages with sleeping in 2nd step, using normal activity
of the background writer. It is something like cost-based vacuum delay.
Background writer has two pointers, 'ALL' and 'LRU', indicating where to
write out in buffer pool. We can wait for the ALL clock-hand going around
to guarantee all pages to be written.

Here is pseudo-code for the proposed method. The internal loop is just the
same as bgwriter's activity.

PrepareCheckPoint(); -- do step 1
Reset num_of_scanned_pages by ALL activity;
do {
BgBufferSync(); -- do a part of step 2
sleep(bgwriter_delay);
} while (num_of_scanned_pages < shared_buffers);
CreateCheckPoint(); -- do step 3 and 4

We may accelerate background writer to reduce works at checkpoint instead of
the method, but it introduces another performance problem; Extra pressure
is always put on the storage devices to keep the number of dirty pages low.

I'm working about adjusting the progress of checkpoint to checkpoint timeout
and wal segments limitation automatically to avoid overlap of two checkpoints.
I'll post a patch sometime soon.

Comments and suggestions welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#2Takayuki Tsunakawa
tunakawa@soft.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

Hello, Itagaki-san

This is a proposal for load distributed checkpoint.
We offen encounters performance gap during checkpoint. The reason is

write

bursts. Storage devices are too overworked in checkpoint, so they

can not

supply usual transaction processing.

Good! You are focusing on a very important problem. System designers
don't like unsteady performance -- sudden slowdown. Commercial
database systems have made efforts to provide steady performance. I've
seen somewhere the report that Oracle provides stable throughput even
during checkpoint. I wonder how it is implemented.

I'm working about adjusting the progress of checkpoint to checkpoint

timeout

and wal segments limitation automatically to avoid overlap of two

checkpoints.

Have you already tried your patch? What's your first impression about
the improvement? I'm very interested. On my machine, pgbench shows 210
tps at first, however, it drops to 70 tps during checkpoint.

Checkpoint consists of the following four steps, and the major

performance

problem is 2nd step. All dirty buffers are written without interval

in it.

1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control file

Hmm. Isn't it possible that step 3 affects the performance greatly?
I'm sorry if you have already identified step 2 as disturbing
backends.

As you know, PostgreSQL does not transfer the data to disk when
write()ing. Actual transfer occurs when fsync()ing at checkpoints,
unless the filesystem cache runs short. So, disk is overworked at
fsync()s.

What processing of bgwriter (shared locking of buffers, write(),
fsync(), flushing of log for WAL, etc.) do you consider (or did you
detect) as disturbing what processing of backends (exclusive locking
of buffers, putting log records onto WAL buffers, flushing log at
commit, writing dirty buffers when shared buffers run short)?

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

On Thu, Dec 7, 2006 at 12:05 AM, in message

<20061207144843.6269.ITAGAKI.TAKAHIRO@oss.ntt.co.jp>, ITAGAKI Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

We offen encounters performance gap during checkpoint. The reason is write
bursts. Storage devices are too overworked in checkpoint, so they can not
supply usual transaction processing.

When we first switched our web site to PostgreSQL, this was one of our biggest problems. Queries which normally run in a few milliseconds were hitting the 20 second limit we impose in our web application. These were happening in bursts which suggested that they were caused by checkpoints. We adjusted the background writer configuration and nearly eliminated the problem.

bgwriter_all_maxpages | 600
bgwriter_all_percent | 10
bgwriter_delay | 200
bgwriter_lru_maxpages | 200
bgwriter_lru_percent | 20

Between the xfs caching and the batter backed cache in the RAID controller, the disk writes seemed to settle out pretty well.

Checkpoint consists of the following four steps, and the major performance
problem is 2nd step. All dirty buffers are written without interval in it.

1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control file

I suggested to write pages with sleeping in 2nd step, using normal activity
of the background writer. It is something like cost- based vacuum delay.
Background writer has two pointers, 'ALL' and 'LRU', indicating where to
write out in buffer pool. We can wait for the ALL clock- hand going around
to guarantee all pages to be written.

Here is pseudo- code for the proposed method. The internal loop is just the
same as bgwriter's activity.

PrepareCheckPoint(); -- do step 1
Reset num_of_scanned_pages by ALL activity;
do {
BgBufferSync(); -- do a part of step 2
sleep(bgwriter_delay);
} while (num_of_scanned_pages < shared_buffers);
CreateCheckPoint(); -- do step 3 and 4

Would the background writer be disabled during this extended checkpoint? How is it better to concentrate step 2 in an extended checkpoint periodically rather than consistently in the background writer?

We may accelerate background writer to reduce works at checkpoint instead of
the method, but it introduces another performance problem; Extra pressure
is always put on the storage devices to keep the number of dirty pages low.

Doesn't the file system caching logic combined with a battery backed cache in the controller cover this, or is your patch to help out those who don't have battery backed controller cache? What would the impact of your patch be on environments like ours? Will there be any affect on PITR techniques, in terms of how current the copied WAL files would be?

Show quoted text

I'm working about adjusting the progress of checkpoint to checkpoint timeout
and wal segments limitation automatically to avoid overlap of two
checkpoints.
I'll post a patch sometime soon.

Comments and suggestions welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#4Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Takayuki Tsunakawa (#2)
Re: Load distributed checkpoint

Takayuki Tsunakawa wrote:

Hello, Itagaki-san

Checkpoint consists of the following four steps, and the major
performance
problem is 2nd step. All dirty buffers are written without interval
in it.
1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control file

Hmm. Isn't it possible that step 3 affects the performance greatly?
I'm sorry if you have already identified step 2 as disturbing
backends.

As you know, PostgreSQL does not transfer the data to disk when
write()ing. Actual transfer occurs when fsync()ing at checkpoints,
unless the filesystem cache runs short. So, disk is overworked at
fsync()s.

It seems to me that virtual memory settings of the OS will determine
if step 2 or step 3 causes much of the actual disk I/O.

In particular, on Linux, things like /proc/sys/vm/dirty_expire_centisecs
and dirty_writeback_centisecs and possibly dirty_background_ratio
would affect this. If those numbers are high, ISTM most write()s
from step 2 would wait for the flush in step 3. If I understand
correctly, if the dirty_expire_centisecs number is low, most write()s
from step 2 would happen before step 3 because of the pdflush daemons.
I expect other OS's would have different but similar knobs to tune this.

It seems to me that the most portable way postgresql could force
the I/O to be balanced would be to insert otherwise unnecessary
fsync()s into step 2; but that it might (not sure why) be better
to handle this through OS-specific tuning outside of postgres.

#5Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

Hello,

As Mr. Mayer points out, which of step 2 or 3 actually causes I/O
depends on the VM settings, and the amount of RAM available for file
system cache.

"Ron Mayer" <rm_pg@cheapcomplexdevices.com> wrote in message
news:45786549.2000602@cheapcomplexdevices.com...

It seems to me that the most portable way postgresql could force
the I/O to be balanced would be to insert otherwise unnecessary
fsync()s into step 2; but that it might (not sure why) be better
to handle this through OS-specific tuning outside of postgres.

I'm afraid it is difficult for system designers to expect steady
throughput/response time, as long as PostgreSQL depends on the
flushing of file system cache. How does Oracle provide stable
performance?
Though I'm not sure, isn't it the key to use O_SYNC so that write()s
transfer data to disk? That is, PostgreSQL completely controls the
timing of data transfer. Moreover, if possible, it's better to bypass
the file system cache, using such as O_DIRECT flag for open() on UNIX
and FILE_FLAG_NO_BUFFERING flag for CreateFile() on Windows. As far as
I know, SQL Server and Oracle does this. I think commercial DBMSs do
the same thing to control and anticipate the I/O activity without
being influenced by VM policy.
If PostgreSQL is to use these, writing of dirty buffers has to be
improved. To decrease the count of I/O, pages adjacent on disk that
are also adjacent on memory must be written with one write().

From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>

Would the background writer be disabled during this extended

checkpoint? How is it better to concentrate step 2 in an extended
checkpoint periodically rather than consistently in the background
writer?

Will there be any affect on PITR techniques, in terms of how current

the copied WAL files would be?

Extending the checkpoint can also cause extended downtime, to put it
in an extreme way. I understand that checkpoints occur during crash
recovery and PITR, so time for those operations would get longer. The
checkpoint also occurs at server shutdown. However, distinction among
these might be made, and undesirable extension could be avoided.

#6Greg Smith
gsmith@gregsmith.com
In reply to: Kevin Grittner (#3)
Re: Load distributed checkpoint

On Thu, 7 Dec 2006, Kevin Grittner wrote:

Between the xfs caching and the batter backed cache in the RAID...

Mmmmm, battered cache. You can deep fry anything nowadays.

Would the background writer be disabled during this extended checkpoint?

The background writer is the same process that does the full buffer sweep
at checkpoint time. You wouldn't have to disable it because it would be
busy doing this extended checkpoint instead of its normal job.

How is it better to concentrate step 2 in an extended checkpoint
periodically rather than consistently in the background writer?

Right now, when the checkpoint flush is occuring, there is no background
writer active--that process is handling the checkpoint. Itagaki's
suggestion is basically to take the current checkpoint code, which runs
all in one burst, and spread it out over time. I like the concept, as
I've seen the behavior he's describing (even after tuning the background
writer like you suggest and doing Linux disk tuning as Ron describes), but
I think solving the problem is a little harder than suggested.

I have two concerns with the logic behind this approach. The first is
that if the background writer isn't keeping up with writing out all the
dirty pages, what makes you think that running the checkpoint with a
similar level of activity is going to? If your checkpoint is taking a
long time, it's because the background writer has an overwhelming load and
needs to be bailed out. Slowing down the writes with a lazier checkpoint
process introduces the possibility that you'll hit a second checkpoint
request before you're even finished cleaning up the first one, and then
you're really in trouble.

Second, the assumption here is that it's writing the dirty buffers out
that is the primary cause of the ugly slowdown. I too believe it could
just as easily be the fsync when it's done that killing you, and slowing
down the writes isn't necessarily going to make that faster.

Doesn't the file system caching logic combined with a battery backed
cache in the controller cover this, or is your patch to help out those
who don't have battery backed controller cache?

Unless your shared buffer pool is so small that you can write it all out
onto the cache, that won't help much with this problem.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#7Greg Smith
gsmith@gregsmith.com
In reply to: Takayuki Tsunakawa (#5)
Re: Load distributed checkpoint

On Fri, 8 Dec 2006, Takayuki Tsunakawa wrote:

Though I'm not sure, isn't it the key to use O_SYNC so that write()s
transfer data to disk?

If disk writes near checkpoint time aren't happening fast enough now, I
doubt forcing a sync after every write will make that better.

If PostgreSQL is to use these, writing of dirty buffers has to be
improved. To decrease the count of I/O, pages adjacent on disk that
are also adjacent on memory must be written with one write().

Sorting out which pages are next to one another on disk is one of the jobs
the file system cache does; bypassing it will then make all that
complicated sorting logic the job of the database engine. And unlike the
operating system, the engine doesn't even really know anything about the
filesystem or physical disks involved, so what are the odds it's going to
do a better job? That's the concern with assuming direct writes are the
solution here--you have to be smarter than the OS is for that to be an
improvement, and you have a lot less information to make your decisions
with than it does.

I like Itagaki's idea of some automatic tuning of the checkpoint timeout
and wal segment parameters to help out with checkpoint behavior; that's an
idea that would help a lot of people. I'm not so sure that trying to make
PostgreSQL take over operations that it's relying on the OS to handle
intelligently right now will be as helpful, and it's a big programming job
to even try.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#8ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Ron Mayer (#4)
Re: Load distributed checkpoint

Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:

1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control file

Hmm. Isn't it possible that step 3 affects the performance greatly?
I'm sorry if you have already identified step 2 as disturbing
backends.

It seems to me that virtual memory settings of the OS will determine
if step 2 or step 3 causes much of the actual disk I/O.

if the dirty_expire_centisecs number is low, most write()s
from step 2 would happen before step 3 because of the pdflush daemons.

Exactly. It depends on OSes, kernel settings, and filesystems.
I tested the patch on Linux kernel 2.6.9-39, default settings, and ext3fs.
Maybe pdflush daemons were strong enough to write dirty buffers in kernel,
so step 2 was a main part and 3 was not.

There are technical issues to distribute step 3. We can write buffers
on a page basis, that is granular enough. However, fsync() is on a file
basis (1GB), so we can only control granularity of fsync roughly.
sync_file_range (http://lwn.net/Articles/178199/) or some special APIs
would be a help, but there are portability issues...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#7)
Re: Load distributed checkpoint

Greg Smith <gsmith@gregsmith.com> writes:

On Fri, 8 Dec 2006, Takayuki Tsunakawa wrote:

Though I'm not sure, isn't it the key to use O_SYNC so that write()s
transfer data to disk?

If disk writes near checkpoint time aren't happening fast enough now, I
doubt forcing a sync after every write will make that better.

I think the idea would be to force the writes to actually occur, rather
than just being scheduled (and then forced en-masse by an fsync at
checkpoint time). Since the point of the bgwriter is to try to force
writes to occur *outside* checkpoint times, this seems to make sense.
I share your doubts about the value of slowing down checkpoints --- but
to the extent that bgwriter-issued writes are delayed by the kernel
until the next checkpoint, we are certainly not getting the desired
effect of leveling the write load.

To decrease the count of I/O, pages adjacent on disk that
are also adjacent on memory must be written with one write().

Sorting out which pages are next to one another on disk is one of the jobs
the file system cache does; bypassing it will then make all that
complicated sorting logic the job of the database engine.

Indeed --- the knowledge that we don't know the physical layout has
always been the strongest argument against using O_SYNC in this way.
But I don't think anyone's made any serious tests. A properly tuned
bgwriter should be eating only a "background" level of I/O effort
between checkpoints, so maybe it doesn't matter too much if it's not
optimally scheduled.

regards, tom lane

#10ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Takayuki Tsunakawa (#5)
Re: Load distributed checkpoint

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

I'm afraid it is difficult for system designers to expect steady
throughput/response time, as long as PostgreSQL depends on the
flushing of file system cache. How does Oracle provide stable
performance?
Though I'm not sure, isn't it the key to use O_SYNC so that write()s
transfer data to disk?

AFAIK, other databases use write() and fsync() in combination. They call
fsync() immediately after they write buffers in some small batches. Otherwise,
they uses asynchronous and direct I/O options. Therefore, dirty pages in
kernel buffers are keeped to be low at any time.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#11ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Kevin Grittner (#3)
Re: Load distributed checkpoint

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

We adjusted the background writer configuration and nearly eliminated
the problem.

bgwriter_all_maxpages | 600
bgwriter_all_percent | 10
bgwriter_delay | 200
bgwriter_lru_maxpages | 200
bgwriter_lru_percent | 20

Between the xfs caching and the batter backed cache in the RAID
controller, the disk writes seemed to settle out pretty well.

Yes, higher bgwriter_all_maxpages is better for stability. I also do so
up to now. However, if some processes makes lots of dirty buffers in the
shortest time, ex.VACUUM, bgwriter begins to write many pages and affects
responce time.

We will be able to set bgwriter_all_maxpages to lower value with load
distributed checkpoint. It expands the range of tuning of bgwriter.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#12Jim C. Nasby
jim@nasby.net
In reply to: Kevin Grittner (#3)
Re: Load distributed checkpoint

On Thu, Dec 07, 2006 at 10:03:05AM -0600, Kevin Grittner wrote:

When we first switched our web site to PostgreSQL, this was one of our biggest problems. Queries which normally run in a few milliseconds were hitting the 20 second limit we impose in our web application. These were happening in bursts which suggested that they were caused by checkpoints. We adjusted the background writer configuration and nearly eliminated the problem.

bgwriter_all_maxpages | 600
bgwriter_all_percent | 10
bgwriter_delay | 200
bgwriter_lru_maxpages | 200
bgwriter_lru_percent | 20

Bear in mind that bgwriter settings should be considered in conjunction
with shared_buffer and checkpoint_timeout settings. For example, if you
have 60,000 shared buffers and a 300 second checkpoint interval, those
settings are going to be pretty aggressive.

Generally, I try and configure the all* settings so that you'll get 1
clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
have any actual tests to back that methodology up.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#13Martijn van Oosterhout
kleptog@svana.org
In reply to: ITAGAKI Takahiro (#10)
Re: Load distributed checkpoint

On Fri, Dec 08, 2006 at 02:22:14PM +0900, ITAGAKI Takahiro wrote:

AFAIK, other databases use write() and fsync() in combination. They call
fsync() immediately after they write buffers in some small batches. Otherwise,
they uses asynchronous and direct I/O options. Therefore, dirty pages in
kernel buffers are keeped to be low at any time.

The "easy" solution I can think of is, when a session/backend is
exiting cleanly (client sent quit command), execute fsync() on some of
the descriptors before actually closing. At this point the user isn't
waiting anymore, so it can take its time.

The problem with fsync() remains that it can cause a write spike,
althoguh the more often you do it the less of an effect it would have.

A longer term solution maybe be create a daemon with system specific
information that monitors the load and tweaks parameters in response.
Not just postgresql parameters, but also system parameters. Even if it
never becomes part of postgresql, it will provide a way to test all
these "hunches" people have about optimising the system.

BTW, has anyone ever considered having the bgwriter do a NOTIFY
whenever it starts/ends a checkpoint, so client coulds monitor the
activity without reading the logs?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#14Inaam Rana
inaamrana@gmail.com
In reply to: Ron Mayer (#4)
Re: Load distributed checkpoint

On 12/7/06, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:

Takayuki Tsunakawa wrote:

Hello, Itagaki-san

Checkpoint consists of the following four steps, and the major
performance
problem is 2nd step. All dirty buffers are written without interval
in it.
1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control file

Hmm. Isn't it possible that step 3 affects the performance greatly?
I'm sorry if you have already identified step 2 as disturbing
backends.

As you know, PostgreSQL does not transfer the data to disk when
write()ing. Actual transfer occurs when fsync()ing at checkpoints,
unless the filesystem cache runs short. So, disk is overworked at
fsync()s.

It seems to me that virtual memory settings of the OS will determine
if step 2 or step 3 causes much of the actual disk I/O.

In particular, on Linux, things like /proc/sys/vm/dirty_expire_centisecs

dirty_expire_centisecs will have little, if any, effect on a box with
consistent workload. Under uniform load bgwriter will keep pushing the
buffers to fs cache which will result in eviction/flushing of pages to disk.
That the pages will age quickly can lower the cap of dirty pages but it
won't/can't handle sudden spike at checkpoint time.

and dirty_writeback_centisecs

Again on a system that encounters IO chokes on checkpoints pdflush is
presumably working like crazy at that time. Reducing the gap between its
wakeup calls will have probably very little impact on the checkpoint
performance.

and possibly dirty_background_ratio

I have seen this to put a real cap on number of dirty pages during normal
running. As regards checkpoints, this again seems to have little effect.

The problem while dealing with checkpoints is that we are dealing with two
starkly different type of IO loads. The larger the number of shared_buffers
the greater the spike in IO activity at checkpoint. AFAICS no specific vm
tunables can smooth out checkpoint spikes by itself. There has to be some
intelligence in the bgwriter to even the load out.

would affect this. If those numbers are high, ISTM most write()s

Show quoted text

from step 2 would wait for the flush in step 3. If I understand
correctly, if the dirty_expire_centisecs number is low, most write()s
from step 2 would happen before step 3 because of the pdflush daemons.
I expect other OS's would have different but similar knobs to tune this.

It seems to me that the most portable way postgresql could force
the I/O to be balanced would be to insert otherwise unnecessary
fsync()s into step 2; but that it might (not sure why) be better
to handle this through OS-specific tuning outside of postgres.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

#15Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Jim C. Nasby (#12)
Re: Load distributed checkpoint

On Fri, Dec 8, 2006 at 1:13 AM, in message

<20061208071305.GG44124@nasby.net>,
"Jim C. Nasby" <jim@nasby.net> wrote:

On Thu, Dec 07, 2006 at 10:03:05AM - 0600, Kevin Grittner wrote:

We adjusted the background writer configuration
and nearly eliminated the problem.

bgwriter_all_maxpages | 600
bgwriter_all_percent | 10
bgwriter_delay | 200
bgwriter_lru_maxpages | 200
bgwriter_lru_percent | 20

Bear in mind that bgwriter settings should be considered in

conjunction

with shared_buffer and checkpoint_timeout settings. For example, if

you

have 60,000 shared buffers and a 300 second checkpoint interval,

those

settings are going to be pretty aggressive.

Generally, I try and configure the all* settings so that you'll get

1

clock- sweep per checkpoint_timeout. It's worked pretty well, but I

don't

have any actual tests to back that methodology up.

We have 20,000 shared buffers and a 300 second checkpoint interval.

We got to these numbers somewhat scientifically. I studied I/O
patterns under production load and figured we should be able to handle
about 800 writes in per 200 ms without causing problems. I have to
admit that I based the percentages and the ratio between "all" and "lru"
on gut feel after musing over the documentation.

Since my values were such a dramatic change from the default, I boosted
the production settings a little bit each day and looked for feedback
from our web team. Things improved with each incremental increase.
When I got to my calculated values (above) they reported that these
timeouts had dropped to an acceptable level -- a few per day on a
website with 2 million hits per day. We may benefit from further
adjustments, but since the problem is negligible with these settings,
there are bigger fish to fry at the moment.

By the way, if I remember correctly, these boxes have 256 MB battery
backed cache, while 20,000 buffers is 156.25 MB.

-Kevin

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#15)
Re: Load distributed checkpoint

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

"Jim C. Nasby" <jim@nasby.net> wrote:

Generally, I try and configure the all* settings so that you'll get 1
clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
have any actual tests to back that methodology up.

We got to these numbers somewhat scientifically. I studied I/O
patterns under production load and figured we should be able to handle
about 800 writes in per 200 ms without causing problems. I have to
admit that I based the percentages and the ratio between "all" and "lru"
on gut feel after musing over the documentation.

I like Kevin's settings better than what Jim suggests. If the bgwriter
only makes one sweep between checkpoints then it's hardly going to make
any impact at all on the number of dirty buffers the checkpoint will
have to write. The point of the bgwriter is to reduce the checkpoint
I/O spike by doing writes between checkpoints, and to have any
meaningful impact on that, you'll need it to make the cycle several times.

Another point here is that you want checkpoints to be pretty far apart
to minimize the WAL load from full-page images. So again, a bgwriter
that's only making one loop per checkpoint is not gonna be doing much.

I wonder whether it would be feasible to teach the bgwriter to get more
aggressive as the time for the next checkpoint approaches? Writes
issued early in the interval have a much higher probability of being
wasted (because the page gets re-dirtied later). But maybe that just
reduces to what Takahiro-san already suggested, namely that
checkpoint-time writes should be done with the same kind of scheduling
the bgwriter uses outside checkpoints. We still have the problem that
the real I/O storm is triggered by fsync() not write(), and we don't
have a way to spread out the consequences of fsync().

regards, tom lane

#17Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#16)
Re: Load distributed checkpoint

On Fri, Dec 8, 2006 at 10:43 AM, in message

<6439.1165596207@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder whether it would be feasible to teach the bgwriter to get

more

aggressive as the time for the next checkpoint approaches? Writes
issued early in the interval have a much higher probability of being
wasted (because the page gets re- dirtied later).

But wouldn't the ones written earlier stand a much better chance of
being scheduled for a physical write before the fsync? They aren't ALL
re-dirtied. In our environment, we seem to be getting a lot written
before the fsync with our current settings.

But maybe that just
reduces to what Takahiro- san already suggested, namely that
checkpoint- time writes should be done with the same kind of

scheduling

the bgwriter uses outside checkpoints. We still have the problem

that

the real I/O storm is triggered by fsync() not write(), and we don't
have a way to spread out the consequences of fsync().

We don't have a way to force writes before the fsync, but early writes
to the file system encourages it. After reading this thread, I'm
tempted to nudge our settings a little higher -- especially the
percentages. How much overhead is there in checking whether buffer
pages are dirty?

-Kevin

#18Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#17)
Re: Load distributed checkpoint

On Fri, Dec 8, 2006 at 11:01 AM, in message

<4579461A.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:

I'm going to correct a previous statement, based on having just chatted
with a member of the web team.

The remaining dribble of these was only occurring on the Windows
machines. (We were running two servers from Windows and two from Linux
for a while to see how things compared.) The settings I previously
posted have completely eliminated the problem on Linux, and we are
moving all servers to Linux, so this has become a total non-issue for
us. I really encourage people to try some much more aggressive
background writer settings, and possibly try tuning the OS dirty write
settings.

Unless PostgreSQL gets much closer to the hardware than the community
consensus seems to support, I don't understand what you could do in the
checkpoint phase that would improve on that. (That, of course, doesn't
mean I'm not missing something, just that the arguments made so far
haven't shown me that the suggested changes would do anything but move
the problem around a little bit.)

-Kevin

#19Brad Nicholson
bnichols@ca.afilias.info
In reply to: Kevin Grittner (#18)
Re: Load distributed checkpoint

On Fri, 2006-12-08 at 11:55 -0600, Kevin Grittner wrote:

On Fri, Dec 8, 2006 at 11:01 AM, in message

<4579461A.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:

I'm going to correct a previous statement, based on having just chatted
with a member of the web team.

The remaining dribble of these was only occurring on the Windows
machines. (We were running two servers from Windows and two from Linux
for a while to see how things compared.) The settings I previously
posted have completely eliminated the problem on Linux, and we are
moving all servers to Linux, so this has become a total non-issue for
us. I really encourage people to try some much more aggressive
background writer settings, and possibly try tuning the OS dirty write
settings.

How much increased I/O usage have you seen in regular operation with
those settings?

--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.

#20Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Brad Nicholson (#19)
Re: Load distributed checkpoint

On Fri, Dec 8, 2006 at 12:18 PM, in message

<1165601938.10248.56.camel@dba5.int.libertyrms.com>, Brad Nicholson
<bnichols@ca.afilias.info> wrote:

How much increased I/O usage have you seen in regular operation with
those settings?

We have not experience any increase in I/O, just a smoothing. Keep in
mind that the file system cache will collapse repeated writes to the
same location until things settle, and the controller's cache also has a
chance of doing so. If we just push dirty pages out to the OS as soon
as possible, and let the file system do its job, I think we're in better
shape than if we try to micro-manage it within our buffer pages.

You mileage may vary of course, but I'm curious whether any real world
production examples exist where this approach is a loser.

-Kevin

#21Inaam Rana
inaamrana@gmail.com
In reply to: Kevin Grittner (#18)
Re: Load distributed checkpoint

Unless PostgreSQL gets much closer to the hardware than the community
consensus seems to support, I don't understand what you could do in the
checkpoint phase that would improve on that. (That, of course, doesn't
mean I'm not missing something, just that the arguments made so far
haven't shown me that the suggested changes would do anything but move
the problem around a little bit.)

If we can try out things like O_SYNC and/or O_DIRECT that will give us
exact information about dirty pages that need be written to the disk at any
given point in time. We can then use this information to design light weight
checkpoints.

#22Simon Riggs
simon@2ndquadrant.com
In reply to: Takayuki Tsunakawa (#5)
Re: Load distributed checkpoint

On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:

I understand that checkpoints occur during crash
recovery and PITR, so time for those operations would get longer.

A restorepoint happens during recovery, not a checkpoint. The recovery
is merely repeating the work of the checkpoint that occurred in the
original WAL stream. Elongating the checkpoint would not have any effect
on a restorepoint: we only record the checkpoint when it is complete and
we only create a restorepoint when we see the checkpoint record.

Crash recovery and PITR use almost exactly the same code path (by
design), so there isn't anything special to say about PITR either.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#23ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Kevin Grittner (#20)
Re: Load distributed checkpoint

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

We have not experience any increase in I/O, just a smoothing. Keep in
mind that the file system cache will collapse repeated writes to the
same location until things settle, and the controller's cache also has a
chance of doing so. If we just push dirty pages out to the OS as soon
as possible, and let the file system do its job, I think we're in better
shape than if we try to micro-manage it within our buffer pages.

Maybe we have two entirely different tuning approaches:
1. Retain dirty buffers in database, and keep OS buffers clean.
2. Keep database clean, and entrust OS to manage dirty buffers.

I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints
vary in the approaches; write() will be worse in 1st, fsync() in 2nd.

Distributed write() is easier than distributed fsync(), because we can
use write() on a page basis, but fsync() only on a file basis.
Also, database has own access-frequency information for its buffers,
so I think 1st approach behaves better in handling re-dirty of buffers.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#24Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

Hello,

From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

I'm afraid it is difficult for system designers to expect steady
throughput/response time, as long as PostgreSQL depends on the
flushing of file system cache. How does Oracle provide stable
performance?
Though I'm not sure, isn't it the key to use O_SYNC so that

write()s

transfer data to disk?

AFAIK, other databases use write() and fsync() in combination. They

call

fsync() immediately after they write buffers in some small batches.

Otherwise,

they uses asynchronous and direct I/O options. Therefore, dirty

pages in

kernel buffers are keeped to be low at any time.

Oracle seems to use O_SYNC. I've found a related page in the Oracle
manuals.

--------------------------------------------------
http://download-west.oracle.com/docs/cd/B19306_01/win.102/b15688/ap_unix.htm
Direct Writes to Disk
On both UNIX and Windows platforms, bypassing the file system buffer
cache ensures data is written to disk.
On UNIX, Oracle Database uses the O_SYNC flag to bypass the file
system buffer cache. The flag name depends on the UNIX port.
On Windows, Oracle Database bypasses the file system buffer cache
completely.
--------------------------------------------------

As Itagaki-san says, asynchronous+direct I/O provides best
performance, I believe. Oracle supplies the combination as follows:

--------------------------------------------------
http://download-west.oracle.com/docs/cd/B19306_01/server.102/b15658/appc_linux.htm#sthref870
Oracle Database supports kernel asynchronous I/O. This feature is
disabled by default.
By default, the DISK_ASYNCH_IO initialization parameter in the
parameter file is set to TRUE to enable asynchronous I/O on raw
devices. To enable asynchronous I/O on file system files:
Ensure that all Oracle Database files are located on file systems that
support asynchronous I/O.
Set the FILESYSTEMIO_OPTIONS initialization parameter in the parameter
file to one of the following values:

Linux Distribution Recommended Value
SUSE Linux Enterprise Server 9 SETALL
Other distributions ASYNCH
--------------------------------------------------

I believe SQL Server also uses direct+asynchronous I/O, because
Microsoft recommends in MSDN that the combination plus appropriate
multi-threading provides best performance.

I tested Oracle9i on RHEL 2.1. I straced DBWn (database writer, which
is like the bgwriter of PostgreSQL) while creating tables, indexes,
etc. and shutting down the database server. Oracle surely uses the
O_SYNC as follows, but it doesn't use fsync().

24462 open("/work4/ora/tuna/users01.dbf", O_RDWR|O_SYNC|O_LARGEFILE) =
16

I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san
referring to DB2?

#25Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

Mr. Riggs,

Thank you for teaching me the following. I seem to have misunderstood.
I'll learn more.

From: "Simon Riggs" <simon@2ndquadrant.com>

On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:

I understand that checkpoints occur during crash
recovery and PITR, so time for those operations would get longer.

A restorepoint happens during recovery, not a checkpoint. The

recovery

is merely repeating the work of the checkpoint that occurred in the
original WAL stream. Elongating the checkpoint would not have any

effect

on a restorepoint: we only record the checkpoint when it is complete

and

we only create a restorepoint when we see the checkpoint record.

Regards,

----- Original Message -----
From: "Simon Riggs" <simon@2ndquadrant.com>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
Cc: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>;
<pgsql-hackers@postgresql.org>
Sent: Monday, December 11, 2006 6:30 PM
Subject: Re: [HACKERS] Load distributed checkpoint

On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:

I understand that checkpoints occur during crash
recovery and PITR, so time for those operations would get longer.

A restorepoint happens during recovery, not a checkpoint. The

recovery

is merely repeating the work of the checkpoint that occurred in the
original WAL stream. Elongating the checkpoint would not have any

effect

on a restorepoint: we only record the checkpoint when it is complete

and

Show quoted text

we only create a restorepoint when we see the checkpoint record.

Crash recovery and PITR use almost exactly the same code path (by
design), so there isn't anything special to say about PITR either.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#26Inaam Rana
inaamrana@gmail.com
In reply to: Takayuki Tsunakawa (#24)
Re: Load distributed checkpoint

I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san
referring to DB2?

DB2 would also open data files with O_SYNC option and page_cleaners
(counterparts of bgwriter) would exploit AIO if available on the system.

Inaam Rana
EnterpriseDB http://www.enterprisedb.com

#27Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: ITAGAKI Takahiro (#23)
Re: Load distributed checkpoint

ITAGAKI Takahiro wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

...the file system cache will collapse repeated writes to the
same location until things settle ...
If we just push dirty pages out to the OS as soon as possible,
and let the file system do its job, I think we're in better
shape...

Maybe we have two entirely different tuning approaches:
1. Retain dirty buffers in database, and keep OS buffers clean.
2. Keep database clean, and entrust OS to manage dirty buffers.

I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints
vary in the approaches; write() will be worse in 1st, fsync() in 2nd.

The fsync() won't necessarily be worse in the second approach. OS's have
quite a few tunable parameters that can encourage the system to physically
write the pending write()s before the fsync() - either in the background
or by the process doing the write() itself when there are too many
dirty pages.

For checkpoints, I think the main question is whether postgresql's
background writer is smarter or less smart than pdflush or the
equivalent on your system for database workloads.

Also, database has own access-frequency information for its buffers,
so I think 1st approach behaves better in handling re-dirty of buffers.

I'm curious what access-frequency info the OS and the database has.

One thing I do worry about is if both postgresql and the OS
are both delaying write()s in the hopes of collapsing them
at the same time. If so, this would cause both to be experience
bigger delays than expected, and make checkpoints worse.

I'm guesing if you use approach 1. you might be better off
turning down the amount of buffering that the OS does with
dirty pages - and if you use approach 2, you might be better
off turning down the amount of delays that postgresql adds.

#28Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Ron Mayer (#27)
Re: Load distributed checkpoint

On Mon, Dec 11, 2006 at 3:31 PM, in message

<457DCE4D.7090402@cheapcomplexdevices.com>, Ron Mayer
<rm_pg@cheapcomplexdevices.com> wrote:

One thing I do worry about is if both postgresql and the OS
are both delaying write()s in the hopes of collapsing them
at the same time. If so, this would cause both to be experience
bigger delays than expected, and make checkpoints worse.

That is my concern. Letting 30 seconds worth of dirty pages accumulate
between checkpoints and then trying to smooth the writes within
checkpoint code seems like a doomed effort. Either we take over control
of everything, like many other products, and schedule it all ourselves,
or we push it out to the file system fairly quickly and let it do its
job. Let's not stand with one foot on the pier and one on the boat.

It is entirely possible that the current setup, where a delay of a
couple seconds may collapse several OS API calls, could have a
performance benefit; but, let's be clear that we're talking about saving
that overhead, not disk I/O.

-Kevin

#29Zeugswetter Andreas ADI SD
ZeugswetterA@spardat.at
In reply to: Kevin Grittner (#28)
Re: Load distributed checkpoint

One thing I do worry about is if both postgresql and the OS
are both delaying write()s in the hopes of collapsing them
at the same time. If so, this would cause both to be experience
bigger delays than expected, and make checkpoints worse.

That is my concern. Letting 30 seconds worth of dirty pages

accumulate

between checkpoints and then trying to smooth the writes within
checkpoint code seems like a doomed effort.

30 seconds ??? You are supposed to avoid excessive checkpoints.
If you are reducing checkpoint_timeout to avoid the spike, you are imho
definitely tuning the wrong knob. You are supposed to increase
checkpoint_timeout as far as you can to still have an acceptable
recovery delay after a crash.

If you cannot afford a huge spike during checkpoint, say every 30
minutes,
you have to make bgwriter more aggressive. This would generally be true
for
both of the variants, db driven direct io and buffered filesystem io.

Andreas

#30Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Zeugswetter Andreas ADI SD (#29)
Re: Load distributed checkpoint

On Tue, Dec 12, 2006 at 3:22 AM, in message

<E1539E0ED7043848906A8FF995BDA5790198ECC8@m0143.s-mxs.net>,
"Zeugswetter
Andreas ADI SD" <ZeugswetterA@spardat.at> wrote:

One thing I do worry about is if both postgresql and the OS
are both delaying write()s in the hopes of collapsing them
at the same time. If so, this would cause both to be experience
bigger delays than expected, and make checkpoints worse.

That is my concern. Letting 30 seconds worth of dirty pages

accumulate

between checkpoints and then trying to smooth the writes within
checkpoint code seems like a doomed effort.

30 seconds ??? You are supposed to avoid excessive checkpoints.
If you are reducing checkpoint_timeout to avoid the spike, you are

imho

definitely tuning the wrong knob.

Sorry for the fuzzy language -- I was intending to describe a situation
where dirty pages accumulate by checkpoint time which would take 30
seconds to write to disk. We were into this situation (and worse) with
the default bgwriter settings.

you have to make bgwriter more aggressive.

This is what I've been saying.

I've also been saying that if the PostgreSQL way is to let the file
system handle the caching and I/O scheduling, we should trust it to know
what to do with dirty pages, and not try to second-guess it. (Of course
there are knobs to tune the file system if needed.) Our checkpoint
performance issues went away when we went to settings which basically
never leave a dirty page hidden from the file system for more than two
seconds.

-Kevin

#31Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#16)
Re: Load distributed checkpoint

I have thought a while about this and I have some ideas.

Ideally, we would be able to trickle the sync of individuals blocks
during the checkpoint, but we can't because we rely on the kernel to
sync all dirty blocks that haven't made it to disk using fsync(). We
could trickle the fsync() calls, but that just extends the amount of
data we are writing that has been dirtied post-checkpoint. In an ideal
world, we would be able to fsync() only part of a file at a time, and
only those blocks that were dirtied pre-checkpoint, but I don't see that
happening anytime soon (and one reason why many commercial databases
bypass the kernel cache).

So, in the real world, one conclusion seems to be that our existing
method of tuning the background writer just isn't good enough for the
average user:

#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round

These settings control what the bgwriter does, but they do not clearly
relate to the checkpoint timing, which is the purpose of the bgwriter,
and they don't change during the checkpoint interval, which is also less
than ideal. If set to aggressively, it writes too much, and if too low,
the checkpoint does too much I/O.

We clearly need more bgwriter activity as the checkpoint approaches, and
one that is more auto-tuned, like many of our other parameters. I think
we created these settings to see how they worked in the field, so it
probably time to reevaluate them based on field reports.

I think the bgwriter should keep track of how far it is to the next
checkpoint, and use that information to increase write activity.
Basically now, during a checkpoint, the bgwriter does a full buffer scan
and fsync's all dirty files, so it changes from the configuration
parameter-defined behavior right to 100% activity. I think it would be
ideal if we could ramp up the writes so that when it is 95% to the next
checkpoint, it can be operating at 95% of the activity it would do
during a checkpoint.

My guess is if we can do that, we will have much smoother performance
because we have more WAL writes just after checkpoint for newly-dirtied
pages, and the new setup will give us more write activity just before
checkpoint.

One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
the kernel cache, so we are sure data will be on disk by checkpoint
time. This was avoided in the past because of the expense of
second-guessing the kernel disk I/O scheduling algorithms.

---------------------------------------------------------------------------

Tom Lane wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

"Jim C. Nasby" <jim@nasby.net> wrote:

Generally, I try and configure the all* settings so that you'll get 1
clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
have any actual tests to back that methodology up.

We got to these numbers somewhat scientifically. I studied I/O
patterns under production load and figured we should be able to handle
about 800 writes in per 200 ms without causing problems. I have to
admit that I based the percentages and the ratio between "all" and "lru"
on gut feel after musing over the documentation.

I like Kevin's settings better than what Jim suggests. If the bgwriter
only makes one sweep between checkpoints then it's hardly going to make
any impact at all on the number of dirty buffers the checkpoint will
have to write. The point of the bgwriter is to reduce the checkpoint
I/O spike by doing writes between checkpoints, and to have any
meaningful impact on that, you'll need it to make the cycle several times.

Another point here is that you want checkpoints to be pretty far apart
to minimize the WAL load from full-page images. So again, a bgwriter
that's only making one loop per checkpoint is not gonna be doing much.

I wonder whether it would be feasible to teach the bgwriter to get more
aggressive as the time for the next checkpoint approaches? Writes
issued early in the interval have a much higher probability of being
wasted (because the page gets re-dirtied later). But maybe that just
reduces to what Takahiro-san already suggested, namely that
checkpoint-time writes should be done with the same kind of scheduling
the bgwriter uses outside checkpoints. We still have the problem that
the real I/O storm is triggered by fsync() not write(), and we don't
have a way to spread out the consequences of fsync().

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#32Gregory Stark
stark@enterprisedb.com
In reply to: Bruce Momjian (#31)
Re: Load distributed checkpoint

Tom Lane wrote:

I like Kevin's settings better than what Jim suggests. If the bgwriter
only makes one sweep between checkpoints then it's hardly going to make
any impact at all on the number of dirty buffers the checkpoint will
have to write. The point of the bgwriter is to reduce the checkpoint
I/O spike by doing writes between checkpoints, and to have any
meaningful impact on that, you'll need it to make the cycle several times.

Another point here is that you want checkpoints to be pretty far apart
to minimize the WAL load from full-page images. So again, a bgwriter
that's only making one loop per checkpoint is not gonna be doing much.

I missed the previous message but it sounds like you're operating under a
different set of assumptions than the original poster. If you do a single
sweep through all of the buffers *and sync them* then you've just finished a
checkpoint -- the *previous* checkpoint. Not the subsequent one.

That is, rather than trying to spread the load of the checkpoint out by
getting the writes into the kernel sooner but make no attempt to sync them
until checkpoint time, start the checkpoint as soon as the previous checkpoint
finishes, and dribble the blocks of the checkpoint out slowly throughout an
entire checkpoint cycle syncing them immediately using O_SYNC/ODIRECT.

It's a fundamental shift in the idea of the purpose of bgwriter. Instead of
trying to suck i/o away from the subsequent checkpoint it would be responsible
for all the i/o of the previous checkpoint which would still be in progress
for the entire time of checkpoint_timeout. It would only complete when
bgwriter had finished doing its one full sweep.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#32)
Re: Load distributed checkpoint

Gregory Stark <stark@enterprisedb.com> writes:

It's a fundamental shift in the idea of the purpose of bgwriter. Instead of
trying to suck i/o away from the subsequent checkpoint it would be responsible
for all the i/o of the previous checkpoint which would still be in progress
for the entire time of checkpoint_timeout. It would only complete when
bgwriter had finished doing its one full sweep.

I think that's basically the same as the original suggestion, which is
to do checkpoints using less than the full I/O bandwidth of the machine
--- tying it exactly to the default bgwriter rate may or may not be
appropriate.

The difficulty with such schemes is that if you go over to using O_DIRECT
writes instead of fsync in the bgwriter, it's hard to avoid doing the
same when a random backend has to write a dirty buffer --- yet you'd
really rather that such a backend not have to wait for the ensuing I/O.
And this gets a lot worse if checkpoints are slowed down, because it gets
more likely that the bufmgr will run out of clean buffers and have to do
a write() from a backend.

regards, tom lane

#34Jim C. Nasby
jim@nasby.net
In reply to: Tom Lane (#16)
Re: Load distributed checkpoint

On Fri, Dec 08, 2006 at 11:43:27AM -0500, Tom Lane wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

"Jim C. Nasby" <jim@nasby.net> wrote:

Generally, I try and configure the all* settings so that you'll get 1
clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
have any actual tests to back that methodology up.

We got to these numbers somewhat scientifically. I studied I/O
patterns under production load and figured we should be able to handle
about 800 writes in per 200 ms without causing problems. I have to
admit that I based the percentages and the ratio between "all" and "lru"
on gut feel after musing over the documentation.

I like Kevin's settings better than what Jim suggests. If the bgwriter
only makes one sweep between checkpoints then it's hardly going to make
any impact at all on the number of dirty buffers the checkpoint will
have to write. The point of the bgwriter is to reduce the checkpoint
I/O spike by doing writes between checkpoints, and to have any
meaningful impact on that, you'll need it to make the cycle several times.

It would be good if the docs included more detailed info on how exactly
the bgwriter goes about flushing stuff to disk. You can certainly read
them and think that the bgwriter just goes through and issues writes for
any dirty buffers it finds. Though, looking at BgBufferSync, I think it
actually does write out pages during the "all" scan, regardless of what
usage_count says.

I wonder whether it would be feasible to teach the bgwriter to get more
aggressive as the time for the next checkpoint approaches? Writes
issued early in the interval have a much higher probability of being
wasted (because the page gets re-dirtied later). But maybe that just
reduces to what Takahiro-san already suggested, namely that
checkpoint-time writes should be done with the same kind of scheduling
the bgwriter uses outside checkpoints. We still have the problem that
the real I/O storm is triggered by fsync() not write(), and we don't
have a way to spread out the consequences of fsync().

Would the ramp-up of write activity push the kernel to actually write
stuff? My understanding is that most OSes have a time limit on how long
they'll let a write-request sit in cache, so ISTM a better way to smooth
out disk IO is to write things in a steady stream.

If the bgwriter takes the buffer access counter into account when
deciding what to write out, it might make sense to write more recently
accessed pages as checkpoint nears. The idea being that odds are good
those buffers are about to get flushed by BufferSync() anyway.

Also, I have a dumb question... BgBufferSync uses buf_id1 to keep track
of what buffer the bgwriter_all scan is looking at, which means that
it should remember where it was at the end of the last scan; yet it's
initialized to 0 every time BgBufferSync is called. Is there someplace
else that is remembering where the complete scan is leaving off when
bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the scan
in fact just keep re-scanning the beginning of the buffers?
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#35Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

Hello,

From: "Jim C. Nasby" <jim@nasby.net>
Also, I have a dumb question... BgBufferSync uses buf_id1 to keep
track

of what buffer the bgwriter_all scan is looking at, which means that
it should remember where it was at the end of the last scan; yet

it's

initialized to 0 every time BgBufferSync is called. Is there

someplace

else that is remembering where the complete scan is leaving off when
bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the

scan

in fact just keep re-scanning the beginning of the buffers?

No. BgBufferSync() correctly keeps track of the position to restart
scanning at. bufid1 is not initialized to 0 every time BgBufferSync()
is called, because bufid1 is a static local variable. Please see the
following code. It prints:

a=0
a=1
a=2

#include <stdio.h>

void func(void)
{
static int a = 0;

printf("a=%d\n", a);
a++;
}

int main(void)
{
func();
func();
func();

return 0;
}

#36Jim C. Nasby
jim@nasby.net
In reply to: Takayuki Tsunakawa (#35)
Re: Load distributed checkpoint

On Wed, Dec 13, 2006 at 06:27:38PM +0900, Takayuki Tsunakawa wrote:

No. BgBufferSync() correctly keeps track of the position to restart
scanning at. bufid1 is not initialized to 0 every time BgBufferSync()
is called, because bufid1 is a static local variable. Please see the
following code. It prints:

Doh, I completely missed the static part of the declaration. Thanks for
the C tutorial. :)
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#37Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
1 attachment(s)
Re: Load distributed checkpoint

Hello, Itagaki-san, all

Sorry for my long mail. I've had trouble in sending this mail because
it's too long for pgsql-hackers to accept (I couldn't find how large
mail is accepted.) So I'm trying to send several times.
Please see the attachment for the content.

Attachments:

mail_ckpt.zipapplication/x-zip-compressed; name=mail_ckpt.zipDownload
#38ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: ITAGAKI Takahiro (#1)
1 attachment(s)
Load distributed checkpoint patch

This is a patch for load distributed checkpoint discussed in
http://archives.postgresql.org/pgsql-hackers/2006-12/msg00337.php

Only write() calls are smoothed, fsync() are not.
Existing checkpoint method is called "immediate checkpoint" in the patch,
and the new method called "asynchronous checkpoint".

I'm working on more robust control of checkpoint progress. WAL are written
a lot just after checkpoints because of full-page-writes and decreases
gradually. Linear control seems to be not enough in some cases.

Comments, suggestions and testing are appreciated.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

checkpoint.patchapplication/octet-stream; name=checkpoint.patchDownload
Index: src/include/postmaster/bgwriter.h
===================================================================
*** src/include/postmaster/bgwriter.h	(HEAD)
--- src/include/postmaster/bgwriter.h	(modified)
*************** extern int	CheckPointWarning;
*** 23,29 ****
  
  extern void BackgroundWriterMain(void);
  
! extern void RequestCheckpoint(bool waitforit, bool warnontime);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
--- 23,29 ----
  
  extern void BackgroundWriterMain(void);
  
! extern void RequestCheckpoint(bool waitforit);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
Index: src/include/access/xlog.h
===================================================================
*** src/include/access/xlog.h	(HEAD)
--- src/include/access/xlog.h	(modified)
*************** extern void BootStrapXLOG(void);
*** 162,170 ****
--- 162,172 ----
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
+ extern bool PrepareCheckPoint(bool shutdown, bool force);
  extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
  
  #endif   /* XLOG_H */
Index: src/include/storage/bufmgr.h
===================================================================
*** src/include/storage/bufmgr.h	(HEAD)
--- src/include/storage/bufmgr.h	(modified)
*************** extern void AbortBufferIO(void);
*** 151,157 ****
  
  extern void BufmgrCommit(void);
  extern void BufferSync(void);
! extern void BgBufferSync(void);
  
  extern void AtProcExit_LocalBuffers(void);
  
--- 151,157 ----
  
  extern void BufmgrCommit(void);
  extern void BufferSync(void);
! extern int	BgBufferSync(int num_to_scan);
  
  extern void AtProcExit_LocalBuffers(void);
  
Index: src/backend/commands/dbcommands.c
===================================================================
*** src/backend/commands/dbcommands.c	(HEAD)
--- src/backend/commands/dbcommands.c	(modified)
*************** createdb(const CreatedbStmt *stmt)
*** 504,510 ****
  		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
  		 * we can avoid this.
  		 */
! 		RequestCheckpoint(true, false);
  
  		/*
  		 * Close pg_database, but keep lock till commit (this is important to
--- 504,510 ----
  		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
  		 * we can avoid this.
  		 */
! 		RequestCheckpoint(true);
  
  		/*
  		 * Close pg_database, but keep lock till commit (this is important to
*************** dropdb(const char *dbname, bool missing_
*** 648,654 ****
  	 * open files, which would cause rmdir() to fail.
  	 */
  #ifdef WIN32
! 	RequestCheckpoint(true, false);
  #endif
  
  	/*
--- 648,654 ----
  	 * open files, which would cause rmdir() to fail.
  	 */
  #ifdef WIN32
! 	RequestCheckpoint(true);
  #endif
  
  	/*
Index: src/backend/postmaster/bgwriter.c
===================================================================
*** src/backend/postmaster/bgwriter.c	(HEAD)
--- src/backend/postmaster/bgwriter.c	(modified)
***************
*** 90,99 ****
   *	6. If ckpt_failed is different from the originally saved value,
   *	   assume request failed; otherwise it was definitely successful.
   *
!  * An additional field is ckpt_time_warn; this is also sig_atomic_t for
   * simplicity, but is only used as a boolean.  If a backend is requesting
!  * a checkpoint for which a checkpoints-too-close-together warning is
!  * reasonable, it should set this field TRUE just before sending the signal.
   *
   * The requests array holds fsync requests sent by backends and not yet
   * absorbed by the bgwriter.  Unlike the checkpoint fields, the requests
--- 90,101 ----
   *	6. If ckpt_failed is different from the originally saved value,
   *	   assume request failed; otherwise it was definitely successful.
   *
!  * An additional field is ckpt_immediate; this is also sig_atomic_t for
   * simplicity, but is only used as a boolean.  If a backend is requesting
!  * an immediate checkpoint, set this field TRUE just before sending the
!  * signal. If do so, checkpoint is performed as soon as possible. Otherwise,
!  * checkpoint is run asynchronously to avoid i/o impacts and a checkpoints-
!  * too-close-together warning may be reported.
   *
   * The requests array holds fsync requests sent by backends and not yet
   * absorbed by the bgwriter.  Unlike the checkpoint fields, the requests
*************** typedef struct
*** 115,121 ****
  	sig_atomic_t ckpt_done;		/* advances when checkpoint done */
  	sig_atomic_t ckpt_failed;	/* advances when checkpoint fails */
  
! 	sig_atomic_t ckpt_time_warn;	/* warn if too soon since last ckpt? */
  
  	int			num_requests;	/* current # of requests */
  	int			max_requests;	/* allocated array size */
--- 117,123 ----
  	sig_atomic_t ckpt_done;		/* advances when checkpoint done */
  	sig_atomic_t ckpt_failed;	/* advances when checkpoint fails */
  
! 	sig_atomic_t ckpt_immediate;	/* immediate checkpoint requested? */
  
  	int			num_requests;	/* current # of requests */
  	int			max_requests;	/* allocated array size */
*************** static bool am_bg_writer = false;
*** 145,154 ****
--- 147,167 ----
  
  static bool ckpt_active = false;
  
+ /* Number of scanned pages during asynchronous checkpoint, otherwise -1. */
+ static int	ckpt_progress = -1;
+ 
+ /* WAL position at starting asynchronous checkpoint. */
+ static XLogRecPtr ckpt_recptr;
+ 
+ /* safety margin for smgrsync */
+ #define CHECKPOINT_MARGIN		0.1
+ 
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;
  
  
+ static void BgCreateCheckPoint(bool immediate);
+ static int	BgGetScanPages(void);
  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
  static void ReqCheckpointHandler(SIGNAL_ARGS);
*************** BackgroundWriterMain(void)
*** 285,290 ****
--- 298,306 ----
  			ckpt_active = false;
  		}
  
+ 		/* Discard prepared checkpoint */
+ 		ckpt_progress = -1;
+ 
  		/*
  		 * Now return to normal top-level context and clear ErrorContext for
  		 * next time.
*************** BackgroundWriterMain(void)
*** 327,333 ****
  	for (;;)
  	{
  		bool		do_checkpoint = false;
! 		bool		force_checkpoint = false;
  		time_t		now;
  		int			elapsed_secs;
  		long		udelay;
--- 343,349 ----
  	for (;;)
  	{
  		bool		do_checkpoint = false;
! 		bool		immediate = false;
  		time_t		now;
  		int			elapsed_secs;
  		long		udelay;
*************** BackgroundWriterMain(void)
*** 353,359 ****
  		{
  			checkpoint_requested = false;
  			do_checkpoint = true;
! 			force_checkpoint = true;
  		}
  		if (shutdown_requested)
  		{
--- 369,376 ----
  		{
  			checkpoint_requested = false;
  			do_checkpoint = true;
! 			immediate = BgWriterShmem->ckpt_immediate;
! 			BgWriterShmem->ckpt_immediate = false;
  		}
  		if (shutdown_requested)
  		{
*************** BackgroundWriterMain(void)
*** 385,423 ****
  		if (do_checkpoint)
  		{
  			/*
- 			 * We will warn if (a) too soon since last checkpoint (whatever
- 			 * caused it) and (b) somebody has set the ckpt_time_warn flag
- 			 * since the last checkpoint start.  Note in particular that this
- 			 * implementation will not generate warnings caused by
- 			 * CheckPointTimeout < CheckPointWarning.
- 			 */
- 			if (BgWriterShmem->ckpt_time_warn &&
- 				elapsed_secs < CheckPointWarning)
- 				ereport(LOG,
- 						(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
- 								elapsed_secs),
- 						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
- 			BgWriterShmem->ckpt_time_warn = false;
- 
- 			/*
  			 * Indicate checkpoint start to any waiting backends.
  			 */
  			ckpt_active = true;
  			BgWriterShmem->ckpt_started++;
  
! 			CreateCheckPoint(false, force_checkpoint);
! 
! 			/*
! 			 * After any checkpoint, close all smgr files.	This is so we
! 			 * won't hang onto smgr references to deleted files indefinitely.
! 			 */
! 			smgrcloseall();
  
! 			/*
! 			 * Indicate checkpoint completion to any waiting backends.
! 			 */
! 			BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
! 			ckpt_active = false;
  
  			/*
  			 * Note we record the checkpoint start time not end time as
--- 402,457 ----
  		if (do_checkpoint)
  		{
  			/*
  			 * Indicate checkpoint start to any waiting backends.
  			 */
  			ckpt_active = true;
  			BgWriterShmem->ckpt_started++;
  
! 			if (immediate)
! 			{
! 				/* Cancel previous asynchronous checkpoint if running */
! 				if (ckpt_progress >= 0)
! 				{
! 					elog(DEBUG1, "cancel previous asynchronous checkpoint");
! 					ckpt_progress = -1;
! 				}
! 				BgCreateCheckPoint(true);
! 			}
! 			else
! 			{
! 				/*
! 				 * We will warn if (a) too soon since last checkpoint
! 				 * (whatever caused it) and (b) the checkpoint is not forced.
! 				 * Note in particular that this implementation will not generate
! 				 * warnings caused by CheckPointTimeout < CheckPointWarning.
! 				 */
! 				if (elapsed_secs < CheckPointTimeout &&
! 					elapsed_secs < CheckPointWarning)
! 					ereport(LOG,
! 						(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! 								elapsed_secs),
! 						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
  
! 				/* Finish previous asynchronous checkpoint if running. */
! 				if (ckpt_progress >= 0)
! 				{
! 					int	num_to_scan = NBuffers - ckpt_progress;
! 					if (num_to_scan > 0)
! 					{
! 						elog(DEBUG1, "finish previous checkpoint, writing %d buffers", num_to_scan);
! 						BgBufferSync(num_to_scan);
! 					}
! 					BgCreateCheckPoint(false);
! 				}
! 
! 				/* Start an asynchronous checkpoint */
! 				elog(DEBUG1, "prepare asynchronous checkpoint");
! 				if (PrepareCheckPoint(false, false))
! 				{
! 					ckpt_progress = 0;	/* >=0 means 'prepared' */
! 					ckpt_recptr = GetInsertRecPtr();
! 				}
! 			}
  
  			/*
  			 * Note we record the checkpoint start time not end time as
*************** BackgroundWriterMain(void)
*** 426,433 ****
  			 */
  			last_checkpoint_time = now;
  		}
  		else
! 			BgBufferSync();
  
  		/*
  		 * Check for archive_timeout, if so, switch xlog files.  First we do a
--- 460,482 ----
  			 */
  			last_checkpoint_time = now;
  		}
+ 		else if (ckpt_progress >= 0)
+ 		{
+ 			/* asynchronous checkpoint is running */
+ 			ckpt_progress += BgBufferSync(BgGetScanPages());
+ 
+ 			/*
+ 			 * If all dirty buffers have been written, we can continue
+ 			 * the rest of the checkpoint.
+ 			 */
+ 			if (ckpt_progress >= NBuffers)
+ 				BgCreateCheckPoint(false);
+ 		}
  		else
! 		{
! 			/* Normal activity, not in checkpoints */
! 			BgBufferSync(0);
! 		}
  
  		/*
  		 * Check for archive_timeout, if so, switch xlog files.  First we do a
*************** BackgroundWriterMain(void)
*** 506,511 ****
--- 555,627 ----
  	}
  }
  
+ /*
+  * BgCreateCheckPoint -- create checkpoint and close all smgr files.
+  */
+ static void
+ BgCreateCheckPoint(bool immediate)
+ {
+ 	elog(DEBUG1, "start %s checkpoint",
+ 		(immediate ? "immediate" : "asynchronous"));
+ 
+ 	ckpt_active = true;
+ 
+ 	CreateCheckPoint(false, immediate);
+ 
+ 	/*
+ 	 * After any checkpoint, close all smgr files.	This is so we
+ 	 * won't hang onto smgr references to deleted files indefinitely.
+ 	 */
+ 	smgrcloseall();
+ 
+ 	/*
+ 	 * Indicate checkpoint completion to any waiting backends.
+ 	 */
+ 	BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
+ 
+ 	ckpt_progress = -1;
+ 	ckpt_active = false;
+ }
+ 
+ /*
+  * BgGetScanPages -- Returns the number of pages recommemded to be scanned.
+  */
+ static int
+ BgGetScanPages(void)
+ {
+ 	struct timeval	now;
+ 	XLogRecPtr		recptr;
+ 	double			progress_in_time,
+ 					progress_in_xlog;
+ 	int				pos;
+ 
+ 	/* coordinate the progress with checkpoint_timeout */
+ 	gettimeofday(&now, NULL);
+ 	progress_in_time = ((double)(int)(now.tv_sec - last_checkpoint_time) +
+ 						now.tv_usec / 1000000.0) / CheckPointTimeout;
+ 
+ 	/* coordinate the progress with checkpoint_segments */
+ 	recptr = GetInsertRecPtr();
+ 	progress_in_xlog =
+ 		((double) (recptr.xlogid - ckpt_recptr.xlogid) * XLogSegsPerFile +
+ 		 (double) (recptr.xrecoff - ckpt_recptr.xrecoff) / XLogSegSize) /
+ 		CheckPointSegments;
+ 
+ 	pos = (int) (NBuffers * (1.0 + CHECKPOINT_MARGIN) *
+ 					Max(progress_in_time, progress_in_xlog));
+ 	if (pos > NBuffers)
+ 		pos = NBuffers;
+ 
+ 	elog(DEBUG1, "checkpoint progress : %d->%d(%+d) (time=%.3f, xlog=%.3f)",
+ 		ckpt_progress, pos, pos - ckpt_progress,
+ 		progress_in_time, progress_in_xlog);
+ 
+ 	if (pos > ckpt_progress)
+ 		return pos - ckpt_progress;
+ 	else
+ 		return 0;
+ }
+ 
  
  /* --------------------------------
   *		signal handler routines
*************** BgWriterShmemInit(void)
*** 612,625 ****
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
!  * immediately.
!  *
!  * If warnontime is true, and it's "too soon" since the last checkpoint,
!  * the bgwriter will log a warning.  This should be true only for checkpoints
!  * caused due to xlog filling, else the warning will be misleading.
   */
  void
! RequestCheckpoint(bool waitforit, bool warnontime)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
--- 728,739 ----
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
!  * immediately. If false, checkpoints are caused due to xlog filling.
!  * Then if it's "too soon" since the last checkpoint, the bgwriter will
!  * log a warning, else the warning will be misleading.
   */
  void
! RequestCheckpoint(bool waitforit)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
*************** RequestCheckpoint(bool waitforit, bool w
*** 642,650 ****
  		return;
  	}
  
! 	/* Set warning request flag if appropriate */
! 	if (warnontime)
! 		bgs->ckpt_time_warn = true;
  
  	/*
  	 * Send signal to request checkpoint.  When waitforit is false, we
--- 756,764 ----
  		return;
  	}
  
! 	/* Set immediate request flag if appropriate */
! 	if (waitforit)
! 		bgs->ckpt_immediate = true;
  
  	/*
  	 * Send signal to request checkpoint.  When waitforit is false, we
Index: src/backend/access/transam/xlog.c
===================================================================
*** src/backend/access/transam/xlog.c	(HEAD)
--- src/backend/access/transam/xlog.c	(modified)
***************
*** 43,48 ****
--- 43,49 ----
  #include "storage/fd.h"
  #include "storage/pmsignal.h"
  #include "storage/procarray.h"
+ #include "storage/smgr.h"
  #include "storage/spin.h"
  #include "utils/builtins.h"
  #include "utils/nabstime.h"
*************** static TransactionId recoveryStopXid;
*** 191,196 ****
--- 192,200 ----
  static time_t recoveryStopTime;
  static bool recoveryStopAfter;
  
+ /* checkpoint */
+ static CheckPoint checkPoint;
+ 
  /*
   * During normal operation, the only timeline we care about is ThisTimeLineID.
   * During recovery, however, things are more complicated.  To simplify life
*************** static void readRecoveryCommandFile(void
*** 469,475 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
--- 473,479 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool flush);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
*************** XLogWrite(XLogwrtRqst WriteRqst, bool fl
*** 1666,1672 ****
  						if (XLOG_DEBUG)
  							elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
! 						RequestCheckpoint(false, true);
  					}
  				}
  			}
--- 1670,1676 ----
  						if (XLOG_DEBUG)
  							elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
! 						RequestCheckpoint(false);
  					}
  				}
  			}
*************** GetRedoRecPtr(void)
*** 5343,5348 ****
--- 5347,5369 ----
  }
  
  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+ 	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	XLogCtlInsert  *Insert = &XLogCtl->Insert;
+ 	XLogRecPtr		recptr;
+ 
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	INSERT_RECPTR(recptr, Insert, Insert->curridx);
+ 	SpinLockRelease(&xlogctl->info_lck);
+ 
+ 	return recptr;
+ }
+ 
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
*************** ShutdownXLOG(int code, Datum arg)
*** 5417,5448 ****
  }
  
  /*
!  * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
! void
! CreateCheckPoint(bool shutdown, bool force)
  {
- 	CheckPoint	checkPoint;
- 	XLogRecPtr	recptr;
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecData rdata;
  	uint32		freespace;
- 	uint32		_logId;
- 	uint32		_logSeg;
- 	int			nsegsadded = 0;
- 	int			nsegsremoved = 0;
- 	int			nsegsrecycled = 0;
- 
- 	/*
- 	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
- 	 * (This is just pro forma, since in the present system structure there is
- 	 * only one process that is allowed to issue checkpoints at any given
- 	 * time.)
- 	 */
- 	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
  
  	/*
  	 * Use a critical section to force system panic if we have trouble.
--- 5438,5453 ----
  }
  
  /*
!  * PrepareCheckPoint -- prepare a checkpoint
   *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
! bool
! PrepareCheckPoint(bool shutdown, bool force)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	uint32		freespace;
  
  	/*
  	 * Use a critical section to force system panic if we have trouble.
*************** CreateCheckPoint(bool shutdown, bool for
*** 5501,5509 ****
  		{
  			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointStartLock);
- 			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
! 			return;
  		}
  	}
  
--- 5506,5514 ----
  		{
  			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointStartLock);
  			END_CRIT_SECTION();
! 			elog(DEBUG1, "checkpoint is not needed");
! 			return false;
  		}
  	}
  
*************** CreateCheckPoint(bool shutdown, bool for
*** 5552,5561 ****
  
  	LWLockRelease(CheckpointStartLock);
  
- 	if (!shutdown)
- 		ereport(DEBUG2,
- 				(errmsg("checkpoint starting")));
- 
  	/*
  	 * Get the other info we need for the checkpoint record.
  	 */
--- 5557,5562 ----
*************** CreateCheckPoint(bool shutdown, bool for
*** 5578,5583 ****
--- 5579,5624 ----
  							 &checkPoint.nextMulti,
  							 &checkPoint.nextMultiOffset);
  
+ 	END_CRIT_SECTION();
+ 
+ 	if (!shutdown)
+ 		ereport(DEBUG2,
+ 				(errmsg("checkpoint starting")));
+ 
+ 	return true;
+ }
+ 
+ /*
+  * CreateCheckPoint -- finish a checkpoint
+  */
+ void
+ CreateCheckPoint(bool shutdown, bool force)
+ {
+ 	XLogRecPtr	recptr;
+ 	XLogRecData rdata;
+ 	uint32		_logId;
+ 	uint32		_logSeg;
+ 	int			nsegsadded = 0;
+ 	int			nsegsremoved = 0;
+ 	int			nsegsrecycled = 0;
+ 
+ 	Assert(!shutdown || (shutdown && force));	/* must force on shutdown */
+ 
+ 	/* Do a force checkpoint if not prepared. (should not happen) */
+ 	if (!TransactionIdIsValid(checkPoint.nextXid))
+ 		force = true;
+ 
+ 	/*
+ 	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
+ 	 * (This is just pro forma, since in the present system structure there is
+ 	 * only one process that is allowed to issue checkpoints at any given
+ 	 * time.)
+ 	 */
+ 	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+ 
+ 	if (shutdown || force)
+ 		PrepareCheckPoint(shutdown, force);
+ 
  	/*
  	 * Having constructed the checkpoint record, ensure all shmem disk buffers
  	 * and commit-log buffers are flushed to disk.
*************** CreateCheckPoint(bool shutdown, bool for
*** 5586,5594 ****
  	 * complete the checkpoint, but there is no reason to force a system
  	 * panic. Accordingly, exit critical section while doing it.
  	 */
! 	END_CRIT_SECTION();
! 
! 	CheckPointGuts(checkPoint.redo);
  
  	START_CRIT_SECTION();
  
--- 5627,5633 ----
  	 * complete the checkpoint, but there is no reason to force a system
  	 * panic. Accordingly, exit critical section while doing it.
  	 */
! 	CheckPointGuts(checkPoint.redo, force);
  
  	START_CRIT_SECTION();
  
*************** CreateCheckPoint(bool shutdown, bool for
*** 5684,5689 ****
--- 5723,5730 ----
  				(errmsg("checkpoint complete; %d transaction log file(s) added, %d removed, %d recycled",
  						nsegsadded, nsegsremoved, nsegsrecycled)));
  
+ 	checkPoint.nextXid = InvalidTransactionId;	/* done */
+ 
  	LWLockRelease(CheckpointLock);
  }
  
*************** CreateCheckPoint(bool shutdown, bool for
*** 5694,5705 ****
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool();			/* performs all required fsyncs */
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
--- 5735,5752 ----
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool flush)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 
! 	elog(DEBUG1, "CheckPointGuts %s flush", (flush ? "WITH" : "WITHOUT"));
! 
! 	if (flush)
! 		BufferSync();
! 	smgrsync();	/* performs all required fsyncs */
! 
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
*************** RecoveryRestartPoint(const CheckPoint *c
*** 5748,5754 ****
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
--- 5795,5801 ----
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo, true);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
*************** pg_start_backup(PG_FUNCTION_ARGS)
*** 6167,6173 ****
  		 * have different checkpoint positions and hence different history
  		 * file names, even if nothing happened in between.
  		 */
! 		RequestCheckpoint(true, false);
  
  		/*
  		 * Now we need to fetch the checkpoint record location, and also its
--- 6214,6220 ----
  		 * have different checkpoint positions and hence different history
  		 * file names, even if nothing happened in between.
  		 */
! 		RequestCheckpoint(true);
  
  		/*
  		 * Now we need to fetch the checkpoint record location, and also its
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
*** src/backend/storage/buffer/bufmgr.c	(HEAD)
--- src/backend/storage/buffer/bufmgr.c	(modified)
*************** BufferSync(void)
*** 985,1005 ****
   * BgBufferSync -- Write out some dirty buffers in the pool.
   *
   * This is called periodically by the background writer process.
   */
! void
! BgBufferSync(void)
  {
  	static int	buf_id1 = 0;
  	int			buf_id2;
- 	int			num_to_scan;
  	int			num_written;
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
  
  	/*
! 	 * To minimize work at checkpoint time, we want to try to keep all the
! 	 * buffers clean; this motivates a scan that proceeds sequentially through
  	 * all buffers.  But we are also charged with ensuring that buffers that
  	 * will be recycled soon are clean when needed; these buffers are the ones
  	 * just ahead of the StrategySyncStart point.  We make a separate scan
--- 985,1009 ----
   * BgBufferSync -- Write out some dirty buffers in the pool.
   *
   * This is called periodically by the background writer process.
+  * num_to_scan buffers are always scanned and written by the 'all' activity.
+  * Returns the number of scanned buffers by the 'all' activity.
   */
! int
! BgBufferSync(int num_to_scan)
  {
  	static int	buf_id1 = 0;
  	int			buf_id2;
  	int			num_written;
+ 	int			num_scanned = 0;
+ 
+ 	Assert(num_to_scan >= 0);
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
  
  	/*
! 	 * To distribute work at checkpoint time, we want to try to write buffers
! 	 * by increments; this motivates a scan that proceeds sequentially through
  	 * all buffers.  But we are also charged with ensuring that buffers that
  	 * will be recycled soon are clean when needed; these buffers are the ones
  	 * just ahead of the StrategySyncStart point.  We make a separate scan
*************** BgBufferSync(void)
*** 1007,1027 ****
  	 */
  
  	/*
! 	 * This loop runs over all buffers, including pinned ones.	The starting
! 	 * point advances through the buffer pool on successive calls.
  	 *
  	 * Note that we advance the static counter *before* trying to write. This
  	 * ensures that, if we have a persistent write failure on a dirty buffer,
  	 * we'll still be able to make progress writing other buffers. (The
  	 * bgwriter will catch the error and just call us again later.)
  	 */
! 	if (bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0)
  	{
! 		num_to_scan = (int) ((NBuffers * bgwriter_all_percent + 99) / 100);
! 		num_written = 0;
  
  		while (num_to_scan-- > 0)
  		{
  			if (++buf_id1 >= NBuffers)
  				buf_id1 = 0;
  			if (SyncOneBuffer(buf_id1, false))
--- 1011,1043 ----
  	 */
  
  	/*
! 	 * These two loops run over all buffers, including pinned ones.	The
! 	 * starting point advances through the buffer pool on successive calls.
  	 *
  	 * Note that we advance the static counter *before* trying to write. This
  	 * ensures that, if we have a persistent write failure on a dirty buffer,
  	 * we'll still be able to make progress writing other buffers. (The
  	 * bgwriter will catch the error and just call us again later.)
  	 */
! 
! 	num_written = 0;
! 	while (num_to_scan-- > 0)
  	{
! 		num_scanned += 1;
! 		if (++buf_id1 >= NBuffers)
! 			buf_id1 = 0;
! 		if (SyncOneBuffer(buf_id1, false))
! 			++num_written;
! 	}
! 
! 	if (bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > num_written)
! 	{
! 		num_to_scan = (int) ((NBuffers * bgwriter_all_percent + 99) / 100)
! 					- num_scanned;
  
  		while (num_to_scan-- > 0)
  		{
+ 			num_scanned += 1;
  			if (++buf_id1 >= NBuffers)
  				buf_id1 = 0;
  			if (SyncOneBuffer(buf_id1, false))
*************** BgBufferSync(void)
*** 1054,1059 ****
--- 1070,1077 ----
  				buf_id2 = 0;
  		}
  	}
+ 
+ 	return num_scanned;
  }
  
  /*
Index: src/backend/tcop/utility.c
===================================================================
*** src/backend/tcop/utility.c	(HEAD)
--- src/backend/tcop/utility.c	(modified)
*************** ProcessUtility(Node *parsetree,
*** 1055,1061 ****
  				ereport(ERROR,
  						(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  						 errmsg("must be superuser to do CHECKPOINT")));
! 			RequestCheckpoint(true, false);
  			break;
  
  		case T_ReindexStmt:
--- 1055,1061 ----
  				ereport(ERROR,
  						(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  						 errmsg("must be superuser to do CHECKPOINT")));
! 			RequestCheckpoint(true);
  			break;
  
  		case T_ReindexStmt:
#39ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

I performed some simple tests, and I'll show the results below.

(1) The default case
235 80 226 77 240
(2) No write case
242 250 244 253 280
(3) No checkpoint case
229 252 256 292 276
(4) No fsync() case
236 112 215 216 221
(5) No write by PostgreSQL, but fsync() by another program case
9 223 260 283 292
(6) case (5) + O_SYNC by write_fsync
97 114 126 112 125
(7) O_SYNC case
182 103 41 50 74

I posted a patch to PATCHES. Please try out it.
It does write() smoothly, but fsync() at a burst.
I suppose the result will be between (3) and (5).

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#40Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#38)
Re: Load distributed checkpoint patch

ITAGAKI Takahiro wrote:

This is a patch for load distributed checkpoint discussed in
http://archives.postgresql.org/pgsql-hackers/2006-12/msg00337.php

Only write() calls are smoothed, fsync() are not.
Existing checkpoint method is called "immediate checkpoint" in the patch,
and the new method called "asynchronous checkpoint".

I'm working on more robust control of checkpoint progress. WAL are written
a lot just after checkpoints because of full-page-writes and decreases
gradually. Linear control seems to be not enough in some cases.

Comments, suggestions and testing are appreciated.

OK, if I understand correctly, instead of doing a buffer scan, write(),
and fsync(), and recyle the WAL files at checkpoint time, you delay the
scan/write part with the some delay. Do you use the same delay
autovacuum uses? As I remember, often the checkpoint is caused because
we are using the last WAL file. Doesn't this delay the creation of new
WAL files by renaming the old ones to higher numbers (we can't rename
them until the checkpoint is complete)?

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#41ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Bruce Momjian (#40)
Re: [PATCHES] Load distributed checkpoint patch

Bruce Momjian <bruce@momjian.us> wrote:

OK, if I understand correctly, instead of doing a buffer scan, write(),
and fsync(), and recyle the WAL files at checkpoint time, you delay the
scan/write part with the some delay.

Exactly. Actual behavior of checkpoint is not changed by the patch. Compared
with existing checkpoints, it just takes longer time in scan/write part.

Do you use the same delay autovacuum uses?

What do you mean 'the same delay'? Autovacuum does VACUUM, not CHECKPOINT.
If you think cost-based-delay, I think we cannot use it here. It's hard to
estimate how much checkpoints delay by cost-based sleeping, but we should
finish asynchronous checkpoints by the start of next checkpoint. So I gave
priority to punctuality over load smoothing.

As I remember, often the checkpoint is caused because
we are using the last WAL file. Doesn't this delay the creation of new
WAL files by renaming the old ones to higher numbers (we can't rename
them until the checkpoint is complete)?

Checkpoints should be done by the next one, so we need WAL files for two
checkpoints. It is the same as now.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#42Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

Hello, Itagaki-san

I posted a patch to PATCHES. Please try out it.

Really!? I've just joined pgsql-patches. When did you post it,
yesterday? I couldn't find the patch in the following page which
lists the mails to pgsql-patches of this month:

http://archives.postgresql.org/pgsql-patches/2006-12/index.php

Could you send me the patch if it has not already registered on the
above page. I want to try the patch by all means, because smoothing
response times is very important so that we can recommend PostgreSQL
to system designers. I really wish your patch will be the real
solution.

It does write() smoothly, but fsync() at a burst.
I suppose the result will be between (3) and (5).

Hmm... I think some logical reasoning is needed to get the
understanding from community members (sorry if the community members
have already agreed.) Excuse me for repeating myself, but I'm afraid
fsync() will be the evil sometime in some environments. Success in
one test environment is not the real success.

#43Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

Hello, Itagaki-san, all

I have to report a sad result. Your patch didn't work. Let's
consider the solution together. What you are addressing is very
important for the system designers in the real world -- smoothing
response time.

Recall that unpatched PostgreSQL showed the following tps's in case
(1) (i.e. with default bgwriter_* and checkpoint_* settings.)

235 80 226 77 240

The patched PostgreSQL showed the following tps's:

230 228 77 209 66

[disk usage]
The same tendency can be seen as with the unpatched PostgreSQL. That
is:
When the tps is low, the %util of disk for data files is high, and
%util of disk for WAL is low. Why is transaction logging is disturbed
by cleaning and/or syncing activity? While the bgwriter is
fsync()ing, it does not lock any data structures that the transactions
want to access. Even though they share the same SCSI controller and
bus, they are different disks. The bandwidth does not appear to be
exhausted, since Ultra320 is said to have 256MB band width in
practice.
(Recall that WAL is on sdd and data files are on sde.)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sdd 0.00 810.78 0.00 102.20 0.00 7306.99 0.00
3653.49 71.50 1.12 10.95 7.32 74.77
sde 0.00 25.35 0.00 6.19 0.00 252.30 0.00 126.15
40.77 0.50 81.32 5.94 3.67
sdd 0.00 884.20 0.00 126.00 0.00 8080.00 0.00
4040.00 64.13 1.26 10.00 7.11 89.64
sde 0.00 21.40 0.00 5.00 0.00 211.20 0.00 105.60
42.24 0.31 62.56 6.52 3.26
sdd 0.00 924.80 0.00 116.20 0.00 8326.40 0.00
4163.20 71.66 1.23 10.59 7.37 85.64
sde 0.00 27.60 0.00 26.60 0.00 433.60 0.00 216.80
16.30 4.24 159.29 2.44 6.50
sdd 0.00 721.20 0.00 102.40 0.00 6588.80 0.00
3294.40 64.34 0.99 9.71 7.07 72.40
sde 0.00 1446.80 0.00 101.60 0.00 20289.60 0.00
10144.80 199.70 1192.40 572.45 2.29 23.30
sdd 0.00 0.00 0.00 0.20 0.00 1.60 0.00 0.80
8.00 0.11 539.00 539.00 10.80
sde 0.00 0.00 0.00 452.10 0.00 0.00 0.00
0.00 0.00 3829.57 3715.83 2.22 100.22
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 349.80 0.00 0.00 0.00
0.00 0.00 1745.52 8515.74 2.86 100.02
sdd 0.00 442.40 0.00 51.00 0.00 3948.80 0.00 1974.40
77.43 0.60 11.73 7.54 38.46
sde 0.00 2.80 0.00 184.00 0.00 25.60 0.00
12.80 0.14 277.52 12629.41 3.19 58.74
sdd 0.00 898.00 0.00 124.80 0.00 8182.40 0.00
4091.20 65.56 1.30 10.40 7.24 90.30
sde 0.00 19.20 0.00 3.80 0.00 184.00 0.00 92.00
48.42 0.24 62.11 14.11 5.36
sdd 0.00 842.28 0.00 109.02 0.00 7612.02 0.00
3806.01 69.82 1.33 12.26 8.35 91.02
sde 0.00 45.49 0.00 46.89 0.00 739.08 0.00 369.54
15.76 9.04 192.73 3.38 15.85
sdd 0.00 1198.41 0.00 71.51 0.00 10505.18 0.00
5252.59 146.90 128.19 99.76 13.48 96.43
sde 0.00 1357.77 0.00 199.80 0.00 19263.75 0.00
9631.87 96.41 2251.09 1179.42 2.39 47.81
sdd 0.00 0.00 0.00 7.20 0.00 0.00 0.00 0.00
0.00 203.87 5671.83 138.92 100.02
sde 0.00 0.00 0.00 409.60 0.00 0.00 0.00
0.00 0.00 3171.04 4779.83 2.44 100.02
sdd 0.00 0.00 0.00 17.80 0.00 0.00 0.00 0.00
0.00 137.87 10240.90 56.19 100.02
sde 0.00 0.00 0.00 240.60 0.00 0.00 0.00
0.00 0.00 1573.85 9815.29 4.16 100.02
sdd 0.00 109.80 0.00 35.40 0.00 1012.80 0.00 506.40
28.61 42.14 7974.47 27.86 98.64
sde 0.00 2.80 0.00 198.80 0.00 30.40 0.00
15.20 0.15 428.49 14474.39 4.30 85.56
sdd 0.00 466.20 0.00 62.80 0.00 4230.40 0.00 2115.20
67.36 0.59 9.49 6.79 42.62
sde 0.00 5.20 0.00 0.80 0.00 48.00 0.00 24.00
60.00 0.01 16.25 11.25 0.90
sdd 0.00 0.00 0.00 0.20 0.00 1.60 0.00 0.80
8.00 0.01 35.00 35.00 0.70
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00

I suspect that fsync() is the criminal as I've been afraid. I'll show
you an interesting data. I collected the stack traces of backend
processes while a checkpoint is happening.

[bgwriter]
Oh, he is fsync()ing hard.

#0 0x0000003a629bfbb2 in __fsync_nocancel () from
/lib64/tls/libc.so.6
#1 0x00000000005742a1 in mdsync ()
#2 0x00000000005753d7 in smgrsync ()
#3 0x0000000000564d65 in FlushBufferPool ()
...

[some backends]
They are forced to wait for some lock pertaining to WAL when they try
to insert a log record.

#0 0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6
#1 0x000000000054c9c6 in PGSemaphoreLock ()
#2 0x0000000000572374 in LWLockAcquire ()
#3 0x000000000046ab10 in XLogInsert ()

[some backends]
They are waiting for some lock when they commit.

#0 0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6
#1 0x000000000054c9c6 in PGSemaphoreLock ()
#2 0x0000000000572374 in LWLockAcquire ()
#3 0x000000000046a746 in XLogFlush ()
#4 0x000000000046228e in RecordTransactionCommit ()

[the second criminal backend]
He is making other backends wait for WALWRITE lock, because he is
doing a long-time operation - fdatasync() - with the WALWRITE lock
being held.
But I think he is also a victim. He is probably forced to wait by
bgwriter's fsync() activity.

#0 0x0000003a629bfc59 in fdatasync () from /lib64/tls/libc.so.6
#1 0x0000000000469ac0 in issue_xlog_fsync ()
#2 0x000000000046a210 in XLogWrite ()
#3 0x000000000046a7df in XLogFlush ()
#4 0x000000000046228e in RecordTransactionCommit ()

The following stack trace sometimes appears as the second criminal
instead of the above:

#0 0x0000003a629bfbb2 in __fsync_nocancel () from
/lib64/tls/libc.so.6
#1 0x000000000046530e in UpdateControlFile ()
#2 0x000000000046a05d in XLogWrite ()
#3 0x000000000046a7df in XLogFlush ()
#4 0x000000000046228e in RecordTransactionCommit ()

[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync(). We can't ignore what commercial databases
have done so far. The kernel does as much as he likes when PostgreSQL
requests him to fsync().
One question is the disk utilization. While bgwriter is fsync()ing,
%util of WAL disk drops to almost 0. But the the bandwidth of
Ultra320 SCSI does not appear to be used fully. Any idea?
So what should we do next?

#44Martijn van Oosterhout
kleptog@svana.org
In reply to: Takayuki Tsunakawa (#43)
Re: Load distributed checkpoint

On Wed, Dec 20, 2006 at 08:10:56PM +0900, Takayuki Tsunakawa wrote:

One question is the disk utilization. While bgwriter is fsync()ing,
%util of WAL disk drops to almost 0. But the the bandwidth of
Ultra320 SCSI does not appear to be used fully. Any idea?

That implies that fsyncing a datafile blocks fsyncing the WAL. That
seems terribly unlikely (although...). What OS/Kernel/Filesystem is
this. I note a sync bug in linux for ext3 that may have relevence.

Have a nice day,

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#45Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#38)
Re: [PATCHES] Load distributed checkpoint patch

From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>

Bruce Momjian <bruce@momjian.us> wrote:

Do you use the same delay autovacuum uses?

What do you mean 'the same delay'? Autovacuum does VACUUM, not

CHECKPOINT.

If you think cost-based-delay, I think we cannot use it here. It's

hard to

estimate how much checkpoints delay by cost-based sleeping, but we

should

finish asynchronous checkpoints by the start of next checkpoint. So

I gave

priority to punctuality over load smoothing.

I consider that smoothing the load (more meaningfully, response time)
has higher priority over checkpoint punctuality in a practical sense,
because the users of a system benefit from good steady response and
give good reputation to the system. If the checkpoint processing is
not punctual, crash recovery would take longer time. But which would
you give higher priority, the unlikely event (=crash of the system) or
likely event (=peek hours of the system)? I believe the latter should
be regarded. The system can write dirty buffers after the peek hours
pass. User experience should be taken much case of.

#46Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

That implies that fsyncing a datafile blocks fsyncing the WAL. That
seems terribly unlikely (although...). What OS/Kernel/Filesystem is
this. I note a sync bug in linux for ext3 that may have relevence.

Oh, really? What bug? I've heard that ext3 reports wrong data to
iostat when it performs writes (the data is correct when performing
reads.)
My env is:

OS: RHEL 4.0 for AMD64/EM64T
kernel: 2.6.9-42.ELsmp
The file system is ext3.

Terribly unlikely? But I've seen the disk utilization quite often.

#47Martijn van Oosterhout
kleptog@svana.org
In reply to: Takayuki Tsunakawa (#46)
Re: Load distributed checkpoint

On Wed, Dec 20, 2006 at 09:14:50PM +0900, Takayuki Tsunakawa wrote:

That implies that fsyncing a datafile blocks fsyncing the WAL. That
seems terribly unlikely (although...). What OS/Kernel/Filesystem is
this. I note a sync bug in linux for ext3 that may have relevence.

Oh, really? What bug? I've heard that ext3 reports wrong data to
iostat when it performs writes (the data is correct when performing
reads.)

I was referring to this in the 2.6.6 changelog:

http://www.linuxhq.com/kernel/changelog/v2.6/6/index.html

ext3's fsync/fdatasync implementation is currently syncing the inode via a
full journal commit even if it was unaltered.

However you're running a later version so that's not it.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#48Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#41)
Re: [PATCHES] Load distributed checkpoint patch

ITAGAKI Takahiro wrote:

Bruce Momjian <bruce@momjian.us> wrote:

OK, if I understand correctly, instead of doing a buffer scan, write(),
and fsync(), and recyle the WAL files at checkpoint time, you delay the
scan/write part with the some delay.

Exactly. Actual behavior of checkpoint is not changed by the patch. Compared
with existing checkpoints, it just takes longer time in scan/write part.

Do you use the same delay autovacuum uses?

Sorry, I meant bgwriter delay, not autovauum.

What do you mean 'the same delay'? Autovacuum does VACUUM, not CHECKPOINT.
If you think cost-based-delay, I think we cannot use it here. It's hard to
estimate how much checkpoints delay by cost-based sleeping, but we should
finish asynchronous checkpoints by the start of next checkpoint. So I gave
priority to punctuality over load smoothing.

OK.

As I remember, often the checkpoint is caused because
we are using the last WAL file. Doesn't this delay the creation of new
WAL files by renaming the old ones to higher numbers (we can't rename
them until the checkpoint is complete)?

Checkpoints should be done by the next one, so we need WAL files for two
checkpoints. It is the same as now.

Ah, OK, so we already reserve a full set of WAL files while we are
waiting for the checkpoint to complete.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#49Inaam Rana
inaamrana@gmail.com
In reply to: Takayuki Tsunakawa (#43)
Re: Load distributed checkpoint

On 12/20/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:

[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync(). We can't ignore what commercial databases
have done so far. The kernel does as much as he likes when PostgreSQL
requests him to fsync().

I am new to the community and am very interested in the tests that you have
done. I am also working on resolving the sudden IO spikes at checkpoint
time. I agree with you that fsync() is the core issue here.

Being a new member I was wondering if someone on this list has done testing
with O_DIRECT and/or O_SYNC for datafiles as that seems to be the most
logical way of dealing with fsync() flood at checkpoint time. If so, I'll be
very interested in the results. As mentioned in this thread that a single
bgwriter with O_DIRECT will not be able to keep pace with cleaning effort
causing backend writes. I think (i.e. IMHO) multiple bgwriters and/or
AsyncIO with O_DIRECT can resolve this issue.

Talking of bgwriter_* parameters I think we are missing a crucial internal
counter i.e. number of dirty pages. How much work bgwriter has to do at each
wakeup call should be a function of total buffers and currently dirty
buffers. Relying on both these values instead of just one static NBuffers
should allow bgwriter to adapt more quickly to workload changes and ensure
that not much work is accumulated for checkpoint.

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

#50Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

On 12/20/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:

[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync(). We can't ignore what commercial databases
have done so far. The kernel does as much as he likes when PostgreSQL
requests him to fsync().

From: Inaam Rana

I am new to the community and am very interested in the tests that you

have done. I am also working on resolving the sudden IO spikes at checkpoint
time. I agree with you that fsync() is the core issue here.

Thank you for understanding my bad English correctly. Yes, what I've been
insisting is that it is necessary to avoid fsync()/fdatasync() and to use
O_SYNC (plus O_DIRECT if supported on the target platform) to really
eliminate the big spikes.
In my mail, the following sentence made a small mistake.

"I believe that the problem cannot be solved in a real sense by avoiding
fsync/fdatasync()."

The correct sentence is:

"I believe that the problem cannot be solved in a real sense without
avoiding fsync/fdatasync()."

Being a new member I was wondering if someone on this list has done

testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the
most logical way of dealing with fsync() flood at checkpoint time. If so,
I'll be very interested in the results.

Could you see the mail I sent on Dec 18? Its content was so long that I
zipped the whole content and attached to the mail. I just performed the
same test simply adding O_SYNC to open() in mdopen() and another function in
md.c. I couldn't succeed in running with O_DIRECT because O_DIRECT requires
the shared buffers to be aligned on the sector-size boundary. To perform
O_DIRECT test, a little more modification is necessary to the code where the
shared buffers are allocated.
The result was bad. But that's just a starting point. We need some
improvements that commercial databases have done. I think some approaches
we should take are:

(1) two-checkpoint (described in Jim Gray's textbook "Transaction
Processing: Concepts and Techniques"
(2) what Oracle suggests in its manual (see my previous mails)
(3) write multiple contiguous buffers with one write() to decrease the count
of write() calls

As mentioned in this thread that a single bgwriter with O_DIRECT will not

be able to keep pace with cleaning effort causing backend writes. I think
(i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this
issue.

I agree with you. Oracle provides a parameter called DB_WRITER_PROCESSES to
set the number of database writer processes. Oracle also provides
asynchronous I/O to solve the problem you are saying about. Please see
section 10.3.9 the following page:

http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14211/instance_tune.htm#sthref1049

Talking of bgwriter_* parameters I think we are missing a crucial internal

counter i.e. number of dirty pages. How much work bgwriter has to do at each
wakeup call should be a function of total buffers and currently dirty
buffers. Relying on both these values instead of just one static NBuffers
should allow bgwriter to adapt more quickly to workload changes and ensure
that not much work is accumulated for checkpoint.

I agree with you in the sense that the current bgwriter is a bit careless
about the system load. I believe that PostgreSQL should be more gentle to
OLTP transactions -- many users of the system as a result. I think the
speed of WAL accumulation should also be taken into account. Let's list up
the problems and ideas.

--

#51ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Takayuki Tsunakawa (#43)
Re: Load distributed checkpoint

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

I have to report a sad result. Your patch didn't work. Let's
consider the solution together. What you are addressing is very
important for the system designers in the real world -- smoothing
response time.

You were running the test on the very memory-depend machine.

shared_buffers = 4GB / The scaling factor is 50, 800MB of data.

Thet would be why the patch did not work. I tested it with DBT-2, 10GB of
data and 2GB of memory. Storage is always the main part of performace here,
even not in checkpoints.

If you use Linux, it has very unpleased behavior in fsync(); It locks all
metadata of the file being fsync-ed. We have to wait for the completion of
fsync when we do read(), write(), and even lseek().

Almost of your data is in the accounts table and it was stored in a single
file. All of transactions must wait for fsync to the single largest file,
so you saw the bottleneck was in the fsync.

[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync().

I think so, too. However, I assume we can resolve a part of the
checkpoint spikes with smoothing of write() alone.

BTW, can we use the same way to fsync? We call fsync()s to all modified
files without rest in mdsync(), but it's not difficult at all to insert
sleeps between fsync()s. Do you think it helps us? One of issues is that
we have to sleep in file unit, which is maybe rough granularity.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#52Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#39)
Re: Load distributed checkpoint

From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>

You were running the test on the very memory-depend machine.

shared_buffers = 4GB / The scaling factor is 50, 800MB of data.

Thet would be why the patch did not work. I tested it with DBT-2,

10GB of

data and 2GB of memory. Storage is always the main part of

performace here,

even not in checkpoints.

Yes, I used half the size of RAM as the shared buffers, which is
reasonable. And I cached all the data. The effect of fsync() is a
heavier offence, isn't it? System administrators would say "I have
enough memory. The data hasn't exhausted the DB cache yet. But the
users complain to me about the response. Why? What should I do?
What? Checkpoint?? Why doesn't PostgreSQL take care of frontend
users?"
BTW, is DBT-2 an OLTP benchmark which randomly access some parts of
data, or a batch application which accesses all data? I'm not
familiar with it. I know that IPA opens it to the public.

If you use Linux, it has very unpleased behavior in fsync(); It

locks all

metadata of the file being fsync-ed. We have to wait for the

completion of

fsync when we do read(), write(), and even lseek().
Almost of your data is in the accounts table and it was stored in a

single

file. All of transactions must wait for fsync to the single largest

file,

so you saw the bottleneck was in the fsync.

Oh, really, what an evil fsync is! Yes, I sometimes saw a backend
waiting for lseek() to complete when it committed. But why does the
backend which is syncing WAL/pg_control have to wait for syncing the
data file? They are, not to mention, different files, and WAL and
data files are stored on separate disks.

[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync().

I think so, too. However, I assume we can resolve a part of the
checkpoint spikes with smoothing of write() alone.

First, what's the goal (if possible numerically? Have you explained
to community members why the patch would help many people? At least,
I haven't heard that fsync() can be seriously bad and we would close
our eyes to what fsync() does.
By the way, what good results did you get with DBT-2? If you don't
mind, can you show us?

BTW, can we use the same way to fsync? We call fsync()s to all

modified

files without rest in mdsync(), but it's not difficult at all to

insert

sleeps between fsync()s. Do you think it helps us? One of issues is

that

we have to sleep in file unit, which is maybe rough granularity.

No, it definitely won't help us. There is no reason why it will help.
It might help in some limited environments, though, how can we
characterize such environments? Can we say "our approach helps our
environments, but it won't help you. The kernel VM settings may help
you. Good luck!"?
We have to consider seriously. I think it's time to face the problem
and we should follow the approaches of experts like Jim Gray and DBMS
vendors, unless we have a new clever idea like them.

#53ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Takayuki Tsunakawa (#52)
Re: Load distributed checkpoint

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

If you use Linux, it has very unpleased behavior in fsync(); It locks all
metadata of the file being fsync-ed. We have to wait for the completion of
fsync when we do read(), write(), and even lseek().

Oh, really, what an evil fsync is! Yes, I sometimes saw a backend
waiting for lseek() to complete when it committed. But why does the
backend which is syncing WAL/pg_control have to wait for syncing the
data file? They are, not to mention, different files, and WAL and
data files are stored on separate disks.

Backends call lseek() in planning, so they have to wait fsync() to
the table that they will access. Even if all of data in the file is in
the cache, lseek() conflict with fsync(). You can see a lot of backends
are waiting in planning phase in checkpoints, not executing phase.

it's not difficult at all to insert sleeps between fsync()s.

Can we say "our approach helps our
environments, but it won't help you. The kernel VM settings may help
you. Good luck!"?

I didn't say such a thing at all.
There are several opinions in the discussion:
1. High bgwriter setting is enough!
2. Change your OS :-)
3. Use O_SYNC or O_DIRECT, but very poor performance.
4. We may settle for single fsync(), but not many fsync()s in a short time.
I just suggested 4.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#54Martijn van Oosterhout
kleptog@svana.org
In reply to: ITAGAKI Takahiro (#53)
Re: Load distributed checkpoint

On Thu, Dec 21, 2006 at 06:46:36PM +0900, ITAGAKI Takahiro wrote:

Oh, really, what an evil fsync is! Yes, I sometimes saw a backend
waiting for lseek() to complete when it committed. But why does the
backend which is syncing WAL/pg_control have to wait for syncing the
data file? They are, not to mention, different files, and WAL and
data files are stored on separate disks.

Backends call lseek() in planning, so they have to wait fsync() to
the table that they will access. Even if all of data in the file is in
the cache, lseek() conflict with fsync(). You can see a lot of backends
are waiting in planning phase in checkpoints, not executing phase.

Hmm, there are other ways to sync parts of a file. For example doing an
mmap()/msync()/munmap() cycle to start an asyncronous flush. But given
what you're saying that might suffer from the same problem.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#55Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#51)
Re: Load distributed checkpoint

To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
Cc: <pgsql-hackers@postgresql.org>
Sent: Thursday, December 21, 2006 6:46 PM
Subject: Re: [HACKERS] Load distributed checkpoint

From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

Oh, really, what an evil fsync is! Yes, I sometimes saw a backend
waiting for lseek() to complete when it committed. But why does

the

backend which is syncing WAL/pg_control have to wait for syncing

the

data file? They are, not to mention, different files, and WAL and
data files are stored on separate disks.

Backends call lseek() in planning, so they have to wait fsync() to
the table that they will access. Even if all of data in the file is

in

the cache, lseek() conflict with fsync(). You can see a lot of

backends

are waiting in planning phase in checkpoints, not executing phase.

I see. I found one backend like the following. But one in my case
one out of 16 backends. Most of others are waiting to acquire
WALWRITE lock.

#0 0x0000003a629c6902 in __lseek_nocancel () from
/lib64/tls/libc.so.6
#1 0x000000000056789f in FileSeek ()
#2 0x0000000000574053 in mdnblocks ()
#3 0x0000000000574f4a in smgrnblocks ()
#4 0x00000000005489e8 in estimate_rel_size ()
#5 0x0000000000548bee in get_relation_info ()
#6 0x000000000054aa3d in build_simple_rel ()
#7 0x0000000000539c6b in add_base_rels_to_query ()
#8 0x000000000053b955 in query_planner ()
#9 0x000000000053c1c9 in grouping_planner ()
#10 0x000000000053d3b4 in subquery_planner ()
#11 0x000000000053d5b3 in planner ()
#12 0x00000000005778fc in pg_plan_query ()
#13 0x000000000057798c in pg_plan_queries ()
#14 0x0000000000577c53 in exec_simple_query ()

#56Zeugswetter Andreas ADI SD
ZeugswetterA@spardat.at
In reply to: Takayuki Tsunakawa (#52)
Re: Load distributed checkpoint

You were running the test on the very memory-depend machine.

shared_buffers = 4GB / The scaling factor is 50, 800MB of data.

Thet would be why the patch did not work. I tested it with DBT-2,

10GB of

data and 2GB of memory. Storage is always the main part of

performace here,

even not in checkpoints.

Yes, I used half the size of RAM as the shared buffers, which is
reasonable. And I cached all the data.

For pg, half RAM for shared_buffers is too much. The ratio is good for
other db software, that does not use the OS cache.

Andreas

#57Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: Zeugswetter Andreas ADI SD (#56)
Re: Load distributed checkpoint

----- Original Message -----
From: "Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>; "ITAGAKI
Takahiro" <itagaki.takahiro@oss.ntt.co.jp>

Yes, I used half the size of RAM as the shared buffers, which is
reasonable. And I cached all the data.

For pg, half RAM for shared_buffers is too much. The ratio is good

for

other db software, that does not use the OS cache.

What percentage of RAM is recommended for shared buffers in general?
40%? 30%? Or, is the general recommendation like "According to the
amount of your data, this much RAM should be left for the kernel
cache. But tha's the story on Linux. It may be different for other
OSes."?
Hmm, if it is so, it sounds hard for system designers/administrators
to judge.

----- Original Message -----
From: "Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>; "ITAGAKI
Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
Cc: <pgsql-hackers@postgresql.org>
Sent: Thursday, December 21, 2006 11:04 PM
Subject: RE: [HACKERS] Load distributed checkpoint

You were running the test on the very memory-depend machine.

shared_buffers = 4GB / The scaling factor is 50, 800MB of data.

Thet would be why the patch did not work. I tested it with DBT-2,

10GB of

data and 2GB of memory. Storage is always the main part of

performace here,

even not in checkpoints.

Yes, I used half the size of RAM as the shared buffers, which is
reasonable. And I cached all the data.

For pg, half RAM for shared_buffers is too much. The ratio is good for
other db software, that does not use the OS cache.

Andreas

#58Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Takayuki Tsunakawa (#45)
Re: [PATCHES] Load distributed checkpoint patch

On Wed, Dec 20, 2006 at 6:05 AM, in message

<03be01c7242f$2b4ce130$19527c0a@OPERAO>, "Takayuki Tsunakawa"
<tsunakawa.takay@jp.fujitsu.com> wrote:

I consider that smoothing the load (more meaningfully, response

time)

has higher priority over checkpoint punctuality in a practical

sense,

because the users of a system benefit from good steady response and
give good reputation to the system.

I agree with that.

If the checkpoint processing is
not punctual, crash recovery would take longer time. But which

would

you give higher priority, the unlikely event (=crash of the system)

or

likely event (=peek hours of the system)? I believe the latter

should

be regarded.

I'm still with you here.

The system can write dirty buffers after the peek hours
pass.

I don't see that in our busiest environment.

We have 3,000 "directly connected" users, various business partner
interfaces, and public web entry doing OLTP in 72 databases distributed
around the state, with real-time replication to central databases which
are considered derived copies. If all the pages modified on the central
databases were held in buffers or cache until after peak hours, query
performance would suffer -- assuming it would all even fit in cache. We
must have a way for dirty pages to be written under load while
responding to hundreds of thousands of queries per hour without
disturbing "freezes" during checkpoints.

On top of that, we monitor database requests on the source machines,
and during "idle time" we synchronize the data with all of the targets
to identify, log, and correct "drift". So even if we could shift all
our disk writes to the end of the day, that would have its own down
side, in extending our synchronization cycle.

I raise this only to be sure that such environments are considered with
these changes, not to discourage improvements in the checkpoint
techniques. We have effectively eliminated checkpoint problems in our
environment with a combination of battery backed controller cache and
aggressive background writer configuration. When you have a patch which
seems to help those who still have problems, I'll try to get time
approved to run a transaction replication stream onto one of our servers
(in "catch up mode") while we do a web "stress test" by playing back
requests from our production log. That should indicate how the patch
will affect us.

-Kevin

#59ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Kevin Grittner (#58)
Re: [PATCHES] Load distributed checkpoint patch

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

I consider that smoothing the load (more meaningfully, response time)
has higher priority over checkpoint punctuality in a practical sense,

I agree with that.

I agree with checkpoint_time is not so important, but we should
respect checkpoint_segements, or else new WAL files would be
created unboundedly, as Bruce pointed.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#60ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Takayuki Tsunakawa (#57)
Re: Load distributed checkpoint

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

For pg, half RAM for shared_buffers is too much. The ratio is good for
other db software, that does not use the OS cache.

What percentage of RAM is recommended for shared buffers in general?
40%? 30%? Or, is the general recommendation like "According to the
amount of your data, this much RAM should be left for the kernel
cache. But tha's the story on Linux. It may be different for other
OSes."?
Hmm, if it is so, it sounds hard for system designers/administrators
to judge.

If you use linux, try the following settings:
1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
2. Increase wal_buffers to redule WAL flushing.
3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync().
4. Separate data and WAL files into different partitions or disks.

I suppose 1 is important for you, because kernel will not write dirty
buffers until 10% of buffers become dirty in default settings.
You have large memory (8GB), but small data set (800MB). So kernel
almost never writes buffers not in checkpoints. Accumulate dirty buffers
are written at a burst in fsync().

We would be happy if we would be free from a difficult combination
of tuning. If you have *idea for improvements*, please suggest it.
I think we've already understood *problem itself*.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#61Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#38)
Re: [PATCHES] Load distributed checkpoint patch

Hello, Mr. Grittner,

From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>

We have 3,000 "directly connected" users, various business partner
interfaces, and public web entry doing OLTP in 72 databases

distributed

around the state, with real-time replication to central databases

which

are considered derived copies.

What a big system you have.

If all the pages modified on the central
databases were held in buffers or cache until after peak hours,

query

performance would suffer -- assuming it would all even fit in cache.

We

must have a way for dirty pages to be written under load while
responding to hundreds of thousands of queries per hour without
disturbing "freezes" during checkpoints.

I agree with you. My words were not good. I consider it is necessary
to always advance checkpoints even under heavy load, caring OLTP
transactions.

I raise this only to be sure that such environments are considered

with

these changes, not to discourage improvements in the checkpoint
techniques. We have effectively eliminated checkpoint problems in

our

environment with a combination of battery backed controller cache

and

aggressive background writer configuration. When you have a patch

which

seems to help those who still have problems, I'll try to get time
approved to run a transaction replication stream onto one of our

servers

(in "catch up mode") while we do a web "stress test" by playing back
requests from our production log. That should indicate how the

patch

will affect us.

Thank you very much for your kind offer.

#62Greg Smith
gsmith@gregsmith.com
In reply to: Inaam Rana (#49)
Re: Load distributed checkpoint

On Wed, 20 Dec 2006, Inaam Rana wrote:

Talking of bgwriter_* parameters I think we are missing a crucial
internal counter i.e. number of dirty pages. How much work bgwriter has
to do at each wakeup call should be a function of total buffers and
currently dirty buffers.

This is actually a question I'd been meaning to throw out myself to this
list. How hard would it be to add an internal counter to the buffer
management scheme that kept track of the current number of dirty pages?
I've been looking at the bufmgr code lately trying to figure out how to
insert one as part of building an auto-tuning bgwriter, but it's unclear
to me how I'd lock such a resource properly and scalably. I have a
feeling I'd be inserting a single-process locking bottleneck into that
code with any of the naive implementations I considered.

The main problem I've been seeing is also long waits stuck behind a slow
fsync on Linux. What I've been moving toward testing is an approach
slightly different from the proposals here. What if all the database page
writes (background writer, buffer eviction, or checkpoint scan) were
counted and periodic fsync requests send to the bgwriter based on that?
For example, when I know I have a battery-backed caching controller that
will buffer 64MB worth of data for me, if I forced a fsync after every
6000 8K writes, no single fsync would get stuck waiting for the disk to
write for longer than I'd like.

Give the admin a max_writes_before_sync parameter, make the default of 0
work just like the current behavior, and off you go; a simple tunable that
doesn't require a complicated scheme to implement or break anybody's
existing setup. Combined with a properly tuned background writer, that
would solve the issues I've been running into. It would even make the
problem of Linux caching too many writes until checkpoint time go away (I
know how to eliminate that by adjusting caching policy, but I have to be
root to do it; a DBA should be able to work around that issue even if they
don't have access to the kernel tunables.)

While I'm all for testing to prove me wrong, my gut feel is that going all
the way to sync writes a la Oracle is a doomed approach, particularly on
low-end hardware where they're super expensive. Following The Oracle Way
is a good roadmap for a lot of things, but I wouldn't put building a lean
enough database to run on modest hardware on that list. You can do sync
writes with perfectly good performance on systems with a good
battery-backed cache, but I think you'll get creamed in comparisons
against MySQL on IDE disks if you start walking down that path; since
right now a fair comparison with similar logging behavior is an even match
there, that's a step backwards.

Also on the topic of sync writes to the database proper: wouldn't using
O_DIRECT for those potentially counter-productive? I was under the
impressions that one of the behaviors counted on by Postgres was that data
evicted from its buffer cache, eventually intended for writing to disk,
was still kept around for a bit in the OS buffer cache. A subsequent read
because the data was needed again might find the data already in the OS
buffer, therefore avoiding an actual disk read; that substantially reduces
the typical penalty for the database engine making a bad choice on what to
evict. I fear a move to direct writes would put more pressure on the LRU
implementation to be very smart, and that's code that you really don't
want to be more complicated.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#63Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: Zeugswetter Andreas ADI SD (#56)
Re: Load distributed checkpoint

Hello, Itagaki-san,

Thank you for an interesting piece of information.

From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>

If you use linux, try the following settings:
1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
2. Increase wal_buffers to redule WAL flushing.
3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync().
4. Separate data and WAL files into different partitions or disks.

I suppose 1 is important for you, because kernel will not write

dirty

buffers until 10% of buffers become dirty in default settings.
You have large memory (8GB), but small data set (800MB). So kernel
almost never writes buffers not in checkpoints. Accumulate dirty

buffers

are written at a burst in fsync().

I'll show the results of this tuning to share information with people
who don't have experience of this kind.
The numbers shown below are the tps when running "pgbench -c16 -t100
postgres" five times in succession.

(1) Default case(this is show again for comparison and reminder)
The bgwriter_* and checkpoint_* are set to those defaults.
wal_buffers and wal_sync_method are also set to those defaults (64kB
and fdatasync respectively.)

235 80 226 77 240

(2) Default + WAL 1MB case
The configuration is the same as case (1) except that wal_buffers is
set to 1024kB.

302 328 82 330 85

This is better improvement than I expected.

(3) Default + wal_sync_method=open_sync case
The configuration is the same as case (1) except that wal_sync_method
is set to open_sync.

162 67 176 67 164

Too bad compared to case (2). Do you know the reason?

(4) (2)+(3) case

322 350 85 321 84

This is good, too.

(5) (4) + /proc/sys/vm/dirty* tuning
dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
changed from 40 to 4.

308 349 84 349 84

The tuning of kernel cache doesn't appear to bring performance
improvement in my env. The kernel still waits too long before it
starts flushing dirty buffers because the cache is large? If so,
increasingly available RAM may cause trouble more frequently in the
near future. Do the dirty_*_ratio accept values less than 1?

BTW, in case (1), the best response time of a transaction was 6
milliseconds. On the other hand, the worst response time was 13
seconds.

We would be happy if we would be free from a difficult combination
of tuning. If you have *idea for improvements*, please suggest it.
I think we've already understood *problem itself*.

I agree with you. Let's make the ideas more concrete, doing some
experimentations.

#64Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: Zeugswetter Andreas ADI SD (#56)
Re: Load distributed checkpoint

From: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>

(5) (4) + /proc/sys/vm/dirty* tuning
dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
changed from 40 to 4.

308 349 84 349 84

Sorry, I forgot to include the result when using Itagaki-san's patch.
The patch showd the following tps for case (5).

323 350 340 59 225

The best response time was 4 msec, and the worst one was 16 seconds.

----- Original Message -----
From: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
To: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
Cc: <pgsql-hackers@postgresql.org>
Sent: Friday, December 22, 2006 3:20 PM
Subject: Re: [HACKERS] Load distributed checkpoint

Hello, Itagaki-san,

Thank you for an interesting piece of information.

From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>

If you use linux, try the following settings:
1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
2. Increase wal_buffers to redule WAL flushing.
3. Set wal_sync_method to open_sync; O_SYNC is faster then

fsync().

4. Separate data and WAL files into different partitions or disks.

I suppose 1 is important for you, because kernel will not write

dirty

buffers until 10% of buffers become dirty in default settings.
You have large memory (8GB), but small data set (800MB). So kernel
almost never writes buffers not in checkpoints. Accumulate dirty

buffers

are written at a burst in fsync().

I'll show the results of this tuning to share information with

people

who don't have experience of this kind.
The numbers shown below are the tps when running "pgbench -c16 -t100
postgres" five times in succession.

(1) Default case(this is show again for comparison and reminder)
The bgwriter_* and checkpoint_* are set to those defaults.
wal_buffers and wal_sync_method are also set to those defaults (64kB
and fdatasync respectively.)

235 80 226 77 240

(2) Default + WAL 1MB case
The configuration is the same as case (1) except that wal_buffers is
set to 1024kB.

302 328 82 330 85

This is better improvement than I expected.

(3) Default + wal_sync_method=open_sync case
The configuration is the same as case (1) except that

wal_sync_method

is set to open_sync.

162 67 176 67 164

Too bad compared to case (2). Do you know the reason?

(4) (2)+(3) case

322 350 85 321 84

This is good, too.

(5) (4) + /proc/sys/vm/dirty* tuning
dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
changed from 40 to 4.

308 349 84 349 84

The tuning of kernel cache doesn't appear to bring performance
improvement in my env. The kernel still waits too long before it
starts flushing dirty buffers because the cache is large? If so,
increasingly available RAM may cause trouble more frequently in the
near future. Do the dirty_*_ratio accept values less than 1?

BTW, in case (1), the best response time of a transaction was 6
milliseconds. On the other hand, the worst response time was 13
seconds.

We would be happy if we would be free from a difficult combination
of tuning. If you have *idea for improvements*, please suggest it.
I think we've already understood *problem itself*.

I agree with you. Let's make the ideas more concrete, doing some
experimentations.

---------------------------(end of

broadcast)---------------------------

Show quoted text

TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

#65Inaam Rana
inaamrana@gmail.com
In reply to: Takayuki Tsunakawa (#64)
Re: Load distributed checkpoint

On 12/22/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:

From: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>

(5) (4) + /proc/sys/vm/dirty* tuning
dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
changed from 40 to 4.

308 349 84 349 84

Sorry, I forgot to include the result when using Itagaki-san's patch.
The patch showd the following tps for case (5).

323 350 340 59 225

The best response time was 4 msec, and the worst one was 16 seconds.

Which IO Shceduler (elevator) you are using?

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

#66Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: Zeugswetter Andreas ADI SD (#56)
Re: Load distributed checkpoint

From: Inaam Rana

Which IO Shceduler (elevator) you are using?

Elevator? Sorry, I'm not familiar with the kernel implementation, so I
don't what it is. My Linux distribution is Red Hat Enterprise Linux 4.0 for
AMD64/EM64T, and the kernel is 2.6.9-42.ELsmp. I probably havn't changed
any kernel settings, except for IPC settings to run PostgreSQL.

#67ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Takayuki Tsunakawa (#63)
Re: Load distributed checkpoint

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

(1) Default case(this is show again for comparison and reminder)
235 80 226 77 240
(2) Default + WAL 1MB case
302 328 82 330 85
(3) Default + wal_sync_method=open_sync case
162 67 176 67 164
(4) (2)+(3) case
322 350 85 321 84
(5) (4) + /proc/sys/vm/dirty* tuning
308 349 84 349 84

(3) is very strange. Your machine seems to be too restricted
by WAL so that other factors cannot be measured properly.

I'll send results on my machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)
- shared_buffers=1GB
- wal_sync_method = open_datasync
- wal_buffers = 1MB
- checkpoint_segments = 16
- checkpoint_timeout = 5min

I repeated "pgbench -c16 -t500 -s50"
and picked up results around checkpoints.

[HEAD]
...
560.8
373.5 <- checkpoint is here
570.8
...

[with patch]
...
562.0
528.4 <- checkpoint (fsync) is here
547.0
...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#68Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

From: "Greg Smith" <gsmith@gregsmith.com>

This is actually a question I'd been meaning to throw out myself to

this

list. How hard would it be to add an internal counter to the buffer
management scheme that kept track of the current number of dirty

pages?

I've been looking at the bufmgr code lately trying to figure out how

to

insert one as part of building an auto-tuning bgwriter, but it's

unclear

to me how I'd lock such a resource properly and scalably. I have a
feeling I'd be inserting a single-process locking bottleneck into

that

code with any of the naive implementations I considered.

To put it in an extreme way, how about making bgwriter count the dirty
buffers periodically scanning all the buffers? Do you know the book
"Principles of Transaction Processing"? Jim Gray was one of the
reviewers of this book.

http://www.amazon.com/gp/aa.html?HMAC=&amp;CartId=&amp;Operation=ItemLookup&amp;&amp;ItemId=1558604154&amp;ResponseGroup=Request,Large,Variations&amp;bStyle=aaz.jpg&amp;MerchantId=All&amp;isdetail=true&amp;bsi=Books&amp;logo=foo&amp;Marketplace=us&amp;AssociateTag=pocketpc

In chapter 8, the author describes fuzzy checkpoint combined with
two-checkpoint approach. In his explanation, recovery manager (which
would be bgwriter in PostgreSQL) scans the buffers and records the
list of dirty buffers at each checkpoint. This won't need any locking
in PostgreSQL if I understand correctly. Then, the recovery manager
performs the next checkpoint after writing those dirty buffers. In
two-checkpoint approach, crash recovery starts redoing from the second
to last checkpoint. Two-checkpoint is described in Jim Gray's book,
too. But they don't refer to how the recovery manager tunes the speed
of writing.

slightly different from the proposals here. What if all the

database page

writes (background writer, buffer eviction, or checkpoint scan) were
counted and periodic fsync requests send to the bgwriter based on

that?

For example, when I know I have a battery-backed caching controller

that

will buffer 64MB worth of data for me, if I forced a fsync after

every

6000 8K writes, no single fsync would get stuck waiting for the disk

to

write for longer than I'd like.

That seems interesting.

You can do sync
writes with perfectly good performance on systems with a good
battery-backed cache, but I think you'll get creamed in comparisons
against MySQL on IDE disks if you start walking down that path;

since

right now a fair comparison with similar logging behavior is an even

match

there, that's a step backwards.

I wonder what characteristics SATA disks have compared to IDE. Recent
PCs are equiped with SATA disks, aren't they?
What do you feel your approach compares to MySQL on IDE disks?

Also on the topic of sync writes to the database proper: wouldn't

using

O_DIRECT for those potentially counter-productive? I was under the
impressions that one of the behaviors counted on by Postgres was

that data

evicted from its buffer cache, eventually intended for writing to

disk,

was still kept around for a bit in the OS buffer cache. A

subsequent read

because the data was needed again might find the data already in the

OS

buffer, therefore avoiding an actual disk read; that substantially

reduces

the typical penalty for the database engine making a bad choice on

what to

evict. I fear a move to direct writes would put more pressure on

the LRU

implementation to be very smart, and that's code that you really

don't

want to be more complicated.

I'm worried about this, too.

#69Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#60)
Re: Load distributed checkpoint

(3) is very strange. Your machine seems to be too restricted
by WAL so that other factors cannot be measured properly.

Right... It takes as long as 15 seconds to fsync 1GB file. It's
strange. This is a borrowed PC server, so the disk may be RAID 5?
However, the WAL disk and DB disks show the same throughput. I'll
investigate. I may have to find another machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)

Oh, Windows. Maybe the fsync() problem Itagaki-san pointed out does
not exist.
BTW, your env is showing attractive result, isn't it?

----- Original Message -----
From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
Cc: <pgsql-hackers@postgresql.org>
Sent: Friday, December 22, 2006 6:09 PM
Subject: Re: [HACKERS] Load distributed checkpoint

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

(1) Default case(this is show again for comparison and reminder)
235 80 226 77 240
(2) Default + WAL 1MB case
302 328 82 330 85
(3) Default + wal_sync_method=open_sync case
162 67 176 67 164
(4) (2)+(3) case
322 350 85 321 84
(5) (4) + /proc/sys/vm/dirty* tuning
308 349 84 349 84

(3) is very strange. Your machine seems to be too restricted
by WAL so that other factors cannot be measured properly.

I'll send results on my machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)
- shared_buffers=1GB
- wal_sync_method = open_datasync
- wal_buffers = 1MB
- checkpoint_segments = 16
- checkpoint_timeout = 5min

I repeated "pgbench -c16 -t500 -s50"
and picked up results around checkpoints.

[HEAD]
...
560.8
373.5 <- checkpoint is here
570.8
...

[with patch]
...
562.0
528.4 <- checkpoint (fsync) is here
547.0
...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#70Simon Riggs
simon@2ndquadrant.com
In reply to: ITAGAKI Takahiro (#53)
Re: Load distributed checkpoint

On Thu, 2006-12-21 at 18:46 +0900, ITAGAKI Takahiro wrote:

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

If you use Linux, it has very unpleased behavior in fsync(); It locks all
metadata of the file being fsync-ed. We have to wait for the completion of
fsync when we do read(), write(), and even lseek().

Oh, really, what an evil fsync is! Yes, I sometimes saw a backend
waiting for lseek() to complete when it committed. But why does the
backend which is syncing WAL/pg_control have to wait for syncing the
data file? They are, not to mention, different files, and WAL and
data files are stored on separate disks.

Backends call lseek() in planning, so they have to wait fsync() to
the table that they will access. Even if all of data in the file is in
the cache, lseek() conflict with fsync(). You can see a lot of backends
are waiting in planning phase in checkpoints, not executing phase.

It isn't clear to me why you are doing planning during a test at all.

If you are doing replanning during test execution then the real
performance problem will be the planning, not the fact that the fsync
stops planning from happening.

Prepared queries are only replanned manually, so the chances of
replanning during a checkpoint are fairly low. So although it sounds
worrying, I'm not sure that we'll want to alter the use of lseek during
planning - though there may be other arguments also.

I have also seen cases where the WAL drive, even when separated, appears
to spike upwards during a checkpoint. My best current theory, so far
untested, is that the WAL and data drives are using the same CFQ
scheduler and that the scheduler actively slows down WAL requests when
it need not. Mounting the drives as separate block drives with separate
schedulers, CFQ for data and Deadline for WAL should help.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#71Zeugswetter Andreas ADI SD
ZeugswetterA@spardat.at
In reply to: Takayuki Tsunakawa (#63)
Re: Load distributed checkpoint

If you use linux, try the following settings:
1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.

You will need to pair this with bgwriter_* settings, else too few
pages are written to the os inbetween checkpoints.

2. Increase wal_buffers to redule WAL flushing.

You will want the possibility of single group writes to be able to reach

256kb. The default is thus not enough when you have enough RAM.
You also want enough, so that new txns don't need to wait for an empty
buffer (that is only freed by a write).

3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync().

O_SYNC's only advantage over fdatasync is that it saves a system call,
since it still passes through OS cache, but the disadvantage is that it
does not let the OS group writes. Thus it is more susceptible to too
few WAL_buffers. What you want is O_DIRECT + enough wal_buffers to allow

256k writes.

4. Separate data and WAL files into different partitions or disks.

While this is generally suggested, I somehow doubt the validity when you
only have few disk spindles. If e.g. you only have 2-3 (mirrored) disks
I
wouldn't do it (at least on the average 70/30 read write systems).

Andreas

#72Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#31)
Re: Load distributed checkpoint

I have a new idea. Rather than increasing write activity as we approach
checkpoint, I think there is an easier solution. I am very familiar
with the BSD kernel, and it seems they have a similar issue in trying to
smooth writes:

http://www.brno.cas.cz/cgi-bin/bsdi-man?proto=1.1&amp;query=update&amp;msection=4&amp;apropos=0

UPDATE(4) BSD Programmer's Manual UPDATE(4)

NAME
update - trickle sync filesystem caches to disk

DESCRIPTION
At system boot time, the kernel starts filesys_syncer, process
3. This process helps protect the integrity of disk volumes
by ensuring that volatile cached filesystem data are written
to disk within the vfs.generic.syncdelay interval which defaults
to thirty seconds (see sysctl(8)). When a vnode is first
written it is placed vfs.generic.syncdelay seconds down on
the trickle sync queue. If it still exists and has dirty data
when it reaches the top of the queue, filesys_syncer writes
it to disk. This approach evens out the load on the underlying
I/O system and avoids writing short-lived files. The pa- pers
on trickle-sync tend to favor aging based on buffers rather
than files. However, BSD/OS synchronizes on file age rather
than buffer age because the data structures are much smaller
as there are typically far fewer files than buffers. Although
this can make the I/O bursty when a big file is written to
disk, it is still much better than the wholesale writes that
were being done by the historic update process which wrote
all dirty data buffers every 30 seconds. It also adapts much
better to the soft update code which wants to control aging
to improve performance (inodes age in one third of
vfs.generic.syncdelay seconds, directories in one half of
vfs.generic.syncdelay seconds). This ordering ensures that
most dependencies are gone (e.g., inodes are written when
directory en- tries want to go to disk) reducing the amount
of work that the soft up- date code needs to do.

I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync(). Perhaps a simple solution is to do the
write()'s of all dirty buffers as we do now at checkpoint time, but
delay 30 seconds and then do fsync() on all the files. The goal here is
that during the 30-second delay, the kernel will be forcing data to the
disk, so the fsync() we eventually do will only be for the write() of
buffers during the 30-second delay, and because we wrote all dirty
buffers 30 seconds ago, there shouldn't be too many of them.

I think the basic difference between this and the proposed patch is that
we do not put delays in the buffer write() or fsync() phases --- we just
put a delay _between_ the phases, and wait for the kernel to smooth it
out for us. The kernel certainly knows more about what needs to get to
disk, so it seems logical to let it do the I/O smoothing.

---------------------------------------------------------------------------

Bruce Momjian wrote:

I have thought a while about this and I have some ideas.

Ideally, we would be able to trickle the sync of individuals blocks
during the checkpoint, but we can't because we rely on the kernel to
sync all dirty blocks that haven't made it to disk using fsync(). We
could trickle the fsync() calls, but that just extends the amount of
data we are writing that has been dirtied post-checkpoint. In an ideal
world, we would be able to fsync() only part of a file at a time, and
only those blocks that were dirtied pre-checkpoint, but I don't see that
happening anytime soon (and one reason why many commercial databases
bypass the kernel cache).

So, in the real world, one conclusion seems to be that our existing
method of tuning the background writer just isn't good enough for the
average user:

#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round

These settings control what the bgwriter does, but they do not clearly
relate to the checkpoint timing, which is the purpose of the bgwriter,
and they don't change during the checkpoint interval, which is also less
than ideal. If set to aggressively, it writes too much, and if too low,
the checkpoint does too much I/O.

We clearly need more bgwriter activity as the checkpoint approaches, and
one that is more auto-tuned, like many of our other parameters. I think
we created these settings to see how they worked in the field, so it
probably time to reevaluate them based on field reports.

I think the bgwriter should keep track of how far it is to the next
checkpoint, and use that information to increase write activity.
Basically now, during a checkpoint, the bgwriter does a full buffer scan
and fsync's all dirty files, so it changes from the configuration
parameter-defined behavior right to 100% activity. I think it would be
ideal if we could ramp up the writes so that when it is 95% to the next
checkpoint, it can be operating at 95% of the activity it would do
during a checkpoint.

My guess is if we can do that, we will have much smoother performance
because we have more WAL writes just after checkpoint for newly-dirtied
pages, and the new setup will give us more write activity just before
checkpoint.

One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
the kernel cache, so we are sure data will be on disk by checkpoint
time. This was avoided in the past because of the expense of
second-guessing the kernel disk I/O scheduling algorithms.

---------------------------------------------------------------------------

Tom Lane wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

"Jim C. Nasby" <jim@nasby.net> wrote:

Generally, I try and configure the all* settings so that you'll get 1
clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
have any actual tests to back that methodology up.

We got to these numbers somewhat scientifically. I studied I/O
patterns under production load and figured we should be able to handle
about 800 writes in per 200 ms without causing problems. I have to
admit that I based the percentages and the ratio between "all" and "lru"
on gut feel after musing over the documentation.

I like Kevin's settings better than what Jim suggests. If the bgwriter
only makes one sweep between checkpoints then it's hardly going to make
any impact at all on the number of dirty buffers the checkpoint will
have to write. The point of the bgwriter is to reduce the checkpoint
I/O spike by doing writes between checkpoints, and to have any
meaningful impact on that, you'll need it to make the cycle several times.

Another point here is that you want checkpoints to be pretty far apart
to minimize the WAL load from full-page images. So again, a bgwriter
that's only making one loop per checkpoint is not gonna be doing much.

I wonder whether it would be feasible to teach the bgwriter to get more
aggressive as the time for the next checkpoint approaches? Writes
issued early in the interval have a much higher probability of being
wasted (because the page gets re-dirtied later). But maybe that just
reduces to what Takahiro-san already suggested, namely that
checkpoint-time writes should be done with the same kind of scheduling
the bgwriter uses outside checkpoints. We still have the problem that
the real I/O storm is triggered by fsync() not write(), and we don't
have a way to spread out the consequences of fsync().

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#73Inaam Rana
inaamrana@gmail.com
In reply to: Takayuki Tsunakawa (#66)
Re: Load distributed checkpoint

On 12/22/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:

From: Inaam Rana

Which IO Shceduler (elevator) you are using?

Elevator? Sorry, I'm not familiar with the kernel implementation, so I
don't what it is. My Linux distribution is Red Hat Enterprise Linux 4.0for AMD64/EM64T, and the kernel is
2.6.9-42.ELsmp. I probably havn't changed any kernel settings, except for
IPC settings to run PostgreSQL.

There are four IO schedulers in Linux. Anticipatory, CFQ (default),
deadline, and noop. For typical OLTP type loads generally deadline is
recommended. If you are constrained on CPU and you have a good controller
then its better to use noop.
Deadline attempts to merge requests by maintaining two red black trees in
sector sort order and it also ensures that a request is serviced in given
time by using FIFO. I don't expect it to do the magic but was wondering that
it may dilute the issue of fsync() elbowing out WAL writes.

You can look into /sys/block/<device>/queue/scheduler to see which scheduler
you are using.

regards,
inaam

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

#74Gregory Stark
stark@enterprisedb.com
In reply to: Bruce Momjian (#72)
Re: Load distributed checkpoint

"Bruce Momjian" <bruce@momjian.us> writes:

I have a new idea. Rather than increasing write activity as we approach
checkpoint, I think there is an easier solution. I am very familiar
with the BSD kernel, and it seems they have a similar issue in trying to
smooth writes:

Just to give a bit of context for this. The traditional mechanism for syncing
buffers to disk on BSD which this daemon was a replacement for was to simply
call "sync" every 30s. Compared to that this daemon certainly smooths the I/O
out over the 30s window...

Linux has a more complex solution to this (of course) which has undergone a
few generations over time. Older kernels had a user space daemon called
bdflush which called an undocumented syscall every 5s. More recent ones have a
kernel thread called pdflush. I think both have various mostly undocumented
tuning knobs but neither makes any sort of guarantee about the amount of time
a dirty buffer might live before being synced.

Your thinking is correct but that's already the whole point of bgwriter isn't
it? To get the buffers out to the kernel early in the checkpoint interval so
that come checkpoint time they're hopefully already flushed to disk. As long
as your checkpoint interval is well over 30s only the last 30s (or so, it's a
bit fuzzier on Linux) should still be at risk of being pending.

I think the main problem with an additional pause in the hopes of getting more
buffers synced is that during the 30s pause on a busy system there would be a
continual stream of new dirty buffers being created as bgwriter works and
other backends need to reuse pages. So when the fsync is eventually called
there will still be a large amount of i/o to do. Fundamentally the problem is
that fsync is too blunt an instrument. We only need to fsync the buffers we
care about, not the entire file.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#75Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#70)
Re: Load distributed checkpoint

On Fri, 22 Dec 2006, Simon Riggs wrote:

I have also seen cases where the WAL drive, even when separated, appears
to spike upwards during a checkpoint. My best current theory, so far
untested, is that the WAL and data drives are using the same CFQ
scheduler and that the scheduler actively slows down WAL requests when
it need not. Mounting the drives as separate block drives with separate
schedulers, CFQ for data and Deadline for WAL should help.

The situation I've been seeing is that the database needs a new block to
complete a query and issues a read request to get it, but that read is
behind the big checkpoint fsync. Client sits there for quite some time
waiting for the fsync to finish before it gets the data it needs, and now
your trivial select took seconds to complete. It's fairly easy to
replicate this problem using pgbench on Linux--I've seen a query sit there
for 15 seconds when going out of my way to aggrevate the behavior. One of
Takayuki's posts here mentioned a worst-case delay of 13 seconds, that's
the problem rearing its ugly head.

You may be right that what you're seeing would be solved with a more
complicated tuning on a per-device basis (which, by the way, isn't
available unless you're running a more recent Linux kernel than most many
distributions have available). You can tune the schedulers all day and
not make a lick of difference to what I've been running into; I know, I
tried.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#76Bruce Momjian
bruce@momjian.us
In reply to: Gregory Stark (#74)
Re: Load distributed checkpoint

Gregory Stark wrote:

"Bruce Momjian" <bruce@momjian.us> writes:

I have a new idea. Rather than increasing write activity as we approach
checkpoint, I think there is an easier solution. I am very familiar
with the BSD kernel, and it seems they have a similar issue in trying to
smooth writes:

Just to give a bit of context for this. The traditional mechanism for syncing
buffers to disk on BSD which this daemon was a replacement for was to simply
call "sync" every 30s. Compared to that this daemon certainly smooths the I/O
out over the 30s window...

Linux has a more complex solution to this (of course) which has undergone a
few generations over time. Older kernels had a user space daemon called
bdflush which called an undocumented syscall every 5s. More recent ones have a
kernel thread called pdflush. I think both have various mostly undocumented
tuning knobs but neither makes any sort of guarantee about the amount of time
a dirty buffer might live before being synced.

Your thinking is correct but that's already the whole point of bgwriter isn't
it? To get the buffers out to the kernel early in the checkpoint interval so
that come checkpoint time they're hopefully already flushed to disk. As long
as your checkpoint interval is well over 30s only the last 30s (or so, it's a
bit fuzzier on Linux) should still be at risk of being pending.

I think the main problem with an additional pause in the hopes of getting more
buffers synced is that during the 30s pause on a busy system there would be a
continual stream of new dirty buffers being created as bgwriter works and
other backends need to reuse pages. So when the fsync is eventually called
there will still be a large amount of i/o to do. Fundamentally the problem is
that fsync is too blunt an instrument. We only need to fsync the buffers we
care about, not the entire file.

Well, one idea would be for the bgwriter not to do many write()'s
between the massive checkpoint write()'s and the fsync()'s. That would
cut down on the extra I/O that fsync() would have to do.

The problem I see with making the bgwriter do more writes between
checkpoints is that overhead of those scans, and the overhead of doing
write's that will later be dirtied before the checkpoint. With the
delay between stages idea, we don't need to guess how agressive the
bgwriter needs to be --- we can just do the writes, and wait for a
while.

On an idle system, would someone dirty a large file, and watch the disk
I/O to see how long it takes for the I/O to complete to disk?

In what we have now, we are either having the bgwriter do too much I/O
between checkpoints, or guaranteeing an I/O storm during a checkpoint by
doing lots of write()'s and then calling fsync() right away. I don't
see how we are ever going to get that properly tuned.

Would someone code up a patch and test it?

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#77Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: Zeugswetter Andreas ADI SD (#56)
Re: Load distributed checkpoint

Hello, Inaam-san,

There are four IO schedulers in Linux. Anticipatory, CFQ (default),

deadline, and noop. For typical OLTP type loads generally deadline is
recommended. If you are constrained on CPU and you have a good controller
then its better to use noop.

Deadline attempts to merge requests by maintaining two red black trees in

sector sort order and it also ensures that a request is serviced in given
time by using FIFO. I don't expect it to do the magic but was wondering that
it may dilute the issue of fsync() elbowing out WAL writes.

You can look into /sys/block/<device>/queue/scheduler to see which

scheduler you are using.

Thank you for your information.
I could only find the following files in /sys/block/<device>/queue/:

iosched
max_hw_sectors_kb
max_sectors_kb
nr_requests
read_ahead_kb

In iosched, the following files exist:

quantum (the content is "4")
queued (the content is "8")

#78Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: Bruce Momjian (#76)
Re: Load distributed checkpoint

From: "Bruce Momjian" <bruce@momjian.us>

On an idle system, would someone dirty a large file, and watch the

disk

I/O to see how long it takes for the I/O to complete to disk?

I ran "dd if=/dev/zero of=<file on DB disk> bs=8k count=`expr 1048576
/ 8`, that is, writing 1GB file with 8KB write()'s. It took about 175
seconds for the kernel to flush buffers. I'll put the result of
"iostat -x 5" which was started at the same time as dd. 175 seconds
means 35 rows with nonzero %util * 5 seconds of interval.

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sde 0.00 981.56 0.00 50.90 0.00 8439.28 0.00 4219.64
165.80 34.38 525.01 6.25 31.82
sde 0.00 25737.00 0.00 118.60 0.00 219820.80 0.00
109910.40 1853.46 6529.74 1587.08 8.43 100.02
sde 0.00 1912.97 0.00 127.74 0.00 16325.75 0.00
8162.87 127.80 8192.67 6502.98 7.81 99.82
sde 0.00 1728.00 0.00 117.80 0.00 14745.60 0.00
7372.80 125.17 8209.36 11227.29 8.49 100.02
sde 0.00 1536.00 0.00 103.80 0.00 13107.20 0.00
6553.60 126.27 8209.10 16729.09 9.64 100.02
sde 0.00 1344.00 0.00 91.60 0.00 11468.80 0.00
5734.40 125.21 8208.69 21481.22 10.92 100.02
sde 0.00 1532.80 0.00 98.00 0.00 13081.60 0.00
6540.80 133.49 8209.34 26583.57 10.21 100.02
sde 0.00 1632.00 0.00 107.00 0.00 13926.40 0.00
6963.20 130.15 8208.89 31662.93 9.35 100.02
sde 0.00 1536.00 0.00 106.40 0.00 13107.20 0.00
6553.60 123.19 8209.66 36443.07 9.40 100.02
sde 0.00 1442.89 0.00 95.59 0.00 12312.63 0.00
6156.31 128.81 8227.23 41446.78 10.49 100.24
sde 0.00 1532.80 0.00 103.00 0.00 13081.60 0.00
6540.80 127.01 8210.77 46606.03 9.71 100.04
sde 0.00 1440.00 0.00 92.00 0.00 12288.00 0.00
6144.00 133.57 8208.82 51421.78 10.87 100.02
sde 0.00 1344.00 0.00 91.80 0.00 11468.80 0.00
5734.40 124.93 8209.86 56524.37 10.90 100.02
sde 0.00 1539.08 0.00 101.00 0.00 13133.47 0.00
6566.73 130.03 8225.59 61477.93 9.92 100.22
sde 0.00 1436.80 0.00 95.40 0.00 12262.40 0.00
6131.20 128.54 8208.88 66566.42 10.48 100.02
sde 0.00 1344.00 0.00 92.40 0.00 11468.80 0.00
5734.40 124.12 8209.47 71466.12 10.82 100.02
sde 0.00 1102.40 0.00 101.00 0.00 9408.38 0.00
4704.19 93.15 8174.36 76538.41 9.88 99.82
sde 0.00 0.00 0.00 89.00 0.00 0.00 0.00 0.00
0.00 7855.72 80795.64 11.24 100.02
sde 0.00 0.00 0.00 91.38 0.00 0.00 0.00 0.00
0.00 7422.53 81823.89 10.97 100.22
sde 0.00 0.00 0.00 96.80 0.00 0.00 0.00 0.00
0.00 6919.07 83194.91 10.33 100.02
sde 0.00 0.00 0.00 91.78 0.00 0.00 0.00 0.00
0.00 6480.85 84657.04 10.92 100.22
sde 0.00 0.00 0.00 92.60 0.00 0.00 0.00 0.00
0.00 5997.22 84749.79 10.80 100.02
sde 0.00 0.00 0.00 91.02 0.00 0.00 0.00 0.00
0.00 5528.97 85345.07 10.97 99.82
sde 0.00 0.00 0.00 102.61 0.00 0.00 0.00
0.00 0.00 5059.61 85057.91 9.77 100.22
sde 0.00 0.00 0.00 93.20 0.00 0.00 0.00 0.00
0.00 4572.57 85284.49 10.73 100.02
sde 0.00 0.00 0.00 98.20 0.00 0.00 0.00 0.00
0.00 4109.15 86086.50 10.21 100.22
sde 0.00 0.00 0.00 91.42 0.00 0.00 0.00 0.00
0.00 3611.72 86405.24 10.92 99.82
sde 0.00 0.00 0.00 100.00 0.00 0.00 0.00
0.00 0.00 3135.62 86292.49 10.00 100.02
sde 0.00 0.00 0.00 100.40 0.00 0.00 0.00
0.00 0.00 2652.63 86609.79 9.96 100.02
sde 0.00 0.00 0.00 92.80 0.00 0.00 0.00 0.00
0.00 2153.69 86168.58 10.78 100.02
sde 0.00 0.00 0.00 88.80 0.00 0.00 0.00 0.00
0.00 1694.74 86275.58 11.26 100.02
sde 0.00 0.00 0.00 98.20 0.00 0.00 0.00 0.00
0.00 1241.41 86708.40 10.19 100.02
sde 0.00 0.00 0.00 95.20 0.00 0.00 0.00 0.00
0.00 747.52 86505.59 10.51 100.02
sde 0.00 0.00 0.00 89.20 0.00 0.00 0.00 0.00
0.00 283.44 86551.11 11.21 100.02
sde 0.00 51.20 0.00 17.20 0.00 449.60 0.00 224.80
26.14 4.30 61572.65 9.05 15.56
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00

#79ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Bruce Momjian (#72)
Re: Load distributed checkpoint

Bruce Momjian <bruce@momjian.us> wrote:

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync(). Perhaps a simple solution is to do the
write()'s of all dirty buffers as we do now at checkpoint time, but
delay 30 seconds and then do fsync() on all the files.

I think there are two platforms that have different problems in checkpoints.
It's in fsync() on one platform, and in write() on another. It is complex
depending on OS, the amount of memory, disks, writeback-cache and so on.

I think the basic difference between this and the proposed patch is that
we do not put delays in the buffer write() or fsync() phases --- we just
put a delay _between_ the phases, and wait for the kernel to smooth it
out for us. The kernel certainly knows more about what needs to get to
disk, so it seems logical to let it do the I/O smoothing.

Both proposals do not conflict each other. Also, solutions for either
platform do not have bad effect on the other platform. Can we employ
both of them?

I tested your proposal but it did not work on write-critical machine.
However, if the idea works well on BSD or some platforms, we would be
better off buying it.

[pgbench results]
...
566.973777
327.158222 <- (1) write()
560.773868 <- (2) sleep
544.106645 <- (3) fsync()
...

[changes in codes]
(This is a bad implementation because shutdown takes long time!)
void
FlushBufferPool(void)
{
BufferSync(); // (1) write -- about 20s

time_t start = time(NULL);
while (time(NULL) - start < 30) // (2) sleep -- 30s
{
pg_usleep(BgWriterDelay * 1000L);
BgBufferSync();
AbsorbFsyncRequests();
}

smgrsync(); // (3) fsync -- less than 200ms
}

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#80Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#79)
Re: Load distributed checkpoint

ITAGAKI Takahiro wrote:

Bruce Momjian <bruce@momjian.us> wrote:

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync(). Perhaps a simple solution is to do the
write()'s of all dirty buffers as we do now at checkpoint time, but
delay 30 seconds and then do fsync() on all the files.

I think there are two platforms that have different problems in checkpoints.
It's in fsync() on one platform, and in write() on another. It is complex
depending on OS, the amount of memory, disks, writeback-cache and so on.

I think the basic difference between this and the proposed patch is that
we do not put delays in the buffer write() or fsync() phases --- we just
put a delay _between_ the phases, and wait for the kernel to smooth it
out for us. The kernel certainly knows more about what needs to get to
disk, so it seems logical to let it do the I/O smoothing.

Both proposals do not conflict each other. Also, solutions for either
platform do not have bad effect on the other platform. Can we employ
both of them?

I tested your proposal but it did not work on write-critical machine.
However, if the idea works well on BSD or some platforms, we would be
better off buying it.

[pgbench results]
...
566.973777
327.158222 <- (1) write()
560.773868 <- (2) sleep
544.106645 <- (3) fsync()

OK, so you are saying that performance dropped only during the write(),
and not during the fsync()? Interesting. I would like to know the
results of a few tests just like you reported them above:

1a) write spread out over 30 seconds
1b) write with no delay

2a) sleep(0)
2b) sleep(30)

3) fsync

I would like to know the performance at each stage for each combination,
e.g. when using 1b, 2a, 3, performance during the write() phase was X,
during the sleep it was Y, and during the fsync it was Z. (Of course,
sleep(0) has no stage timing.)

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#81Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#72)
Re: Load distributed checkpoint

On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:

I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync().

Well, its correct to say that the transfer to disk is the source of the
problem, but that doesn't only occur when we fsync(). There are actually
two disk storms that occur, because of the way the fs cache works. [Ron
referred to this effect uplist]

Linux 2.6+ will attempt to write to disk any dirty blocks in excess of a
certain threshold, the dirty _background_ratio, which defaults to 10% of
RAM. So when the checkpoint issues lots of write() calls, we generally
exceed the threshold and then begin io storm number 1 to get us back
down to the dirty_background_ratio. When we issue the fsync() calls, we
then begin io storm number 2, which takes us down close to zero dirty
blocks (on a dedicated server). [Thanks to my colleague Richard Kennedy
for confirming this via investigation; he's been on leave throughout
this discussion, regrettably].

Putting delays in is very simple and does help, however, how much it
helps depends upon:
- the number of dirty blocks in shared_buffers
- the dirty_background_ratio
- the number of dirty blocks in each file when we fsync()

For example, on a system with a very large RAM, and yet a medium write
workload, the dirty_background_ratio may never be exceeded. In that
case, all of the I/O happens during storm 2, so . If you set
dirty_background_ratio lower, then most of the writes happen during
storm 1.

During storm 2, the fsync calls write all dirty blocks in a file to
disk. In many cases, a few tables/files have all of the writes, so
adding a delay between fsyncs doesn't spread out the writes like you
would hope it would.

Most of the time, storm 1 and storm 2 run together in a continuous
stream, but sometimes you see a double peak. There is an overlap of a
few seconds between 1 and 2 in many cases.

Linux will also write blocks to disk after a period of inactivity,
dirty_expire_centisecs which by default is 30 seconds. So putting a
delay between storm1 and storm2 should help matters somewhat, but 30
secs is probably almost exactly the wrong number (by chance), though I
do like the idea.

Perhaps a simple solution is to do the
write()'s of all dirty buffers as we do now at checkpoint time, but
delay 30 seconds and then do fsync() on all the files.

So yes, putting a short delay, say 10 seconds, in at that point should
help matters somewhat, sometimes. (But the exact number depends upon how
the OS is tuned.)

The goal here is
that during the 30-second delay, the kernel will be forcing data to the
disk, so the fsync() we eventually do will only be for the write() of
buffers during the 30-second delay, and because we wrote all dirty
buffers 30 seconds ago, there shouldn't be too many of them.

...but not for that reason.

IMHO the best thing to do is to
1. put a short delay between the write() steps in the checkpoint
2. put a longer delay in between the write() phase and the fsync() phase
3. tune the OS writeback mechanism to help smoothing
Either set (1) and (2) as GUCs, or have code that reads the OS settings
and acts accordingly.

Or alternatively put, both Bruce and Itagaki-san have good ideas.

So, in the real world, one conclusion seems to be that our existing
method of tuning the background writer just isn't good enough for the
average user:

#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round

These settings control what the bgwriter does, but they do not clearly
relate to the checkpoint timing, which is the purpose of the bgwriter,
and they don't change during the checkpoint interval, which is also less
than ideal. If set to aggressively, it writes too much, and if too low,
the checkpoint does too much I/O.

Yes, that's very clear.

We clearly need more bgwriter activity as the checkpoint approaches

I'd put it that we should write a block to disk prior to checkpoint if
it appears that it won't be dirtied again if we do so. That doesn't
necessarily translate directly into *more* activity.

, and

one that is more auto-tuned, like many of our other parameters. I think
we created these settings to see how they worked in the field, so it
probably time to reevaluate them based on field reports.

I think the bgwriter should keep track of how far it is to the next
checkpoint, and use that information to increase write activity.
Basically now, during a checkpoint, the bgwriter does a full buffer scan
and fsync's all dirty files, so it changes from the configuration
parameter-defined behavior right to 100% activity. I think it would be
ideal if we could ramp up the writes so that when it is 95% to the next
checkpoint, it can be operating at 95% of the activity it would do
during a checkpoint.

My guess is if we can do that, we will have much smoother performance
because we have more WAL writes just after checkpoint for newly-dirtied
pages, and the new setup will give us more write activity just before
checkpoint.

Well, as long as the kernel ignores Postgres and Postgres ignores the
kernel, things will never be smooth (literally). If we write more, but
are still below the dirty_background_ratio, it won't make the slightest
bit of difference.

Trying to get trustworthy/explicable performance test reports is already
difficult for this reason.

IMHO, the best approach would be one that takes into how the OS behaves,
so we can work with it. Regrettably, I can't see any way of doing this
other than OS-specific code of some shape or form.

One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
the kernel cache, so we are sure data will be on disk by checkpoint
time. This was avoided in the past because of the expense of
second-guessing the kernel disk I/O scheduling algorithms.

Seems like a longer term best approach to me.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#82Simon Riggs
simon@2ndquadrant.com
In reply to: Takayuki Tsunakawa (#37)
Re: Load distributed checkpoint

On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote:

Hello, Itagaki-san, all

Sorry for my long mail. I've had trouble in sending this mail because
it's too long for pgsql-hackers to accept (I couldn't find how large
mail is accepted.) So I'm trying to send several times.
Please see the attachment for the content.

Your results for fsync are interesting.

I've noticed that a checkpoint seems to increase the activity on the WAL
drive as well as increasing I/O wait times. That doesn't correspond to
any real increase in WAL traffic I'm aware of.

Have you tried setting deadline scheduler on the WAL device and CFQ on
the data device? That should allow the I/Os to move through different
queues and prevent interference.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#83Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#81)
Re: Load distributed checkpoint

On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote:

On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:

I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync().

Well, its correct to say that the transfer to disk is the source of the
problem, but that doesn't only occur when we fsync(). There are actually
two disk storms that occur, because of the way the fs cache works. [Ron
referred to this effect uplist]

As someone looking from the outside:

fsync only works on one file, so presumably the checkpoint process is
opening each file one by one and fsyncing them. Does that make any
difference here? Could you adjust the timing here?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#84Simon Riggs
simon@2ndquadrant.com
In reply to: Martijn van Oosterhout (#83)
Re: Load distributed checkpoint

On Wed, 2006-12-27 at 23:26 +0100, Martijn van Oosterhout wrote:

On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote:

On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:

I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync().

Well, its correct to say that the transfer to disk is the source of the
problem, but that doesn't only occur when we fsync(). There are actually
two disk storms that occur, because of the way the fs cache works. [Ron
referred to this effect uplist]

As someone looking from the outside:

fsync only works on one file, so presumably the checkpoint process is
opening each file one by one and fsyncing them.

Yes

Does that make any
difference here? Could you adjust the timing here?

Thats the hard bit with io storm 2. When you fsync a file you don't
actually know how many blocks you're writing, plus there's no way to
slow down those writes by putting delays between them (although its
possible your controller might know how to do this, I've never heard of
one that does).

If we put a delay after each fsync, that will space out all the ones
that don't need spacing out and do nothing to the ones that most need
it. Unfortunately.

IMHO there isn't any simple scheme that works all the time, for all OS
settings, default configurations and mechanisms.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#85ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Bruce Momjian (#80)
Re: Load distributed checkpoint

Bruce Momjian <bruce@momjian.us> wrote:

566.973777
327.158222 <- (1) write()
560.773868 <- (2) sleep
544.106645 <- (3) fsync()

OK, so you are saying that performance dropped only during the write(),
and not during the fsync()? Interesting.

Almost yes, but there is a small drop in fsync. (560->540)

I would like to know the
results of a few tests just like you reported them above:

1a) write spread out over 30 seconds
1b) write with no delay

2a) sleep(0)
2b) sleep(30)

3) fsync

I would like to know the performance at each stage for each combination,
e.g. when using 1b, 2a, 3, performance during the write() phase was X,
during the sleep it was Y, and during the fsync it was Z. (Of course,
sleep(0) has no stage timing.)

I'm thinking about generalizing your idea; Adding three parameters
(checkpoint_write, checkpoint_naptime and checkpoint_fsync)
to control sleeps in each stage.

1) write() spread out over 'checkpoint_write' seconds.
2) sleep 'checkpoint_naptime' seconds between write() and fsync().
3) fsync() spread out over 'checkpoint_fsync' seconds.

If three parameter are all zero, checkpoints behave as the same as now.
If checkpoint_write = checkpoint_timeout and other two are zero,
it is just like my proposal before.

As you might expect, I intend the above only for development purpose.
Additinal three parameters are hard to use for users. If we can pull out
some proper values from the tests, we'd better to set those values as
default. I assume we can derive them from existing checkpoint_timeout.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#86Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Gregory Stark (#74)
Re: Load distributed checkpoint

Gregory Stark wrote:

"Bruce Momjian" <bruce@momjian.us> writes:

I have a new idea. ...the BSD kernel...similar issue...to smooth writes:

Linux has a more complex solution to this (of course) which has undergone a
few generations over time. Older kernels had a user space daemon called
bdflush which called an undocumented syscall every 5s. More recent ones have a
kernel thread called pdflush. I think both have various mostly undocumented
tuning knobs but neither makes any sort of guarantee about the amount of time
a dirty buffer might live before being synced.

Earlier in this thread (around the 7th) was a discussion of
/proc/sys/vm/dirty_expire_centisecs and /proc/vm/dirty_writeback_centisecs
which seem to be the tunables that matter here. Googling suggests that

dirty_expire_centisecs specifies that data which has been dirty in memory
for longer than this interval will be written out next time a pdflush
daemon wakes up"

and

dirty_writeback_centisecs "expresses the interval between those wakeups"

It seems to me that the sum of the two times does determine the maximum
time before the kernel will start syncing a dirtied page.

Bottom line, though is that it seems both postgresql and the OS's are
trying to delay writes in the hopes of collapsing them; and that the
actual delay is the sum of the OS's delay and postgresql's delay. I
think Kevin Grittner's experimentation earlier in the thread did indeed
suggest that getting writes to the OS faster and let it handle the
collapsing of the writes was an effective method of reducing painful
checkpoints.

#87Jim C. Nasby
jim@nasby.net
In reply to: Simon Riggs (#84)
Re: Load distributed checkpoint

On Wed, Dec 27, 2006 at 10:54:57PM +0000, Simon Riggs wrote:

On Wed, 2006-12-27 at 23:26 +0100, Martijn van Oosterhout wrote:

On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote:

On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:

I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync().

Well, its correct to say that the transfer to disk is the source of the
problem, but that doesn't only occur when we fsync(). There are actually
two disk storms that occur, because of the way the fs cache works. [Ron
referred to this effect uplist]

As someone looking from the outside:

fsync only works on one file, so presumably the checkpoint process is
opening each file one by one and fsyncing them.

Yes

Does that make any
difference here? Could you adjust the timing here?

Thats the hard bit with io storm 2. When you fsync a file you don't
actually know how many blocks you're writing, plus there's no way to
slow down those writes by putting delays between them (although its
possible your controller might know how to do this, I've never heard of
one that does).

Any controller that sophisticated would likely also have a BBU and write
caching, which should greatly reduce the impact of at least the fsync
storm... unless you fill the cache. I suspect we might need a way to
control how much data we try and push out at a time to avoid that...

As for settings, I really like the simplicity of the Oracle system...
"Just try to ensure recovery takes about X amount of seconds". I like
the idea of a creeping checkpoint even more; only writing a buffer out
when we need to checkpoint it makes a lot more sense to me than trying
to guess when we'll next dirty a buffer. Such a system would probably
also be a lot easier to tune than the current bgwriter, even if we
couldn't simplify it all the way to "seconds for recovery".
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#88Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#85)
Re: Load distributed checkpoint

ITAGAKI Takahiro wrote:

Bruce Momjian <bruce@momjian.us> wrote:

566.973777
327.158222 <- (1) write()
560.773868 <- (2) sleep
544.106645 <- (3) fsync()

OK, so you are saying that performance dropped only during the write(),
and not during the fsync()? Interesting.

Almost yes, but there is a small drop in fsync. (560->540)

I would like to know the
results of a few tests just like you reported them above:

1a) write spread out over 30 seconds
1b) write with no delay

2a) sleep(0)
2b) sleep(30)

3) fsync

I would like to know the performance at each stage for each combination,
e.g. when using 1b, 2a, 3, performance during the write() phase was X,
during the sleep it was Y, and during the fsync it was Z. (Of course,
sleep(0) has no stage timing.)

I'm thinking about generalizing your idea; Adding three parameters
(checkpoint_write, checkpoint_naptime and checkpoint_fsync)
to control sleeps in each stage.

1) write() spread out over 'checkpoint_write' seconds.
2) sleep 'checkpoint_naptime' seconds between write() and fsync().
3) fsync() spread out over 'checkpoint_fsync' seconds.

If three parameter are all zero, checkpoints behave as the same as now.
If checkpoint_write = checkpoint_timeout and other two are zero,
it is just like my proposal before.

As you might expect, I intend the above only for development purpose.
Additinal three parameters are hard to use for users. If we can pull out
some proper values from the tests, we'd better to set those values as
default. I assume we can derive them from existing checkpoint_timeout.

Great idea, though I wouldn't bother with checkpoint_fsync. I think
Simon's previous email spelled out the problems of trying to delay
fsyncs() --- in most cases, there will be one file with most of the I/O,
and that fsync is going to be the flood. Basically, I think the
variability of table access is too great for the fsync delay to ever be
tunable by users.

To summarize, if we could have fsync() only write the dirty buffers that
happened as part of the checkpoint, we could delay the write() for the
entire time between checkpoints, but we can't do that, so we have to
make it user-tunable.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#89Jim C. Nasby
jim@nasby.net
In reply to: Bruce Momjian (#88)
Re: Load distributed checkpoint

On Thu, Dec 28, 2006 at 12:50:19PM -0500, Bruce Momjian wrote:

To summarize, if we could have fsync() only write the dirty buffers that
happened as part of the checkpoint, we could delay the write() for the
entire time between checkpoints, but we can't do that, so we have to
make it user-tunable.

What about the mmap/msync(?)/munmap idea someone mentioned?
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#90Bruce Momjian
bruce@momjian.us
In reply to: Jim C. Nasby (#89)
Re: Load distributed checkpoint

Jim C. Nasby wrote:

On Thu, Dec 28, 2006 at 12:50:19PM -0500, Bruce Momjian wrote:

To summarize, if we could have fsync() only write the dirty buffers that
happened as part of the checkpoint, we could delay the write() for the
entire time between checkpoints, but we can't do that, so we have to
make it user-tunable.

What about the mmap/msync(?)/munmap idea someone mentioned?

I see that as similar to using O_DIRECT during checkpoint, which had
poor performance.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#91Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#90)
Re: Load distributed checkpoint

Bruce Momjian <bruce@momjian.us> writes:

Jim C. Nasby wrote:

What about the mmap/msync(?)/munmap idea someone mentioned?

I see that as similar to using O_DIRECT during checkpoint, which had
poor performance.

That's a complete nonstarter on portability grounds, even if msync gave
us the desired semantics, which it doesn't. It's no better than fsync
for our purposes.

To my mind the problem with fsync is not that it gives us too little
control but that it gives too much: we have to specify a particular
order of writing out files. What we'd really like is a version of
sync(2) that tells us when it's done but doesn't constrain the I/O
scheduler's choices at all. Unfortunately there's no such API ...

regards, tom lane

#92Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#91)
Re: Load distributed checkpoint

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Jim C. Nasby wrote:

What about the mmap/msync(?)/munmap idea someone mentioned?

I see that as similar to using O_DIRECT during checkpoint, which had
poor performance.

That's a complete nonstarter on portability grounds, even if msync gave
us the desired semantics, which it doesn't. It's no better than fsync
for our purposes.

To my mind the problem with fsync is not that it gives us too little
control but that it gives too much: we have to specify a particular
order of writing out files. What we'd really like is a version of
sync(2) that tells us when it's done but doesn't constrain the I/O
scheduler's choices at all. Unfortunately there's no such API ...

Yea, we used to use sync() but that did all files, not just the
PostgreSQL ones.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#93Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Tom Lane (#91)
Re: Load distributed checkpoint

Tom Lane wrote:

To my mind the problem with fsync is not that it gives us too little
control but that it gives too much: we have to specify a particular
order of writing out files. What we'd really like is a version of
sync(2) that tells us when it's done but doesn't constrain the I/O
scheduler's choices at all. Unfortunately there's no such API ...

The problem I see with fsync is that it causes an immediate I/O storm as
the OS tries to flush everything out as quickly as possible. But we're
not in a hurry. What we'd need is a lazy fsync, that would tell the
operating system "let me know when all these dirty buffers are written
to disk, but I'm not in a hurry, take your time". It wouldn't change the
scheduling of the writes, just inform the caller when they're done.

If we wanted more precise control of the flushing, we could use
sync_file_range on Linux, but that's not portable. Nevertheless, I think
it would be OK to have an ifdef and use it on platforms that support
it, if it gave a benefit.

As a side note, with full_page_writes on, a checkpoint wouldn't actually
need to fsync those pages that have been written to WAL after the
checkpoint started. Doesn't make much difference in most cases, but we
could take that into account if we start taking more control of the
flushing.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#94Jim C. Nasby
jim@nasby.net
In reply to: Heikki Linnakangas (#93)
Re: Load distributed checkpoint

On Thu, Dec 28, 2006 at 09:28:48PM +0000, Heikki Linnakangas wrote:

Tom Lane wrote:

To my mind the problem with fsync is not that it gives us too little
control but that it gives too much: we have to specify a particular
order of writing out files. What we'd really like is a version of
sync(2) that tells us when it's done but doesn't constrain the I/O
scheduler's choices at all. Unfortunately there's no such API ...

The problem I see with fsync is that it causes an immediate I/O storm as
the OS tries to flush everything out as quickly as possible. But we're
not in a hurry. What we'd need is a lazy fsync, that would tell the
operating system "let me know when all these dirty buffers are written
to disk, but I'm not in a hurry, take your time". It wouldn't change the
scheduling of the writes, just inform the caller when they're done.

If we wanted more precise control of the flushing, we could use
sync_file_range on Linux, but that's not portable. Nevertheless, I think
it would be OK to have an ifdef and use it on platforms that support
it, if it gave a benefit.

I believe there's something similar for OS X as well. The question is:
would it be better to do that, or to just delay calling fsync until the
OS has had a chance to write things out.

As a side note, with full_page_writes on, a checkpoint wouldn't actually
need to fsync those pages that have been written to WAL after the
checkpoint started. Doesn't make much difference in most cases, but we
could take that into account if we start taking more control of the
flushing.

Hrm, interesting point, but I suspect the window involved there is too
small to be worth worrying about.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#95Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#94)
Re: Load distributed checkpoint

"Jim C. Nasby" <jim@nasby.net> writes:

I believe there's something similar for OS X as well. The question is:
would it be better to do that, or to just delay calling fsync until the
OS has had a chance to write things out.

A delay is not going to help unless you can suppress additional writes
to the file, which I don't think you can unless there's very little
going on in the database --- dirty buffers have to get written to make
room for other pages, checkpoint in progress or no.

regards, tom lane

#96Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#95)
Re: Load distributed checkpoint

Tom Lane wrote:

"Jim C. Nasby" <jim@nasby.net> writes:

I believe there's something similar for OS X as well. The question is:
would it be better to do that, or to just delay calling fsync until the
OS has had a chance to write things out.

A delay is not going to help unless you can suppress additional writes
to the file, which I don't think you can unless there's very little
going on in the database --- dirty buffers have to get written to make
room for other pages, checkpoint in progress or no.

I am afraid a delay between write and fsync is the only portable option
we have right now --- there is hope that since the check point write, we
will not have a huge number of dirty buffers at the start of the
checkpoint that need to be written out.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#97Jim C. Nasby
jim@nasby.net
In reply to: Bruce Momjian (#96)
Re: Load distributed checkpoint

On Fri, Dec 29, 2006 at 09:02:11PM -0500, Bruce Momjian wrote:

Tom Lane wrote:

"Jim C. Nasby" <jim@nasby.net> writes:

I believe there's something similar for OS X as well. The question is:
would it be better to do that, or to just delay calling fsync until the
OS has had a chance to write things out.

A delay is not going to help unless you can suppress additional writes
to the file, which I don't think you can unless there's very little
going on in the database --- dirty buffers have to get written to make
room for other pages, checkpoint in progress or no.

I am afraid a delay between write and fsync is the only portable option
we have right now --- there is hope that since the check point write, we
will not have a huge number of dirty buffers at the start of the
checkpoint that need to be written out.

We could potentially control that behavior, too. For example, trying to
suppress writes to a given file when we're getting ready to checkpoint
it (of course, separating checkpointing into a file-by-file operation is
a non-trivial change, but it should be possible).
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#98Zeugswetter Andreas ADI SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#95)
Re: Load distributed checkpoint

I believe there's something similar for OS X as well. The question

is:

would it be better to do that, or to just delay calling fsync until

the

OS has had a chance to write things out.

A delay is not going to help unless you can suppress additional writes
to the file, which I don't think you can unless there's very little
going on in the database --- dirty buffers have to get written to make
room for other pages, checkpoint in progress or no.

But checkpoint first writes all dirty pages, so we have more than
average
pages that can be replaced without a write. Thus we have a window where
we can wait until dirty pages have to be replaced again. Since the
bgwriter
is sleeping until fsync, only pages that have to be replaced will be
written.
Question is, how do we time that window.

Andreas

#99Takayuki Tsunakawa
tsunakawa.takay@jp.fujitsu.com
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

Happy new year

From: "Simon Riggs" <simon@2ndquadrant.com>

Have you tried setting deadline scheduler on the WAL device and CFQ

on

the data device? That should allow the I/Os to move through

different

queues and prevent interference.

No, I've not tried yet. Inaam-san told me that Linux had a few I/O
schedulers but I'm not familiar with them. I'll find information
about them (how to change the scheduler settings) and try the same
test.

----- Original Message -----
From: "Simon Riggs" <simon@2ndquadrant.com>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
Cc: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>;
<pgsql-hackers@postgresql.org>
Sent: Thursday, December 28, 2006 7:07 AM
Subject: Re: [HACKERS] Load distributed checkpoint

On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote:

Hello, Itagaki-san, all

Sorry for my long mail. I've had trouble in sending this mail

because

it's too long for pgsql-hackers to accept (I couldn't find how

large

mail is accepted.) So I'm trying to send several times.
Please see the attachment for the content.

Your results for fsync are interesting.

I've noticed that a checkpoint seems to increase the activity on the

WAL

drive as well as increasing I/O wait times. That doesn't correspond

to

any real increase in WAL traffic I'm aware of.

Have you tried setting deadline scheduler on the WAL device and CFQ

on

the data device? That should allow the I/Os to move through

different

Show quoted text

queues and prevent interference.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#100ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: ITAGAKI Takahiro (#38)
1 attachment(s)
Re: Load distributed checkpoint patch

The attached is a revised patch for smoothing checkpoints. The added 3 GUC
parameters can be used to control sleeps in each stage during checkpoints.
The another is an approach to replace fsync() for fine control.

1. checkpoint_write_duration (default=0, in seconds)
Sets the duration of write() phase in checkpoints.
2. checkpoint_nap_duration (default=0, in seconds)
Sets the duration between write() and fsync() phases in checkpoints.
3. checkpoint_sync_duration (default=0, in seconds)
Sets the duration of fsync() phase in checkpoints.
4. checkpoint_sync_size (default=0, in KB)
Sets the synchronization unit of data files in checkpoints.

I'll send the detail to -hackers.
Any comments and testing reports will be appreciated.

This is a patch for load distributed checkpoint discussed in
http://archives.postgresql.org/pgsql-hackers/2006-12/msg00337.php

Only write() calls are smoothed, fsync() are not.
Existing checkpoint method is called "immediate checkpoint" in the patch,
and the new method called "asynchronous checkpoint".

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

checkpoint.patchapplication/octet-stream; name=checkpoint.patchDownload
diff -cprN head/src/backend/access/transam/xlog.c checkpoint/src/backend/access/transam/xlog.c
*** head/src/backend/access/transam/xlog.c	Mon Dec 11 13:52:47 2006
--- checkpoint/src/backend/access/transam/xlog.c	Fri Jan  5 14:15:58 2007
***************
*** 43,48 ****
--- 43,49 ----
  #include "storage/fd.h"
  #include "storage/pmsignal.h"
  #include "storage/procarray.h"
+ #include "storage/smgr.h"
  #include "storage/spin.h"
  #include "utils/builtins.h"
  #include "utils/nabstime.h"
*************** bool		XLOG_DEBUG = false;
*** 159,164 ****
--- 160,166 ----
  /* these are derived from XLOG_sync_method by assign_xlog_sync_method */
  int			sync_method = DEFAULT_SYNC_METHOD;
  static int	open_sync_bit = DEFAULT_SYNC_FLAGBIT;
+ int			checkpoint_nap_duration = 0;
  
  #define XLOG_SYNC_BIT  (enableFsync ? open_sync_bit : 0)
  
*************** static TransactionId recoveryStopXid;
*** 191,196 ****
--- 193,201 ----
  static time_t recoveryStopTime;
  static bool recoveryStopAfter;
  
+ /* checkpoint */
+ static CheckPoint checkPoint;
+ 
  /*
   * During normal operation, the only timeline we care about is ThisTimeLineID.
   * During recovery, however, things are more complicated.  To simplify life
*************** static void readRecoveryCommandFile(void
*** 469,475 ****
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
--- 474,480 ----
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
! static void CheckPointGuts(XLogRecPtr checkPointRedo, bool flush);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
*************** XLogWrite(XLogwrtRqst WriteRqst, bool fl
*** 1666,1672 ****
  						if (XLOG_DEBUG)
  							elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
! 						RequestCheckpoint(false, true);
  					}
  				}
  			}
--- 1671,1677 ----
  						if (XLOG_DEBUG)
  							elog(LOG, "time for a checkpoint, signaling bgwriter");
  #endif
! 						RequestCheckpoint(false);
  					}
  				}
  			}
*************** GetRedoRecPtr(void)
*** 5343,5348 ****
--- 5348,5370 ----
  }
  
  /*
+  * GetInsertRecPtr -- Returns the current insert position.
+  */
+ XLogRecPtr
+ GetInsertRecPtr(void)
+ {
+ 	volatile XLogCtlData *xlogctl = XLogCtl;
+ 	XLogCtlInsert  *Insert = &XLogCtl->Insert;
+ 	XLogRecPtr		recptr;
+ 
+ 	SpinLockAcquire(&xlogctl->info_lck);
+ 	INSERT_RECPTR(recptr, Insert, Insert->curridx);
+ 	SpinLockRelease(&xlogctl->info_lck);
+ 
+ 	return recptr;
+ }
+ 
+ /*
   * Get the time of the last xlog segment switch
   */
  time_t
*************** ShutdownXLOG(int code, Datum arg)
*** 5417,5448 ****
  }
  
  /*
!  * Perform a checkpoint --- either during shutdown, or on-the-fly
   *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
! void
! CreateCheckPoint(bool shutdown, bool force)
  {
- 	CheckPoint	checkPoint;
- 	XLogRecPtr	recptr;
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
- 	XLogRecData rdata;
  	uint32		freespace;
- 	uint32		_logId;
- 	uint32		_logSeg;
- 	int			nsegsadded = 0;
- 	int			nsegsremoved = 0;
- 	int			nsegsrecycled = 0;
- 
- 	/*
- 	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
- 	 * (This is just pro forma, since in the present system structure there is
- 	 * only one process that is allowed to issue checkpoints at any given
- 	 * time.)
- 	 */
- 	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
  
  	/*
  	 * Use a critical section to force system panic if we have trouble.
--- 5439,5454 ----
  }
  
  /*
!  * PrepareCheckPoint -- prepare a checkpoint
   *
   * If force is true, we force a checkpoint regardless of whether any XLOG
   * activity has occurred since the last one.
   */
! bool
! PrepareCheckPoint(bool shutdown, bool force)
  {
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	uint32		freespace;
  
  	/*
  	 * Use a critical section to force system panic if we have trouble.
*************** CreateCheckPoint(bool shutdown, bool for
*** 5501,5509 ****
  		{
  			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointStartLock);
- 			LWLockRelease(CheckpointLock);
  			END_CRIT_SECTION();
! 			return;
  		}
  	}
  
--- 5507,5515 ----
  		{
  			LWLockRelease(WALInsertLock);
  			LWLockRelease(CheckpointStartLock);
  			END_CRIT_SECTION();
! 			elog(DEBUG1, "checkpoint is not needed");
! 			return false;
  		}
  	}
  
*************** CreateCheckPoint(bool shutdown, bool for
*** 5552,5561 ****
  
  	LWLockRelease(CheckpointStartLock);
  
- 	if (!shutdown)
- 		ereport(DEBUG2,
- 				(errmsg("checkpoint starting")));
- 
  	/*
  	 * Get the other info we need for the checkpoint record.
  	 */
--- 5558,5563 ----
*************** CreateCheckPoint(bool shutdown, bool for
*** 5578,5583 ****
--- 5580,5625 ----
  							 &checkPoint.nextMulti,
  							 &checkPoint.nextMultiOffset);
  
+ 	END_CRIT_SECTION();
+ 
+ 	if (!shutdown)
+ 		ereport(DEBUG2,
+ 				(errmsg("checkpoint starting")));
+ 
+ 	return true;
+ }
+ 
+ /*
+  * CreateCheckPoint -- finish a checkpoint
+  */
+ void
+ CreateCheckPoint(bool shutdown, bool force)
+ {
+ 	XLogRecPtr	recptr;
+ 	XLogRecData rdata;
+ 	uint32		_logId;
+ 	uint32		_logSeg;
+ 	int			nsegsadded = 0;
+ 	int			nsegsremoved = 0;
+ 	int			nsegsrecycled = 0;
+ 
+ 	Assert(!shutdown || (shutdown && force));	/* must force on shutdown */
+ 
+ 	/* Do a force checkpoint if not prepared. (should not happen) */
+ 	if (!TransactionIdIsValid(checkPoint.nextXid))
+ 		force = true;
+ 
+ 	/*
+ 	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
+ 	 * (This is just pro forma, since in the present system structure there is
+ 	 * only one process that is allowed to issue checkpoints at any given
+ 	 * time.)
+ 	 */
+ 	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+ 
+ 	if (shutdown || force)
+ 		PrepareCheckPoint(shutdown, force);
+ 
  	/*
  	 * Having constructed the checkpoint record, ensure all shmem disk buffers
  	 * and commit-log buffers are flushed to disk.
*************** CreateCheckPoint(bool shutdown, bool for
*** 5586,5594 ****
  	 * complete the checkpoint, but there is no reason to force a system
  	 * panic. Accordingly, exit critical section while doing it.
  	 */
! 	END_CRIT_SECTION();
! 
! 	CheckPointGuts(checkPoint.redo);
  
  	START_CRIT_SECTION();
  
--- 5628,5634 ----
  	 * complete the checkpoint, but there is no reason to force a system
  	 * panic. Accordingly, exit critical section while doing it.
  	 */
! 	CheckPointGuts(checkPoint.redo, force);
  
  	START_CRIT_SECTION();
  
*************** CreateCheckPoint(bool shutdown, bool for
*** 5684,5689 ****
--- 5724,5731 ----
  				(errmsg("checkpoint complete; %d transaction log file(s) added, %d removed, %d recycled",
  						nsegsadded, nsegsremoved, nsegsrecycled)));
  
+ 	checkPoint.nextXid = InvalidTransactionId;	/* done */
+ 
  	LWLockRelease(CheckpointLock);
  }
  
*************** CreateCheckPoint(bool shutdown, bool for
*** 5694,5705 ****
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 	FlushBufferPool();			/* performs all required fsyncs */
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
--- 5736,5771 ----
   * recovery restartpoints.
   */
  static void
! CheckPointGuts(XLogRecPtr checkPointRedo, bool flush)
  {
  	CheckPointCLOG();
  	CheckPointSUBTRANS();
  	CheckPointMultiXact();
! 
! 	elog(DEBUG1, "CheckPointGuts %s flush", (flush ? "WITH" : "WITHOUT"));
! 
! 	if (flush)
! 		BufferSync();
! 	else if (checkpoint_nap_duration > 0)
! 	{
! 		/* naptime between write() and fsync() */
! 
! 		time_t	start;
! 
! 		elog(DEBUG1, "CHECKPOINT: nap between write() and fsync()");
! 		start = time(NULL);
! 		while (time(NULL) - start < checkpoint_nap_duration)
! 		{
! 			pg_usleep(BgWriterDelay * 1000L);
! 			BgBufferSync(0);
! 			AbsorbFsyncRequests();
! 		}
! 	}
! 
! 	elog(DEBUG1, "CHECKPOINT: fsync() start");
! 	smgrsync(flush);	/* performs all required fsyncs */
! 	elog(DEBUG1, "CHECKPOINT: fsync() finish");
! 
  	/* We deliberately delay 2PC checkpointing as long as possible */
  	CheckPointTwoPhase(checkPointRedo);
  }
*************** RecoveryRestartPoint(const CheckPoint *c
*** 5748,5754 ****
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
--- 5814,5820 ----
  	/*
  	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo, true);
  
  	/*
  	 * Update pg_control so that any subsequent crash will restart from this
*************** pg_start_backup(PG_FUNCTION_ARGS)
*** 6167,6173 ****
  		 * have different checkpoint positions and hence different history
  		 * file names, even if nothing happened in between.
  		 */
! 		RequestCheckpoint(true, false);
  
  		/*
  		 * Now we need to fetch the checkpoint record location, and also its
--- 6233,6239 ----
  		 * have different checkpoint positions and hence different history
  		 * file names, even if nothing happened in between.
  		 */
! 		RequestCheckpoint(true);
  
  		/*
  		 * Now we need to fetch the checkpoint record location, and also its
diff -cprN head/src/backend/commands/dbcommands.c checkpoint/src/backend/commands/dbcommands.c
*** head/src/backend/commands/dbcommands.c	Tue Nov  7 09:40:50 2006
--- checkpoint/src/backend/commands/dbcommands.c	Tue Dec 19 11:12:15 2006
*************** createdb(const CreatedbStmt *stmt)
*** 504,510 ****
  		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
  		 * we can avoid this.
  		 */
! 		RequestCheckpoint(true, false);
  
  		/*
  		 * Close pg_database, but keep lock till commit (this is important to
--- 504,510 ----
  		 * Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
  		 * we can avoid this.
  		 */
! 		RequestCheckpoint(true);
  
  		/*
  		 * Close pg_database, but keep lock till commit (this is important to
*************** dropdb(const char *dbname, bool missing_
*** 648,654 ****
  	 * open files, which would cause rmdir() to fail.
  	 */
  #ifdef WIN32
! 	RequestCheckpoint(true, false);
  #endif
  
  	/*
--- 648,654 ----
  	 * open files, which would cause rmdir() to fail.
  	 */
  #ifdef WIN32
! 	RequestCheckpoint(true);
  #endif
  
  	/*
diff -cprN head/src/backend/postmaster/bgwriter.c checkpoint/src/backend/postmaster/bgwriter.c
*** head/src/backend/postmaster/bgwriter.c	Mon Dec  4 16:08:41 2006
--- checkpoint/src/backend/postmaster/bgwriter.c	Fri Jan  5 17:21:26 2007
***************
*** 46,51 ****
--- 46,52 ----
  #include <signal.h>
  #include <time.h>
  #include <unistd.h>
+ #include <sys/time.h>
  
  #include "access/xlog_internal.h"
  #include "libpq/pqsignal.h"
***************
*** 90,99 ****
   *	6. If ckpt_failed is different from the originally saved value,
   *	   assume request failed; otherwise it was definitely successful.
   *
!  * An additional field is ckpt_time_warn; this is also sig_atomic_t for
   * simplicity, but is only used as a boolean.  If a backend is requesting
!  * a checkpoint for which a checkpoints-too-close-together warning is
!  * reasonable, it should set this field TRUE just before sending the signal.
   *
   * The requests array holds fsync requests sent by backends and not yet
   * absorbed by the bgwriter.  Unlike the checkpoint fields, the requests
--- 91,102 ----
   *	6. If ckpt_failed is different from the originally saved value,
   *	   assume request failed; otherwise it was definitely successful.
   *
!  * An additional field is ckpt_immediate; this is also sig_atomic_t for
   * simplicity, but is only used as a boolean.  If a backend is requesting
!  * an immediate checkpoint, set this field TRUE just before sending the
!  * signal. If do so, checkpoint is performed as soon as possible. Otherwise,
!  * checkpoint is run asynchronously to avoid i/o impacts and a checkpoints-
!  * too-close-together warning may be reported.
   *
   * The requests array holds fsync requests sent by backends and not yet
   * absorbed by the bgwriter.  Unlike the checkpoint fields, the requests
*************** typedef struct
*** 115,121 ****
  	sig_atomic_t ckpt_done;		/* advances when checkpoint done */
  	sig_atomic_t ckpt_failed;	/* advances when checkpoint fails */
  
! 	sig_atomic_t ckpt_time_warn;	/* warn if too soon since last ckpt? */
  
  	int			num_requests;	/* current # of requests */
  	int			max_requests;	/* allocated array size */
--- 118,124 ----
  	sig_atomic_t ckpt_done;		/* advances when checkpoint done */
  	sig_atomic_t ckpt_failed;	/* advances when checkpoint fails */
  
! 	sig_atomic_t ckpt_immediate;	/* immediate checkpoint requested? */
  
  	int			num_requests;	/* current # of requests */
  	int			max_requests;	/* allocated array size */
*************** static BgWriterShmemStruct *BgWriterShme
*** 130,135 ****
--- 133,139 ----
  int			BgWriterDelay = 200;
  int			CheckPointTimeout = 300;
  int			CheckPointWarning = 30;
+ int			checkpoint_write_duration = 0;
  
  /*
   * Flags set by interrupt handlers for later service in the main loop.
*************** static bool am_bg_writer = false;
*** 145,154 ****
--- 149,166 ----
  
  static bool ckpt_active = false;
  
+ /* Number of scanned pages during asynchronous checkpoint, otherwise -1. */
+ static int	ckpt_progress = -1;
+ 
+ /* WAL position at starting asynchronous checkpoint. */
+ static XLogRecPtr ckpt_recptr;
+ 
  static time_t last_checkpoint_time;
  static time_t last_xlog_switch_time;
  
  
+ static void BgCreateCheckPoint(bool immediate);
+ static int	BgGetScanPages(void);
  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
  static void ReqCheckpointHandler(SIGNAL_ARGS);
*************** BackgroundWriterMain(void)
*** 285,290 ****
--- 297,305 ----
  			ckpt_active = false;
  		}
  
+ 		/* Discard prepared checkpoint */
+ 		ckpt_progress = -1;
+ 
  		/*
  		 * Now return to normal top-level context and clear ErrorContext for
  		 * next time.
*************** BackgroundWriterMain(void)
*** 327,333 ****
  	for (;;)
  	{
  		bool		do_checkpoint = false;
! 		bool		force_checkpoint = false;
  		time_t		now;
  		int			elapsed_secs;
  		long		udelay;
--- 342,348 ----
  	for (;;)
  	{
  		bool		do_checkpoint = false;
! 		bool		immediate = false;
  		time_t		now;
  		int			elapsed_secs;
  		long		udelay;
*************** BackgroundWriterMain(void)
*** 353,359 ****
  		{
  			checkpoint_requested = false;
  			do_checkpoint = true;
! 			force_checkpoint = true;
  		}
  		if (shutdown_requested)
  		{
--- 368,375 ----
  		{
  			checkpoint_requested = false;
  			do_checkpoint = true;
! 			immediate = BgWriterShmem->ckpt_immediate;
! 			BgWriterShmem->ckpt_immediate = false;
  		}
  		if (shutdown_requested)
  		{
*************** BackgroundWriterMain(void)
*** 385,423 ****
  		if (do_checkpoint)
  		{
  			/*
- 			 * We will warn if (a) too soon since last checkpoint (whatever
- 			 * caused it) and (b) somebody has set the ckpt_time_warn flag
- 			 * since the last checkpoint start.  Note in particular that this
- 			 * implementation will not generate warnings caused by
- 			 * CheckPointTimeout < CheckPointWarning.
- 			 */
- 			if (BgWriterShmem->ckpt_time_warn &&
- 				elapsed_secs < CheckPointWarning)
- 				ereport(LOG,
- 						(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
- 								elapsed_secs),
- 						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
- 			BgWriterShmem->ckpt_time_warn = false;
- 
- 			/*
  			 * Indicate checkpoint start to any waiting backends.
  			 */
  			ckpt_active = true;
  			BgWriterShmem->ckpt_started++;
  
! 			CreateCheckPoint(false, force_checkpoint);
! 
! 			/*
! 			 * After any checkpoint, close all smgr files.	This is so we
! 			 * won't hang onto smgr references to deleted files indefinitely.
! 			 */
! 			smgrcloseall();
  
! 			/*
! 			 * Indicate checkpoint completion to any waiting backends.
! 			 */
! 			BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
! 			ckpt_active = false;
  
  			/*
  			 * Note we record the checkpoint start time not end time as
--- 401,456 ----
  		if (do_checkpoint)
  		{
  			/*
  			 * Indicate checkpoint start to any waiting backends.
  			 */
  			ckpt_active = true;
  			BgWriterShmem->ckpt_started++;
  
! 			if (immediate)
! 			{
! 				/* Cancel previous asynchronous checkpoint if running */
! 				if (ckpt_progress >= 0)
! 				{
! 					elog(DEBUG1, "cancel previous asynchronous checkpoint");
! 					ckpt_progress = -1;
! 				}
! 				BgCreateCheckPoint(true);
! 			}
! 			else
! 			{
! 				/*
! 				 * We will warn if (a) too soon since last checkpoint
! 				 * (whatever caused it) and (b) the checkpoint is not forced.
! 				 * Note in particular that this implementation will not generate
! 				 * warnings caused by CheckPointTimeout < CheckPointWarning.
! 				 */
! 				if (elapsed_secs < CheckPointTimeout &&
! 					elapsed_secs < CheckPointWarning)
! 					ereport(LOG,
! 						(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! 								elapsed_secs),
! 						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
  
! 				/* Finish previous asynchronous checkpoint if running. */
! 				if (ckpt_progress >= 0)
! 				{
! 					int	num_to_scan = NBuffers - ckpt_progress;
! 					if (num_to_scan > 0)
! 					{
! 						elog(DEBUG1, "finish previous checkpoint, writing %d buffers", num_to_scan);
! 						BgBufferSync(num_to_scan);
! 					}
! 					BgCreateCheckPoint(false);
! 				}
! 
! 				/* Start an asynchronous checkpoint */
! 				elog(DEBUG1, "prepare asynchronous checkpoint");
! 				if (PrepareCheckPoint(false, false))
! 				{
! 					ckpt_progress = 0;	/* >=0 means 'prepared' */
! 					ckpt_recptr = GetInsertRecPtr();
! 				}
! 			}
  
  			/*
  			 * Note we record the checkpoint start time not end time as
*************** BackgroundWriterMain(void)
*** 426,433 ****
  			 */
  			last_checkpoint_time = now;
  		}
  		else
! 			BgBufferSync();
  
  		/*
  		 * Check for archive_timeout, if so, switch xlog files.  First we do a
--- 459,481 ----
  			 */
  			last_checkpoint_time = now;
  		}
+ 		else if (ckpt_progress >= 0)
+ 		{
+ 			/* asynchronous checkpoint is running */
+ 			ckpt_progress += BgBufferSync(BgGetScanPages());
+ 
+ 			/*
+ 			 * If all dirty buffers have been written, we can continue
+ 			 * the rest of the checkpoint.
+ 			 */
+ 			if (ckpt_progress >= NBuffers)
+ 				BgCreateCheckPoint(false);
+ 		}
  		else
! 		{
! 			/* Normal activity, not in checkpoints */
! 			BgBufferSync(0);
! 		}
  
  		/*
  		 * Check for archive_timeout, if so, switch xlog files.  First we do a
*************** BackgroundWriterMain(void)
*** 506,511 ****
--- 554,628 ----
  	}
  }
  
+ /*
+  * BgCreateCheckPoint -- create checkpoint and close all smgr files.
+  */
+ static void
+ BgCreateCheckPoint(bool immediate)
+ {
+ 	elog(DEBUG1, "start %s checkpoint",
+ 		(immediate ? "immediate" : "asynchronous"));
+ 
+ 	ckpt_active = true;
+ 
+ 	CreateCheckPoint(false, immediate);
+ 
+ 	/*
+ 	 * After any checkpoint, close all smgr files.	This is so we
+ 	 * won't hang onto smgr references to deleted files indefinitely.
+ 	 */
+ 	smgrcloseall();
+ 
+ 	/*
+ 	 * Indicate checkpoint completion to any waiting backends.
+ 	 */
+ 	BgWriterShmem->ckpt_done = BgWriterShmem->ckpt_started;
+ 
+ 	ckpt_progress = -1;
+ 	ckpt_active = false;
+ }
+ 
+ /*
+  * BgGetScanPages -- Returns the number of pages recommemded to be scanned.
+  */
+ static int
+ BgGetScanPages(void)
+ {
+ 	struct timeval	now;
+ 	XLogRecPtr		recptr;
+ 	double			progress_in_time,
+ 					progress_in_xlog;
+ 	int				pos;
+ 
+ 	if (checkpoint_write_duration == 0)
+ 		return NBuffers;	/* write all pages at once */
+ 
+ 	/* coordinate the progress with checkpoint_timeout */
+ 	gettimeofday(&now, NULL);
+ 	progress_in_time = ((double)(int)(now.tv_sec - last_checkpoint_time) +
+ 						now.tv_usec / 1000000.0) / checkpoint_write_duration;
+ 
+ 	/* coordinate the progress with checkpoint_segments */
+ 	recptr = GetInsertRecPtr();
+ 	progress_in_xlog =
+ 		((double) (recptr.xlogid - ckpt_recptr.xlogid) * XLogSegsPerFile +
+ 		 (double) (recptr.xrecoff - ckpt_recptr.xrecoff) / XLogSegSize) /
+ 		CheckPointSegments;
+ 
+ 	pos = (int) (NBuffers * Max(progress_in_time, progress_in_xlog));
+ 	if (pos > NBuffers)
+ 		pos = NBuffers;
+ 
+ 	elog(DEBUG2, "checkpoint progress : %d->%d(%+d) (time=%.3f, xlog=%.3f)",
+ 		ckpt_progress, pos, pos - ckpt_progress,
+ 		progress_in_time, progress_in_xlog);
+ 
+ 	if (pos > ckpt_progress)
+ 		return pos - ckpt_progress;
+ 	else
+ 		return 0;
+ }
+ 
  
  /* --------------------------------
   *		signal handler routines
*************** BgWriterShmemInit(void)
*** 612,625 ****
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
!  * immediately.
!  *
!  * If warnontime is true, and it's "too soon" since the last checkpoint,
!  * the bgwriter will log a warning.  This should be true only for checkpoints
!  * caused due to xlog filling, else the warning will be misleading.
   */
  void
! RequestCheckpoint(bool waitforit, bool warnontime)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
--- 729,740 ----
   *
   * If waitforit is true, wait until the checkpoint is completed
   * before returning; otherwise, just signal the request and return
!  * immediately. If false, checkpoints are caused due to xlog filling.
!  * Then if it's "too soon" since the last checkpoint, the bgwriter will
!  * log a warning, else the warning will be misleading.
   */
  void
! RequestCheckpoint(bool waitforit)
  {
  	/* use volatile pointer to prevent code rearrangement */
  	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
*************** RequestCheckpoint(bool waitforit, bool w
*** 642,650 ****
  		return;
  	}
  
! 	/* Set warning request flag if appropriate */
! 	if (warnontime)
! 		bgs->ckpt_time_warn = true;
  
  	/*
  	 * Send signal to request checkpoint.  When waitforit is false, we
--- 757,765 ----
  		return;
  	}
  
! 	/* Set immediate request flag if appropriate */
! 	if (waitforit)
! 		bgs->ckpt_immediate = true;
  
  	/*
  	 * Send signal to request checkpoint.  When waitforit is false, we
diff -cprN head/src/backend/storage/buffer/bufmgr.c checkpoint/src/backend/storage/buffer/bufmgr.c
*** head/src/backend/storage/buffer/bufmgr.c	Thu Oct 26 12:48:40 2006
--- checkpoint/src/backend/storage/buffer/bufmgr.c	Fri Jan  5 11:26:35 2007
*************** BufferSync(void)
*** 985,1005 ****
   * BgBufferSync -- Write out some dirty buffers in the pool.
   *
   * This is called periodically by the background writer process.
   */
! void
! BgBufferSync(void)
  {
  	static int	buf_id1 = 0;
  	int			buf_id2;
- 	int			num_to_scan;
  	int			num_written;
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
  
  	/*
! 	 * To minimize work at checkpoint time, we want to try to keep all the
! 	 * buffers clean; this motivates a scan that proceeds sequentially through
  	 * all buffers.  But we are also charged with ensuring that buffers that
  	 * will be recycled soon are clean when needed; these buffers are the ones
  	 * just ahead of the StrategySyncStart point.  We make a separate scan
--- 985,1009 ----
   * BgBufferSync -- Write out some dirty buffers in the pool.
   *
   * This is called periodically by the background writer process.
+  * num_to_scan buffers are always scanned and written by the 'all' activity.
+  * Returns the number of scanned buffers by the 'all' activity.
   */
! int
! BgBufferSync(int num_to_scan)
  {
  	static int	buf_id1 = 0;
  	int			buf_id2;
  	int			num_written;
+ 	int			num_scanned = 0;
+ 
+ 	Assert(num_to_scan >= 0);
  
  	/* Make sure we can handle the pin inside SyncOneBuffer */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
  
  	/*
! 	 * To distribute work at checkpoint time, we want to try to write buffers
! 	 * by increments; this motivates a scan that proceeds sequentially through
  	 * all buffers.  But we are also charged with ensuring that buffers that
  	 * will be recycled soon are clean when needed; these buffers are the ones
  	 * just ahead of the StrategySyncStart point.  We make a separate scan
*************** BgBufferSync(void)
*** 1007,1027 ****
  	 */
  
  	/*
! 	 * This loop runs over all buffers, including pinned ones.	The starting
! 	 * point advances through the buffer pool on successive calls.
  	 *
  	 * Note that we advance the static counter *before* trying to write. This
  	 * ensures that, if we have a persistent write failure on a dirty buffer,
  	 * we'll still be able to make progress writing other buffers. (The
  	 * bgwriter will catch the error and just call us again later.)
  	 */
! 	if (bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > 0)
  	{
! 		num_to_scan = (int) ((NBuffers * bgwriter_all_percent + 99) / 100);
! 		num_written = 0;
  
  		while (num_to_scan-- > 0)
  		{
  			if (++buf_id1 >= NBuffers)
  				buf_id1 = 0;
  			if (SyncOneBuffer(buf_id1, false))
--- 1011,1043 ----
  	 */
  
  	/*
! 	 * These two loops run over all buffers, including pinned ones.	The
! 	 * starting point advances through the buffer pool on successive calls.
  	 *
  	 * Note that we advance the static counter *before* trying to write. This
  	 * ensures that, if we have a persistent write failure on a dirty buffer,
  	 * we'll still be able to make progress writing other buffers. (The
  	 * bgwriter will catch the error and just call us again later.)
  	 */
! 
! 	num_written = 0;
! 	while (num_to_scan-- > 0)
  	{
! 		num_scanned += 1;
! 		if (++buf_id1 >= NBuffers)
! 			buf_id1 = 0;
! 		if (SyncOneBuffer(buf_id1, false))
! 			++num_written;
! 	}
! 
! 	if (bgwriter_all_percent > 0.0 && bgwriter_all_maxpages > num_written)
! 	{
! 		num_to_scan = (int) ((NBuffers * bgwriter_all_percent + 99) / 100)
! 					- num_scanned;
  
  		while (num_to_scan-- > 0)
  		{
+ 			num_scanned += 1;
  			if (++buf_id1 >= NBuffers)
  				buf_id1 = 0;
  			if (SyncOneBuffer(buf_id1, false))
*************** BgBufferSync(void)
*** 1054,1059 ****
--- 1070,1077 ----
  				buf_id2 = 0;
  		}
  	}
+ 
+ 	return num_scanned;
  }
  
  /*
*************** void
*** 1273,1279 ****
  FlushBufferPool(void)
  {
  	BufferSync();
! 	smgrsync();
  }
  
  
--- 1291,1297 ----
  FlushBufferPool(void)
  {
  	BufferSync();
! 	smgrsync(true);
  }
  
  
diff -cprN head/src/backend/storage/file/fd.c checkpoint/src/backend/storage/file/fd.c
*** head/src/backend/storage/file/fd.c	Wed Nov  8 14:10:42 2006
--- checkpoint/src/backend/storage/file/fd.c	Fri Jan  5 17:21:26 2007
***************
*** 40,51 ****
--- 40,59 ----
  
  #include "postgres.h"
  
+ #ifndef _GNU_SOURCE
+ #define _GNU_SOURCE
+ #endif
+ 
  #include <sys/file.h>
  #include <sys/param.h>
  #include <sys/stat.h>
  #include <unistd.h>
  #include <fcntl.h>
  
+ #ifdef _POSIX_MAPPED_FILES
+ #include <sys/mman.h>
+ #endif
+ 
  #include "miscadmin.h"
  #include "access/xact.h"
  #include "storage/fd.h"
*************** FileSync(File file)
*** 1122,1127 ****
--- 1130,1190 ----
  	return pg_fsync(VfdCache[file].fd);
  }
  
+ int
+ FileSyncRange(File file, long offset, long amount)
+ {
+ 	int			returnCode;
+ 
+ 	Assert(FileIsValid(file));
+ 
+ 	DO_DB(elog(LOG, "FileSyncRange: %d (%s) %ld %ld",
+ 			   file, VfdCache[file].fileName, offset, amount));
+ 
+ 	returnCode = FileAccess(file);
+ 	if (returnCode < 0)
+ 		return returnCode;
+ 
+ #if defined(SYNC_FILE_RANGE_WRITE)
+ 	returnCode = sync_file_range(VfdCache[file].fd, offset, amount,
+ 		SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
+ #elif _POSIX_MAPPED_FILES > 0
+ 	{
+ 		void *p = mmap(NULL, amount, PROT_READ | PROT_WRITE, MAP_SHARED,
+ 			VfdCache[file].fd, offset);
+ 		if (p == MAP_FAILED)
+ 			return -1;
+ 		returnCode = msync(p, amount, MS_SYNC);
+ 		munmap(p, amount);
+ 	}
+ #elif defined(WIN32)
+ 	{
+ 		HANDLE	mapping = NULL;
+ 		LPVOID	p = NULL;
+ 
+ 		if (!(mapping = CreateFileMapping(
+ 					(HANDLE) _get_osfhandle(VfdCache[file].fd),
+ 					NULL, PAGE_READWRITE, 0, 0, NULL)) ||
+ 			!(p = MapViewOfFile(mapping, FILE_MAP_ALL_ACCESS,
+ 					0, offset, amount)) ||
+ 			!FlushViewOfFile(p, amount))
+ 		{
+ 			returnCode = -1;
+ 			errno = GetLastError();
+ 		}
+ 
+ 		if (p)
+ 			UnmapViewOfFile(p);
+ 		if (mapping)
+ 			CloseHandle(mapping);
+ 	}
+ #else
+ 	elog(WARNING, "Native sync_file_range() is not supported");
+ 	returnCode = pg_fsync(VfdCache[file].fd);
+ #endif
+ 
+ 	return returnCode;
+ }
+ 
  long
  FileSeek(File file, long offset, int whence)
  {
diff -cprN head/src/backend/storage/smgr/md.c checkpoint/src/backend/storage/smgr/md.c
*** head/src/backend/storage/smgr/md.c	Fri Jan  5 15:25:19 2007
--- checkpoint/src/backend/storage/smgr/md.c	Fri Jan  5 17:21:26 2007
***************
*** 17,26 ****
--- 17,28 ----
  #include <unistd.h>
  #include <fcntl.h>
  #include <sys/file.h>
+ #include <sys/time.h>
  
  #include "catalog/catalog.h"
  #include "miscadmin.h"
  #include "postmaster/bgwriter.h"
+ #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/bufmgr.h"
  #include "storage/smgr.h"
*************** typedef struct _MdfdVec
*** 86,92 ****
  } MdfdVec;
  
  static MemoryContext MdCxt;		/* context for all md.c allocations */
! 
  
  /*
   * In some contexts (currently, standalone backends and the bgwriter process)
--- 88,95 ----
  } MdfdVec;
  
  static MemoryContext MdCxt;		/* context for all md.c allocations */
! int		checkpoint_sync_size = 0;
! int		checkpoint_sync_duration = 0;
  
  /*
   * In some contexts (currently, standalone backends and the bgwriter process)
*************** mdimmedsync(SMgrRelation reln)
*** 825,830 ****
--- 828,856 ----
  #endif
  }
  
+ static void
+ sleep_between_sync(long len, uint64 total_len)
+ {
+ 	struct timeval start;
+ 	struct timeval now;
+ 	uint64	usec;
+ 
+ 	if (len <= 0 || total_len == 0 || checkpoint_sync_duration == 0)
+ 		return;
+ 
+ 	usec = (uint64) checkpoint_sync_duration * 1000000 * len / total_len;
+ 
+ 	gettimeofday(&start, NULL);
+ 	do
+ 	{
+ 		pg_usleep(BgWriterDelay * 1000L);
+ 		BgBufferSync(0);
+ 		AbsorbFsyncRequests();
+ 		gettimeofday(&now, NULL);
+ 	} while((uint64) (now.tv_sec - start.tv_sec) * 1000000 +
+ 			(now.tv_usec - start.tv_usec) < usec);
+ }
+ 
  /*
   *	mdsync() -- Sync previous writes to stable storage.
   *
*************** mdsync(void)
*** 837,842 ****
--- 863,869 ----
  	HASH_SEQ_STATUS hstat;
  	PendingOperationEntry *entry;
  	int			absorb_counter;
+ 	uint64		total_len;
  
  	if (!pendingOpsTable)
  		elog(ERROR, "cannot sync without a pendingOpsTable");
*************** mdsync(void)
*** 849,854 ****
--- 876,903 ----
  	 */
  	AbsorbFsyncRequests();
  
+ 	/*
+ 	 * Calc the total length of files to fsync.
+ 	 */
+ 	total_len = 0;
+ 	if (checkpoint_sync_duration > 0)
+ 	{
+ 		hash_seq_init(&hstat, pendingOpsTable);
+ 		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+ 		{
+ 			SMgrRelation	reln;
+ 			MdfdVec		   *seg;
+ 			long			len;
+ 
+ 			reln = smgropen(entry->rnode);
+ 			seg = _mdfd_getseg(reln,
+ 							entry->segno * ((BlockNumber) RELSEG_SIZE),
+ 							true, EXTENSION_RETURN_NULL);
+ 			if (seg && (len = FileSeek(seg->mdfd_vfd, 0, SEEK_END)) >= 0)
+ 				total_len += len;
+ 		}
+ 	}
+ 
  	absorb_counter = FSYNCS_PER_ABSORB;
  	hash_seq_init(&hstat, pendingOpsTable);
  	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
*************** mdsync(void)
*** 905,919 ****
  							   false, EXTENSION_RETURN_NULL);
  			if (seg)
  			{
! 				if (FileSync(seg->mdfd_vfd) < 0 &&
! 					errno != ENOENT)
! 					ereport(ERROR,
  							(errcode_for_file_access(),
  							 errmsg("could not fsync segment %u of relation %u/%u/%u: %m",
  									entry->segno,
  									entry->rnode.spcNode,
  									entry->rnode.dbNode,
  									entry->rnode.relNode)));
  			}
  		}
  
--- 954,1011 ----
  							   false, EXTENSION_RETURN_NULL);
  			if (seg)
  			{
! 				if (checkpoint_sync_size > 0 && checkpoint_sync_duration > 0)
! 				{
! 					long	len;
! 					long	off;
! 					long	sync_bytes = checkpoint_sync_size * 1024;
! 
! 					len = FileSeek(seg->mdfd_vfd, 0, SEEK_END);
! 					for (off = 0; off < len; off += sync_bytes)
! 					{
! 						long	amount;
! 
! 						if (off + sync_bytes < len)
! 							amount = sync_bytes;
! 						else
! 							amount = len - off;
! 
! 						elog(DEBUG2, "FileSyncRange : segment %u of relation %u/%u/%u range %ld/%ld",
! 							entry->segno,
! 							entry->rnode.spcNode,
! 							entry->rnode.dbNode,
! 							entry->rnode.relNode,
! 							off, amount);
! 
! 						if (FileSyncRange(seg->mdfd_vfd, off, amount) < 0 && errno != ENOENT)
! 							ereport(ERROR,
! 								(errcode_for_file_access(),
! 								 errmsg("could not fsync segment %u of relation %u/%u/%u range %ld/%ld: %m",
! 									entry->segno,
! 									entry->rnode.spcNode,
! 									entry->rnode.dbNode,
! 									entry->rnode.relNode,
! 									off, amount)));
! 
! 						if (total_len > 0)
! 							sleep_between_sync(amount, total_len);
! 					}
! 				}
! 				else
! 				{
! 					if (FileSync(seg->mdfd_vfd) < 0 && errno != ENOENT)
! 						ereport(ERROR,
  							(errcode_for_file_access(),
  							 errmsg("could not fsync segment %u of relation %u/%u/%u: %m",
  									entry->segno,
  									entry->rnode.spcNode,
  									entry->rnode.dbNode,
  									entry->rnode.relNode)));
+ 
+ 					if (total_len > 0)
+ 						sleep_between_sync(
+ 							FileSeek(seg->mdfd_vfd, 0, SEEK_END), total_len);
+ 				}
  			}
  		}
  
diff -cprN head/src/backend/storage/smgr/smgr.c checkpoint/src/backend/storage/smgr/smgr.c
*** head/src/backend/storage/smgr/smgr.c	Fri Jan  5 15:25:19 2007
--- checkpoint/src/backend/storage/smgr/smgr.c	Fri Jan  5 15:28:49 2007
***************
*** 27,32 ****
--- 27,33 ----
  #include "storage/smgr.h"
  #include "utils/memutils.h"
  
+ extern int	checkpoint_sync_duration;
  
  /*
   * This struct of function pointers defines the API between smgr.c and
*************** smgrabort(void)
*** 781,795 ****
   *	smgrsync() -- Sync files to disk at checkpoint time.
   */
  void
! smgrsync(void)
  {
  	int			i;
  
  	for (i = 0; i < NSmgr; i++)
  	{
  		if (smgrsw[i].smgr_sync)
  			(*(smgrsw[i].smgr_sync)) ();
  	}
  }
  
  
--- 782,804 ----
   *	smgrsync() -- Sync files to disk at checkpoint time.
   */
  void
! smgrsync(bool immediate)
  {
  	int			i;
+ 	long	prev_checkpoint_sync_duration = checkpoint_sync_duration;
+ 
+ 	/* FIXME: Use more clever ways. */
+ 	if (immediate)
+ 		checkpoint_sync_duration = 0;
  
  	for (i = 0; i < NSmgr; i++)
  	{
  		if (smgrsw[i].smgr_sync)
  			(*(smgrsw[i].smgr_sync)) ();
  	}
+ 
+ 	if (immediate)
+ 		checkpoint_sync_duration = prev_checkpoint_sync_duration;
  }
  
  
diff -cprN head/src/backend/tcop/utility.c checkpoint/src/backend/tcop/utility.c
*** head/src/backend/tcop/utility.c	Tue Oct 17 18:18:32 2006
--- checkpoint/src/backend/tcop/utility.c	Tue Dec 19 11:12:15 2006
*************** ProcessUtility(Node *parsetree,
*** 1055,1061 ****
  				ereport(ERROR,
  						(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  						 errmsg("must be superuser to do CHECKPOINT")));
! 			RequestCheckpoint(true, false);
  			break;
  
  		case T_ReindexStmt:
--- 1055,1061 ----
  				ereport(ERROR,
  						(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  						 errmsg("must be superuser to do CHECKPOINT")));
! 			RequestCheckpoint(true);
  			break;
  
  		case T_ReindexStmt:
diff -cprN head/src/backend/utils/misc/guc.c checkpoint/src/backend/utils/misc/guc.c
*** head/src/backend/utils/misc/guc.c	Thu Jan  4 14:56:19 2007
--- checkpoint/src/backend/utils/misc/guc.c	Fri Jan  5 17:21:26 2007
*************** extern int	CommitDelay;
*** 99,104 ****
--- 99,108 ----
  extern int	CommitSiblings;
  extern char *default_tablespace;
  extern bool fullPageWrites;
+ extern int	checkpoint_write_duration;
+ extern int	checkpoint_nap_duration;
+ extern int	checkpoint_sync_duration;
+ extern int	checkpoint_sync_size;
  
  #ifdef TRACE_SORT
  extern bool trace_sort;
*************** static struct config_int ConfigureNamesI
*** 1423,1428 ****
--- 1427,1472 ----
  	},
  
  	{
+ 		{"checkpoint_write_duration", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the duration of write() phase in checkpoints."),
+ 			NULL,
+ 			GUC_UNIT_S
+ 		},
+ 		&checkpoint_write_duration,
+ 		0, 0, 3600, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"checkpoint_nap_duration", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the duration between write() and fsync() phases in checkpoints."),
+ 			NULL,
+ 			GUC_UNIT_S
+ 		},
+ 		&checkpoint_nap_duration,
+ 		0, 0, 3600, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"checkpoint_sync_duration", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the duration of fsync() phase in checkpoints."),
+ 			NULL,
+ 			GUC_UNIT_S
+ 		},
+ 		&checkpoint_sync_duration,
+ 		0, 0, 3600, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"checkpoint_sync_size", PGC_SIGHUP, WAL_CHECKPOINTS,
+ 			gettext_noop("Sets the synchronization unit of data files in checkpoints."),
+ 			NULL,
+ 			GUC_UNIT_KB
+ 		},
+ 		&checkpoint_sync_size,
+ 		0, 0, RELSEG_SIZE * BLCKSZ / 1024, NULL, NULL
+ 	},
+ 
+ 	{
  		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
  			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
  			NULL,
diff -cprN head/src/backend/utils/misc/postgresql.conf.sample checkpoint/src/backend/utils/misc/postgresql.conf.sample
*** head/src/backend/utils/misc/postgresql.conf.sample	Wed Nov 29 15:21:18 2006
--- checkpoint/src/backend/utils/misc/postgresql.conf.sample	Fri Jan  5 11:26:35 2007
***************
*** 164,169 ****
--- 164,173 ----
  #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
  #checkpoint_timeout = 5min		# range 30s-1h
  #checkpoint_warning = 30s		# 0 is off
+ #checkpoint_write_duration = 0s		# duration in write() phase in checkpoints
+ #checkpoint_nap_duration = 0s		# duration between write() and fsync()
+ #checkpoint_sync_duration = 0s		# duration in fsync() phase in checkpoints
+ #checkpoint_sync_size = 0MB		# synchronization unit of files, 0 is at once
  
  # - Archiving -
  
diff -cprN head/src/include/access/xlog.h checkpoint/src/include/access/xlog.h
*** head/src/include/access/xlog.h	Tue Nov  7 09:40:50 2006
--- checkpoint/src/include/access/xlog.h	Tue Dec 19 11:12:15 2006
*************** extern void BootStrapXLOG(void);
*** 162,170 ****
--- 162,172 ----
  extern void StartupXLOG(void);
  extern void ShutdownXLOG(int code, Datum arg);
  extern void InitXLOGAccess(void);
+ extern bool PrepareCheckPoint(bool shutdown, bool force);
  extern void CreateCheckPoint(bool shutdown, bool force);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr GetRedoRecPtr(void);
+ extern XLogRecPtr GetInsertRecPtr(void);
  extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
  
  #endif   /* XLOG_H */
diff -cprN head/src/include/postmaster/bgwriter.h checkpoint/src/include/postmaster/bgwriter.h
*** head/src/include/postmaster/bgwriter.h	Tue Oct 10 16:16:01 2006
--- checkpoint/src/include/postmaster/bgwriter.h	Tue Dec 19 11:12:15 2006
*************** extern int	CheckPointWarning;
*** 23,29 ****
  
  extern void BackgroundWriterMain(void);
  
! extern void RequestCheckpoint(bool waitforit, bool warnontime);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
--- 23,29 ----
  
  extern void BackgroundWriterMain(void);
  
! extern void RequestCheckpoint(bool waitforit);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno);
  extern void AbsorbFsyncRequests(void);
diff -cprN head/src/include/storage/bufmgr.h checkpoint/src/include/storage/bufmgr.h
*** head/src/include/storage/bufmgr.h	Tue Oct 10 16:16:01 2006
--- checkpoint/src/include/storage/bufmgr.h	Tue Dec 19 11:12:15 2006
*************** extern void AbortBufferIO(void);
*** 151,157 ****
  
  extern void BufmgrCommit(void);
  extern void BufferSync(void);
! extern void BgBufferSync(void);
  
  extern void AtProcExit_LocalBuffers(void);
  
--- 151,157 ----
  
  extern void BufmgrCommit(void);
  extern void BufferSync(void);
! extern int	BgBufferSync(int num_to_scan);
  
  extern void AtProcExit_LocalBuffers(void);
  
diff -cprN head/src/include/storage/fd.h checkpoint/src/include/storage/fd.h
*** head/src/include/storage/fd.h	Tue Oct 10 16:16:01 2006
--- checkpoint/src/include/storage/fd.h	Fri Jan  5 11:26:35 2007
*************** extern void FileUnlink(File file);
*** 67,72 ****
--- 67,73 ----
  extern int	FileRead(File file, char *buffer, int amount);
  extern int	FileWrite(File file, char *buffer, int amount);
  extern int	FileSync(File file);
+ extern int	FileSyncRange(File file, long offset, long amount);
  extern long FileSeek(File file, long offset, int whence);
  extern int	FileTruncate(File file, long offset);
  
diff -cprN head/src/include/storage/smgr.h checkpoint/src/include/storage/smgr.h
*** head/src/include/storage/smgr.h	Fri Jan  5 15:25:19 2007
--- checkpoint/src/include/storage/smgr.h	Fri Jan  5 15:28:49 2007
*************** extern void AtSubAbort_smgr(void);
*** 82,88 ****
  extern void PostPrepare_smgr(void);
  extern void smgrcommit(void);
  extern void smgrabort(void);
! extern void smgrsync(void);
  
  extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
--- 82,88 ----
  extern void PostPrepare_smgr(void);
  extern void smgrcommit(void);
  extern void smgrabort(void);
! extern void smgrsync(bool immediate);
  
  extern void smgr_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void smgr_desc(StringInfo buf, uint8 xl_info, char *rec);
#101ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Bruce Momjian (#88)
Re: Load distributed checkpoint

I wrote:

I'm thinking about generalizing your idea; Adding three parameters
to control sleeps in each stage.

I sent a patch to -patches that adds 3+1 GUC parameters for checkpoints.
We can use three of them to control sleeps in each stage during checkpoints.
The last is an experimental approach to replace fsync() for fine control.

1. checkpoint_write_duration (default=0, in seconds)
Sets the duration of write() phase in checkpoints.
2. checkpoint_nap_duration (default=0, in seconds)
Sets the duration between write() and fsync() phases in checkpoints.
3. checkpoint_sync_duration (default=0, in seconds)
Sets the duration of fsync() phase in checkpoints.

The 1st parameter spreads write(). If you set checkpoint_write_duration
to 90% of checkpoint_timeout, it's just same as the patch I sent before.

The 2nd is naptime between write() and fsync() phases. Kernel's writer might
work much if you set it to around 30-60s, that might be useful for some
traditional UNIXes, as you say. In contrast, the 1st was the most variable
in my machine somehow (Windows and Linux).

The 3rd spreads fsync(). This parameter only works when you have several
tables or a very large table (that consists of some 1GB of files), because
fsync() is on a file basis.

Bruce Momjian <bruce@momjian.us> wrote:

To summarize, if we could have fsync() only write the dirty buffers that
happened as part of the checkpoint, we could delay the write() for the
entire time between checkpoints, but we can't do that, so we have to
make it user-tunable.

The 3rd has the above limitation so that I added another parameter.

4. checkpoint_sync_size (default=0, in KB)
Sets the synchronization unit of data files in checkpoints.

It uses sync_file_range or mmap/msync/munmap to divide file-synchronization
into specified granularity. I think 16-64MB fits the machines in that
performance are restricted by fsync() in checkpoints.

The feature is uncompleted. For example, sync_file_range does not flush
metadata of files in fact (it's equivalent of fdatasync), so we may lose
data under the patch. It must be fixed, but I want to measure the advantage
before that.

I'm interested in which parameter is useful for each environment.
Any comments and testing reports will be appreciated.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#102Inaam Rana
inaamrana@gmail.com
In reply to: Takayuki Tsunakawa (#99)
Re: Load distributed checkpoint

No, I've not tried yet. Inaam-san told me that Linux had a few I/O
schedulers but I'm not familiar with them. I'll find information
about them (how to change the scheduler settings) and try the same
test.

I am sorry, your response just slipped by me. The docs for RHEL (I believe
you are running RHEL which has 2.6.9 kernel) say that it does support
selectable IO scheduler.

http://www.redhat.com/rhel/details/limits/

I am not sure where else to look for scheduler apart from /sys

regards,
inaam

#103Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#1)
Re: Load distributed checkpoint

Thread added to TODO list:

* Reduce checkpoint performance degredation by forcing data to disk
more evenly

http://archives.postgresql.org/pgsql-hackers/2006-12/msg00337.php
http://archives.postgresql.org/pgsql-hackers/2007-01/msg00079.php

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:

This is a proposal for load distributed checkpoint.
(It is presented on postgresql anniversary summit in last summer.)

We offen encounters performance gap during checkpoint. The reason is write
bursts. Storage devices are too overworked in checkpoint, so they can not
supply usual transaction processing.

Checkpoint consists of the following four steps, and the major performance
problem is 2nd step. All dirty buffers are written without interval in it.

1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control file

I suggested to write pages with sleeping in 2nd step, using normal activity
of the background writer. It is something like cost-based vacuum delay.
Background writer has two pointers, 'ALL' and 'LRU', indicating where to
write out in buffer pool. We can wait for the ALL clock-hand going around
to guarantee all pages to be written.

Here is pseudo-code for the proposed method. The internal loop is just the
same as bgwriter's activity.

PrepareCheckPoint(); -- do step 1
Reset num_of_scanned_pages by ALL activity;
do {
BgBufferSync(); -- do a part of step 2
sleep(bgwriter_delay);
} while (num_of_scanned_pages < shared_buffers);
CreateCheckPoint(); -- do step 3 and 4

We may accelerate background writer to reduce works at checkpoint instead of
the method, but it introduces another performance problem; Extra pressure
is always put on the storage devices to keep the number of dirty pages low.

I'm working about adjusting the progress of checkpoint to checkpoint timeout
and wal segments limitation automatically to avoid overlap of two checkpoints.
I'll post a patch sometime soon.

Comments and suggestions welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#104Inaam Rana
inaamrana@gmail.com
In reply to: ITAGAKI Takahiro (#39)
Re: Load distributed checkpoint

On 12/19/06, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote:

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

I performed some simple tests, and I'll show the results below.

(1) The default case
235 80 226 77 240
(2) No write case
242 250 244 253 280
(3) No checkpoint case
229 252 256 292 276
(4) No fsync() case
236 112 215 216 221
(5) No write by PostgreSQL, but fsync() by another program case
9 223 260 283 292
(6) case (5) + O_SYNC by write_fsync
97 114 126 112 125
(7) O_SYNC case
182 103 41 50 74

I posted a patch to PATCHES. Please try out it.
It does write() smoothly, but fsync() at a burst.
I suppose the result will be between (3) and (5).

Itagaki,

Did you had a chance to look into this any further? We, at EnterpriseDB,
have done some testing on this patch (dbt2 runs) and it looks like we are
getting the desired results, particularly so when we spread out both sync
and write phases.

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

#105ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Inaam Rana (#104)
Re: Load distributed checkpoint

"Inaam Rana" <inaamrana@gmail.com> wrote:

Did you had a chance to look into this any further? We, at EnterpriseDB,
have done some testing on this patch (dbt2 runs) and it looks like we are
getting the desired results, particularly so when we spread out both sync
and write phases.

Thank you for testing! Yes, I'm cleaning the patch. I changed configuration
parameters to delay each phase in checkpoints from setting absolute times
(checkpoint_xxx_duration) to setting relative to checkpoint_timeout
(checkpoint_xxx_percent). Delay factors strongly depend on total checkpoint
lengths, so I think the relative method is better.

I tested the patch on my machine, too. The belows are the results of
pgbench with/without the patch. As you see, the patch is not a complete
solution -- 12s of response time is not satisfiable yet -- but it was
better than other possible settings we could choose now.

pgbench -s100 -c16 -t100000
on the machine with 1GB ram and one SCSI drive

| A | B | C | D |
--------------------------+--------+--------+--------+--------+
bgwriter_all_maxpages | 5 | 5 | 60 | 120 |
checkpoint_write_percent | 50.0 | 0 | 0 | 0 |
checkpoint_nap_percent | 10.0 | 0 | 0 | 0 |
checkpoint_sync_percent | 20.0 | 0 | 0 | 0 |
--------------------------+--------+--------+--------+--------+
pgbench tps | 612.23 | 517.64 | 488.90 | 378.39 |
response average | 2.50ms | 2.89ms | 3.12ms | 3.99ms |
response maximum | 12.23s |123.66s | 55.09s | 36.72s |
--------------------------+--------+--------+--------+--------+

A: Proposed configurations with patch
B: Default configurations
C: Make bgwriter aggressive
D: Make bgwriter more aggressive

Other configurations:
- full_page_writes = off
- wal_buffers = 4MB
- checkpoint_segments = 32
- checkpoint_timeout = 15min

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#106Josh Berkus
josh@agliodbs.com
In reply to: ITAGAKI Takahiro (#105)
Re: Load distributed checkpoint

Itagaki,

Thank you for testing! Yes, I'm cleaning the patch. I changed
configuration parameters to delay each phase in checkpoints from setting
absolute times (checkpoint_xxx_duration) to setting relative to
checkpoint_timeout (checkpoint_xxx_percent). Delay factors strongly
depend on total checkpoint lengths, so I think the relative method is
better.

Can I have a copy of the patch to add to the Sun testing queue?

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

#107ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Josh Berkus (#106)
1 attachment(s)
Re: [HACKERS] Load distributed checkpoint

Josh Berkus <josh@agliodbs.com> wrote:

Can I have a copy of the patch to add to the Sun testing queue?

This is the revised version of the patch. Delay factors in checkpoints
can be specified by checkpoint_write_percent, checkpoint_nap_percent
and checkpoint_sync_percent. They are relative to checkpoint_timeout.

Also, checking of archive_timeout during checkpoints and some error
handling routines were added.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

checkpoint_02-27.patch.gzapplication/octet-stream; name=checkpoint_02-27.patch.gzDownload
#108Inaam Rana
inaamrana@gmail.com
In reply to: ITAGAKI Takahiro (#107)
Re: [HACKERS] Load distributed checkpoint

On 2/26/07, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote:

Josh Berkus <josh@agliodbs.com> wrote:

Can I have a copy of the patch to add to the Sun testing queue?

This is the revised version of the patch. Delay factors in checkpoints
can be specified by checkpoint_write_percent, checkpoint_nap_percent
and checkpoint_sync_percent. They are relative to checkpoint_timeout.

Also, checking of archive_timeout during checkpoints and some error
handling routines were added.

One of the issues we had during testing with original patch was db stop not
working properly. I think you coded something to do a stop checkpoint in
immediately but if a checkpoint is already in progress at that time, it
would take its own time to complete.
Does this patch resolve that issue? Also, is it based on pg82stable or HEAD?

regards,
inaam

Regards,

---
ITAGAKI Takahiro
NTT Open Source Software Center

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

#109ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Inaam Rana (#108)
Re: [PATCHES] Load distributed checkpoint

"Inaam Rana" <inaamrana@gmail.com> wrote:

One of the issues we had during testing with original patch was db stop not
working properly. I think you coded something to do a stop checkpoint in
immediately but if a checkpoint is already in progress at that time, it
would take its own time to complete.
Does this patch resolve that issue?

Yes, I fixed the problem. If a checkpoint by user SQL or shutdown is waiting
during an automatic checkpoint, the running checkpoint will be done without
any delays. At the worst case, you have to wait two checkpoints, (a running
automatic checkpoint and your explicit request) but nothing more of them.

Also, is it based on pg82stable or HEAD?

It's based on HEAD.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center