Load distributed checkpoint
This is a proposal for load distributed checkpoint.
(It is presented on postgresql anniversary summit in last summer.)
We offen encounters performance gap during checkpoint. The reason is write
bursts. Storage devices are too overworked in checkpoint, so they can not
supply usual transaction processing.
Checkpoint consists of the following four steps, and the major performance
problem is 2nd step. All dirty buffers are written without interval in it.
1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control file
I suggested to write pages with sleeping in 2nd step, using normal activity
of the background writer. It is something like cost-based vacuum delay.
Background writer has two pointers, 'ALL' and 'LRU', indicating where to
write out in buffer pool. We can wait for the ALL clock-hand going around
to guarantee all pages to be written.
Here is pseudo-code for the proposed method. The internal loop is just the
same as bgwriter's activity.
PrepareCheckPoint(); -- do step 1
Reset num_of_scanned_pages by ALL activity;
do {
BgBufferSync(); -- do a part of step 2
sleep(bgwriter_delay);
} while (num_of_scanned_pages < shared_buffers);
CreateCheckPoint(); -- do step 3 and 4
We may accelerate background writer to reduce works at checkpoint instead of
the method, but it introduces another performance problem; Extra pressure
is always put on the storage devices to keep the number of dirty pages low.
I'm working about adjusting the progress of checkpoint to checkpoint timeout
and wal segments limitation automatically to avoid overlap of two checkpoints.
I'll post a patch sometime soon.
Comments and suggestions welcome.
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Hello, Itagaki-san
This is a proposal for load distributed checkpoint.
We offen encounters performance gap during checkpoint. The reason is
write
bursts. Storage devices are too overworked in checkpoint, so they
can not
supply usual transaction processing.
Good! You are focusing on a very important problem. System designers
don't like unsteady performance -- sudden slowdown. Commercial
database systems have made efforts to provide steady performance. I've
seen somewhere the report that Oracle provides stable throughput even
during checkpoint. I wonder how it is implemented.
I'm working about adjusting the progress of checkpoint to checkpoint
timeout
and wal segments limitation automatically to avoid overlap of two
checkpoints.
Have you already tried your patch? What's your first impression about
the improvement? I'm very interested. On my machine, pgbench shows 210
tps at first, however, it drops to 70 tps during checkpoint.
Checkpoint consists of the following four steps, and the major
performance
problem is 2nd step. All dirty buffers are written without interval
in it.
1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control file
Hmm. Isn't it possible that step 3 affects the performance greatly?
I'm sorry if you have already identified step 2 as disturbing
backends.
As you know, PostgreSQL does not transfer the data to disk when
write()ing. Actual transfer occurs when fsync()ing at checkpoints,
unless the filesystem cache runs short. So, disk is overworked at
fsync()s.
What processing of bgwriter (shared locking of buffers, write(),
fsync(), flushing of log for WAL, etc.) do you consider (or did you
detect) as disturbing what processing of backends (exclusive locking
of buffers, putting log records onto WAL buffers, flushing log at
commit, writing dirty buffers when shared buffers run short)?
On Thu, Dec 7, 2006 at 12:05 AM, in message
<20061207144843.6269.ITAGAKI.TAKAHIRO@oss.ntt.co.jp>, ITAGAKI Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:
We offen encounters performance gap during checkpoint. The reason is write
bursts. Storage devices are too overworked in checkpoint, so they can not
supply usual transaction processing.
When we first switched our web site to PostgreSQL, this was one of our biggest problems. Queries which normally run in a few milliseconds were hitting the 20 second limit we impose in our web application. These were happening in bursts which suggested that they were caused by checkpoints. We adjusted the background writer configuration and nearly eliminated the problem.
bgwriter_all_maxpages | 600
bgwriter_all_percent | 10
bgwriter_delay | 200
bgwriter_lru_maxpages | 200
bgwriter_lru_percent | 20
Between the xfs caching and the batter backed cache in the RAID controller, the disk writes seemed to settle out pretty well.
Checkpoint consists of the following four steps, and the major performance
problem is 2nd step. All dirty buffers are written without interval in it.1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control fileI suggested to write pages with sleeping in 2nd step, using normal activity
of the background writer. It is something like cost- based vacuum delay.
Background writer has two pointers, 'ALL' and 'LRU', indicating where to
write out in buffer pool. We can wait for the ALL clock- hand going around
to guarantee all pages to be written.Here is pseudo- code for the proposed method. The internal loop is just the
same as bgwriter's activity.PrepareCheckPoint(); -- do step 1
Reset num_of_scanned_pages by ALL activity;
do {
BgBufferSync(); -- do a part of step 2
sleep(bgwriter_delay);
} while (num_of_scanned_pages < shared_buffers);
CreateCheckPoint(); -- do step 3 and 4
Would the background writer be disabled during this extended checkpoint? How is it better to concentrate step 2 in an extended checkpoint periodically rather than consistently in the background writer?
We may accelerate background writer to reduce works at checkpoint instead of
the method, but it introduces another performance problem; Extra pressure
is always put on the storage devices to keep the number of dirty pages low.
Doesn't the file system caching logic combined with a battery backed cache in the controller cover this, or is your patch to help out those who don't have battery backed controller cache? What would the impact of your patch be on environments like ours? Will there be any affect on PITR techniques, in terms of how current the copied WAL files would be?
Show quoted text
I'm working about adjusting the progress of checkpoint to checkpoint timeout
and wal segments limitation automatically to avoid overlap of two
checkpoints.
I'll post a patch sometime soon.Comments and suggestions welcome.
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Takayuki Tsunakawa wrote:
Hello, Itagaki-san
Checkpoint consists of the following four steps, and the major
performance
problem is 2nd step. All dirty buffers are written without interval
in it.
1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control fileHmm. Isn't it possible that step 3 affects the performance greatly?
I'm sorry if you have already identified step 2 as disturbing
backends.As you know, PostgreSQL does not transfer the data to disk when
write()ing. Actual transfer occurs when fsync()ing at checkpoints,
unless the filesystem cache runs short. So, disk is overworked at
fsync()s.
It seems to me that virtual memory settings of the OS will determine
if step 2 or step 3 causes much of the actual disk I/O.
In particular, on Linux, things like /proc/sys/vm/dirty_expire_centisecs
and dirty_writeback_centisecs and possibly dirty_background_ratio
would affect this. If those numbers are high, ISTM most write()s
from step 2 would wait for the flush in step 3. If I understand
correctly, if the dirty_expire_centisecs number is low, most write()s
from step 2 would happen before step 3 because of the pdflush daemons.
I expect other OS's would have different but similar knobs to tune this.
It seems to me that the most portable way postgresql could force
the I/O to be balanced would be to insert otherwise unnecessary
fsync()s into step 2; but that it might (not sure why) be better
to handle this through OS-specific tuning outside of postgres.
Hello,
As Mr. Mayer points out, which of step 2 or 3 actually causes I/O
depends on the VM settings, and the amount of RAM available for file
system cache.
"Ron Mayer" <rm_pg@cheapcomplexdevices.com> wrote in message
news:45786549.2000602@cheapcomplexdevices.com...
It seems to me that the most portable way postgresql could force
the I/O to be balanced would be to insert otherwise unnecessary
fsync()s into step 2; but that it might (not sure why) be better
to handle this through OS-specific tuning outside of postgres.
I'm afraid it is difficult for system designers to expect steady
throughput/response time, as long as PostgreSQL depends on the
flushing of file system cache. How does Oracle provide stable
performance?
Though I'm not sure, isn't it the key to use O_SYNC so that write()s
transfer data to disk? That is, PostgreSQL completely controls the
timing of data transfer. Moreover, if possible, it's better to bypass
the file system cache, using such as O_DIRECT flag for open() on UNIX
and FILE_FLAG_NO_BUFFERING flag for CreateFile() on Windows. As far as
I know, SQL Server and Oracle does this. I think commercial DBMSs do
the same thing to control and anticipate the I/O activity without
being influenced by VM policy.
If PostgreSQL is to use these, writing of dirty buffers has to be
improved. To decrease the count of I/O, pages adjacent on disk that
are also adjacent on memory must be written with one write().
From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
Would the background writer be disabled during this extended
checkpoint? How is it better to concentrate step 2 in an extended
checkpoint periodically rather than consistently in the background
writer?
Will there be any affect on PITR techniques, in terms of how current
the copied WAL files would be?
Extending the checkpoint can also cause extended downtime, to put it
in an extreme way. I understand that checkpoints occur during crash
recovery and PITR, so time for those operations would get longer. The
checkpoint also occurs at server shutdown. However, distinction among
these might be made, and undesirable extension could be avoided.
On Thu, 7 Dec 2006, Kevin Grittner wrote:
Between the xfs caching and the batter backed cache in the RAID...
Mmmmm, battered cache. You can deep fry anything nowadays.
Would the background writer be disabled during this extended checkpoint?
The background writer is the same process that does the full buffer sweep
at checkpoint time. You wouldn't have to disable it because it would be
busy doing this extended checkpoint instead of its normal job.
How is it better to concentrate step 2 in an extended checkpoint
periodically rather than consistently in the background writer?
Right now, when the checkpoint flush is occuring, there is no background
writer active--that process is handling the checkpoint. Itagaki's
suggestion is basically to take the current checkpoint code, which runs
all in one burst, and spread it out over time. I like the concept, as
I've seen the behavior he's describing (even after tuning the background
writer like you suggest and doing Linux disk tuning as Ron describes), but
I think solving the problem is a little harder than suggested.
I have two concerns with the logic behind this approach. The first is
that if the background writer isn't keeping up with writing out all the
dirty pages, what makes you think that running the checkpoint with a
similar level of activity is going to? If your checkpoint is taking a
long time, it's because the background writer has an overwhelming load and
needs to be bailed out. Slowing down the writes with a lazier checkpoint
process introduces the possibility that you'll hit a second checkpoint
request before you're even finished cleaning up the first one, and then
you're really in trouble.
Second, the assumption here is that it's writing the dirty buffers out
that is the primary cause of the ugly slowdown. I too believe it could
just as easily be the fsync when it's done that killing you, and slowing
down the writes isn't necessarily going to make that faster.
Doesn't the file system caching logic combined with a battery backed
cache in the controller cover this, or is your patch to help out those
who don't have battery backed controller cache?
Unless your shared buffer pool is so small that you can write it all out
onto the cache, that won't help much with this problem.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 8 Dec 2006, Takayuki Tsunakawa wrote:
Though I'm not sure, isn't it the key to use O_SYNC so that write()s
transfer data to disk?
If disk writes near checkpoint time aren't happening fast enough now, I
doubt forcing a sync after every write will make that better.
If PostgreSQL is to use these, writing of dirty buffers has to be
improved. To decrease the count of I/O, pages adjacent on disk that
are also adjacent on memory must be written with one write().
Sorting out which pages are next to one another on disk is one of the jobs
the file system cache does; bypassing it will then make all that
complicated sorting logic the job of the database engine. And unlike the
operating system, the engine doesn't even really know anything about the
filesystem or physical disks involved, so what are the odds it's going to
do a better job? That's the concern with assuming direct writes are the
solution here--you have to be smarter than the OS is for that to be an
improvement, and you have a lot less information to make your decisions
with than it does.
I like Itagaki's idea of some automatic tuning of the checkpoint timeout
and wal segment parameters to help out with checkpoint behavior; that's an
idea that would help a lot of people. I'm not so sure that trying to make
PostgreSQL take over operations that it's relying on the OS to handle
intelligently right now will be as helpful, and it's a big programming job
to even try.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:
1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control fileHmm. Isn't it possible that step 3 affects the performance greatly?
I'm sorry if you have already identified step 2 as disturbing
backends.It seems to me that virtual memory settings of the OS will determine
if step 2 or step 3 causes much of the actual disk I/O.if the dirty_expire_centisecs number is low, most write()s
from step 2 would happen before step 3 because of the pdflush daemons.
Exactly. It depends on OSes, kernel settings, and filesystems.
I tested the patch on Linux kernel 2.6.9-39, default settings, and ext3fs.
Maybe pdflush daemons were strong enough to write dirty buffers in kernel,
so step 2 was a main part and 3 was not.
There are technical issues to distribute step 3. We can write buffers
on a page basis, that is granular enough. However, fsync() is on a file
basis (1GB), so we can only control granularity of fsync roughly.
sync_file_range (http://lwn.net/Articles/178199/) or some special APIs
would be a help, but there are portability issues...
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Greg Smith <gsmith@gregsmith.com> writes:
On Fri, 8 Dec 2006, Takayuki Tsunakawa wrote:
Though I'm not sure, isn't it the key to use O_SYNC so that write()s
transfer data to disk?
If disk writes near checkpoint time aren't happening fast enough now, I
doubt forcing a sync after every write will make that better.
I think the idea would be to force the writes to actually occur, rather
than just being scheduled (and then forced en-masse by an fsync at
checkpoint time). Since the point of the bgwriter is to try to force
writes to occur *outside* checkpoint times, this seems to make sense.
I share your doubts about the value of slowing down checkpoints --- but
to the extent that bgwriter-issued writes are delayed by the kernel
until the next checkpoint, we are certainly not getting the desired
effect of leveling the write load.
To decrease the count of I/O, pages adjacent on disk that
are also adjacent on memory must be written with one write().
Sorting out which pages are next to one another on disk is one of the jobs
the file system cache does; bypassing it will then make all that
complicated sorting logic the job of the database engine.
Indeed --- the knowledge that we don't know the physical layout has
always been the strongest argument against using O_SYNC in this way.
But I don't think anyone's made any serious tests. A properly tuned
bgwriter should be eating only a "background" level of I/O effort
between checkpoints, so maybe it doesn't matter too much if it's not
optimally scheduled.
regards, tom lane
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:
I'm afraid it is difficult for system designers to expect steady
throughput/response time, as long as PostgreSQL depends on the
flushing of file system cache. How does Oracle provide stable
performance?
Though I'm not sure, isn't it the key to use O_SYNC so that write()s
transfer data to disk?
AFAIK, other databases use write() and fsync() in combination. They call
fsync() immediately after they write buffers in some small batches. Otherwise,
they uses asynchronous and direct I/O options. Therefore, dirty pages in
kernel buffers are keeped to be low at any time.
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
We adjusted the background writer configuration and nearly eliminated
the problem.bgwriter_all_maxpages | 600
bgwriter_all_percent | 10
bgwriter_delay | 200
bgwriter_lru_maxpages | 200
bgwriter_lru_percent | 20Between the xfs caching and the batter backed cache in the RAID
controller, the disk writes seemed to settle out pretty well.
Yes, higher bgwriter_all_maxpages is better for stability. I also do so
up to now. However, if some processes makes lots of dirty buffers in the
shortest time, ex.VACUUM, bgwriter begins to write many pages and affects
responce time.
We will be able to set bgwriter_all_maxpages to lower value with load
distributed checkpoint. It expands the range of tuning of bgwriter.
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
On Thu, Dec 07, 2006 at 10:03:05AM -0600, Kevin Grittner wrote:
When we first switched our web site to PostgreSQL, this was one of our biggest problems. Queries which normally run in a few milliseconds were hitting the 20 second limit we impose in our web application. These were happening in bursts which suggested that they were caused by checkpoints. We adjusted the background writer configuration and nearly eliminated the problem.
bgwriter_all_maxpages | 600
bgwriter_all_percent | 10
bgwriter_delay | 200
bgwriter_lru_maxpages | 200
bgwriter_lru_percent | 20
Bear in mind that bgwriter settings should be considered in conjunction
with shared_buffer and checkpoint_timeout settings. For example, if you
have 60,000 shared buffers and a 300 second checkpoint interval, those
settings are going to be pretty aggressive.
Generally, I try and configure the all* settings so that you'll get 1
clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
have any actual tests to back that methodology up.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Fri, Dec 08, 2006 at 02:22:14PM +0900, ITAGAKI Takahiro wrote:
AFAIK, other databases use write() and fsync() in combination. They call
fsync() immediately after they write buffers in some small batches. Otherwise,
they uses asynchronous and direct I/O options. Therefore, dirty pages in
kernel buffers are keeped to be low at any time.
The "easy" solution I can think of is, when a session/backend is
exiting cleanly (client sent quit command), execute fsync() on some of
the descriptors before actually closing. At this point the user isn't
waiting anymore, so it can take its time.
The problem with fsync() remains that it can cause a write spike,
althoguh the more often you do it the less of an effect it would have.
A longer term solution maybe be create a daemon with system specific
information that monitors the load and tweaks parameters in response.
Not just postgresql parameters, but also system parameters. Even if it
never becomes part of postgresql, it will provide a way to test all
these "hunches" people have about optimising the system.
BTW, has anyone ever considered having the bgwriter do a NOTIFY
whenever it starts/ends a checkpoint, so client coulds monitor the
activity without reading the logs?
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
On 12/7/06, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:
Takayuki Tsunakawa wrote:
Hello, Itagaki-san
Checkpoint consists of the following four steps, and the major
performance
problem is 2nd step. All dirty buffers are written without interval
in it.
1. Query information (REDO pointer, next XID etc.)
2. Write dirty pages in buffer pool
3. Flush all modified files
4. Update control fileHmm. Isn't it possible that step 3 affects the performance greatly?
I'm sorry if you have already identified step 2 as disturbing
backends.As you know, PostgreSQL does not transfer the data to disk when
write()ing. Actual transfer occurs when fsync()ing at checkpoints,
unless the filesystem cache runs short. So, disk is overworked at
fsync()s.It seems to me that virtual memory settings of the OS will determine
if step 2 or step 3 causes much of the actual disk I/O.In particular, on Linux, things like /proc/sys/vm/dirty_expire_centisecs
dirty_expire_centisecs will have little, if any, effect on a box with
consistent workload. Under uniform load bgwriter will keep pushing the
buffers to fs cache which will result in eviction/flushing of pages to disk.
That the pages will age quickly can lower the cap of dirty pages but it
won't/can't handle sudden spike at checkpoint time.
and dirty_writeback_centisecs
Again on a system that encounters IO chokes on checkpoints pdflush is
presumably working like crazy at that time. Reducing the gap between its
wakeup calls will have probably very little impact on the checkpoint
performance.
and possibly dirty_background_ratio
I have seen this to put a real cap on number of dirty pages during normal
running. As regards checkpoints, this again seems to have little effect.
The problem while dealing with checkpoints is that we are dealing with two
starkly different type of IO loads. The larger the number of shared_buffers
the greater the spike in IO activity at checkpoint. AFAICS no specific vm
tunables can smooth out checkpoint spikes by itself. There has to be some
intelligence in the bgwriter to even the load out.
would affect this. If those numbers are high, ISTM most write()s
Show quoted text
from step 2 would wait for the flush in step 3. If I understand
correctly, if the dirty_expire_centisecs number is low, most write()s
from step 2 would happen before step 3 because of the pdflush daemons.
I expect other OS's would have different but similar knobs to tune this.It seems to me that the most portable way postgresql could force
the I/O to be balanced would be to insert otherwise unnecessary
fsync()s into step 2; but that it might (not sure why) be better
to handle this through OS-specific tuning outside of postgres.---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
On Fri, Dec 8, 2006 at 1:13 AM, in message
<20061208071305.GG44124@nasby.net>,
"Jim C. Nasby" <jim@nasby.net> wrote:
On Thu, Dec 07, 2006 at 10:03:05AM - 0600, Kevin Grittner wrote:
We adjusted the background writer configuration
and nearly eliminated the problem.bgwriter_all_maxpages | 600
bgwriter_all_percent | 10
bgwriter_delay | 200
bgwriter_lru_maxpages | 200
bgwriter_lru_percent | 20Bear in mind that bgwriter settings should be considered in
conjunction
with shared_buffer and checkpoint_timeout settings. For example, if
you
have 60,000 shared buffers and a 300 second checkpoint interval,
those
settings are going to be pretty aggressive.
Generally, I try and configure the all* settings so that you'll get
1
clock- sweep per checkpoint_timeout. It's worked pretty well, but I
don't
have any actual tests to back that methodology up.
We have 20,000 shared buffers and a 300 second checkpoint interval.
We got to these numbers somewhat scientifically. I studied I/O
patterns under production load and figured we should be able to handle
about 800 writes in per 200 ms without causing problems. I have to
admit that I based the percentages and the ratio between "all" and "lru"
on gut feel after musing over the documentation.
Since my values were such a dramatic change from the default, I boosted
the production settings a little bit each day and looked for feedback
from our web team. Things improved with each incremental increase.
When I got to my calculated values (above) they reported that these
timeouts had dropped to an acceptable level -- a few per day on a
website with 2 million hits per day. We may benefit from further
adjustments, but since the problem is negligible with these settings,
there are bigger fish to fry at the moment.
By the way, if I remember correctly, these boxes have 256 MB battery
backed cache, while 20,000 buffers is 156.25 MB.
-Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
"Jim C. Nasby" <jim@nasby.net> wrote:
Generally, I try and configure the all* settings so that you'll get 1
clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
have any actual tests to back that methodology up.
We got to these numbers somewhat scientifically. I studied I/O
patterns under production load and figured we should be able to handle
about 800 writes in per 200 ms without causing problems. I have to
admit that I based the percentages and the ratio between "all" and "lru"
on gut feel after musing over the documentation.
I like Kevin's settings better than what Jim suggests. If the bgwriter
only makes one sweep between checkpoints then it's hardly going to make
any impact at all on the number of dirty buffers the checkpoint will
have to write. The point of the bgwriter is to reduce the checkpoint
I/O spike by doing writes between checkpoints, and to have any
meaningful impact on that, you'll need it to make the cycle several times.
Another point here is that you want checkpoints to be pretty far apart
to minimize the WAL load from full-page images. So again, a bgwriter
that's only making one loop per checkpoint is not gonna be doing much.
I wonder whether it would be feasible to teach the bgwriter to get more
aggressive as the time for the next checkpoint approaches? Writes
issued early in the interval have a much higher probability of being
wasted (because the page gets re-dirtied later). But maybe that just
reduces to what Takahiro-san already suggested, namely that
checkpoint-time writes should be done with the same kind of scheduling
the bgwriter uses outside checkpoints. We still have the problem that
the real I/O storm is triggered by fsync() not write(), and we don't
have a way to spread out the consequences of fsync().
regards, tom lane
On Fri, Dec 8, 2006 at 10:43 AM, in message
<6439.1165596207@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wonder whether it would be feasible to teach the bgwriter to get
more
aggressive as the time for the next checkpoint approaches? Writes
issued early in the interval have a much higher probability of being
wasted (because the page gets re- dirtied later).
But wouldn't the ones written earlier stand a much better chance of
being scheduled for a physical write before the fsync? They aren't ALL
re-dirtied. In our environment, we seem to be getting a lot written
before the fsync with our current settings.
But maybe that just
reduces to what Takahiro- san already suggested, namely that
checkpoint- time writes should be done with the same kind of
scheduling
the bgwriter uses outside checkpoints. We still have the problem
that
the real I/O storm is triggered by fsync() not write(), and we don't
have a way to spread out the consequences of fsync().
We don't have a way to force writes before the fsync, but early writes
to the file system encourages it. After reading this thread, I'm
tempted to nudge our settings a little higher -- especially the
percentages. How much overhead is there in checking whether buffer
pages are dirty?
-Kevin
On Fri, Dec 8, 2006 at 11:01 AM, in message
<4579461A.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:
I'm going to correct a previous statement, based on having just chatted
with a member of the web team.
The remaining dribble of these was only occurring on the Windows
machines. (We were running two servers from Windows and two from Linux
for a while to see how things compared.) The settings I previously
posted have completely eliminated the problem on Linux, and we are
moving all servers to Linux, so this has become a total non-issue for
us. I really encourage people to try some much more aggressive
background writer settings, and possibly try tuning the OS dirty write
settings.
Unless PostgreSQL gets much closer to the hardware than the community
consensus seems to support, I don't understand what you could do in the
checkpoint phase that would improve on that. (That, of course, doesn't
mean I'm not missing something, just that the arguments made so far
haven't shown me that the suggested changes would do anything but move
the problem around a little bit.)
-Kevin
On Fri, 2006-12-08 at 11:55 -0600, Kevin Grittner wrote:
On Fri, Dec 8, 2006 at 11:01 AM, in message
<4579461A.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:I'm going to correct a previous statement, based on having just chatted
with a member of the web team.The remaining dribble of these was only occurring on the Windows
machines. (We were running two servers from Windows and two from Linux
for a while to see how things compared.) The settings I previously
posted have completely eliminated the problem on Linux, and we are
moving all servers to Linux, so this has become a total non-issue for
us. I really encourage people to try some much more aggressive
background writer settings, and possibly try tuning the OS dirty write
settings.
How much increased I/O usage have you seen in regular operation with
those settings?
--
Brad Nicholson 416-673-4106
Database Administrator, Afilias Canada Corp.
On Fri, Dec 8, 2006 at 12:18 PM, in message
<1165601938.10248.56.camel@dba5.int.libertyrms.com>, Brad Nicholson
<bnichols@ca.afilias.info> wrote:
How much increased I/O usage have you seen in regular operation with
those settings?
We have not experience any increase in I/O, just a smoothing. Keep in
mind that the file system cache will collapse repeated writes to the
same location until things settle, and the controller's cache also has a
chance of doing so. If we just push dirty pages out to the OS as soon
as possible, and let the file system do its job, I think we're in better
shape than if we try to micro-manage it within our buffer pages.
You mileage may vary of course, but I'm curious whether any real world
production examples exist where this approach is a loser.
-Kevin