Controlling Load Distributed Checkpoints

Started by Heikki Linnakangasalmost 19 years ago65 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

I'm again looking at way the GUC variables work in load distributed
checkpoints patch. We've discussed them a lot already, but I don't think
they're still quite right.

Write-phase
-----------
I like the way the write-phase is controlled in general. Writes are
throttled so that we spend the specified percentage of checkpoint
interval doing the writes. But we always write at a specified minimum
rate to avoid spreading out the writes unnecessarily when there's little
work to do.

The original patch uses bgwriter_all_max_pages to set the minimum rate.
I think we should have a separate variable, checkpoint_write_min_rate,
in KB/s, instead.

Nap phase
---------
This is trickier. The purpose of the sleep between writes and fsyncs is
to give the OS a chance to flush the pages to disk in it's own pace,
hopefully limiting the affect on concurrent activity. The sleep
shouldn't last too long, because any concurrent activity can be dirtying
and writing more pages, and we might end up fsyncing more than necessary
which is bad for performance. The optimal delay depends on many factors,
but I believe it's somewhere between 0-30 seconds in any reasonable system.

In the current patch, the duration of the sleep between the write and
sync phases is controlled as a percentage of checkpoint interval. Given
that the optimal delay is in the range of seconds, and
checkpoint_timeout can be up to 60 minutes, the useful values of that
percentage would be very small, like 0.5% or even less. Furthermore, the
optimal value doesn't depend that much on the checkpoint interval, it's
more dependent on your OS and memory configuration.

We should therefore give the delay as a number of seconds instead of as
a percentage of checkpoint interval.

Sync phase
----------
This is also tricky. As with the nap phase, we don't want to spend too
much time fsyncing, because concurrent activity will write more dirty
pages and we might just end up doing more work.

And we don't know how much work an fsync performs. The patch uses the
file size as a measure of that, but as we discussed that doesn't
necessarily have anything to do with reality. fsyncing a 1GB file with
one dirty block isn't any more expensive than fsyncing a file with a
single block.

Another problem is the granularity of an fsync. If we fsync a 1GB file
that's full of dirty pages, we can't limit the affect on other activity.
The best we can do is to sleep between fsyncs, but sleeping more than a
few seconds is hardly going to be useful, no matter how bad an I/O storm
each fsync causes.

Because of the above, I'm thinking we should ditch the
checkpoint_sync_percentage variable, in favor of:
checkpoint_fsync_period # duration of the fsync phase, in seconds
checkpoint_fsync_delay # max. sleep between fsyncs, in milliseconds

In all phases, the normal bgwriter activities are performed:
lru-cleaning and switching xlog segments if archive_timeout expires. If
a new checkpoint request arrives while the previous one is still in
progress, we skip all the delays and finish the previous checkpoint as
soon as possible.

GUC summary and suggested default values
----------------------------------------
checkpoint_write_percent = 50 # % of checkpoint interval to spread out
writes
checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty
buffers at checkpoint (KB/s)
checkpoint_nap_duration = 2 # delay between write and sync phase, in
seconds
checkpoint_fsync_period = 30 # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500 # max. delay between fsyncs

I don't like adding that many GUC variables, but I don't really see a
way to tune them automatically. Maybe we could just hard-code the last
one, it doesn't seem that critical, but that still leaves us 4 variables.

Thoughts?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#2Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#1)
Re: Controlling Load Distributed Checkpoints

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

GUC summary and suggested default values
----------------------------------------
checkpoint_write_percent = 50 # % of checkpoint interval to spread out writes
checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty
buffers at checkpoint (KB/s)

I don't understand why this is a min_rate rather than a max_rate.

checkpoint_nap_duration = 2 # delay between write and sync phase, in seconds

Not a comment on the choice of guc parameters, but don't we expect useful
values of this to be much closer to 30 than 0? I understand it might not be
exactly 30.

Actually, it's not so much whether there's any write traffic to the data files
during the nap that matters, it's whether there's more traffic during the nap
than during the 30s or so prior to the nap. As long as it's a steady-state
condition it shouldn't matter how long we wait, should it?

checkpoint_fsync_period = 30 # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500 # max. delay between fsyncs

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Controlling Load Distributed Checkpoints

Heikki Linnakangas <heikki@enterprisedb.com> writes:

GUC summary and suggested default values
----------------------------------------
checkpoint_write_percent = 50 # % of checkpoint interval to spread out
writes
checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty
buffers at checkpoint (KB/s)
checkpoint_nap_duration = 2 # delay between write and sync phase, in
seconds
checkpoint_fsync_period = 30 # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500 # max. delay between fsyncs

I don't like adding that many GUC variables, but I don't really see a
way to tune them automatically.

If we don't know how to tune them, how will the users know? Having to
add that many variables to control one feature says to me that we don't
understand the feature.

Perhaps what we need is to think about how it can auto-tune itself.

regards, tom lane

#4Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#3)
Re: Controlling Load Distributed Checkpoints

On Wed, 6 Jun 2007, Tom Lane wrote:

If we don't know how to tune them, how will the users know?

I can tell you a good starting set for them to on a Linux system, but you
first have to let me know how much memory is in the OS buffer cache, the
typical I/O rate the disks can support, how many buffers are expected to
be written out by BGW/other backends at heaviest load, and the current
setting for /proc/sys/vm/dirty_background_ratio. It's not a coincidence
that there are patches applied to 8.3 or in the queue to measure all of
the Postgres internals involved in that computation; I've been picking
away at the edges of this problem.

Getting this sort of tuning right takes that level of information about
the underlying system. If there's a way to internally auto-tune the
values this patch operates on (which I haven't found despite months of
trying), it would be in the form of some sort of measurement/feedback loop
based on how fast data is being written out. There really are way too
many things involved to try and tune it based on anything else; the
underlying OS/hardware mechanisms that determine how this will go are
complicated enough that it might as well be a black box for most people.

One of the things I've been fiddling with the design of is a testing
program that simulates database activity at checkpoint time under load.
I think running some tests like that is the most straightforward way to
generate useful values for these tunables; it's much harder to try and
determine them from within the backends because there's so much going on
to keep track of.

I view the LDC mechanism as being in the same state right now as the
background writer: there are a lot of complicated knobs to tweak, they
all do *something* useful for someone, and eliminating them will require a
data-collection process across a much wider sample of data than can be
collected quickly. If I had to make a guess how this will end up, I'd
expect there to be more knobs in LDC than everyone would like for the 8.3
release, along with fairly verbose logging of what is happening at
checkpoint time (that's why I've been nudging development in that area,
along with making logs easier to aggregate). Collect up enough of that
information, then you're in a position to talk about useful automatic
tuning--right around the 8.4 timeframe I suspect.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#5Greg Smith
gsmith@gregsmith.com
In reply to: Heikki Linnakangas (#1)
Re: Controlling Load Distributed Checkpoints

On Wed, 6 Jun 2007, Heikki Linnakangas wrote:

The original patch uses bgwriter_all_max_pages to set the minimum rate. I
think we should have a separate variable, checkpoint_write_min_rate, in KB/s,
instead.

Completely agreed. There shouldn't be any coupling with the background
writer parameters, which may be set for a completely different set of
priorities than the checkpoint has. I have to look at this code again to
see why it's a min_rate instead of a max, that seems a little weird.

Nap phase: We should therefore give the delay as a number of seconds
instead of as a percentage of checkpoint interval.

Again, the setting here should be completely decoupled from another GUC
like the interval. My main complaint with the original form of this patch
was how much it tried to syncronize the process with the interval; since I
don't even have a system where that value is set to something, because
it's all segment based instead, that whole idea was incompatible.

The original patch tried to spread the load out as evenly as possible over
the time available. I much prefer thinking in terms of getting it done as
quickly as possible while trying to bound the I/O storm.

And we don't know how much work an fsync performs. The patch uses the file
size as a measure of that, but as we discussed that doesn't necessarily have
anything to do with reality. fsyncing a 1GB file with one dirty block isn't
any more expensive than fsyncing a file with a single block.

On top of that, if you have a system with a write cache, the time an fsync
takes can greatly depend on how full it is at the time, which there is no
way to measure or even model easily.

Is there any way to track how many dirty blocks went into each file during
the checkpoint write? That's your best bet for guessing how long the
fsync will take.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Greg Smith (#5)
Re: Controlling Load Distributed Checkpoints

Greg Smith wrote:

On Wed, 6 Jun 2007, Heikki Linnakangas wrote:

The original patch uses bgwriter_all_max_pages to set the minimum
rate. I think we should have a separate variable,
checkpoint_write_min_rate, in KB/s, instead.

Completely agreed. There shouldn't be any coupling with the background
writer parameters, which may be set for a completely different set of
priorities than the checkpoint has. I have to look at this code again
to see why it's a min_rate instead of a max, that seems a little weird.

It's min rate, because it never writes slower than that, and it can
write faster if the next checkpoint is due soon so that we wouldn't
finish before it's time to start the next one. (Or to be precise, before
the next checkpoint is closer than 100-(checkpoint_write_percent)% of
the checkpoint interval)

Nap phase: We should therefore give the delay as a number of seconds
instead of as a percentage of checkpoint interval.

Again, the setting here should be completely decoupled from another GUC
like the interval. My main complaint with the original form of this
patch was how much it tried to syncronize the process with the interval;
since I don't even have a system where that value is set to something,
because it's all segment based instead, that whole idea was incompatible.

checkpoint_segments is taken into account as well as checkpoint_timeout.
I used the term "checkpoint interval" to mean the real interval at which
the checkpoints occur, whether it's because of segments or timeout.

The original patch tried to spread the load out as evenly as possible
over the time available. I much prefer thinking in terms of getting it
done as quickly as possible while trying to bound the I/O storm.

Yeah, the checkpoint_min_rate allows you to do that.

So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other
activity too much. Set checkpoint_write_percent to a high number, and
set checkpoint_min_rate to define "too much".
2. Disturb other activity as little as possible, as long as the
checkpoint finishes in a reasonable time. Set checkpoint_min_rate to a
low number, and checkpoint_write_percent to define "reasonable time"

Are both interesting use cases, or is it enough to cater for just one of
them? I think 2 is easier to tune. Defining the min_rate properly can be
difficult and depends a lot on your hardware and application, but a
default value of say 50% for checkpoint_write_percent to tune for use
case 2 should work pretty well for most people.

In any case, the checkpoint better finish before it's time to start
another one. Or would you rather delay the next checkpoint, and let
checkpoint take as long as it takes to finish at the min_rate?

And we don't know how much work an fsync performs. The patch uses the
file size as a measure of that, but as we discussed that doesn't
necessarily have anything to do with reality. fsyncing a 1GB file with
one dirty block isn't any more expensive than fsyncing a file with a
single block.

On top of that, if you have a system with a write cache, the time an
fsync takes can greatly depend on how full it is at the time, which
there is no way to measure or even model easily.

Is there any way to track how many dirty blocks went into each file
during the checkpoint write? That's your best bet for guessing how long
the fsync will take.

I suppose it's possible, but the OS has hopefully started flushing them
to disk almost as soon as we started the writes, so even that isn't very
good a measure.

On a Linux system, one way to model it is that the OS flushes dirty
buffers to disk at the same rate as we write them, but delayed by
dirty_expire_centisecs. That should hold if the writes are spread out
enough. Then the amount of dirty buffers in OS cache at the end of write
phase is roughly constant, as long as the write phase lasts longer than
dirty_expire_centisecs. If we take a nap of dirty_expire_centisecs after
the write phase, the fsyncs should be effectively no-ops, except that
they will flush any other writes the bgwriter lru-sweep and other
backends performed during the nap.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Hannu Krosing
hannu@tm.ee
In reply to: Tom Lane (#3)
Re: Controlling Load Distributed Checkpoints

Ühel kenal päeval, K, 2007-06-06 kell 11:03, kirjutas Tom Lane:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

GUC summary and suggested default values
----------------------------------------
checkpoint_write_percent = 50 # % of checkpoint interval to spread out
writes
checkpoint_write_min_rate = 1000 # minimum I/O rate to write dirty
buffers at checkpoint (KB/s)
checkpoint_nap_duration = 2 # delay between write and sync phase, in
seconds
checkpoint_fsync_period = 30 # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500 # max. delay between fsyncs

I don't like adding that many GUC variables, but I don't really see a
way to tune them automatically.

If we don't know how to tune them, how will the users know?

He talked about doing it _automatically_.

If the knobns are available, it will be possible to determine "good"
values even by brute-force performance testing, given enough time and
manpower is available.

Having to
add that many variables to control one feature says to me that we don't
understand the feature.

The feature has lots of complex dependencies to things outside postgres,
so learning to understand it takes time. Having the knows available
helps as more people ar willing to do turn-the-knobs-and-test vs.
recompile-and-test.

Perhaps what we need is to think about how it can auto-tune itself.

Sure.

-------------------
Hannu Krosing

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Greg Smith (#5)
Re: Controlling Load Distributed Checkpoints

Thinking about this whole idea a bit more, it occured to me that the
current approach to write all, then fsync all is really a historical
artifact of the fact that we used to use the system-wide sync call
instead of fsyncs to flush the pages to disk. That might not be the best
way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

1.
Scan all shared buffers, and build a list of all files with dirty pages,
and buffers belonging to them

2.
foreach(file in list)
{
foreach(buffer belonging to file)
{
write();
sleep(); /* to throttle the I/O rate */
}
sleep(); /* to give the OS a chance to flush the writes at it's own
pace */
fsync()
}

This would spread out the fsyncs in a natural way, making the knob to
control the duration of the sync phase unnecessary.

At some point we'll also need to fsync all files that have been modified
since the last checkpoint, but don't have any dirty buffers in the
buffer cache. I think it's a reasonable assumption that fsyncing those
files doesn't generate a lot of I/O. Since the writes have been made
some time ago, the OS has likely already flushed them to disk.

Doing the 1st phase of just scanning the buffers to see which ones are
dirty also effectively implements the optimization of not writing
buffers that were dirtied after the checkpoint start. And grouping the
writes per file gives the OS a better chance to group the physical writes.

One problem is that currently the segmentation of relations to 1GB files
is handled at a low level inside md.c, and we don't really have any
visibility into that in the buffer manager. ISTM that some changes to
the smgr interfaces would be needed for this to work well, though just
doing it on a relation per relation basis would also be better than the
current approach.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#8)
Re: Controlling Load Distributed Checkpoints

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Thinking about this whole idea a bit more, it occured to me that the
current approach to write all, then fsync all is really a historical
artifact of the fact that we used to use the system-wide sync call
instead of fsyncs to flush the pages to disk. That might not be the best
way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling. Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.

regards, tom lane

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#9)
Re: Controlling Load Distributed Checkpoints

Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Thinking about this whole idea a bit more, it occured to me that the
current approach to write all, then fsync all is really a historical
artifact of the fact that we used to use the system-wide sync call
instead of fsyncs to flush the pages to disk. That might not be the best
way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling. Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.

I'm not proposing to issue any more fsyncs. I'm proposing to change the
ordering so that instead of first writing all dirty buffers and then
fsyncing all files, we'd write all buffers belonging to a file, fsync
that file only, then write all buffers belonging to next file, fsync,
and so forth.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#10)
Re: Controlling Load Distributed Checkpoints

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Tom Lane wrote:

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling. Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.

I'm not proposing to issue any more fsyncs. I'm proposing to change the
ordering so that instead of first writing all dirty buffers and then
fsyncing all files, we'd write all buffers belonging to a file, fsync
that file only, then write all buffers belonging to next file, fsync,
and so forth.

But that means that the I/O to different files cannot be overlapped by
the kernel, even if it would be more efficient to do so.

regards, tom lane

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#11)
Re: Controlling Load Distributed Checkpoints

Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Tom Lane wrote:

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling. Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.

I'm not proposing to issue any more fsyncs. I'm proposing to change the
ordering so that instead of first writing all dirty buffers and then
fsyncing all files, we'd write all buffers belonging to a file, fsync
that file only, then write all buffers belonging to next file, fsync,
and so forth.

But that means that the I/O to different files cannot be overlapped by
the kernel, even if it would be more efficient to do so.

True. On the other hand, if we issue writes in essentially random order,
we might fill the kernel buffers with random blocks and the kernel needs
to flush them to disk as almost random I/O. If we did the writes in
groups, the kernel has better chance at coalescing them.

I tend to agree that if the goal is to finish the checkpoint as quickly
as possible, the current approach is better. In the context of load
distributed checkpoints, however, it's unlikely the kernel can do any
significant overlapping since we're trickling the writes anyway.

Do we need both strategies?

I'm starting to feel we should give up on smoothing the fsyncs and
distribute the writes only, for 8.3. As we get more experience with that
and it's shortcomings, we can enhance our checkpoints further in 8.4.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Greg Smith
gsmith@gregsmith.com
In reply to: Heikki Linnakangas (#6)
Re: Controlling Load Distributed Checkpoints

On Thu, 7 Jun 2007, Heikki Linnakangas wrote:

So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other
activity too much
2. Disturb other activity as little as possible, as long as the
checkpoint finishes in a reasonable time.
Are both interesting use cases, or is it enough to cater for just one of
them? I think 2 is easier to tune.

The motivation for the (1) case is that you've got a system that's
dirtying the buffer cache very fast in normal use, where even the
background writer is hard pressed to keep the buffer pool clean. The
checkpoint is the most powerful and efficient way to clean up many dirty
buffers out of such a buffer cache in a short period of time so that
you're back to having room to work in again. In that situation, since
there are many buffers to write out, you'll also be suffering greatly from
fsync pauses. Being able to synchronize writes a little better with the
underlying OS to smooth those out is a huge help.

I'm completely biased because of the workloads I've been dealing with
recently, but I consider (2) so much easier to tune for that it's barely
worth worrying about. If your system is so underloaded that you can let
the checkpoints take their own sweet time, I'd ask if you have enough
going on that you're suffering very much from checkpoint performance
issues anyway. I'm used to being in a situation where if you don't push
out checkpoint data as fast as physically possible, you end up fighting
with the client backends for write bandwidth once the LRU point moves past
where the checkpoint has written out to already. I'm not sure how much
always running the LRU background writer will improve that situation.

On a Linux system, one way to model it is that the OS flushes dirty buffers
to disk at the same rate as we write them, but delayed by
dirty_expire_centisecs. That should hold if the writes are spread out enough.

If they're really spread out, sure. There is congestion avoidance code
inside the Linux kernel that makes dirty_expire_centisecs not quite work
the way it is described under load. All you can say in the general case
is that when dirty_expire_centisecs has passed, the kernel badly wants to
write the buffers out as quickly as possible; that could still be many
seconds after the expiration time on a busy system, or on one with slow
I/O.

On every system I've ever played with Postgres write performance on, I
discovered that the memory-based parameters like dirty_background_ratio
were really driving write behavior, and I almost ignore the expire timeout
now. Plotting the "Dirty:" value in /proc/meminfo as you're running tests
is extremely informative for figuring out what Linux is really doing
underneath the database writes.

The influence of the congestion code is why I made the comment about
watching how long writes are taking to gauge how fast you can dump data
onto the disks. When you're suffering from one of the congestion
mechanisms, the initial writes start blocking, even before the fsync.
That behavior is almost undocumented outside of the relevant kernel source
code.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#14Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#13)
Re: Controlling Load Distributed Checkpoints

"Greg Smith" <gsmith@gregsmith.com> writes:

I'm completely biased because of the workloads I've been dealing with recently,
but I consider (2) so much easier to tune for that it's barely worth worrying
about. If your system is so underloaded that you can let the checkpoints take
their own sweet time, I'd ask if you have enough going on that you're suffering
very much from checkpoint performance issues anyway. I'm used to being in a
situation where if you don't push out checkpoint data as fast as physically
possible, you end up fighting with the client backends for write bandwidth once
the LRU point moves past where the checkpoint has written out to already. I'm
not sure how much always running the LRU background writer will improve that
situation.

I think you're working from a faulty premise.

There's no relationship between the volume of writes and how important the
speed of checkpoint is. In either scenario you should assume a system that is
close to the max i/o bandwidth. The only question is which task the admin
would prefer take the hit for maxing out the bandwidth, the transactions or
the checkpoint.

You seem to have imagined that letting the checkpoint take longer will slow
down transactions. In fact that's precisely the effect we're trying to avoid.
Right now we're seeing tests where Postgres stops handling *any* transactions
for up to a minute. In virtually any real world scenario that would simply be
unacceptable.

That one-minute outage is a direct consequence of trying to finish the
checkpoint as quick as possible. If we spread it out then it might increase
the average i/o load if you sum it up over time, but then you just need a
faster i/o controller.

The only scenario where you would prefer the absolute lowest i/o rate summed
over time would be if you were close to maxing out your i/o bandwidth,
couldn't buy a faster controller, and response time was not a factor, only
sheer volume of transactions processed mattered. That's a much less common
scenario than caring about the response time.

The flip side of having to worry about response time buying a faster
controller doesn't even help. It would shorten the duration of the checkpoint
but not eliminate it. A 30-second outage every half hour is just as
unacceptable as a 1-minute outage every half hour.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#15Greg Smith
gsmith@gregsmith.com
In reply to: Bruce Momjian (#14)
Re: Controlling Load Distributed Checkpoints

On Thu, 7 Jun 2007, Gregory Stark wrote:

You seem to have imagined that letting the checkpoint take longer will slow
down transactions.

And you seem to have imagined that I have so much spare time that I'm just
making stuff up to entertain myself and sow confusion.

I observed some situations where delaying checkpoints too long ends up
slowing down both transaction rate and response time, using earlier
variants of the LDC patch and code with similar principles I wrote. I'm
trying to keep the approach used here out of the worst of the corner cases
I ran into, or least to make it possible for people in those situations to
have some ability to tune out of the bad spots. I am unfortunately not
free to disclose all those test results, and since that project is over I
can't see how the current LDC compares to what I tested at the time.

I plainly stated I had a bias here, one that's not even close to the
average case. My concern here was that Heikki would end up optimizing in
a direction where a really wide spread across the active checkpoint
interval was strongly preferred. I wanted to offer some suggestions on
the type of situation where that might not be true, but where a different
tuning of LDC would still be an improvement over the current behavior.
There are some tuning knobs there that I don't want to see go away until
there's been a wider range of tests to prove they aren't effective.

Right now we're seeing tests where Postgres stops handling *any* transactions
for up to a minute. In virtually any real world scenario that would simply be
unacceptable.

No doubt; I've seen things get close to that bad myself, both on the high
and low end. I collided with the issue in a situation of "maxing out your
i/o bandwidth, couldn't buy a faster controller" at one point, which is
what kicked off my working in this area. It turned out there were still
some software tunables left that pulled the worst case down to the 2-5
second range instead. With more checkpoint_segments to decrease the
frequency, that was just enough to make the problem annoying rather than
crippling. But after that, I could easily imagine a different application
scenario where the behavior you describe is the best case.

This is really a serious issue with the current design of the database,
one that merely changes instead of going away completely if you throw more
hardware at it. I'm perversely glad to hear this is torturing more people
than just me as it improves the odds the situation will improve.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#16Joshua D. Drake
jd@commandprompt.com
In reply to: Greg Smith (#15)
Re: Controlling Load Distributed Checkpoints

This is really a serious issue with the current design of the database,
one that merely changes instead of going away completely if you throw
more hardware at it. I'm perversely glad to hear this is torturing more
people than just me as it improves the odds the situation will improve.

It tortures pretty much any high velocity postgresql db of which there
are more and more every day.

Joshua D. Drake

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

#17Josh Berkus
josh@agliodbs.com
In reply to: Bruce Momjian (#14)
Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints

All,

This brings up another point. With the increased number of .conf
options, the file is getting hard to read again. I'd like to do another
reorganization, but I don't really want to break people's diff scripts.
Should I worry about that?

--Josh

#18Joshua D. Drake
jd@commandprompt.com
In reply to: Josh Berkus (#17)
Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints

Josh Berkus wrote:

All,

This brings up another point. With the increased number of .conf
options, the file is getting hard to read again. I'd like to do another
reorganization, but I don't really want to break people's diff scripts.
Should I worry about that?

As a point of feedback, autovacuum and vacuum should be together.

Joshua D. Drake

--Josh

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#17)
Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints

Josh Berkus <josh@agliodbs.com> writes:

This brings up another point. With the increased number of .conf
options, the file is getting hard to read again. I'd like to do another
reorganization, but I don't really want to break people's diff scripts.

Do you have a better organizing principle than what's there now?

regards, tom lane

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Greg Smith (#13)
Re: Controlling Load Distributed Checkpoints

Greg Smith wrote:

On Thu, 7 Jun 2007, Heikki Linnakangas wrote:

So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other
activity too much
2. Disturb other activity as little as possible, as long as the
checkpoint finishes in a reasonable time.
Are both interesting use cases, or is it enough to cater for just one
of them? I think 2 is easier to tune.

The motivation for the (1) case is that you've got a system that's
dirtying the buffer cache very fast in normal use, where even the
background writer is hard pressed to keep the buffer pool clean. The
checkpoint is the most powerful and efficient way to clean up many dirty
buffers out of such a buffer cache in a short period of time so that
you're back to having room to work in again. In that situation, since
there are many buffers to write out, you'll also be suffering greatly
from fsync pauses. Being able to synchronize writes a little better
with the underlying OS to smooth those out is a huge help.

ISTM the bgwriter just isn't working hard enough in that scenario.
Assuming we get the lru autotuning patch in 8.3, do you think there's
still merit in using the checkpoints that way?

I'm completely biased because of the workloads I've been dealing with
recently, but I consider (2) so much easier to tune for that it's barely
worth worrying about. If your system is so underloaded that you can let
the checkpoints take their own sweet time, I'd ask if you have enough
going on that you're suffering very much from checkpoint performance
issues anyway. I'm used to being in a situation where if you don't push
out checkpoint data as fast as physically possible, you end up fighting
with the client backends for write bandwidth once the LRU point moves
past where the checkpoint has written out to already. I'm not sure how
much always running the LRU background writer will improve that situation.

I'd think it eliminates the problem. Assuming we keep the LRU cleaning
running as usual, I don't see how writing faster during checkpoints
could ever be beneficial for concurrent activity. The more you write,
the less bandwidth there's available for others.

Doing the checkpoint as quickly as possible might be slightly better for
average throughput, but that's a different matter.

On every system I've ever played with Postgres write performance on, I
discovered that the memory-based parameters like dirty_background_ratio
were really driving write behavior, and I almost ignore the expire
timeout now. Plotting the "Dirty:" value in /proc/meminfo as you're
running tests is extremely informative for figuring out what Linux is
really doing underneath the database writes.

Interesting. I haven't touched any of the kernel parameters yet in my
tests. It seems we need to try different parameters and see how the
dynamics change. But we must also keep in mind that average DBA doesn't
change any settings, and might not even be able or allowed to. That
means the defaults should work reasonably well without tweaking the OS
settings.

The influence of the congestion code is why I made the comment about
watching how long writes are taking to gauge how fast you can dump data
onto the disks. When you're suffering from one of the congestion
mechanisms, the initial writes start blocking, even before the fsync.
That behavior is almost undocumented outside of the relevant kernel
source code.

Yeah, that's controlled by dirty_ratio, if I've understood the
parameters correctly. If we spread out the writes enough, we shouldn't
hit that limit or congestion. That's the point of the patch.

Do you have time / resources to do testing? You've clearly spent a lot
of time on this, and I'd be very interested to see some actual numbers
from your tests with various settings.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Heikki Linnakangas (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andrew Sullivan (#21)
#23Greg Smith
gsmith@gregsmith.com
In reply to: Andrew Sullivan (#21)
#24Andrew Sullivan
ajs@crankycanuck.ca
In reply to: Greg Smith (#23)
#25Bruce Momjian
bruce@momjian.us
In reply to: Andrew Sullivan (#24)
#26Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#9)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jim Nasby (#26)
#28ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Heikki Linnakangas (#12)
#29Greg Smith
gsmith@gregsmith.com
In reply to: ITAGAKI Takahiro (#28)
#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: ITAGAKI Takahiro (#28)
#31Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#19)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#31)
#33Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#32)
#34Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Heikki Linnakangas (#27)
#35Florian Pflug
fgp@phlo.org
In reply to: Heikki Linnakangas (#27)
#36PFC
lists@peufeu.com
In reply to: Jim Nasby (#34)
#37ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Greg Smith (#29)
#38Bruce Momjian
bruce@momjian.us
In reply to: PFC (#36)
#39Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#37)
#40Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: ITAGAKI Takahiro (#37)
#41Greg Smith
gsmith@gregsmith.com
In reply to: ITAGAKI Takahiro (#37)
#42Simon Riggs
simon@2ndQuadrant.com
In reply to: ITAGAKI Takahiro (#37)
#43Gregory Maxwell
gmaxwell@gmail.com
In reply to: Simon Riggs (#42)
#44Greg Smith
gsmith@gregsmith.com
In reply to: Gregory Maxwell (#43)
#45Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#42)
#46ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Simon Riggs (#42)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: ITAGAKI Takahiro (#46)
#48Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#37)
#49ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Bruce Momjian (#48)
#50Greg Smith
gsmith@gregsmith.com
In reply to: ITAGAKI Takahiro (#49)
#51ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Greg Smith (#50)
#52Greg Smith
gsmith@gregsmith.com
In reply to: ITAGAKI Takahiro (#51)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: ITAGAKI Takahiro (#51)
#54Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#53)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#54)
#56Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#55)
#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#56)
#58Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#57)
#59Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#57)
#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#59)
#61Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#60)
#62ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Simon Riggs (#59)
#63Greg Smith
gsmith@gregsmith.com
In reply to: Simon Riggs (#59)
#64Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Smith (#63)
#65Greg Smith
gsmith@gregsmith.com
In reply to: ITAGAKI Takahiro (#62)