Final background writer cleanup for 8.3

Started by Greg Smithover 18 years ago27 messageshackers
Jump to latest
#1Greg Smith
gsmith@gregsmith.com

In the interest of closing work on what's officially titled the "Automatic
adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
think this is at, what I'm working on right now, and see if feedback from
that changes how I submit my final attempt for a useful patch in this area
this week. Hopefully there are enough free eyes to stare at this now to
wrap up a plan for what to do that makes sense and still fits in the 8.3
schedule. I'd hate to see this pushed off to 8.4 without making some
forward progress here after the amount of work done already, particularly
when odds aren't good I'll still be working with this code by then.

Let me start with a summary of the conclusions I've reached based on my
own tests and the set that Heikki did last month (last results at
http://community.enterprisedb.com/bgwriter/ ); Heikki will hopefully chime
in if he disagrees with how I'm characterizing things:

1) In the current configuration, if you have a large setting for
bgwriter_lru_percent and/or a small setting for bgwriter_delay, that can
be extremely wasteful because the background writer will consume
CPU/locking resources scanning the buffer pool needlessly. This problem
should go away.

2) Having backends write their own buffers out does not significantly
degrade performance, as those turn into cached OS writes which generally
execute fast enough to not be a large drag on the backend.

3) Any attempt to scan significantly ahead of the current strategy point
will result in some amount of premature writes that decreases overall
efficiency in cases where the buffer is touched again before it gets
re-used. The more in advance you go, the worse this inefficiency is.
The most efficient way for many workloads is to just let the backends do
all the writes.

4) Tom observed that there's no reason to ever scan the same section of
the pool more than once, because anything that changes a buffer's status
will always make it un-reusable until the strategy point has passed over
it. But because of (3), this does not mean that one should drive forward
constantly trying to lap the buffer pool and catch up with the strategy
point.

5) There hasn't been any definitive proof that the background writer is
helpful at all in the context of 8.3. However, yanking it out altogether
may be premature, as there are some theorized ways that it may be helpful
in real-world situations with more intermittant workloads than are
generally encountered in a benchmarking situation. I personally feel that
is some potential for the BGW to become more useful in the context of the
8.4 release if it starts doing things like adding pages it expects to be
recycled soon onto the free list, which could improve backend efficiency
quite a bit compared to the current situation where each backend is
normally running their own scan. But that's a bit too big to fit into 8.3
I think.

What I'm aiming for here is to have the BGW do as little work as possible,
as efficiently as possible, but not remove it altogether. (2) suggests
that this approach won't decrease performance compared to the current 8.2
situation, where I've seen evidence some are over-tuning to have a very
aggressive BGW scan an enormous amount of the pool each time because they
have resources to burn. Having a generally self-tuning background writer
that errs on the lazy side stay in the codebase satisfies (5). Here is
what the patch I'm testing right now does to try and balance all this out:

A) Counters are added to pg_stat_bgwriter that show how many buffers were
written by the backends, by the background writer, how many times
bgwriter_lru_maxpages was hit, and the total number of buffers allocated.
This at least allows monitoring what's going on as people run their own
experiments. Heikki's results included data using the earlier version of
this patch I put assembled (which now conflicts with HEAD, I have an
updated one).

B) bgwriter_lru_percent is removed as a tunable. This eliminates (1).
The idea of scanning a fixed percentage doesn't ever make sense given the
observations above; we scan until we accomplish the cleaning mission
instead.

C) bgwriter_lru_maxpages stays as an absolute maximum number of pages that
can be written in one sweep each bgwriter_delay. This allows easily
turning the writer off altogether by setting it to 0, or limiting how
active it tries to be in situations where (3) is a concern. Admins can
monitor the amount that the max is hit in pg_stat_bgwriter and consider
raising it (or lowering the delay) if it proves to be too limiting. I
think the default needs to be bumped to something more like 100 rather
than the current tiny one before the stock configuration can be considered
"self-tuning" at all.

D) The strategy code gets a "passes" count added to it that serves as a
sort of high-order int for how many times the buffer cache has been looked
over in its entirety.

E) When the background writer start the LRU cleaner, it checks if the
strategy point has passed where it last cleaned up to, using the
passes+buf_id "pointer". If so, it just starts cleaning from the strategy
point as it always has. But if it's still ahead it just continues from
there, thus implementing the core of (4)'s insight. It estimates how many
buffers are probably clean in the space between the strategy point and
where it's starting at, based on how far ahead it is combined with
historical data about how many buffers are scanned on average per reusable
buffer found (the exact computation of this number is the main thing I'm
still fiddling with).

F) A moving average of buffer allocations is used to predict how many
clean buffers are expected to be needed in the next delay cycle. The
original patch from Itagaki doubled the recent allocations to pad this
out; (3) suggests that's too much.

G) Scan the buffer pool until either
--Enough reusable buffers have been located or written out to fill the
upcoming allocation need, taking into account the estimate from (E); this
is the normal expected way the scan will terminate.
--We've written bgwriter_lru_maxpages
--We "lap" and catch the strategy point

In addition to removing a tunable and making the remaining two less
critical, one of my hopes here is that the more efficient way this scheme
operates will allow using much smaller values for bgwriter_delay than have
been practical in the current codebase, which may ultimately have its own
value.

That's what I've got working here now, still need some more tweaking and
testing before I'm done with the code but there's not much left. The main
problem I forsee is that this approach is moderately complicated, adding a
lot of new code and regular+static variables, for something that's not
really proven to be valuable. I will not be surprised if my patch is
rejected on that basis. That's why I wanted to get the big picture
painted in this message while I finish up the work necessary to submit it,
'cause if the whole idea is doomed anyway I might as well stop now.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#1)
Re: Final background writer cleanup for 8.3

Greg Smith <gsmith@gregsmith.com> writes:

In the interest of closing work on what's officially titled the "Automatic
adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
think this is at ...

2) Having backends write their own buffers out does not significantly
degrade performance, as those turn into cached OS writes which generally
execute fast enough to not be a large drag on the backend.

[ itch... ] That assumption scares the heck out of me. It is doubtless
true in a lightly loaded system, but once the kernel is under any kind
of memory pressure I think it's completely wrong. I think designing the
system around this assumption will lead to something that performs great
as long as you're not pushing it hard.

However, your actual specific proposals do not seem to rely on this
assumption extensively, so I wonder why you are emphasizing it.

The only parts of your specific proposals that I find a bit dubious are

... It estimates how many
buffers are probably clean in the space between the strategy point and
where it's starting at, based on how far ahead it is combined with
historical data about how many buffers are scanned on average per reusable
buffer found (the exact computation of this number is the main thing I'm
still fiddling with).

If you're still fiddling with it then you probably aren't going to get
it right in the next few days. Perhaps you should think about whether
this can be left out entirely for 8.3 and revisited later.

F) A moving average of buffer allocations is used to predict how many
clean buffers are expected to be needed in the next delay cycle. The
original patch from Itagaki doubled the recent allocations to pad this
out; (3) suggests that's too much.

Maybe you need to put back the eliminated tuning parameter in the form
of the scaling factor to be used here. I don't like 1.0, mainly because
I don't believe your assumption (2). I'm willing to concede that 2.0
might be too much, but I don't know where in between is the sweet spot.

Also, we might need a tuning parameter for the reaction speed of the
moving average --- what are you using for that?

regards, tom lane

#3Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#2)
Re: Final background writer cleanup for 8.3

On Thu, 23 Aug 2007, Tom Lane wrote:

It is doubtless true in a lightly loaded system, but once the kernel is
under any kind of memory pressure I think it's completely wrong.

The fact that so many tests I've done or seen get maximum throughput in
terms of straight TPS with the background writer turned completely off is
why I stated that so explicitly. I understand what you're saying in terms
of memory pressure, all I'm suggesting is that the empirical tests suggest
the current background writer even with moderate improvements doesn't
necessarily help when you get there. If writes are blocking, whether the
background writer does them slightly ahead of time or whether the backend
does them itself doesn't seem to matter very much. On a heavily loaded
system, your throughput is bottlenecked at the disk either way--and
therefore it's all the more important in those cases to never do a write
until you absolutely have to, lest it be wasted.

If you're still fiddling with it then you probably aren't going to get
it right in the next few days.

The implementation is fine most of the time, I've just found some corner
cases in testing I'd like to improve stability on (mainly how best to
handle when no buffers were allocated during the previous period, some
small concerns about the first pass over the pool). What I'm thinking of
doing is taking a couple of my assumptions/techniques and turning them
into things that can be turned on or off with #DEFINE, that way the parts
of the code that people don't like are easy to identify and pull out.
I've already done with that with one section.

Maybe you need to put back the eliminated tuning parameter in the form
of the scaling factor to be used here. I don't like 1.0, mainly because
I don't believe your assumption (2). I'm willing to concede that 2.0
might be too much, but I don't know where in between is the sweet spot.

That would be easy to implement and add some flexibility, so I'll do that.
bgwriter_lru_percent becomes bgwriter_lru_multiplier, possibly to be
renamed later if someone comes up with a snappier name.

Also, we might need a tuning parameter for the reaction speed of the
moving average --- what are you using for that?

It's hard-coded at 16 samples. Seemed stable around 10-20, picked 16 in
so maybe it will optimize usefully to a bit shift. On the reaction side,
it actually reacts faster than that--if the most recent allocation is
greater than the average, it uses that instead. The number of samples has
more of an impact on the trailing side, and accordingly isn't that
critical.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#4Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#2)
Re: Final background writer cleanup for 8.3

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Greg Smith <gsmith@gregsmith.com> writes:

In the interest of closing work on what's officially titled the "Automatic
adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
think this is at ...

2) Having backends write their own buffers out does not significantly
degrade performance, as those turn into cached OS writes which generally
execute fast enough to not be a large drag on the backend.

[ itch... ] That assumption scares the heck out of me. It is doubtless
true in a lightly loaded system, but once the kernel is under any kind
of memory pressure I think it's completely wrong. I think designing the
system around this assumption will lead to something that performs great
as long as you're not pushing it hard.

I think Heikki's experiments showed it wasn't true for at least some kinds of
heavy loads. However I would expect it to depend heavily on just what kind of
load the machine is under. At least if it's busy writing then I would expect
it to throttle writes. Perhaps in TPCC there are enough reads to throttle the
write rate to something the kernel can buffer.

If you're still fiddling with it then you probably aren't going to get
it right in the next few days. Perhaps you should think about whether
this can be left out entirely for 8.3 and revisited later.

How does all of this relate to your epiphany that we should just have bgwriter
be a full clock sweep ahead clock hand without retracing its steps?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#4)
Re: Final background writer cleanup for 8.3

Gregory Stark wrote:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Greg Smith <gsmith@gregsmith.com> writes:

In the interest of closing work on what's officially titled the "Automatic
adjustment of bgwriter_lru_maxpages" patch, I wanted to summarize where I
think this is at ...
2) Having backends write their own buffers out does not significantly
degrade performance, as those turn into cached OS writes which generally
execute fast enough to not be a large drag on the backend.

[ itch... ] That assumption scares the heck out of me. It is doubtless
true in a lightly loaded system, but once the kernel is under any kind
of memory pressure I think it's completely wrong. I think designing the
system around this assumption will lead to something that performs great
as long as you're not pushing it hard.

I think Heikki's experiments showed it wasn't true for at least some kinds of
heavy loads. However I would expect it to depend heavily on just what kind of
load the machine is under. At least if it's busy writing then I would expect
it to throttle writes. Perhaps in TPCC there are enough reads to throttle the
write rate to something the kernel can buffer.

I ran a bunch of DBT-2 in different configurations, as well as simple
single-threaded tests like random DELETEs on a table with index, steady
rate of INSERTs to a table with no indexes, and bursts of INSERTs with
different bursts sizes and delays between them. I tried the tests with
different bgwriter settings, including turning it off and with the patch
applied, and with different shared_buffers settings.

I was not able to find a test where turning bgwriter on performed better
than turning it off.

If anyone out there has a repeatable test case where bgwriter does help,
I'm all ears. The theory of moving the writes out of the critical path
does sound reasonable, so I'm sure there is test case to demonstrate the
effect, but it seems to be pretty darn hard to find.

The cold, rational side of me says we need a test case to show the
benefit, or if one can't be found, we should remove bgwriter altogether.
The emotional side of me tells me we can't go that far. A reasonable
compromise would be to apply the autotuning patch on the grounds that it
removes a GUC variable that's next to impossible to tune right, even
though we can't show a performance benefit compared to bgwriter=off. And
it definitely makes sense not to restart the scan from the clock sweep
hand on each bgwriter round; as Tom pointed out, it's a waste of time.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#4)
Re: Final background writer cleanup for 8.3

Gregory Stark <stark@enterprisedb.com> writes:

How does all of this relate to your epiphany that we should just have
bgwriter be a full clock sweep ahead clock hand without retracing its
steps?

Well, it's still clearly silly for the bgwriter to rescan buffers it's
already cleaned. But I think we've established that the "keep a lap
ahead" idea goes too far, because it writes dirty buffers speculatively,
long before they actually are needed, and there's just too much chance
of the writes being wasted due to re-dirtying. When proposing that
idea I had supposed that wasted writes wouldn't hurt much, but that's
evidently wrong.

Heikki makes a good point nearby that if you are not disk write
bottlenecked then it's perfectly OK for backends to issue writes,
as it'll just result in a transfer to kernel cache space, and no actual
wait for I/O. And if you *are* write-bottlenecked, then the last thing
you want is any wasted writes. So a fairly conservative strategy that
does bgwrites only "just in time" seems like what we need to aim at.

I think the moving-average-of-requests idea, with a user-adjustable
scaling factor, is the best we have at the moment.

regards, tom lane

#7Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#6)
Re: Final background writer cleanup for 8.3

On Fri, 24 Aug 2007, Tom Lane wrote:

Heikki makes a good point nearby that if you are not disk write
bottlenecked then it's perfectly OK for backends to issue writes, as
it'll just result in a transfer to kernel cache space, and no actual
wait for I/O. And if you *are* write-bottlenecked, then the last thing
you want is any wasted writes.

Which is the same thing I was saying in my last message, so I'm content
we're all on the same page here now--and that the contents of that page
are now clear in the archives for when this comes up again.

So a fairly conservative strategy that does bgwrites only "just in time"
seems like what we need to aim at.

And that's exactly what I've been building. Feedback and general feeling
that I'm doing the right thing appreciated, am returning to the code with
scaling factor as a new tunable but plan otherwise unchanged.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Heikki Linnakangas (#5)
Re: Final background writer cleanup for 8.3

On Fri, Aug 24, 2007 at 7:41 AM, in message

<46CED1EF.8010707@enterprisedb.com>, "Heikki Linnakangas"
<heikki@enterprisedb.com> wrote:

I was not able to find a test where turning bgwriter on performed better
than turning it off.

Any tests which focus just on throughput don't address the problems which
caused us so much grief. What we need is some sort of test which generates
a moderate write load in the background, while paying attention to the
response time of a large number of read-only queries. The total load should
not be enough to saturate the I/O bandwidth overall if applied evenly.

The problem which the background writer has solved for us is that we have
three layers of caching (PostgreSQL, OS, and RAID controller), each with its
own delay before writing; when something like fsync triggers a cascade from
one cache to the next, the write burst bottlenecks the I/O, and reads exceed
acceptable response times. The two approaches which seem to prevent this
problem are to disable all OS delays in writing dirty pages, or to minimize
the delays in PostgreSQL writing dirty pages.

Throughput is not everything. Response time matters.

If anyone out there has a repeatable test case where bgwriter does help,
I'm all ears.

All we have is a production system where PostgreSQL failed to perform at a
level acceptable to the users without it.

The cold, rational side of me says we need a test case to show the
benefit, or if one can't be found, we should remove bgwriter altogether.

I would be fine with that if I could configure the back end to always write a
dirty page to the OS when it is written to shared memory. That would allow
Linux and XFS to do their job in a timely manner, and avoid this problem.

I know we're doing more in 8.3 to move this from the OS's realm into
PostgreSQL code, but until I have a chance to test that, I want to make sure
that what has been proven to work for us is not broken.

-Kevin

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#8)
Re: Final background writer cleanup for 8.3

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Any tests which focus just on throughput don't address the problems which
caused us so much grief.

This is a good point: a steady-state load is either going to be in the
regime where you're not write-bottlenecked, or the one where you are;
and either way the bgwriter isn't going to look like it helps much.

The real use of the bgwriter, perhaps, is to smooth out a varying load
so that you don't get pushed into the write-bottlenecked mode during
spikes. We've already had to rethink the details of how we made that
happen with respect to preventing checkpoints from causing I/O spikes.
Maybe LRU buffer flushes need a rethink too.

Right at the moment I'm still comfortable with what Greg is doing, but
there's an argument here for a more aggressive scaling factor on
number-of-buffers-to-write than he thinks. Still, as long as we have a
GUC variable in there, tuning should be possible.

regards, tom lane

#10Greg Smith
gsmith@gregsmith.com
In reply to: Kevin Grittner (#8)
Re: Final background writer cleanup for 8.3

On Fri, 24 Aug 2007, Kevin Grittner wrote:

I would be fine with that if I could configure the back end to always write a
dirty page to the OS when it is written to shared memory. That would allow
Linux and XFS to do their job in a timely manner, and avoid this problem.

You should take a look at the "io storm on checkpoints" thread on the
pgsql-performance@postgresql.org started by Dmitry Potapov on 8/22 if you
aren't on that list. He was running into the same problem as you (and me
and lots of other people) and had an interesting resolution based on
turning the Linux kernel so that it basically stopped caching writes.
What you suggest here would be particularly inefficient because of how
much extra I/O would happen on the index blocks involved in the active
tables.

I know we're doing more in 8.3 to move this from the OS's realm into
PostgreSQL code, but until I have a chance to test that, I want to make sure
that what has been proven to work for us is not broken.

The background writer code that's in 8.2 can be configured as a big
sledgehammer that happens to help in this area while doing large amounts
of collateral damage via writing things prematurely. Some of the people
involved in the 8.3 code rewrite and testing were having the same problem
as you on a similar scale--I recall Greg Stark commenting that he had a
system that was freezing for a full 30 seconds the way yours was.

I would be extremely surprised to find that the code that's already in 8.3
isn't a big improvement over what you're doing now based on how much it
has helped others running into this issue. And much of the code that
you're relying on now to help with the problem (the all-scan portion of
the BGW) has already been removed as part of that.

Switching to my Agent Smith voice: "No Kevin, your old background writer
is already dead". You'd have to produce some really unexpected and
compelling results during the beta period for it to get put back again.
The work I'm still doing here is very much fine-tuning in comparision to
what's already been committed into 8.3.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#11Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#10)
Re: Final background writer cleanup for 8.3

On Fri, Aug 24, 2007 at 5:47 PM, in message

<Pine.GSO.4.64.0708241807500.28499@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:

On Fri, 24 Aug 2007, Kevin Grittner wrote:

I would be fine with that if I could configure the back end to always write

a

dirty page to the OS when it is written to shared memory. That would allow
Linux and XFS to do their job in a timely manner, and avoid this problem.

You should take a look at the "io storm on checkpoints" thread on the
pgsql-performance@postgresql.org started by Dmitry Potapov on 8/22 if you
aren't on that list. He was running into the same problem as you (and me
and lots of other people) and had an interesting resolution based on
turning the Linux kernel so that it basically stopped caching writes.

I saw it. I think that I'd rather have a write-through cache in PostgreSQL
than give up OS caching entirely. The problem seems to be caused by the
cascade from one cache to the next, so I can easily believe that disabling
the delay on either one solves the problem.

What you suggest here would be particularly inefficient because of how
much extra I/O would happen on the index blocks involved in the active
tables.

I've certainly seen that assertion on these lists often. I don't think I've
yet seen any evidence that it's true. When I made the background writer
more aggressive, there was no discernible increase in disk writes at the OS
level (much less from controller cache to the drives). This may not be true
with some of the benchmark software, but in our environment there tends to
be a lot of activity on a singe court case, and then they're done with it.
(I spent some time looking at this to tune our heuristics for generating
messages on our interfaces to business partners.)

I know we're doing more in 8.3 to move this from the OS's realm into
PostgreSQL code, but until I have a chance to test that, I want to make sure
that what has been proven to work for us is not broken.

The background writer code that's in 8.2 can be configured as a big
sledgehammer that happens to help in this area while doing large amounts
of collateral damage via writing things prematurely.

Again -- to the OS cache, where it sits and accumulates other changes until
the page settles.

I would be extremely surprised to find that the code that's already in 8.3
isn't a big improvement over what you're doing now based on how much it
has helped others running into this issue.

I'm certainly hoping that it will be. I'm not moving to it for production
until I've established that as a fact, however.

And much of the code that
you're relying on now to help with the problem (the all-scan portion of
the BGW) has already been removed as part of that.

Switching to my Agent Smith voice: "No Kevin, your old background writer
is already dead". You'd have to produce some really unexpected and
compelling results during the beta period for it to get put back again.

If I fail to get resources approved to test during beta, this could become
an issue later, when we do get around to testing it. (There's exactly zero
chance of us moving to something which so radically changes a problem area
for us without serious testing.)

For what it's worth, the background writer settings I'm using weren't
arrived at entirely randomly. I monitored I/O during episodes of the
database freezing up, and looked at how many writes per second were going
through. I then reasoned that there was no good reason NOT to push data out
from PostgreSQL to the OS at that speed. I split the writes between the LRU
and full cache aspects of the background writer, with heavier weight given
to getting all dirty pages pushed out to the OS cache so that they could
start to age through the OS timers. (While the raw numbers totaled to the
peak write load, I figured I was actually allowing some slack, since there
was the percentage limit and the two scans would often cover the same
ground, not to mention the assumption that the interval was a sleep time
from the end of one run to the start of the next.) Since it was a
production system, I made incremental changes each day, and each day the
problem became less severe. At the point where I finally set it to my
calculated numbers, we stopped seeing the problem.

I'm not entirely convinced that it's a sound assumption that we should
always try to keep some dirty buffers in the cache on the off chance that
we might be smarter than the OS/FS/RAID controller algorithms about when to
write them. That said, the 8.3 changes sound as though they are likely to
reduce the problems with I/O-related freezes.

Is it my imagination, or are we coming pretty close to the point where we
could accomadate the oft-requested feature of dealing directly with a raw
volume, rather than going through the file system at all?

-Kevin

#12Greg Smith
gsmith@gregsmith.com
In reply to: Kevin Grittner (#11)
Re: Final background writer cleanup for 8.3

On Sat, 25 Aug 2007, Kevin Grittner wrote:

in our environment there tends to be a lot of activity on a singe court
case, and then they're done with it.

I submitted a patch to 8.3 that lets contrib/pg_buffercache show the
usage_count data for each of the buffers. It's actually pretty tiny; you
might consider applying just that patch to your 8.2 production system and
installing the module (as an add-in, it's easy enough to back out). See
http://archives.postgresql.org/pgsql-patches/2007-03/msg00555.php

With that patch in place, try a query like

select usagecount,count(*),isdirty from pg_buffercache group by
isdirty,usagecount order by isdirty,usagecount;

That lets you estimate how much waste would be involved for your
particular data if you wrote it out early--the more high usage_count
blocks in there cache, the worse the potential waste. With the tests I
was running, the hot index blocks were pegged at the maximum count allowed
(5) and they were taking up around 20% of the buffer cache. If those were
written out every time they were touched, it would be a bad scene.

It sounds like your system has a lot of data where the usage_count would
be much lower on average, which would explain why you've been so
successful with resolving it using the background writer. That's a
slightly easier problem to solve than the one I've been banging on.

I'm not moving to it for production until I've established that as a
fact, however.

And you'd be crazy to do otherwise.

I'm not entirely convinced that it's a sound assumption that we should
always try to keep some dirty buffers in the cache on the off chance that
we might be smarter than the OS/FS/RAID controller algorithms about when to
write them.

All I can say is that every time someone had tried to tune the code toward
writing that much more proactively, the results haven't seemed like an
improvement. I wouldn't characterize it as an assumption--it's a theory
that seems to hold every time it's tested. At least on the kind of Linux
systems people put into production right now (which often have relatively
old kernels), the OS is not as smart as everyone would like to to be in
this area.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#13Bruce Momjian
bruce@momjian.us
In reply to: Kevin Grittner (#11)
Re: Final background writer cleanup for 8.3

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

Is it my imagination, or are we coming pretty close to the point where we
could accomadate the oft-requested feature of dealing directly with a raw
volume, rather than going through the file system at all?

Or O_DIRECT.

I think the answer is that we've built enough intelligence that it's feasible
from the memory management side.

However there's another side to that problem. a) you would either need to have
multiple bgwriters or have bgwriter use aio since having only one would
serialize your i/o which would be a big hit to i/o bandwidth. b) you need some
solution to handle preemptively reading ahead for sequential reads.

I don't think we're terribly far off from being able to do it. The traditional
response has always been that our time is better spent doing database stuff
rather than reimplementing what the OS people are doing better. And also that
the OS has more information about the hardware and so can schedule I/O more
efficiently.

However there's also a strong counter-argument that we have more information
about what we're intending to use the data for and how urgent any given i/o is
so.

I'm not sure how that balancing act ends. I have a hunch but I guess it would
take experiments to get a real answer. And the answer might be very different
on different OSes and hardware configurations.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#14Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#12)
Re: Final background writer cleanup for 8.3

On Sun, Aug 26, 2007 at 12:51 AM, in message

<Pine.GSO.4.64.0708260115400.14470@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:

On Sat, 25 Aug 2007, Kevin Grittner wrote:

in our environment there tends to be a lot of activity on a singe court
case, and then they're done with it.

I submitted a patch to 8.3 that lets contrib/pg_buffercache show the
usage_count data for each of the buffers. It's actually pretty tiny; you
might consider applying just that patch to your 8.2 production system and
installing the module (as an add-in, it's easy enough to back out). See
http://archives.postgresql.org/pgsql-patches/2007-03/msg00555.php

With that patch in place, try a query like

select usagecount,count(*),isdirty from pg_buffercache group by
isdirty,usagecount order by isdirty,usagecount;

That lets you estimate how much waste would be involved for your
particular data if you wrote it out early--the more high usage_count
blocks in there cache, the worse the potential waste. With the tests I
was running, the hot index blocks were pegged at the maximum count allowed
(5) and they were taking up around 20% of the buffer cache. If those were
written out every time they were touched, it would be a bad scene.

Just to be sure that I understand, are you saying it would be a bad scene if
the physical writes happened, or that the overhead of pushing them out to
the OS would be crippling?

Anyway, I've installed this on the machine that I proposed using for the
tests. It is our older generation of central servers, soon to be put to
some less critical use as we bring the newest generation on line and the
current "new" machines fall back to secondary roles in our central server
pool. It is currently a replication target for the 72 county-based circuit
court systems, but is just there for ad hoc queries against statewide data;
there's no web load present.

Running the suggested query a few times, with the samples separated by a few
seconds each, I got the following. (The Sunday afternoon replication load
is unusual in that there will be very few users entering any data, just a
trickle of input from our law enforcement interfaces, but a lot of the
county middle tiers will have noticed that there is idle time and that it
has been more than 23 hours since the start of the last synchronization of
county data against the central copies, and so will be doing massive selects
to look for and report any "drift".) I'll check again during normal weekday
load.

usagecount | count | isdirty
------------+-------+---------
0 | 8711 | f
1 | 9394 | f
2 | 1188 | f
3 | 869 | f
4 | 160 | f
5 | 157 | f
| 1 |
(7 rows)

usagecount | count | isdirty
------------+-------+---------
0 | 9033 | f
1 | 8849 | f
2 | 1623 | f
3 | 619 | f
4 | 181 | f
5 | 175 | f
(6 rows)

usagecount | count | isdirty
------------+-------+---------
0 | 9093 | f
1 | 6702 | f
2 | 2267 | f
3 | 602 | f
4 | 428 | f
5 | 1388 | f
(6 rows)

usagecount | count | isdirty
------------+-------+---------
0 | 6556 | f
1 | 7188 | f
2 | 3648 | f
3 | 2074 | f
4 | 720 | f
5 | 293 | f
| 1 |
(7 rows)

usagecount | count | isdirty
------------+-------+---------
0 | 6569 | f
1 | 7855 | f
2 | 3942 | f
3 | 1181 | f
4 | 532 | f
5 | 401 | f
(6 rows)

I also ran the query mentioned in the cited email about 100 times, with 52
instead of 32. (I guess I have a bigger screen.) It would gradually go
from entirely -1 values to mostly -2 with a few -1, then gradually back to
all -1. Repeatedly. I never saw anything other than -1 or -2. Of course
this is with our aggressive background writer settings.

This contrib module seems pretty safe, patch and all. Does anyone think
there is significant risk to slipping it into the 8.2.4 database where we
have massive public exposure on the web site handling 2 million hits per
day?

By the way, Greg, lest my concerns about this be misinterpreted -- I do
really appreciate the effort you've put into analyzing this and tuning the
background writer. I just want to be very cautious here, and I do get
downright alarmed at some of the posts which seem to deny the reality of the
problems which many have experienced with write spikes choking off reads to
the point of significant user impact. I also think we need to somehow
develop a set of tests which report maximum response time on (what should
be) fast queries while the database is under different loads, so that those
of us for whom reliable response time is more important than maximum overall
throughput are protected from performance regressions.

-Kevin

#15Greg Smith
gsmith@gregsmith.com
In reply to: Kevin Grittner (#14)
Re: Final background writer cleanup for 8.3

On Sun, 26 Aug 2007, Kevin Grittner wrote:

usagecount | count | isdirty
------------+-------+---------
0 | 8711 | f
1 | 9394 | f
2 | 1188 | f
3 | 869 | f
4 | 160 | f
5 | 157 | f

Here's a typical sample from your set. Notice how you've got very few
buffers with a high usage count. This is a situation the background
writer is good at working with. Either the old or new work-in-progress
LRU writer can aggressively pound away at any of the buffers with a 0
usage count shortly after they get dirty, and that won't be inefficient
because there aren't large numbers of other clients using them.

Compare against this other sample:

usagecount | count | isdirty
------------+-------+---------
0 | 9093 | f
1 | 6702 | f
2 | 2267 | f
3 | 602 | f
4 | 428 | f
5 | 1388 | f

Notice that you have a much larger number of buffers where the usage count
is 4 or 5. The all-scan part of the 8.2 background writer will waste a
lot of writes when you have a profile that's more like this. If there
have been 4+ client backends touching the buffer recently, you'd be crazy
to write it out right now if you could instead be focusing on banging out
the ones where the usage count is 0. The 8.2 background writer would
write them out anyway, which meant that when you hit a checkpoint both the
OS and the controller cache were filled with such buffers before you even
started writing the checkpoint data. The new setup in 8.3 only worries
about the high usage count buffers when you hit a checkpoint, at which
point it streams them out over a longer, adjustable period (as not to
spike the I/O more than necessary and block your readers) than the 8.2
design, which just dumped them all immediately.

Just to be sure that I understand, are you saying it would be a bad scene if
the physical writes happened, or that the overhead of pushing them out to
the OS would be crippling?

If you have a lot of buffers where the usage_count data was high, it would
be problematic to write them out every time they were touched; odds are
good somebody else is going to dirty them again soon enough so why bother.
On your workload, that doesn't seem to be the case. But that is the
situation on some other test workloads, and balancing for that situation
has been central to the parts of the redesign I've been injecting
suggestions into. One of the systems I was tormented by had the
usagecount of 5 for >20% of the buffers in the cache under heavy load, and
had a physical write been executed every time one of those was touched
that would have been crippling (even if the OS was smart to cache and
therefore make redundant some of the writes, which is behavior I would
prefer not to rely on).

This contrib module seems pretty safe, patch and all. Does anyone think
there is significant risk to slipping it into the 8.2.4 database where we
have massive public exposure on the web site handling 2 million hits per
day?

I think it's fairly safe, and my patch was pretty small; just exposing
some data that nobody had been looking at before. Think how much easier
your life would have been when doing your earlier tuning if you were
looking at the data in these terms. Just be aware that running the query
is itself intensive and causes its own tiny hiccup in throughput every
time it executes, so you may want to consider this more of a snapshot you
run periodically to learn more about your data rather than something you
do very regularly.

I also think we need to somehow develop a set of tests which report
maximum response time on (what should be) fast queries while the
database is under different loads, so that those of us for whom reliable
response time is more important than maximum overall throughput are
protected from performance regressions.

My guess is that the DBT2 tests that Heikki has been running are a more
complicated than you think they are; there are response time guarantee
requirements in there as well as the throughput numbers. The tests that I
run (which I haven't been publishing yet but will be with the final patch
soon) also report worst-case and 90-th percentile latency numbers as well
as TPS. A "regression" that improved TPS at the expense of those two
would not be considered an improvement by anyone involved here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#16Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#15)
Re: Final background writer cleanup for 8.3

On Sun, Aug 26, 2007 at 4:16 PM, in message

<Pine.GSO.4.64.0708261637030.3811@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:

On Sun, 26 Aug 2007, Kevin Grittner wrote:

usagecount | count | isdirty
------------+-------+---------
0 | 9093 | f
1 | 6702 | f
2 | 2267 | f
3 | 602 | f
4 | 428 | f
5 | 1388 | f

Notice that you have a much larger number of buffers where the usage count
is 4 or 5. The all-scan part of the 8.2 background writer will waste a
lot of writes when you have a profile that's more like this. If there
have been 4+ client backends touching the buffer recently, you'd be crazy
to write it out right now if you could instead be focusing on banging out
the ones where the usage count is 0.

Seems to me I'd be crazy to be writing out anything. Nothing's dirty.

In fact, I ran a simple query to count dirty pages once per second for a
minute and had three sample show any pages dirty. The highest count was 5.
Again, this was Sunday afternoon, which is not traditionally a busy time for
the courts. I'll try to get some more meaningful numbers tomorrow.

One of the systems I was tormented by had the
usagecount of 5 for >20% of the buffers in the cache under heavy load, and
had a physical write been executed every time one of those was touched
that would have been crippling (even if the OS was smart to cache and
therefore make redundant some of the writes, which is behavior I would
prefer not to rely on).

Why is that?

The tests that I
run (which I haven't been publishing yet but will be with the final patch
soon) also report worst-case and 90-th percentile latency numbers as well
as TPS. A "regression" that improved TPS at the expense of those two
would not be considered an improvement by anyone involved here.

Have you been able to create a test case which exposes the write-spike
problem under 8.2.4?

By the way, the 90th percentile metric isn't one I'll care a lot about.
In our environment any single instance of a "fast" query running slow is
considered a problem, and my job is to keep those users happy.

-Kevin

#17Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#15)
Re: Final background writer cleanup for 8.3

"Greg Smith" <gsmith@gregsmith.com> writes:

On Sun, 26 Aug 2007, Kevin Grittner wrote:

I also think we need to somehow develop a set of tests which report maximum
response time on (what should be) fast queries while the database is under
different loads, so that those of us for whom reliable response time is more
important than maximum overall throughput are protected from performance
regressions.

My guess is that the DBT2 tests that Heikki has been running are a more
complicated than you think they are; there are response time guarantee
requirements in there as well as the throughput numbers. The tests that I run
(which I haven't been publishing yet but will be with the final patch soon)
also report worst-case and 90-th percentile latency numbers as well as TPS. A
"regression" that improved TPS at the expense of those two would not be
considered an improvement by anyone involved here.

TPCC requires that the 90th percentile response time be under 5s for most
transactions. It also requires that the average be less than the 90th
percentile which helps rule out circumstances where the longest 10% response
times are *much* longer than 5s.

However in practice neither of those requirements really rule out some pretty
bad behaviour as long as it's rare enough. Before the distributed checkpoint
patch went in we were finding 60s of zero activity at every checkpoint. But
there were so few transactions affected that in the big picture it didn't
impact the 90th percentile. It didn't even affect the 95th percentile. I think
you had to look at the 99th percentile before it even began to impact the
results.

I can't really imagine a web site operator being happy if he was told that
only 1% of user's clicks resulted in a browser timeout...

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#18Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#16)
Re: Final background writer cleanup for 8.3

On Sun, Aug 26, 2007 at 7:35 PM, in message

<46D1D601.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:

On Sun, Aug 26, 2007 at 4:16 PM, in message

<Pine.GSO.4.64.0708261637030.3811@westnet.com>, Greg Smith
<gsmith@gregsmith.com> wrote:
I'll try to get some more meaningful numbers tomorrow.

Well, I ran the query against the production web server 40 times, and the highest number I got for usagecount 5 dirty pages was in this sample:

usagecount | count | isdirty
------------+-------+---------
0 | 7358 | f
1 | 7428 | f
2 | 1938 | f
3 | 1311 | f
4 | 1066 | f
5 | 1097 | f
1 | 87 | t
2 | 62 | t
3 | 31 | t
4 | 11 | t
5 | 86 | t
| 5 |
(12 rows)

Most samples looked something like this:

usagecount | count | isdirty
------------+-------+---------
0 | 7981 | f
1 | 6584 | f
2 | 1975 | f
3 | 1063 | f
4 | 1366 | f
5 | 1294 | f
0 | 5 | t
1 | 83 | t
2 | 60 | t
3 | 19 | t
4 | 21 | t
5 | 28 | t
| 1 |
(13 rows)

The system can comfortably write out about 4,000 pages per second as long as the write cache doesn't get swamped, so in the worst case I caught it had 69 ms worth of work to do, if they were all physical writes (which, of course, is highly unlikely).

From shortly afterwards, possibly of interest:

postgres@ATHENA:~> vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 3 20 402248 0 10538028 0 0 0 1 1 2 21 4 55 19
2 4 20 403116 0 10538028 0 0 5180 384 2233 9599 24 5 50 21
3 6 20 402868 0 10532888 0 0 4844 512 2841 14054 44 6 31 19
7 10 20 397908 0 10534944 0 0 6768 465 2674 11995 40 6 26 28
4 15 20 398016 0 10534944 0 0 3344 4703 2297 10578 34 7 13 46
0 22 20 405456 0 10534944 0 0 2464 4192 1785 6167 20 3 21 56
14 19 20 401852 0 10538028 0 0 3680 4704 2474 11779 29 5 12 54
17 13 20 401728 0 10532888 0 0 5504 1945 2554 21490 35 8 10 47
3 10 20 408176 0 10530832 0 0 11380 553 3907 15463 67 13 5 15
4 4 20 405572 0 10535972 0 0 8708 981 2904 12051 26 7 34 33
1 5 20 403588 0 10535972 0 0 5924 464 2589 12194 26 5 45 23
4 7 20 410780 0 10529804 0 0 6284 1163 2674 11830 33 8 35 24
3 13 20 402596 0 10526720 0 0 2424 6598 2441 10332 40 7 11 42
7 16 20 400736 0 10528776 0 0 3928 6784 2453 9852 26 6 26 42
19 14 20 405308 0 10524664 0 0 2272 4708 2208 8583 27 5 19 49
9 17 20 404580 0 10527748 0 0 7156 3560 3185 13203 55 11 3 32
1 11 20 406192 0 10531860 0 0 5112 3647 2758 11362 31 6 26 37
3 13 20 404464 0 10531860 0 0 4856 3426 2342 11077 24 5 35 36
2 13 20 403968 0 10530832 0 0 5308 4634 2762 15778 34 7 22 36
4 12 20 403472 0 10534944 0 0 2996 3766 2090 9331 20 4 34 42
0 5 20 412648 0 10522608 0 0 2364 5187 1816 5194 18 5 56 22
4 13 20 415376 0 10519524 0 0 2836 6172 1929 5075 25 6 26 43
27 16 20 413880 0 10522608 0 0 7892 2340 3325 19769 52 8 10 30
7 7 20 402340 0 10530832 0 0 7600 712 3511 16486 45 8 20 26
4 9 20 403704 0 10531860 0 0 7708 830 3133 16164 43 11 22 24
5 6 20 408416 0 10529804 0 0 6900 814 2703 10806 31 7 39 24
8 6 20 401844 0 10532888 0 0 6884 632 2993 13792 37 7 29 27
13 3 20 398868 0 10534944 0 0 7732 744 3443 14580 63 9 8 19
5 6 20 403580 0 10533916 0 0 6724 623 2905 11937 37 7 34 22
3 7 20 400728 0 10529804 0 0 6924 712 2746 12085 35 7 37 21
0 7 20 408664 0 10526720 0 0 6536 344 2562 10555 27 6 44 24
5 1 20 407796 0 10527748 0 0 4628 1000 2653 13092 41 7 37 15
7 9 20 400480 0 10529804 0 0 3364 744 2326 11198 35 7 40 18
3 4 20 406384 0 10531860 0 0 4044 904 2998 14055 60 9 16 14
18 5 20 397976 0 10525692 0 0 6000 671 3082 14058 55 10 15 20
11 6 20 410996 0 10528776 0 0 4828 3498 2768 13027 38 7 28 27
1 3 20 406416 0 10531860 0 0 4140 616 2496 11980 33 6 43 17

This box is a little beefier than the proposed test box, with 8 3 GHz Xeon MP CPUs and 12 GB of RAM. Other than telling PostgreSQL about the extra RAM in the effective cache size GUC, this box has the same postgresql.conf.

Other than cranking up the background writer settings this is the same box and configuration that stalled so badly that we were bombarded with user complaints.

-Kevin

#19Jan Wieck
JanWieck@Yahoo.com
In reply to: Greg Smith (#3)
Re: Final background writer cleanup for 8.3

On 8/24/2007 1:17 AM, Greg Smith wrote:

On Thu, 23 Aug 2007, Tom Lane wrote:

It is doubtless true in a lightly loaded system, but once the kernel is
under any kind of memory pressure I think it's completely wrong.

The fact that so many tests I've done or seen get maximum throughput in
terms of straight TPS with the background writer turned completely off is
why I stated that so explicitly. I understand what you're saying in terms
of memory pressure, all I'm suggesting is that the empirical tests suggest
the current background writer even with moderate improvements doesn't
necessarily help when you get there. If writes are blocking, whether the
background writer does them slightly ahead of time or whether the backend
does them itself doesn't seem to matter very much. On a heavily loaded
system, your throughput is bottlenecked at the disk either way--and
therefore it's all the more important in those cases to never do a write
until you absolutely have to, lest it be wasted.

Have you used something that like a properly implemented TPC benchmark
simulates users that go through cycles of think times instead of
hammering SUT interactions at the maximum possible rate allowed by the
network latency? And do your tests consider any completed transaction a
good transaction, or are they like TPC benchmarks, which require the
majority of transactions to complete in a certain maximum response time?

Those tests will show you that inflicting an IO storm at checkpoint time
will delay processing enough to get a significant increase in the number
of concurrent transactions by giving the "users" time enough to come out
of their thinking time. That spike in active transactions increases
pressure on CPU, memory and IO ... and eventually leads to the situation
where users submit new transactions at a higher rate than you currently
can commit ... which is where you enter the spiral of death.

Observing that very symptom during my TPC-W tests several years ago was
what lead to developing the background writer in the first place. Can
your tests demonstrate improvements for this kind of (typical web
application) load profile?

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#20Jan Wieck
JanWieck@Yahoo.com
In reply to: Heikki Linnakangas (#5)
Re: Final background writer cleanup for 8.3

On 8/24/2007 8:41 AM, Heikki Linnakangas wrote:

If anyone out there has a repeatable test case where bgwriter does help,
I'm all ears. The theory of moving the writes out of the critical path
does sound reasonable, so I'm sure there is test case to demonstrate the
effect, but it seems to be pretty darn hard to find.

One could try to dust off this TPC-W benchmark.

http://pgfoundry.org/projects/tpc-w-php/

Again, the original theory for the bgwriter wasn't moving writes out of
the critical path, but smoothing responsetimes that tended to go
completely down the toilet during checkpointing, causing all the users
to wake up and overload the system entirely.

It is well known that any kind of bgwriter configuration other than OFF
does increase the total IO cost. But you will find that everyone who has
SLA's that define maximum response times will happily increase the IO
bandwidth to give an aggressively configured bgwriter room to work.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#21Greg Smith
gsmith@gregsmith.com
In reply to: Jan Wieck (#20)
#22Josh Berkus
josh@agliodbs.com
In reply to: Greg Smith (#21)
#23Greg Smith
gsmith@gregsmith.com
In reply to: Josh Berkus (#22)
#24Josh Berkus
josh@agliodbs.com
In reply to: Greg Smith (#23)
#25Greg Smith
gsmith@gregsmith.com
In reply to: Josh Berkus (#24)
#26Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#25)
#27Greg Smith
gsmith@gregsmith.com
In reply to: Kevin Grittner (#26)