RC2 and open issues
We are now packaging RC2. If nothing comes up after RC2 is released, we
can move to final release.
The open items list is attached. The doc changes can be easily
completed before final. The only code issue left is with bgwriter. We
always knew we needed to find better defaults for its parameters, but we
are only now finding more fundamental issues.
I think the summary I have seen recently pegs it right --- our use of %
of dirty buffers requires a scan of the entire buffer cache, and the
current delay of bgwriter is too high, but we can't lower it because the
buffer cache scan will become too expensive if done too frequently.
I think the ideal solution would be to remove bgwriter_percent or change
it to be a percentage of all buffers, not just dirty buffers, so we
don't have to scan the entire list. If we set the new value to 10% with
a delay of 1 second, and the bgwriter remembers the place it stopped
scanning the buffer cache, you will clean out the buffer cache
completely every 10 seconds.
Right now it seems no one can find proper values. We were clear that
this was an issue but it is bad news that we are only addressing it
during RC.
The 8.1 solution is to have some feedback system so writes by individual
backends cause the bgwriter to work more frequently.
The big question is what to do during RC2? Do we just leave it as
suboptimal knowing we will revisit it in 8.1 or try an incremental
solution for 8.0 that might work better.
We have to decide now.
---------------------------------------------------------------------------
PostgreSQL 8.0 Open Items
=========================
Current version at http://candle.pha.pa.us/cgi-bin/pgopenitems.
Changes
-------
* change bgwriter buffer scan behavior?
* adjust bgwriter defaults
Documentation
-------------
* synchonize supported encodings and docs
* improve external interfaces documentation section
* manual pages
Fixed Since Last Beta
---------------------
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I think the ideal solution would be to remove bgwriter_percent or change
it to be a percentage of all buffers, not just dirty buffers, so we
don't have to scan the entire list. If we set the new value to 10% with
a delay of 1 second, and the bgwriter remembers the place it stopped
scanning the buffer cache, you will clean out the buffer cache
completely every 10 seconds.
But we don't *want* it to clean out the buffer cache completely.
There's no point in writing a "hot" page every few seconds. So I don't
think I believe in remembering where we stopped anyway.
I think there's a reasonable case to be made for redefining
bgwriter_percent as the max percent of the total buffer list to scan
(not the max percent of the list to return --- Jan correctly pointed out
that the latter is useless). Then we could modify
StrategyDirtyBufferList so that the percent and maxpages parameters are
passed in, so it can stop as soon as either one is satisfied. This
would be a fairly small/safe code change and I wouldn't have a problem
doing it even at this late stage of the cycle.
Howeve ... we would have to crank up the default bgwriter_percent,
and I don't know if we have any better idea what to set it to after
such a change than we do now ...
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I think the ideal solution would be to remove bgwriter_percent or change
it to be a percentage of all buffers, not just dirty buffers, so we
don't have to scan the entire list. If we set the new value to 10% with
a delay of 1 second, and the bgwriter remembers the place it stopped
scanning the buffer cache, you will clean out the buffer cache
completely every 10 seconds.But we don't *want* it to clean out the buffer cache completely.
You are only cleaning out in pieces over a 10 second period so it is
getting dirty. You are not scanning the entire buffer at one time.
There's no point in writing a "hot" page every few seconds. So I don't
think I believe in remembering where we stopped anyway.
I was thinking if you are doing this scanning every X milliseconds then
after a while the front of the buffer cache will be mostly clean and the
end will be dirty so you will always be going over the same early ones
to get to the later dirty ones. Remembering the location gives the scan
more uniform coverage of the buffer cache.
You need a "clock sweep" like BSD uses (and probably others).
I think there's a reasonable case to be made for redefining
bgwriter_percent as the max percent of the total buffer list to scan
(not the max percent of the list to return --- Jan correctly pointed out
that the latter is useless). Then we could modify
StrategyDirtyBufferList so that the percent and maxpages parameters are
passed in, so it can stop as soon as either one is satisfied. This
would be a fairly small/safe code change and I wouldn't have a problem
doing it even at this late stage of the cycle.Howeve ... we would have to crank up the default bgwriter_percent,
and I don't know if we have any better idea what to set it to after
such a change than we do now ...
Once we make the change we will have to get our testers working on it.
We need those figure to change over time based on backends doing writes
but ath isn't going to happen for 8.0.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
You need a "clock sweep" like BSD uses (and probably others).
No, that's *fundamentally* wrong.
The reason we are going to the trouble of maintaining a complicated
cache algorithm like ARC is so that we can tell the heavily used pages
from the lesser used ones. To throw away that knowledge in favor of
doing I/O with a plain clock sweep algorithm is just wrong.
What's more, I don't even understand what clock sweep would mean given
that the ordering of the list is constantly changing.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I am confused. If we change the percentage to be X% of the entire
buffer cache, and we set it to 1%, and we exit when either the dirty
pages or % are reached, don't we end up just scanning the first 1% of
the cache over and over again?
Exactly. But 1% would be uselessly small with this definition. Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.
regards, tom lane
Import Notes
Reply to msg id not found: 200412210352.iBL3qcR02169@candle.pha.pa.usReference msg id not found: 200412210352.iBL3qcR02169@candle.pha.pa.us | Resolved by subject fallback
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
I am confused. If we change the percentage to be X% of the entire
buffer cache, and we set it to 1%, and we exit when either the dirty
pages or % are reached, don't we end up just scanning the first 1% of
the cache over and over again?Exactly. But 1% would be uselessly small with this definition. Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.
So we are not scanning by buffer address but using the LRU list? Are we
sure they are mostly dirty?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Tom Lane wrote:
Exactly. But 1% would be uselessly small with this definition. Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.
So we are not scanning by buffer address but using the LRU list? Are we
sure they are mostly dirty?
No. The entire point is to keep the LRU end of the list mostly clean.
Now that you mention it, it might be interesting to try the approach of
doing a clock scan on the buffer array and ignoring the ARC lists
entirely. That would be a fundamentally different way of envisioning
what the bgwriter is supposed to do, though. I think the main reason
Jan didn't try that was he wanted to be sure the LRU page was usually
clean so that backends would seldom end up doing writes for themselves
when they needed to get a free buffer.
Maybe we need a hybrid approach: clean a few percent of the LRU end of
the ARC list in order to keep backends from blocking on writes, plus run
a clock scan to keep checkpoints from having to do much. But that's way
beyond what we have time for in the 8.0 cycle.
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Tom Lane wrote:
Exactly. But 1% would be uselessly small with this definition. Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.So we are not scanning by buffer address but using the LRU list? Are we
sure they are mostly dirty?No. The entire point is to keep the LRU end of the list mostly clean.
Now that you mention it, it might be interesting to try the approach of
doing a clock scan on the buffer array and ignoring the ARC lists
entirely. That would be a fundamentally different way of envisioning
what the bgwriter is supposed to do, though. I think the main reason
Jan didn't try that was he wanted to be sure the LRU page was usually
clean so that backends would seldom end up doing writes for themselves
when they needed to get a free buffer.Maybe we need a hybrid approach: clean a few percent of the LRU end of
the ARC list in order to keep backends from blocking on writes, plus run
a clock scan to keep checkpoints from having to do much. But that's way
beyond what we have time for in the 8.0 cycle.
OK, so we scan from the end of the LRU. If we scan X% and find _no_
dirty buffers perhaps we should start where we left off last time.
If we don't start where we left off, I am thinking if you do a lot of
writes then do nothing, the next checkpoint would be huge because a lot
of the LRU will be dirty because the bgwriter never got to it.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Mon, 20 Dec 2004, Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Tom Lane wrote:
Exactly. But 1% would be uselessly small with this definition. Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.So we are not scanning by buffer address but using the LRU list? Are we
sure they are mostly dirty?No. The entire point is to keep the LRU end of the list mostly clean.
Now that you mention it, it might be interesting to try the approach of
doing a clock scan on the buffer array and ignoring the ARC lists
entirely. That would be a fundamentally different way of envisioning
what the bgwriter is supposed to do, though. I think the main reason
Jan didn't try that was he wanted to be sure the LRU page was usually
clean so that backends would seldom end up doing writes for themselves
when they needed to get a free buffer.
Neil and I spoke with Jan briefly last week and he mentioned a few
different approaches he'd been tossing over. Firstly, for alternative
runs, start X% on from the LRU, so that we aren't scanning clean buffers
all the time. Secondly, follow something like the approach you've
mentioned above but remember the offset. So, if we're scanning 10%, after
10 runs we will have written out all buffers.
I was also thinking of benchmarking the effect of changing the algorithm
in StrategyDirtyBufferList(): currently, for each iteration of the loop we
read a buffer from each of T1 and T2. I was wondering what effect reading
T1 first then T2 and vice versa would have on performance. I haven't
thought about this too hard, though, so it might be wrong headed.
Maybe we need a hybrid approach: clean a few percent of the LRU end of
the ARC list in order to keep backends from blocking on writes, plus run
a clock scan to keep checkpoints from having to do much. But that's way
beyond what we have time for in the 8.0 cycle.
Definately.
regards, tom lane
Thanks,
Gavin
Gavin Sherry wrote:
Neil and I spoke with Jan briefly last week and he mentioned a few
different approaches he'd been tossing over. Firstly, for alternative
runs, start X% on from the LRU, so that we aren't scanning clean buffers
all the time. Secondly, follow something like the approach you've
mentioned above but remember the offset. So, if we're scanning 10%, after
10 runs we will have written out all buffers.I was also thinking of benchmarking the effect of changing the algorithm
in StrategyDirtyBufferList(): currently, for each iteration of the loop we
read a buffer from each of T1 and T2. I was wondering what effect reading
T1 first then T2 and vice versa would have on performance. I haven't
thought about this too hard, though, so it might be wrong headed.
So we are all thinking in the same direction. We might have only a few
days to finalize this before final release.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Gavin Sherry <swm@linuxworld.com.au> writes:
I was also thinking of benchmarking the effect of changing the algorithm
in StrategyDirtyBufferList(): currently, for each iteration of the loop we
read a buffer from each of T1 and T2. I was wondering what effect reading
T1 first then T2 and vice versa would have on performance.
Looking at StrategyGetBuffer, it definitely seems like a good idea to
try to keep the bottom end of both T1 and T2 lists clean. But we should
work at T1 a bit harder.
The insight I take away from today's discussion is that there are two
separate goals here: try to keep backends that acquire a buffer via
StrategyGetBuffer from being fed a dirty buffer they have to write,
and try to keep the next upcoming checkpoint from having too much work
to do. Those are both laudable goals but I hadn't really seen before
that they may require different strategies to achieve. I'm liking the
idea that bgwriter should alternate between doing writes in pursuit of
the one goal and doing writes in pursuit of the other.
regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote on 21.12.2004, 07:32:52:
Gavin Sherry writes:
I was also thinking of benchmarking the effect of changing the algorithm
"changing the algorithm" is a phrase that sends shivers up my spine. My
own preference is towards some change, but as minimal as possible.
in StrategyDirtyBufferList(): currently, for each iteration of the loop we
read a buffer from each of T1 and T2. I was wondering what effect reading
T1 first then T2 and vice versa would have on performance.Looking at StrategyGetBuffer, it definitely seems like a good idea to
try to keep the bottom end of both T1 and T2 lists clean. But we should
work at T1 a bit harder.The insight I take away from today's discussion is that there are two
separate goals here: try to keep backends that acquire a buffer via
StrategyGetBuffer from being fed a dirty buffer they have to write,
and try to keep the next upcoming checkpoint from having too much work
to do. Those are both laudable goals but I hadn't really seen before
that they may require different strategies to achieve. I'm liking the
idea that bgwriter should alternate between doing writes in pursuit of
the one goal and doing writes in pursuit of the other.
Agreed: there are two different goals for buffer list management.
I like the way the current algorithm searches both T1 and T2 in
parallel, since that works no matter how long each list is. Always
cleaning one list in preference to the other would not work well since
ARC fluctuates. At any point in time, cleaning one list will have more
benefit than cleaning the other, but which one is best switches
backwards and forwards as ARC fluctuates.
Perhaps the best way would be to concentrate on the list that, at this
point in time, is the one that needs to be cleanest. I *think* that
means we should concentrate on the LRU of the *longest* list, since
that is the direction in which ARC is trying to move (I agree that
seems counter-intuitive: but a few pairs of eyes should confirm which
way round it is)
By observation, DBT2 ends up with T2 >> T1, but that is a result of its
fairly static nature. i.e. DBT2 would benefit from T2 LRU cleaning.
ISTM it would be good to have:
1) very frequent, but small cleaning action on the lists, say every 50ms
to avoid backends having to write a buffer
2) less frequent, deeper cleaning actions, to minimise the effect of
checkpoints, which could be done every 10th cycle e.g. 500ms
(numbers would vary according to workload...)
But, like I said: change, but minimal change seems best to me for now.
Best Regards, Simon Riggs
Import Notes
Resolved by subject fallback
Tom Lane <tgl@sss.pgh.pa.us> wrote on 21.12.2004, 05:05:36:
Bruce Momjian writes:
I am confused. If we change the percentage to be X% of the entire
buffer cache, and we set it to 1%, and we exit when either the dirty
pages or % are reached, don't we end up just scanning the first 1% of
the cache over and over again?Exactly. But 1% would be uselessly small with this definition. Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.
I see the buffer list as a conveyor belt that carries unneeded blocks
away from the MRU. Cleaning near the LRU (I agree: How near?) should be
all that is sufficient to keep the list clean.
Cleaning the first 1% "over and over again" makes it sound like it is
the same list of blocks that are being cleaned. It may be the same
linked list data structure, but that is dynamically changing to contain
completely different blocks from the last time you looked.
Best Regards, Simon Riggs
Import Notes
Resolved by subject fallback
If we don't start where we left off, I am thinking if you do a lot of
writes then do nothing, the next checkpoint would be huge because a lot
of the LRU will be dirty because the bgwriter never got to it.
I think the problem is, that we don't see wether a "read hot"
page is also "write hot". We would want to write dirty "read hot" pages,
but not "write hot" pages. It does not make sense to write a "write hot"
page since it will be dirty again when the checkpoint comes.
Andreas
Import Notes
Resolved by subject fallback
simon@2ndquadrant.com wrote:
Tom Lane <tgl@sss.pgh.pa.us> wrote on 21.12.2004, 05:05:36:
Bruce Momjian writes:
I am confused. If we change the percentage to be X% of the entire
buffer cache, and we set it to 1%, and we exit when either the dirty
pages or % are reached, don't we end up just scanning the first 1% of
the cache over and over again?Exactly. But 1% would be uselessly small with this definition. Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.I see the buffer list as a conveyor belt that carries unneeded blocks
away from the MRU. Cleaning near the LRU (I agree: How near?) should be
all that is sufficient to keep the list clean.Cleaning the first 1% "over and over again" makes it sound like it is
the same list of blocks that are being cleaned. It may be the same
linked list data structure, but that is dynamically changing to contain
completely different blocks from the last time you looked.
However, one thing you can say is that if block B hasn't been written to
since you last checked, then any blocks older than that haven't been
written to either. Of course, the problem is in finding block B again
without re-scanning from the LRU end.
Is there any non-intrusive way we could add a "bookmark" into the
conveyer-belt? (mixing my metaphors again :-) Any blocks written to
would move up the cache, effectively moving the bookmark lower. Enough
activity would cause the bookmark to drop off the end. If that isn't the
case though, we know we can safely skip any blocks older than the bookmark.
--
Richard Huxton
Archonet Ltd
On Mon, Dec 20, 2004 at 11:20:46PM -0500, Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Tom Lane wrote:
Exactly. But 1% would be uselessly small with this definition. Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.So we are not scanning by buffer address but using the LRU list? Are we
sure they are mostly dirty?No. The entire point is to keep the LRU end of the list mostly clean.
Now that you mention it, it might be interesting to try the approach of
doing a clock scan on the buffer array and ignoring the ARC lists
entirely. That would be a fundamentally different way of envisioning
what the bgwriter is supposed to do, though. I think the main reason
Jan didn't try that was he wanted to be sure the LRU page was usually
clean so that backends would seldom end up doing writes for themselves
when they needed to get a free buffer.Maybe we need a hybrid approach: clean a few percent of the LRU end of
the ARC list in order to keep backends from blocking on writes, plus run
a clock scan to keep checkpoints from having to do much. But that's way
beyond what we have time for in the 8.0 cycle.regards, tom lane
I have not had a chance to investigate, but there is a modification of
the ARC cache strategy called CAR that replaces the LRU linked lists
with the clock approximation to the LRU lists. This algorithm is virtually
identical to the current ARC but reduces the contention at the MRU end
of the lists. This may dovetail nicely into your idea of a "clock" bgwriter
functionality as well as help with the cache-line performance problem.
Yours,
Ken Marshall
Tom Lane wrote:
Gavin Sherry <swm@linuxworld.com.au> writes:
I was also thinking of benchmarking the effect of changing the algorithm
in StrategyDirtyBufferList(): currently, for each iteration of the loop we
read a buffer from each of T1 and T2. I was wondering what effect reading
T1 first then T2 and vice versa would have on performance.Looking at StrategyGetBuffer, it definitely seems like a good idea to
try to keep the bottom end of both T1 and T2 lists clean. But we should
work at T1 a bit harder.The insight I take away from today's discussion is that there are two
separate goals here: try to keep backends that acquire a buffer via
StrategyGetBuffer from being fed a dirty buffer they have to write,
and try to keep the next upcoming checkpoint from having too much work
to do. Those are both laudable goals but I hadn't really seen before
that they may require different strategies to achieve. I'm liking the
idea that bgwriter should alternate between doing writes in pursuit of
the one goal and doing writes in pursuit of the other.
It seems we have added a new limitation to bgwriter by not doing a full
scan. With a full scan we could easily grab the first X pages starting
from the end of the LRU list and write them. By not scanning the full
list we are opening the possibility of not seeing some of the front-most
LRU dirty pages. And the full scan was removed so we can run bgwriter
more frequently, but we might end up with other problems.
I have a new proposal. The idea is to cause bgwriter to increase its
frequency based on how quickly it finds dirty pages.
First, we remove the GUC bgwriter_maxpages because I don't see a good
way to set a default for that. A default value needs to be based on a
percentage of the full buffer cache size. Second, we make
bgwriter_percent cause the bgwriter to stop its scan once it has found a
number of dirty buffers that matches X% of the buffer cache size. So,
if it is set to 5%, the bgwriter scan stops once it find enough dirty
buffers to equal 5% of the buffer cache size.
Bgwriter continues to scan starting from the end of the LRU list, just
like it does now.
Now, to control the bgwriter frequency we multiply the percent of the
list it had to span by the bgwriter_delay value to determine when to run
bgwriter next. For example, if you find enough dirty pages by looking
at only 10% of the buffer cache you multiple 10% (0.10) * bgwriter_delay
and that is when you run next. If you have to scan 50%, bgwriter runs
next at 50% (0.50) * bgwriter_delay, and if it has to scan the entire
list it is 100% (1.00) * bgwriter_delay.
What this does is to cause bgwriter to run more frequently when there
are a lot of dirty buffers on the end of the LRU _and_ when the bgwriter
scan will be quick. When there are few writes, bgwriter will run less
frequently but will write dirty buffers nearer to the head of the LRU.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Richard Huxton <dev@archonet.com> writes:
However, one thing you can say is that if block B hasn't been written to
since you last checked, then any blocks older than that haven't been
written to either.
[ itch... ] Can you? I don't recall exactly when a block gets pushed
up the ARC list during a ReadBuffer/WriteBuffer cycle, but at the very
least I'd have to say that this assumption is vulnerable to race
conditions.
Also, the cntxDirty mechanism allows a block to be dirtied without
changing the ARC state at all. I am not very clear on whether Vadim
added that mechanism just for performance or because there were
fundamental deadlock issues without it; but in either case we'd have
to think long and hard about taking it out for the bgwriter's benefit.
regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes:
First, we remove the GUC bgwriter_maxpages because I don't see a good
way to set a default for that. A default value needs to be based on a
percentage of the full buffer cache size.
This is nonsense. The admin knows what he set shared_buffers to, and so
maxpages and percent of shared buffers are not really distinct ways of
specifying things. The cases that make a percent spec useful are if
(a) it is a percent of a non-constant number (eg, percent of total dirty
pages as in the current code), or (b) it is defined in a way that lets
it limit the amount of scanning work done (which it isn't useful for in
the current code). But a maxpages spec is useful for (b) too. More to
the point, maxpages is useful to set a hard limit on the amount of I/O
generated by the bgwriter, and I think people will want to be able to do
that.
Now, to control the bgwriter frequency we multiply the percent of the
list it had to span by the bgwriter_delay value to determine when to run
bgwriter next.
I'm less than enthused about this. The idea of the bgwriter is to
trickle out writes in a way that doesn't affect overall performance too
much. Not to write everything in sight at any cost.
I like the hybrid "keep the bottom of the ARC list clean, plus do a slow
clock scan on the main buffer array" approach better. I can see that
that directly impacts both of the goals that the bgwriter has. I don't
see how a variable I/O rate really improves life on either score; it
just makes things harder to predict.
regards, tom lane
A quick $0.02 on how DB2 does this (at least in 7.x).
They used a combination of everything that's been discussed. The first
priority of their background writer was to keep the LRU end of the cache
free so individual backends would never have to wait to get a page.
Then, they would look to pages that had been dirty for 'a long time',
which was user configurable. Pages older than this setting were
candidates to be written out even if they weren't close to LRU. Finally,
I believe there were also settings for how often the writer would fire
up, and how much work it would do at once.
I agree that the first priority should be to keep clean pages near LRU,
but that you also don't want to get hammered at checkpoint time. I think
what might be interesting to consider is keeping a list of dirty pages,
which would remove the need to scan a very large buffer. Of course, in
an environment with a heavy update load, it could be better to just
scan the buffers, especially if you don't do a clock-sweep but instead
look at where the last page you wrote out has ended up in the LRU list
since you last ran, and start scanning from there (by definition
everything after that page would have to be clean). Of course this is
just conjecture on my part and would need testing to verify, and it's
obviously beyond the scope of 8.0.
As for 8.0, I suspect at this point it's probably best to just go with
whatever method has the smallest amount of code impact unless it's
inherenttly broken.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"