our buffer replacement strategy is kind of lame
While I was poking around at the index-only scans patch, I couldn't
help noticing that our buffer replacement algorithm leaves something
to be desired. Here's the query again:
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
As sample_data is not indexed, it sequential scans that table and does
an index-only scan of pgbench_accounts for each aid. When this is
done, here's what pg_buffercache has to say:
rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1;
usagecount | sum
------------+-------
1 | 136
2 | 12
3 | 72
4 | 7
5 | 13755
| 37218
(6 rows)
Only 96 of the 14286 buffers in sample_data are in shared_buffers,
despite the fact that we have 37,218 *completely unused* buffers lying
around. That sucks, because it means that the sample query did a
whole lot of buffer eviction that was completely needless. The ring
buffer is there to prevent trashing the shared buffer arena, but here
it would have been fine to trash the arena: there wasn't anything
there we would have minded losing (or, indeed, anything at all). On
the other hand, the buffer manager has *no problem at all* trashing
the buffer arena if we're faulting in pages for an index scan rather
than a sequential scan. If you manage to get all of sample_data into
memory (by running many copies of the above query in parallel, you can
get each one to allocate its own ring buffer, and eventually pull in
all the pages), and then run some query that probes an index which is
too large to fit in shared_buffers, it cheerfully blows the whole
sample_data table out without a second thought. Had you sequentially
scanned a big table, of course, there would be some protection, but an
index scan can stomp all over everything with complete impunity.
The general problem here is that we are not very smart about handling
workloads with weak locality - i.e. the working set is larger than
shared buffers. If the working set fits in shared_buffers, we will
keep it there, and it will be strongly protected against eviction.
But if the working set *doesn't* fit in shared buffers, then we fall
back on some not-very-smart heuristics to decide what to do: keep
pages involved in index scans, ditch those involved in sequential
scans.
It seems to me that we need a smarter algorithm. It seems that NetBSD
has adopted the use of Clock-Pro, and there are some discussions of it
being used in Linux as well, though I'm not sure whether that's
actually (still?) the case. Paper is here, although I find the
description of the algorithm to be approximately clear as mud:
http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf
One of the important notions which many of the algorithms in the
literature seem to hit on in one way or another is that you mustn't go
and decide that all of your buffers - or nearly all your buffers - are
hot. That's really equivalent to saying that none of your buffers are
hot; and it's also pretty easy to see that such a scenario would be
disastrous for our current algorithm, which - if everything in
shared_buffers has usage-count 5 all the time - will have to clock
sweep a long way each time it needs to allocate a buffer. In fact,
the more of your buffers want to be hot (and thus hard to evict), the
fewer of them should actually be allowed to acquire that status.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1;
usagecount | sum
------------+-------
1 | 136
2 | 12
3 | 72
4 | 7
5 | 13755
| 37218
(6 rows)Only 96 of the 14286 buffers in sample_data are in shared_buffers,
despite the fact that we have 37,218 *completely unused* buffers lying
around. That sucks, because it means that the sample query did a
whole lot of buffer eviction that was completely needless. The ring
buffer is there to prevent trashing the shared buffer arena, but here
it would have been fine to trash the arena: there wasn't anything
there we would have minded losing (or, indeed, anything at all).
You're missing an important point. The SeqScan is measurably faster
when using the ring buffer because of the effects of L2 cacheing on
the buffers.
Also, your logic is a little off, since you're doing the scan on an
otherwise quiet machine soon after startup and there are some
completely unused buffers. That isn't the typical state of the buffer
cache and so spoiling the cache is not acceptable in the general case.
The above case doesn't suck because it carefully reserved parts of
shared buffers for other users when spoiling the cache would be of no
benefit to the user, so it worked great from my perspective.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On
the other hand, the buffer manager has *no problem at all* trashing
the buffer arena if we're faulting in pages for an index scan rather
than a sequential scan. If you manage to get all of sample_data into
memory (by running many copies of the above query in parallel, you can
get each one to allocate its own ring buffer, and eventually pull in
all the pages), and then run some query that probes an index which is
too large to fit in shared_buffers, it cheerfully blows the whole
sample_data table out without a second thought. Had you sequentially
scanned a big table, of course, there would be some protection, but an
index scan can stomp all over everything with complete impunity.
That's a good observation and I think we should do this
* Make an IndexScan use a ring buffer once it has used 32 blocks. The
vast majority won't do that, so we avoid overhead on the common path.
* Make an BitmapIndexScan use a ring buffer when we know that the
index is larger than 32 blocks. (Ignore upper parts of tree for that
calc).
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
The general problem here is that we are not very smart about handling
workloads with weak locality - i.e. the working set is larger than
shared buffers. If the working set fits in shared_buffers, we will
keep it there, and it will be strongly protected against eviction.
But if the working set *doesn't* fit in shared buffers, then we fall
back on some not-very-smart heuristics to decide what to do: keep
pages involved in index scans, ditch those involved in sequential
scans.It seems to me that we need a smarter algorithm. It seems that NetBSD
has adopted the use of Clock-Pro, and there are some discussions of it
being used in Linux as well, though I'm not sure whether that's
actually (still?) the case. Paper is here, although I find the
description of the algorithm to be approximately clear as mud:http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf
One of the important notions which many of the algorithms in the
literature seem to hit on in one way or another is that you mustn't go
and decide that all of your buffers - or nearly all your buffers - are
hot. That's really equivalent to saying that none of your buffers are
hot; and it's also pretty easy to see that such a scenario would be
disastrous for our current algorithm, which - if everything in
shared_buffers has usage-count 5 all the time - will have to clock
sweep a long way each time it needs to allocate a buffer. In fact,
the more of your buffers want to be hot (and thus hard to evict), the
fewer of them should actually be allowed to acquire that status.
This is something I've been actively working on. I was on the trail of
possible reasons that increasing the size of shared buffers would slow
down PostgreSQL.
The worst case behaviour of the current freelist code is that it can
take up to 5 * shared_buffers checks before identifying a victim
buffer. That occurs when we have a working set exactly matching size
of shared buffers. There are problems when large portions of shared
buffers are very popular, so that the free-ish buffers are a small
proportion of the total buffers - this case gets worse if the
distribution of buffer allocations is non-random so you get say 10
free-ish blocks together, followed by N-10 very popular ones. That
makes every 11th freelist request take ~O(shared_buffers). These
problems will show themselves as sporadic BufFreelistLock contention.
Sporadic is the problem here since it may make the contention hard to
measure in aggregate, yet with real impact on performance.
For us to benefit from increased shared_buffers we need to have an
algorithm that is O(k) in its worst case behaviour and O(1) in its
best case behaviour - the latter aspect is provided by a correctly
functioning bgwriter. At the moment, we do nothing to enforce O(k)
behaviour.
The following patch implements O(k) behaviour using a heuristic limit.
My first attempt was a fixed heuristic limit, but that was less
suitable because there are actually 2 requirements. The value of k
must be small to have a measurable impact on the smoothness of
StrategyGetBuffer(), but k must also be fairly large to ensure the
choice of victim is a relatively good one. So the way to balance those
conflicting objectives is to have a progressively increasing search
window. An just to repeat, k is not a % of shared buffers, which would
still give O(N) behaviour.
I've not been reading "the literature", given the problems we had in
2004/5 regarding patents in this area. I also think that since we rely
on the underlying filesystem for cacheing that we don't have exactly
the same problem as other systems.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
freelist_ok.v2.patchapplication/octet-stream; name=freelist_ok.v2.patchDownload+25-6
On Fri, Aug 12, 2011 at 11:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
I've not been reading "the literature", given the problems we had in
2004/5 regarding patents in this area. I also think that since we rely
on the underlying filesystem for cacheing that we don't have exactly
the same problem as other systems.
Reading the link you gave, I'm unimpressed by ClockPro. The
measurements made are all in low numbers of MB and the results show
that Clock is roughly same or better for large caches, but one might
read them multiple ways.
The zone of interest for us is above 1GB and the issues we care about
are contention, so even if we think ClockPro has a chance of improving
things we really don't have any measurements that support the cases we
care about.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
You're missing an important point. The SeqScan is measurably faster
when using the ring buffer because of the effects of L2 cacheing on
the buffers.
I hadn't thought of that, but I think that's only true if the relation
won't fit in shared_buffers (or whatever portion of shared_buffers is
reasonably available, given the other activity on the machine). In
this particular case, it's almost 20% faster if the relation is all in
shared_buffers; I tested it. I think what's going on here is that
initscan() has a heuristic that tries to use a BufferAccessStrategy if
the relation is larger than 1/4 of shared_buffers. That's probably a
pretty good heuristic in general, but in this case I have a relation
which just so happens to be 27.9% of shared_buffers but will still
fit. As you say below, that may not be typical in real life, although
there are probably data warehousing systems where it's normal to have
only one big query running at a time.
Also, your logic is a little off, since you're doing the scan on an
otherwise quiet machine soon after startup and there are some
completely unused buffers. That isn't the typical state of the buffer
cache and so spoiling the cache is not acceptable in the general case.
The above case doesn't suck because it carefully reserved parts of
shared buffers for other users when spoiling the cache would be of no
benefit to the user, so it worked great from my perspective.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 4:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On
the other hand, the buffer manager has *no problem at all* trashing
the buffer arena if we're faulting in pages for an index scan rather
than a sequential scan. If you manage to get all of sample_data into
memory (by running many copies of the above query in parallel, you can
get each one to allocate its own ring buffer, and eventually pull in
all the pages), and then run some query that probes an index which is
too large to fit in shared_buffers, it cheerfully blows the whole
sample_data table out without a second thought. Had you sequentially
scanned a big table, of course, there would be some protection, but an
index scan can stomp all over everything with complete impunity.That's a good observation and I think we should do this
* Make an IndexScan use a ring buffer once it has used 32 blocks. The
vast majority won't do that, so we avoid overhead on the common path.* Make an BitmapIndexScan use a ring buffer when we know that the
index is larger than 32 blocks. (Ignore upper parts of tree for that
calc).
We'd need to think about what happens to the internal pages of the
btree, the leaf pages, and then the heap pages from the underlying
relation; those probably shouldn't all be treated the same. Also, I
think the tricky part is figuring out when to apply the optimization
in the first place. Once we decide we need a ring buffer, a very
small one (like 32 blocks) is probably the way to go. But it will be
a loser to apply the optimization to data sets that would otherwise
have fit in shared_buffers. This is a classic case of the LRU/MRU
problem. You want to evict buffers in LRU fashion until the working
set gets larger than you can cache; and then you want to switch to MRU
to avoid uselessly caching pages that you'll never manage to revisit
before they're evicted.
The point of algorithms like Clock-Pro is to try to have the system
work that out on the fly, based on the actual workload, rather than
using heuristics. I agree with you there's no convincing evidence
that Clock-Pro would be better for us; I mostly thought it was
interesting because it seems that the NetBSD and Linux guys find it
interesting, and they're managing much larger caches than the ones
we're dealing with.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
You're missing an important point. The SeqScan is measurably faster
when using the ring buffer because of the effects of L2 cacheing on
the buffers.I hadn't thought of that, but I think that's only true if the relation
won't fit in shared_buffers (or whatever portion of shared_buffers is
reasonably available, given the other activity on the machine). In
this particular case, it's almost 20% faster if the relation is all in
shared_buffers; I tested it. I think what's going on here is that
initscan() has a heuristic that tries to use a BufferAccessStrategy if
the relation is larger than 1/4 of shared_buffers. That's probably a
pretty good heuristic in general, but in this case I have a relation
which just so happens to be 27.9% of shared_buffers but will still
fit. As you say below, that may not be typical in real life, although
there are probably data warehousing systems where it's normal to have
only one big query running at a time.
I think there are reasonable arguments to make
* prefer_cache = off (default) | on a table level storage parameter,
=on will disable the use of BufferAccessStrategy
* make cache_spoil_threshold a parameter, with default 0.25
Considering the world of very large RAMs in which we now live, some
control of the above makes sense.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
But it will be
a loser to apply the optimization to data sets that would otherwise
have fit in shared_buffers.
Spoiling the cache is a bad plan, even if it makes the current query faster.
I think we should make the optimisation stronger still and use
FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the
OS cache is a problem as well.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 12, 2011 at 6:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
The worst case behaviour of the current freelist code is that it can
take up to 5 * shared_buffers checks before identifying a victim
buffer. That occurs when we have a working set exactly matching size
of shared buffers. There are problems when large portions of shared
buffers are very popular, so that the free-ish buffers are a small
proportion of the total buffers - this case gets worse if the
distribution of buffer allocations is non-random so you get say 10
free-ish blocks together, followed by N-10 very popular ones. That
makes every 11th freelist request take ~O(shared_buffers). These
problems will show themselves as sporadic BufFreelistLock contention.
Sporadic is the problem here since it may make the contention hard to
measure in aggregate, yet with real impact on performance.
I've been thinking about this exact same set of problems.
For us to benefit from increased shared_buffers we need to have an
algorithm that is O(k) in its worst case behaviour and O(1) in its
best case behaviour - the latter aspect is provided by a correctly
functioning bgwriter. At the moment, we do nothing to enforce O(k)
behaviour.
Check.
The following patch implements O(k) behaviour using a heuristic limit.
My first attempt was a fixed heuristic limit, but that was less
suitable because there are actually 2 requirements. The value of k
must be small to have a measurable impact on the smoothness of
StrategyGetBuffer(), but k must also be fairly large to ensure the
choice of victim is a relatively good one. So the way to balance those
conflicting objectives is to have a progressively increasing search
window. An just to repeat, k is not a % of shared buffers, which would
still give O(N) behaviour.
This is a very interesting idea, and I think it has a lot of potential.
One additional thought I had is that right now I think it's far too
easy for buffers to get pushed all the way up to usage count 5,
because we bump the reference count every time the buffer is pinned,
which can easily happen many times per clock sweep. I think we would
be better off having a "used" flag on each buffer. Each pin sets the
used bit, but does nothing if it is already set. The usage count is
only changed by the clock sweep, which does the following on each
buffer:
- If the "used" flag is set, then the flag is cleared and the usage
count increases by one to a maximum of 5.
- If the "used" flag is not set, then the usage count decreases by one.
That way, buffers can't get hot because of a fast flurry of access
followed by nothing - to get up to usage_count 5, they've actually got
to stay interesting for a period of time. As a side effect, I believe
we could get rid of the spinlock that we currently take and release
for every buffer the clock sweep hits, because we could regard the
usage_count as protected by the BufFreelistLock rather than by the
buffer header spinlock; and the "used" flag doesn't require perfect
synchronization. We'd only pin the buffer we actually plan to evict
(and could recheck the "used" flag before doing so, in case a memory
ordering effect made us miss the fact that it had been recently set).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 8:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
You're missing an important point. The SeqScan is measurably faster
when using the ring buffer because of the effects of L2 cacheing on
the buffers.I hadn't thought of that, but I think that's only true if the relation
won't fit in shared_buffers (or whatever portion of shared_buffers is
reasonably available, given the other activity on the machine). In
this particular case, it's almost 20% faster if the relation is all in
shared_buffers; I tested it. I think what's going on here is that
initscan() has a heuristic that tries to use a BufferAccessStrategy if
the relation is larger than 1/4 of shared_buffers. That's probably a
pretty good heuristic in general, but in this case I have a relation
which just so happens to be 27.9% of shared_buffers but will still
fit. As you say below, that may not be typical in real life, although
there are probably data warehousing systems where it's normal to have
only one big query running at a time.I think there are reasonable arguments to make
* prefer_cache = off (default) | on a table level storage parameter,
=on will disable the use of BufferAccessStrategy* make cache_spoil_threshold a parameter, with default 0.25
Yeah, something like that might make sense. Of course, a completely
self-tuning system would be better, but I'm not sure we're going to
find one of those.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 8:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
But it will be
a loser to apply the optimization to data sets that would otherwise
have fit in shared_buffers.Spoiling the cache is a bad plan, even if it makes the current query faster.
I think we should make the optimisation stronger still and use
FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the
OS cache is a problem as well.
That would probably be better for really big tables, but it will be
worse for those where the entire table fits in the OS cache.
Caching spoiling means you're putting things into the cache which
won't still be there the next time they're needed. If the data in
question fits in cache (and I don't think we can regard that as
uncommon, particularly for the OS cache, which can be half a terabyte)
then you don't want the anti-spoiling logic to kick in.
One thing we could consider for large sequential scans is doing
POSIX_FADV_SEQUENTIAL before beginning the scan (and maybe one more
time if the scan wraps). That's basically throwing the problem of
whether or not to go LRU or MRU back on the OS, but the OS may well
have a better idea whether there's enough cache available to hold the
whole than we do.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Aug 12, 2011 at 01:28:49PM +0100, Simon Riggs wrote:
I think there are reasonable arguments to make
* prefer_cache = off (default) | on a table level storage parameter,
=on will disable the use of BufferAccessStrategy* make cache_spoil_threshold a parameter, with default 0.25
Considering the world of very large RAMs in which we now live, some
control of the above makes sense.
As long as we are discussion cache settings for tables, I have a client
who would like to be able to lock specific tables and indexes into cache
as they have strict response time requirements for particular queries.
At the moment they are running postgres with a tablespace on ramfs and
taking frequent backups, but this is not optimal.
-dg
--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Only 96 of the 14286 buffers in sample_data are in shared_buffers,
despite the fact that we have 37,218 *completely unused* buffers lying
around. That sucks, because it means that the sample query did a
whole lot of buffer eviction that was completely needless. The ring
buffer is there to prevent trashing the shared buffer arena, but here
it would have been fine to trash the arena: there wasn't anything
there we would have minded losing (or, indeed, anything at all).
I don't disagree with the general thrust of your point, but I just
wanted to point out one case where not using free buffers even though
they're available might make sense:
If you execute a large batch delete or update or even just set lots of
hint bits you'll dirty a lot of buffers. The ring buffer forces the
query that is actually dirtying all these buffers to also do the i/o
to write them out. Otherwise you leave them behind to slow down other
queries. This was one of the problems with the old vacuum code which
the ring buffer replaced. It left behind lots of dirtied buffers for
other queries to do i/o on.
--
greg
On Fri, Aug 12, 2011 at 10:51 PM, Greg Stark <stark@mit.edu> wrote:
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Only 96 of the 14286 buffers in sample_data are in shared_buffers,
despite the fact that we have 37,218 *completely unused* buffers lying
around. That sucks, because it means that the sample query did a
whole lot of buffer eviction that was completely needless. The ring
buffer is there to prevent trashing the shared buffer arena, but here
it would have been fine to trash the arena: there wasn't anything
there we would have minded losing (or, indeed, anything at all).I don't disagree with the general thrust of your point, but I just
wanted to point out one case where not using free buffers even though
they're available might make sense:If you execute a large batch delete or update or even just set lots of
hint bits you'll dirty a lot of buffers. The ring buffer forces the
query that is actually dirtying all these buffers to also do the i/o
to write them out. Otherwise you leave them behind to slow down other
queries. This was one of the problems with the old vacuum code which
the ring buffer replaced. It left behind lots of dirtied buffers for
other queries to do i/o on.
Interesting point.
After thinking about this some more, I'm coming around to the idea
that we need to distinguish between:
1. Ensuring a sufficient supply of evictable buffers, and
2. Evicting a buffer.
The second obviously needs to be done only when needed, but the first
one should really be done as background work. Currently, the clock
sweep serves both functions, and that's not good. We shouldn't ever
let ourselves get to the point where there are no buffers at all with
reference count zero, so that the next guy who needs a buffer has to
spin the clock hand around until the reference counts get low enough.
Maintaining a sufficient supply of refcount-zero buffers should be
done as a background task; and possibly we ought to put them all in a
linked list so that the next guy who needs a buffer can just pop one
off.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
and possibly we ought to put them all in a
linked list so that the next guy who needs a buffer can just pop one
The whole point of the clock sweep algorithm is to approximate an LRU
without needing to maintain a linked list. The problem with a linked
list is that you need to serialize access to it so every time you
reference a buffer you need to wait on a lock for the list so you can
move that buffer around in the list.
It does kind of seem like your numbers indicate we're missing part of
the picture though. The idea with the clock sweep algorithm is that
you keep approximately 1/nth of the buffers with each of the n values.
If we're allowing nearly all the buffers to reach a reference count of
5 then you're right that we've lost any information about which
buffers have been referenced most recently.
I wonder if we're suppoesd to be doing something like advancing the
clock hand each time we increment a reference count as well?
--
greg
On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:
On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
and possibly we ought to put them all in a
linked list so that the next guy who needs a buffer can just pop oneThe whole point of the clock sweep algorithm is to approximate an LRU
without needing to maintain a linked list. The problem with a linked
list is that you need to serialize access to it so every time you
reference a buffer you need to wait on a lock for the list so you can
move that buffer around in the list.
Right, but the same doesn't apply to what I'm talking about. You just
put each guy into the linked list when his reference count goes down
to 0. When you want to allocate a buffer, you pop somebody off. If
his reference count is still 0, you forget about him and pop the next
guy, and keep going until you find a suitable victim.
The problem with just running the clock sweep every time you need a
victim buffer is that you may have to pass over a large number of
unevictable buffers before you find someone you can actually kick out.
That's work that should really be getting done in the background, not
at the absolute last moment when we discover, hey, we need a buffer.
It does kind of seem like your numbers indicate we're missing part of
the picture though. The idea with the clock sweep algorithm is that
you keep approximately 1/nth of the buffers with each of the n values.
If we're allowing nearly all the buffers to reach a reference count of
5 then you're right that we've lost any information about which
buffers have been referenced most recently.
It doesn't seem right to have 1/nth of the buffers at each of n values
because that might not match the actual workload. For example, you
might have lightning-hot buffers that fill 50% of your available
cache. Trying to artificially force some of those down to usage count
4 or 3 doesn't actually get you anywhere. In fact, AFAICS, there's no
direct reason to care about about how many buffers have usage count 1
vs. usage count 5. What IS important for performance is that there
are enough with usage count 0. Any other distinction is only useful
to the extent that it allows us to make a better decision about which
buffers we should push down to 0 next.
I wonder if we're suppoesd to be doing something like advancing the
clock hand each time we increment a reference count as well?
That precise thing seems far too expensive, but I agree that
something's missing. Right now we decrease usage counts when the
clock hand moves (which is driven by buffer allocation), and we
increase them when buffers are pinned (which is driven by access to
already-resident buffers). I feel like that's comparing apples and
oranges. If some subset of shared_buffers is very hot relative to the
uncached pages, then everything's going to converge to 5 (or whatever
arbitrary maximum we care to set in lieu of 5).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:
On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
and possibly we ought to put them all in a
linked list so that the next guy who needs a buffer can just pop oneThe whole point of the clock sweep algorithm is to approximate an LRU
without needing to maintain a linked list. The problem with a linked
list is that you need to serialize access to it so every time you
reference a buffer you need to wait on a lock for the list so you can
move that buffer around in the list.Right, but the same doesn't apply to what I'm talking about. You just
put each guy into the linked list when his reference count goes down
to 0. When you want to allocate a buffer, you pop somebody off. If
his reference count is still 0, you forget about him and pop the next
guy, and keep going until you find a suitable victim.The problem with just running the clock sweep every time you need a
victim buffer is that you may have to pass over a large number of
unevictable buffers before you find someone you can actually kick out.
That's work that should really be getting done in the background, not
at the absolute last moment when we discover, hey, we need a buffer.
I think Greg is right here. Those suggested changes just bring back the LRU.
If you keep a separate list then you need to serialize access to it.
usage_count == 0 and "unevictable" are dynamic states, so if the
bgwriter sees those states they can change before anyone uses the
buffer.
The only state which is unchangeable is a recently filled block, such
as from a COPY, which is why I originally suggested a
filled-block-list - though I don't think we need it now because of the
buffer cycle strategy.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I agree that
something's missing.
I'm quoting you completely out of context here, but yes, something is missing.
We can't credibly do one test on usage count in shared buffers and
then start talking about how buffer management is all wrong.
My own patch requires more test evidence before we can commit it,
which is why I hadn't published it before now. I'll endeavour to do
that now its on the table.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Aug 14, 2011 at 6:57 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:
On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
and possibly we ought to put them all in a
linked list so that the next guy who needs a buffer can just pop oneThe whole point of the clock sweep algorithm is to approximate an LRU
without needing to maintain a linked list. The problem with a linked
list is that you need to serialize access to it so every time you
reference a buffer you need to wait on a lock for the list so you can
move that buffer around in the list.Right, but the same doesn't apply to what I'm talking about. You just
put each guy into the linked list when his reference count goes down
to 0. When you want to allocate a buffer, you pop somebody off. If
his reference count is still 0, you forget about him and pop the next
guy, and keep going until you find a suitable victim.The problem with just running the clock sweep every time you need a
victim buffer is that you may have to pass over a large number of
unevictable buffers before you find someone you can actually kick out.
That's work that should really be getting done in the background, not
at the absolute last moment when we discover, hey, we need a buffer.I think Greg is right here. Those suggested changes just bring back the LRU.
Since the problem with LRU is that it requires moving each buffer to
the front of the list every time it's touched, and since the idea that
I proposed does not require that, I don't know what you mean by this.
If you keep a separate list then you need to serialize access to it.
That is true. However, there are some reasons to think it would be
better than what we're doing right now. Right now, we acquire the
BufFreelistLock, and then take and release some number of spinlocks.
It seems fairly easy to construct cases where that number is routinely
over 100; even on my laptop, I could construct cases where it was
routinely 50-100 range. A linked list that we can just dequeue things
from could be protected by a single spinlock, which would hopefully be
somewhat faster. (In the interest of credit where credit is due, this
is pretty much the point Jim Nasby has been making every time this
argument comes up, and I've been dismissing it, but now that I
understand how much work gets done in that clock sweep code, I'm
starting to think he might be right.)
However, if it turns out that that there's still too much contention
over that spinlock, we can probably fix that by maintaining N lists
instead of one, distributing buffers between them in round-robin
fashion, and having each backend choose a list based on its backend ID
modulo N. The testing I've done so far seems to indicate that
spinlock contention doesn't really become noticeable until you have
32-40 CPUs pounding hard on the same spinlock, so even N = 4 is
probably enough to make the problem go away. But I don't even think
we're going to have to go that far right at the outset, because
there's another major source of contention in the buffer eviction
path: the buffer mapping locks.
usage_count == 0 and "unevictable" are dynamic states, so if the
bgwriter sees those states they can change before anyone uses the
buffer.
Yep. That's already a problem. When we pin the buffer to be evicted,
we're not yet holding the relevant buffer mapping locks. By the time
we do, someone else can have pinned the page. So there's already
retry logic here. In theory, you can imagine a backend getting
starved for an arbitrarily long period of time, because it keeps
picking a victim buffer that someone else wrenches away before we
actually manage to kick it out. I haven't yet seen any evidence that
that's a problem in practice, but if it is, this idea will make it
worse. It seems hard to know without testing it, though.
The big problem with this idea is that it pretty much requires that
the work you mentioned in one of your other emails - separating the
background writer and checkpoint machinery into two separate processes
- to happen first. So I'm probably going to have to code that up to
see whether this works. If you're planning to post that patch soon
I'll wait for it. Otherwise, I'm going to have to cobble together
something that is at least good enough for testing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company