Clock sweep not caching enough B-Tree leaf pages?
I have some theories about the PostgreSQL buffer manager/clock sweep.
To motivate the reader to get through the material presented here, I
present up-front a benchmark of a proof-of-concept patch of mine:
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/
Test Set 4 represents the patches performance here.
This shows some considerable improvements for a tpc-b workload, with
15 minute runs, where the buffer manager struggles with moderately
intense cache pressure. shared_buffers is 8GiB, with 32GiB of system
memory in total. The scale factor is 5,000 here, so that puts the
primary index of the accounts table at a size that makes it impossible
to cache entirely within shared_buffers, by a margin of couple of
GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63
GB". Obviously the heap is much larger, since for that table heap
tuples are several times the size of index tuples (the ratio here is
probably well below the mean, if I can be permitted to make a vast
generalization).
PostgreSQL implements a clock sweep algorithm, which gets us something
approaching an LRU for the buffer manager in trade-off for less
contention on core structures. Buffers have a usage_count/"popularity"
that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
algorithm only has one bit for what approximates our "usage_count" (so
it's either 0 or 1). I think that at its core CLOCK is an algorithm
that has some very desirable properties that I am sure must be
preserved. Actually, I think it's more accurate to say we use a
variant of clock pro, a refinement of the original CLOCK.
In the past, various hackers have noted problems they've observed with
this scheme. A common pathology is to see frantic searching for a
victim buffer only to find all buffer usage_count values at 5. It may
take multiple revolutions of the clock hand before a victim buffer is
found, as usage_count is decremented for each and every buffer. Also,
BufFreelistLock contention is considered a serious bottleneck [1]/messages/by-id/CA+TgmobJm0GHk58nUPRQHCGwY25n1DCkU4ku9aQeczZEjiz9mQ@mail.gmail.com,
which is related.
I think a very real problem that may be that approximating an LRU is
bad because an actual LRU is bad, though not due to the problems that
CLOCK/our clock sweep algorithm to its great credit ameliorates. I
don't think that we need to trade-off between an LRU and MRU as Atri
once suggested [2]/messages/by-id/CAOeZVic4HikhmzVD=ZP4JY9g8PgpyiQQOXOELWP=kR+=H1Frgg@mail.gmail.com; rather, I think we need a trade-off between an LRU
and a LFU (very loosely speaking). Something like a pure LRU can make
very bad choices for database workloads. This is described extensively
in the literature.
As I recently remarked upon for unrelated reasons [3]/messages/by-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com, B+Trees perform
excellently when partially cached. There is a remarkable asymmetry
between how many pages we must have cached relative to how much I/O is
avoided, particularly the random I/O that is characteristic of fully
uncached B-Trees when scanned. Approximately 99% of pages in our
nbtree structures can be expected to be leaf pages in many common
cases. There is a very wide fan-out of B-Tree structures that makes
this possible. A lot of material on the subject doesn't emphasize this
basic aspect strongly enough in my view. But it also bears mentioning
that B-Tree pages are, in an important sense, far denser than heap
pages.
Let's leave aside inner/root pages though, because they're so
dramatically useful when in a primary index on a tpb-b table that
they'll always be cached by any non-terrible algorithm. It beggars
belief that the still relatively dense (and consequently *popular*)
B+Tree leaf pages get so little credit for being of such long-term
utility (in the view of our clock sweep algorithm as implemented). The
algorithm has what could be loosely described as an excessively
short-term perspective. There is clearly a better balance to be had
here. I don't think the answer is that we have the B-Tree code give
its pages some special treatment that makes them harder to evict,
although I will admit that I briefly entertained the idea.
I am aware of the false start that we had with ARC back in 2005. I
believe that effort tried to address some of these problems. I am
mindful of the pitfalls there. I'm inclined to think that the decision
to create a CLOCK variant in preference to ARC was the right one at
the time, because clock sweep really is at a considerable advantage
with regard to lock contention.
The 1993 paper "The LRU-K Page Replacement Algorithm For Database Disk
Buffering" [4]http://www.cs.cmu.edu/~christos/courses/721-resources/p297-o_neil.pdf proposes what is (roughly speaking) an algorithm that
is an adaptive hybrid of LRU and LFU. Consider Example 1.1. from that
paper; it describes a very simple scenario in which its possible to
have slightly more heap pages cached than B-Tree pages. This scenario
is essentially a "pgbench -S", a scenario compared to the simplistic
tpc-a; it's nothing more than that. The authors aren't clear on this,
but I assumed a uniform distribution (IIRC, the 2Q paper, which was
published a year or two later, also explicitly assumes this of the
very same earlier LRU-K/LRU-2 example). It is argued within that
example, quite cogently in my opinion, that it's very bad that a
simple LRU cache will see just under 50% of buffers used to cache
B-Tree leaf pages, while over 50% are used for "data" (heap) pages. In
actual fact, it is preferable to almost exclusively cache B-Tree
pages. Using 50% of the cache on B-Tree pages when it is optimal to
cache a number approaching 100% of the cache is a pretty large
disparity, particularly for such a simple workload. This is literally
the furthest possible thing from a contrived corner case.
I believe that it isn't hard to get clock sweep to do just the same as
predicted in the '93 paper, even with a "usage_count". It is nothing
more than an LRU approximation, or at least that's what the relevant
comments say. I did some experiments here, but it probably isn't worth
sharing the raw data, given what I already have here. The example
surrounding caching B-Tree leaf pages in the paper draws attention to
a particularly acute pathology, but there are probably others. Leaf
pages are in many representative cases far denser, and therefore can
be expected to be accessed far more frequently to service queries. Our
current failure to credit them in a way that weighs the frequency with
which they're accessed is something I suggest thinking long and hard
about.
Has anyone thought about this in the last few years? I know that Tom
examined the LRU-K paper back in 2000 [5]/messages/by-id/1601.967421129@sss.pgh.pa.us, but was discouraged by some
kind of contention or CPU overhead (although he did say he intended to
revisit the question). Obviously a lot has changed in the past 14
years, particularly with regard to CPU characteristics.
Anyway, having gone through the requisite background information, I'll
get back to the subject of the pgbench-tools tpc-b benchmark that I
started out with, and what I've actually done in the attached patch to
improve matters. Let me preface this by saying: this is a rough
prototype. The way I add a field to the buffer descriptor struct
clearly isn't going to fly, since we have every reason to believe that
that is itself performance critical in other scenarios (just look at
Andres' recent work on the alignment of these structures in memory).
Actually, it probably matters a lot in this scenario too, just not
enough to mask the general outcome. I've arrived at this prototype
through plenty of principled theorizing, and a bit of unprincipled
frobbing. I'm a big fan of building simple prototypes to test
theories. In the interest of reproducibility I have not attempted to
clean up what I have here just yet, in case I accidentally invalidate
the benchmark presented.
The prototype patch:
1) Throttles incrementation of usage_count temporally. It becomes
impossible to increment usage_count for any given buffer more
frequently than every 3 seconds, while decrementing usage_count is
totally unaffected. This is thought to give the algorithm a short to
medium term "appreciation" of the frequency of access. It also
prevents very close increments in usage_count due to pinning a buffer
twice in the same query or transaction, just because the code in the
executor happens to be accidentally structured such that that happens
(as when inserting two tuples within a single INSERT DML query), or
whatever. Clearly the tpc-b accounts table is accessed twice in
succession in the same transaction by our tpb-c, creating a sort of
misrepresentation of buffer popularity that is likely in and of itself
undesirable.
2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not
5 as before. This gives us a more granular representation of the
popularity/usefulness of each buffer, which I believe spans a
dramatically large spectrum (i.e. my guess is that testing will show I
didn't go far enough). This step on its own would be assumed extremely
counter-productive by those in the know, but I believe that other
measures ameliorate the downsides. I could be wrong about how true
that is in other cases, but then the case helped here isn't what you'd
call a narrow benchmark. It's the pgbench default (although I do use
unlogged tables, in part because the I/O subsystem on this server is
quite under-powered, even though it has plenty of memory).
3) Has buffer usage_count start at 3. I also tried 4 (and 1, the
current value within master), but settled on 3 for now. Seemingly this
makes a very large difference compared to only doing 1) and 2), since
it gives each page a fair shot at proving its usefulness. Presumably
the throttling in 1) makes frantic buffer scanning much less
prevalent, since clock sweep has the upper hand, so to speak. A tug of
war between clock sweep and backends pinning buffers can be
disastrous, but that seems much less likely than before. Semi-magical
constant amounts of time are something you see surprisingly often in
the research. Probably most notably, the "five-minute rule" [6]http://en.wikipedia.org/wiki/Five-minute_rule argues
that one should only cache randomly accessed pages that are re-used
every 5 minutes or less (it's actually a bit more complicated these
days, but not much). This rule is a mainstay of database caching
research theory. Anyway, I'm not sure whether or not this delay should
be exposed as a GUC. I lean towards "no".
The amount of evidence it takes to validate something like this is
generally enormous. Validating the just-in-time bgwriter strategy was
the original reason that Greg Smith wrote pgbench-tools. Still, I'm
confident that I've identified a real problem. The big picture here is
that pages have a fair go at proving their worth, while allowing for
certain pages to be recognized as being of dramatically higher
long-term utility. Generally speaking, as a caching algorithm, a pure
LFU is inappropriate for almost all use-cases, because it never
forgets anything until it forgets everything about individual pages.
There is a balance to be had here, in terms of the extent to which we
allow pages to "rest on their laurels". I wouldn't be surprised to
learn I have the balance wrong right now.
pgbench-tools benchmark interpretation
=============================
I also attach a LibreOffice spreadsheet comparing hit rates for each
table (and its indexes) for each run that you see in the pgbench-tools
report (except the final do-over of master baseline). I built this
with the help of a small pgbench-tools hack, resetting pg_statio*
views, measuring hit rates per pgbench invocation. The hit rate shown
would probably be a lot more interesting if I'd used a rate limit, so
the amount of work performed is consistent across test sets (that is,
variants of the patch, and master). Greg Smith is very enthusiastic
about how much further insight can be had while using that pgbench
feature, and I'd say he's probably right to be. Right now, in terms of
hit rate you mostly see a slightly smaller one for indexes, and a
considerably larger one for heap buffers as compared to the master
baseline.
If you run down the individual runs in the pgbench-tools report, and
consider what actually happens and when, you tend to see a degradation
in performance over successive runs for different client counts for
certain cases tested. It's hard to be sure, but I think that might be
because earlier runs have a big advantage due to the index being well
cached (pgbench-tools performs vacuuming at the start of each run not
including the first run. So VACUUM reverses the situation initially
seen, since we kill heap tuples last when vacuuming, rather than
creating the index last). In contrast, the impressive performance of
the patch (where all 3 of the above measures are implemented) shows
more consistent throughput even with that noise, which seems to
support my theories about what is going on here. Leaving aside cache
effectiveness as such, I suspect that in general we currently pay a
high price for all too frequently having buffers fall out of
shared_buffers into the OS cache, and fall back in again.
Having arrived at the conclusion that these optimizations were worth
sharing here, I decided to "do over" the original master baseline
(Test Set 1). I ended up with a result (Test set 5) that is actually
quite different from the original result, without it being all that
clear that it was better or worse than the first time I took a
baseline (it was probably worse). This might call into question the
accuracy of my results. However, to see what's really going on here,
it is necessary to drill down to the individual TPS/latency figures
for individual pgbench invocations. That's the real story here.
In general, as I said, the I/O system here is quite under specified
for this workload. This is dedicated physical hardware, but it happens
to be what was immediately available to me. There are a couple of SATA
7200 rpm disks in RAID 1. The system specifications, as advertised by
my hosting provider is detailed here:
https://www.hetzner.de/en/hosting/produkte_rootserver/ex40 . As always
with pgbench-tools, you can drill down to see more about each test
(including kernel vm settings, etc).
Inevitably, with this disk setup, which is I imagine particularly bad
with random I/O, and also with so much memory, it's only a matter of
time before things get ugly, even if there is an initial high
performance burst (even on master) before we have to face the music,
unsurprisingly. It seems like the pain is felt at slightly different
times between Test Set 1 and Test Set 5 (i.e. for each test of
master). With either of those 2 test sets, if you drill down to see
the moment-to-moment throughput and latency, you'll find tests from
each test set where both latency and throughput were on the floor for
as long as a couple of minutes at a time, after which there is a
sudden recovery. We're subject to the vagaries of kernel I/O
scheduling to a much greater extent than with the good patched test
sets, it seems. What is seen with master looks very much like sync
phase checkpoint spikes. Once or twice, things are consistently very
poor for almost an entire master invocation of pgbench. In general
things are very inconsistent, although to be fair it's possible that
at its best master has greater throughput than patched for brief
moments (of course, the buffer descriptor bloating which I have yet to
deal with may well be all that is to blame here).
If you look at the test sets that this patch covers (with all the
tricks applied), there are pretty good figures throughout. You can
kind of see the pain towards the end, but there are no dramatic falls
in responsiveness for minutes at a time. There are latency spikes, but
they're *far* shorter, and much better hidden. Without looking at
individual multiple minute spikes, at the macro level (all client
counts for all runs) average latency is about half of what is seen on
master.
Does anyone have any high-end hardware that they could use to test this out?
[1]: /messages/by-id/CA+TgmobJm0GHk58nUPRQHCGwY25n1DCkU4ku9aQeczZEjiz9mQ@mail.gmail.com
[2]: /messages/by-id/CAOeZVic4HikhmzVD=ZP4JY9g8PgpyiQQOXOELWP=kR+=H1Frgg@mail.gmail.com
[3]: /messages/by-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com
[4]: http://www.cs.cmu.edu/~christos/courses/721-resources/p297-o_neil.pdf
[5]: /messages/by-id/1601.967421129@sss.pgh.pa.us
[6]: http://en.wikipedia.org/wiki/Five-minute_rule
--
Peter Geoghegan
On 4/14/14, 12:11 PM, Peter Geoghegan wrote:
I have some theories about the PostgreSQL buffer manager/clock sweep.
To motivate the reader to get through the material presented here, I
present up-front a benchmark of a proof-of-concept patch of mine:http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/
Test Set 4 represents the patches performance here.
This shows some considerable improvements for a tpc-b workload, with
15 minute runs, where the buffer manager struggles with moderately
intense cache pressure. shared_buffers is 8GiB, with 32GiB of system
memory in total. The scale factor is 5,000 here, so that puts the
primary index of the accounts table at a size that makes it impossible
to cache entirely within shared_buffers, by a margin of couple of
GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63
GB". Obviously the heap is much larger, since for that table heap
tuples are several times the size of index tuples (the ratio here is
probably well below the mean, if I can be permitted to make a vast
generalization).PostgreSQL implements a clock sweep algorithm, which gets us something
approaching an LRU for the buffer manager in trade-off for less
contention on core structures. Buffers have a usage_count/"popularity"
that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
algorithm only has one bit for what approximates our "usage_count" (so
it's either 0 or 1). I think that at its core CLOCK is an algorithm
that has some very desirable properties that I am sure must be
preserved. Actually, I think it's more accurate to say we use a
variant of clock pro, a refinement of the original CLOCK.
I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of which has it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of multiple page pools without the overhead. In reality I believe that argument is false, because the clocks for each page pool in an OS *run at different rates* based on system demands.
I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember this difference any time we look at what OSes do.
If you look at the test sets that this patch covers (with all the
tricks applied), there are pretty good figures throughout. You can
kind of see the pain towards the end, but there are no dramatic falls
in responsiveness for minutes at a time. There are latency spikes, but
they're *far* shorter, and much better hidden. Without looking at
individual multiple minute spikes, at the macro level (all client
counts for all runs) average latency is about half of what is seen on
master.
My guess would be that those latency spikes are caused by a need to run the clock for an extended period. IIRC there's code floating around that makes it possible to measure that.
I suspect it would be very interesting to see what happens if your patch is combined with the work that (Greg?) did to reduce the odds of individual backends needing to run the clock. (I know part of that work looked at proactively keeping pages on the free list, but I think there was more to it than that).
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 14, 2014 at 10:11:53AM -0700, Peter Geoghegan wrote:
Has anyone thought about this in the last few years? I know that Tom
examined the LRU-K paper back in 2000 [5], but was discouraged by some
kind of contention or CPU overhead (although he did say he intended to
revisit the question). Obviously a lot has changed in the past 14
years, particularly with regard to CPU characteristics.
I am glad you are looking at this. You are right that it requires a
huge amount of testing, but clearly our code needs improvement in this
area.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Jim Nasby (jim@nasby.net) wrote:
I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of which has it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of multiple page pools without the overhead. In reality I believe that argument is false, because the clocks for each page pool in an OS *run at different rates* based on system demands.
They're also maintained in *parallel*, no? That's something that I've
been talking over with a few folks at various conferences- that we
should consider breaking up shared buffers and then have new backend
processes which work through each pool independently and in parallel.
I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember this difference any time we look at what OSes do.
It's my suspicion that the one-big-pool is exactly why we see many cases
where PG performs worse when the pool is more than a few gigs. Of
course, this is all speculation and proper testing needs to be done..
Thanks,
Stephen
On Mon, Apr 14, 2014 at 5:30 PM, Bruce Momjian <bruce@momjian.us> wrote:
I am glad you are looking at this. You are right that it requires a
huge amount of testing, but clearly our code needs improvement in this
area.
Thanks.
Does anyone recall the original justification for the recommendation
that shared_buffers never exceed 8GiB? I'd like to revisit the test
case, if such a thing exists.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 14, 2014 at 05:45:56PM -0700, Peter Geoghegan wrote:
On Mon, Apr 14, 2014 at 5:30 PM, Bruce Momjian <bruce@momjian.us> wrote:
I am glad you are looking at this. You are right that it requires a
huge amount of testing, but clearly our code needs improvement in this
area.Thanks.
Does anyone recall the original justification for the recommendation
that shared_buffers never exceed 8GiB? I'd like to revisit the test
case, if such a thing exists.
I have understood it be that the overhead of managing over 1 million
buffers is too large if you aren't accessing more than 8GB of data in a
five-minute period. If are accessing that much, it might be possible to
have a win over 8GB.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 4/14/14, 7:43 PM, Stephen Frost wrote:
* Jim Nasby (jim@nasby.net) wrote:
I think it's important to mention that OS implementations (at least all I know of) have multiple page pools, each of which has it's own clock. IIRC one of the arguments for us supporting a count>1 was we could get the benefits of multiple page pools without the overhead. In reality I believe that argument is false, because the clocks for each page pool in an OS *run at different rates* based on system demands.
They're also maintained in *parallel*, no? That's something that I've
been talking over with a few folks at various conferences- that we
should consider breaking up shared buffers and then have new backend
processes which work through each pool independently and in parallel.
I suspect that varies based on the OS, but it certainly happens in a separate process from user processes. The expectation is that there should always be pages on the free list so requests for memory can happen quickly.
http://www.freebsd.org/doc/en/articles/vm-design/freeing-pages.html contains a good overview of what FreeBSD does. See http://www.freebsd.org/doc/en/articles/vm-design/allen-briggs-qa.html#idp62990256 as well.
I don't know if multiple buffer pools would be good or bad for Postgres, but I do think it's important to remember this difference any time we look at what OSes do.
It's my suspicion that the one-big-pool is exactly why we see many cases
where PG performs worse when the pool is more than a few gigs. Of
course, this is all speculation and proper testing needs to be done..
I think there some critical take-aways from FreeBSD that apply here (in no particular order):
1: The system is driven by memory pressure. No pressure means no processing.
2: It sounds like the active list is LFU, not LRU. The cache list is LRU.
3: *The use counter is maintained by a clock.* Because the clock only runs so often this means there is no run-away incrementing like we see in Postgres.
4: Once a page is determined to not be active it goes onto a separate list depending on whether it's clean or dirty.
5: Dirty pages are only written to maintain a certain clean/dirty ratio and again, only when there's actual memory pressure.
6: The system maintains a list of free pages to serve memory requests quickly. In fact, lower level functions (ie: http://www.leidinger.net/FreeBSD/dox/vm/html/d4/d65/vm__phys_8c_source.html#l00862) simply return NULL if they can't find pages on the free list.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 14, 2014 at 7:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Mon, Apr 14, 2014 at 5:30 PM, Bruce Momjian <bruce@momjian.us> wrote:
I am glad you are looking at this. You are right that it requires a
huge amount of testing, but clearly our code needs improvement in this
area.Thanks.
Does anyone recall the original justification for the recommendation
that shared_buffers never exceed 8GiB? I'd like to revisit the test
case, if such a thing exists.
There are many reports of improvement from lowering shared_buffers.
The problem is that it tends to show up on complex production
workloads and that there is no clear evidence pointing to problems
with the clock sweep; it could be higher up in the partition locks or
something else entirely (like the O/S). pgbench is also not the
greatest tool for sniffing out these cases: it's too random and for
large database optimization is generally an exercise in de-randomizing
i/o patterns. We really, really need a broader testing suite that
covers more usage patterns.
I was suspicious for a while that spinlock contention inside the
clocksweep was causing stalls and posted a couple of different patches
to try and reduce the chance of that. I basically gave up when I
couldn't demonstrate that case in simulated testing.
I still think there is no good reason for the clock to pedantically
adjust usage count on contented buffers...better to throw a single
TTAS and bail to the next buffer if either 'T' signals a lock.
merlin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 15, 2014 at 9:30 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
There are many reports of improvement from lowering shared_buffers.
The problem is that it tends to show up on complex production
workloads and that there is no clear evidence pointing to problems
with the clock sweep; it could be higher up in the partition locks or
something else entirely (like the O/S). pgbench is also not the
greatest tool for sniffing out these cases: it's too random and for
large database optimization is generally an exercise in de-randomizing
i/o patterns. We really, really need a broader testing suite that
covers more usage patterns.
I find it quite dissatisfying that we know so little about this.
I'm finding that my patch helps much less when shared_buffers is sized
large enough to fit the index entirely (although there are still some
localized stalls on master, where there are none with patched).
shared_buffers is still far too small to fit the entire heap. With
shared_buffers=24GB (which still leaves just under 8GB of memory for
the OS to use as cache, since this system has 32GB of main memory),
the numbers are much less impressive relative to master with the same
configuration. Both sets of numbers are still better than what you've
already seen with shared_buffers=8GB, since of course the "no more
than 8GB" recommendation is not an absolute, and as you say its
efficacy seemingly cannot be demonstrated with pgbench.
My guess is that the patch doesn't help because once there is more
than enough room to cache the entire index (slightly over twice as
many buffers as would be required to do so), even on master it becomes
virtually impossible to evict those relatively popular index pages,
since they still have an early advantage. It doesn't matter that
master's clock sweep has what I've called an excessively short-term
perspective, because there is always enough pressure relative to the
number of leaf pages being pinned to prefer to evict heap pages. There
is still a lot of buffers that can fit some moderate proportion of all
heap pages even after buffering the entire index (something like
~13GB).
You might say that with this new shared_buffers setting, clock sweep
doesn't need to have a "good memory", because it can immediately
observe the usefulness of B-Tree leaf pages.
There is no need to limit myself to speculation here, of course. I'll
check it out using pg_buffercache.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 14, 2014 at 8:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
PostgreSQL implements a clock sweep algorithm, which gets us something
approaching an LRU for the buffer manager in trade-off for less
contention on core structures. Buffers have a usage_count/"popularity"
that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
algorithm only has one bit for what approximates our "usage_count" (so
it's either 0 or 1). I think that at its core CLOCK is an algorithm
that has some very desirable properties that I am sure must be
preserved. Actually, I think it's more accurate to say we use a
variant of clock pro, a refinement of the original CLOCK.
PostgreSQL replacement algorithm is more similar to Generalized CLOCK
or GCLOCK, as described in [1]http://www.csd.uoc.gr/~hy460/pdf/p35-nicola.pdf. CLOCK-Pro [2]http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf is a different algorithm
that approximates LIRS[3]http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang02_LIRS.pdf. LIRS is what MySQL implements[4]http://lists.mysql.com/commits/28601 and
CLOCK-Pro is implemented by NetBSD [5]http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD and there has been some work on
trying it on Linux [6]http://lwn.net/Articles/147879/. Both LIRS and CLOCK-Pro work by keeping double
the cache size metadata entries and detect pages that have been
recently referenced. Basically they provide an adaptive tradeoff
between LRU and LFU.
In the past, various hackers have noted problems they've observed with
this scheme. A common pathology is to see frantic searching for a
victim buffer only to find all buffer usage_count values at 5. It may
take multiple revolutions of the clock hand before a victim buffer is
found, as usage_count is decremented for each and every buffer. Also,
BufFreelistLock contention is considered a serious bottleneck [1],
which is related.
There's a paper on a non blocking GCLOCK algorithm, that does lock
free clock sweep and buffer pinning[7]http://derby-nb.googlecode.com/svn-history/r41/trunk/derby-nb/ICDE10_conf_full_409.pdf. If we decide to stay with
GCLOCK it may be interesting, although I still believe that some
variant of buffer nailing[8]/messages/by-id/CA+TgmoZYPeYHWAUeJVYy9A5aNDoULcF33WTnprfR9SYcw30vAg@mail.gmail.com is a better idea, my experience shows
that most of the locking overhead is cache line bouncing ignoring the
extreme cases where our naive spinlock implementation blows up.
Let's leave aside inner/root pages though, because they're so
dramatically useful when in a primary index on a tpb-b table that
they'll always be cached by any non-terrible algorithm. It beggars
belief that the still relatively dense (and consequently *popular*)
B+Tree leaf pages get so little credit for being of such long-term
utility (in the view of our clock sweep algorithm as implemented). The
algorithm has what could be loosely described as an excessively
short-term perspective. There is clearly a better balance to be had
here. I don't think the answer is that we have the B-Tree code give
its pages some special treatment that makes them harder to evict,
although I will admit that I briefly entertained the idea.
There has been some research that indicates that for TPC-A workloads
giving index pages higher weights increases hitrates[1]http://www.csd.uoc.gr/~hy460/pdf/p35-nicola.pdf.
I think the hardest hurdle for any changes in this area will be
showing that we don't have any nasty regressions. I think the best way
to do that would be to study separately the performance overhead of
the replacement algorithm and optimality of the replacement choices.
If we capture a bunch of buffer reference traces by instrumenting
PinBuffer, we can pretty accurately simulate the behavior of different
algorithm and tuning choices with different shared buffer sizes.
Obviously full scale tests are still needed due to interactions with
OS, controller and disk caches and other miscellaneous influences. But
even so, simulation would get us much better coverage of various
workloads and at least some confidence that it's a good change
overall. It will be very hard and time consuming to gather equivalent
evidence with full scale tests.
[1]: http://www.csd.uoc.gr/~hy460/pdf/p35-nicola.pdf
[2]: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf
[3]: http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang02_LIRS.pdf
[4]: http://lists.mysql.com/commits/28601
[5]: http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD
[6]: http://lwn.net/Articles/147879/
[7]: http://derby-nb.googlecode.com/svn-history/r41/trunk/derby-nb/ICDE10_conf_full_409.pdf
[8]: /messages/by-id/CA+TgmoZYPeYHWAUeJVYy9A5aNDoULcF33WTnprfR9SYcw30vAg@mail.gmail.com
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma <ants@cybertec.at> wrote:
PostgreSQL replacement algorithm is more similar to Generalized CLOCK
or GCLOCK, as described in [1]. CLOCK-Pro [2] is a different algorithm
that approximates LIRS[3]. LIRS is what MySQL implements[4] and
CLOCK-Pro is implemented by NetBSD [5] and there has been some work on
trying it on Linux [6]. Both LIRS and CLOCK-Pro work by keeping double
the cache size metadata entries and detect pages that have been
recently referenced. Basically they provide an adaptive tradeoff
between LRU and LFU.
That's good to know.
There's a paper on a non blocking GCLOCK algorithm, that does lock
free clock sweep and buffer pinning[7]. If we decide to stay with
GCLOCK it may be interesting, although I still believe that some
variant of buffer nailing[8] is a better idea, my experience shows
that most of the locking overhead is cache line bouncing ignoring the
extreme cases where our naive spinlock implementation blows up.
You might be right about that, but lets handle one problem at a time.
Who knows what the bottleneck will end up being if and when we address
the naivety around frequency? I want to better characterize that
problem first.
There has been some research that indicates that for TPC-A workloads
giving index pages higher weights increases hitrates[1].
Frankly, there doesn't need to be any research on this, because it's
just common sense that probabilistically, leaf pages are much more
useful than heap pages in servicing index scan queries if we assume a
uniform distribution. If we don't assume that, then they're still more
useful on average.
I think the hardest hurdle for any changes in this area will be
showing that we don't have any nasty regressions. I think the best way
to do that would be to study separately the performance overhead of
the replacement algorithm and optimality of the replacement choices.
If we capture a bunch of buffer reference traces by instrumenting
PinBuffer, we can pretty accurately simulate the behavior of different
algorithm and tuning choices with different shared buffer sizes.
Obviously full scale tests are still needed due to interactions with
OS, controller and disk caches and other miscellaneous influences. But
even so, simulation would get us much better coverage of various
workloads and at least some confidence that it's a good change
overall. It will be very hard and time consuming to gather equivalent
evidence with full scale tests.
I think I agree with all of that. The fact that we as a community
don't appear to have too much to say about what workloads to
prioritize somewhat frustrates this. The other problem is that sizing
shared_buffers appropriately involves a surprising amount of deference
to rules of thumb that in practice no one is quite prepared to
rigorously defend - who is to say what apportionment of memory to
Postgres is appropriate here? I too was hopeful that we could evaluate
this work purely in terms of observed improvements to hit rate (at
least initially), but now I doubt even that. It would be great to be
able to say "here are the parameters of this discussion", and have
everyone immediately agree with that, but in this instance that's
legitimately not possible.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 16, 2014 at 5:00 AM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Apr 15, 2014 at 3:59 PM, Ants Aasma <ants@cybertec.at> wrote:
There's a paper on a non blocking GCLOCK algorithm, that does lock
free clock sweep and buffer pinning[7]. If we decide to stay with
GCLOCK it may be interesting, although I still believe that some
variant of buffer nailing[8] is a better idea, my experience shows
that most of the locking overhead is cache line bouncing ignoring the
extreme cases where our naive spinlock implementation blows up.You might be right about that, but lets handle one problem at a time.
Who knows what the bottleneck will end up being if and when we address
the naivety around frequency? I want to better characterize that
problem first.
Just to summarize you about the previous discussion and the
improvements that we decided to do in this area based on feedback
are as follows:
1. Bgwriter needs to be improved so that it can help in reducing
usage count and finding next victim buffer (run the clock sweep
and add buffers to the free list).
2. SetLatch for bgwriter (wakeup bgwriter) when elements in freelist
are less.
3. Split the workdone globallock (Buffreelist) in StrategyGetBuffer
(a spinlock for the freelist, and an lwlock for the clock sweep).
Here we can try to make it lock free based on atomic ops as
well.
4. Bgwriter needs to be more aggressive, logic based on which it
calculates how many buffers it needs to process needs to be
improved.
5. Contention around buffer mapping locks.
6. Cacheline bouncing around the buffer header spinlocks, is there
anything we can do to reduce this?
7. Choose Optimistically used buffer in StrategyGetBuffer().
8. Don't bump the usage count every time buffer is pinned.
I have already addressed some of these improvements in patch[1]/messages/by-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com
and for other's, I have plan to work on them for 9.5.
I think here you want to address the improvements related to usage
count and see if it can get us win in some of commonly used scenario's,
without affecting any other commonly used scenario. I feel this is good
idea to pursue and see if we can get good benefits with it.
Infact few days back, I had ran some tests manually to see the
problems around BufFreeListLock (currently I don't have script ready)
and more recently Jason Petersen has done some benchmarking
in this area which you can refer it here[2]https://googledrive.com/host/0Bx33JCTmOADOeTIwaE9KX21yWEk/Concurrency%20Limits%20with%20Large%20Working%20Sets.
I wonder if we can work together to improve things in this area.
[1]: /messages/by-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com
/messages/by-id/006e01ce926c$c7768680$56639380$@kapila@huawei.com
[2]: https://googledrive.com/host/0Bx33JCTmOADOeTIwaE9KX21yWEk/Concurrency%20Limits%20with%20Large%20Working%20Sets
https://googledrive.com/host/0Bx33JCTmOADOeTIwaE9KX21yWEk/Concurrency%20Limits%20with%20Large%20Working%20Sets
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Apr 14, 2014 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
In the past, various hackers have noted problems they've observed with
this scheme. A common pathology is to see frantic searching for a
victim buffer only to find all buffer usage_count values at 5. It may
take multiple revolutions of the clock hand before a victim buffer is
found, as usage_count is decremented for each and every buffer. Also,
BufFreelistLock contention is considered a serious bottleneck [1],
which is related.
I think that the basic problem here is that usage counts increase when
buffers are referenced, but they decrease when buffers are evicted,
and those two things are not in any necessary way connected to each
other. In particular, if no eviction is happening, reference counts
will converge to the maximum value. I've read a few papers about
algorithms that attempt to segregate the list of buffers into "hot"
and "cold" lists, and an important property of such algorithms is that
they mustn't be allowed to make everything hot. It's easy to be too
simplistic, here: an algorithm that requires that no more than half
the list be hot will fall over badly on a workload where the working
set exceeds the available cache and the really hot portion of the
working set is 60% of the available cache. So you need a more
sophisticated algorithm than that. But that core property that not
all buffers can be hot must somehow be preserved, and our algorithm
doesn't.
This isn't a fundamental property of the usage-count idea; it's an
artifact of the fact that usage count decreases are tied to eviction
pressure rather than access pressure. For example, suppose we made a
rule that if the total usage counts of all buffers exceed 3 *
NBuffers, then every time you bump the usage count of a buffer from N
to N+1, you're required to advance the clock sweep far enough to
decrease the reference count of a buffer by one. When you want to
reclaiim a buffer, you advance a separate clock sweep until you find a
buffer with a zero usage count; if you circle the whole ring without
finding one, then you reclaim the buffer you saw with the lowest usage
count. There are obvious scalability problems here (everyone fighting
over the right to advance the clock sweep) but ignore that for the
sake of the thought experiment: now you have an algorithm where not
all buffers can be hot. If some buffers are hotter than others, then
whenever their usage count is decreased it will immediately get pushed
back up again, but some other buffer then has to absorb the decrease.
Only the buffers that are really hot can maintain high usage counts,
because *somebody* has to have a low usage count.
Even ignoring scalability concerns, this might not be (and probably
isn't) exactly what we want to implement, but I think it illustrates
an important control principle all the same: buffer "cooling" needs to
be driven by the same underlying phenomenon - probably buffer access -
as buffer "heating". If they're driven by unrelated phenomena, then
the rates may be wildly incomparable, and you'll end up with
everything hot or everything cold. If that happens, you lose, because
with everything the same, there's no principled way to decide which
things are actually best to evict.
If we come up with some good solution for shared buffers, we should
also consider it applying it to SLRU eviction. I believe that the
current situation around CLOG eviction is none too pretty.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 15, 2014 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Apr 14, 2014 at 1:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
In the past, various hackers have noted problems they've observed with
this scheme. A common pathology is to see frantic searching for a
victim buffer only to find all buffer usage_count values at 5. It may
take multiple revolutions of the clock hand before a victim buffer is
found, as usage_count is decremented for each and every buffer. Also,
BufFreelistLock contention is considered a serious bottleneck [1],
which is related.I think that the basic problem here is that usage counts increase when
buffers are referenced, but they decrease when buffers are evicted,
and those two things are not in any necessary way connected to each
other. In particular, if no eviction is happening, reference counts
will converge to the maximum value. I've read a few papers about
algorithms that attempt to segregate the list of buffers into "hot"
and "cold" lists, and an important property of such algorithms is that
they mustn't be allowed to make everything hot.
It's possible that I've misunderstood what you mean here, but do you
really think it's likely that everything will be hot, in the event of
using something like what I've sketched here? I think it's an
important measure against this general problem that buffers really
earn the right to be considered hot, so to speak. With my prototype,
in order for a buffer to become as hard to evict as possible, at a
minimum it must be *continually* pinned for at least 30 seconds.
That's actually a pretty tall order. Although, as I said, I wouldn't
be surprised if it was worth making it possible for buffers to be even
more difficult to evict than that. It should be extremely difficult to
evict a root B-Tree page, and to a lesser extent inner pages even
under a lot of cache pressure, for example. There are lots of
workloads in which that can happen, and I have a hard time believing
that it's worth it to evict given the extraordinary difference in
their utility as compared to a lot of other things. I can imagine a
huge barrier against evicting what is actually a relatively tiny
number of pages being worth it.
I don't want to dismiss what you're saying about heating and cooling
being unrelated, but I don't find the conclusion that not everything
can be hot obvious. Maybe "heat" should be relative rather than
absolute, and maybe that's actually what you meant. There is surely
some workload where buffer access actually is perfectly uniform, and
what do you do there? What "temperature" are those buffers?
It occurs to me that within the prototype patch, even though
usage_count is incremented in a vastly slower fashion (in a wall time
sense), clock sweep doesn't take advantage of that. I should probably
investigate having clock sweep become more aggressive in decrementing
in response to realizing that it won't get some buffer's usage_count
down to zero on the next revolution either. There are certainly
problems with that, but they might be fixable. Within the patch, in
order for it to be possible for the usage_count to be incremented in
the interim, an average of 1.5 seconds must pass, so if clock sweep
were to anticipate another no-set-to-zero revolution, it seems pretty
likely that it would be exactly right, or if not then close enough,
since it can only really fail to correct for some buffers getting
incremented once more in the interim. Conceptually, it would be like
multiple logical revolutions were merged into one actual one,
sufficient to have the next revolution find a victim buffer.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
It's good to see focus on this - some improvements around s_b are sorely
needed.
On 2014-04-14 10:11:53 -0700, Peter Geoghegan wrote:
1) Throttles incrementation of usage_count temporally. It becomes
impossible to increment usage_count for any given buffer more
frequently than every 3 seconds, while decrementing usage_count is
totally unaffected.
I think this is unfortunately completely out of question. For one a
gettimeofday() for every uffer pin will become a significant performance
problem. Even the computation of the xact/stm start/stop timestamps
shows up pretty heavily in profiles today - and they are far less
frequent than buffer pins. And that's on x86 linux, where gettimeofday()
is implemented as something more lightweight than a full syscall.
The other significant problem I see with this is that its not adaptive
to the actual throughput of buffers in s_b. In many cases there's
hundreds of clock cycles through shared buffers in 3 seconds. By only
increasing the usagecount that often you've destroyed the little
semblance to a working LRU there is right now.
It also wouldn't work well for situations with a fast changing
workload >> s_b. If you have frequent queries that take a second or so
and access some data repeatedly (index nodes or whatnot) only increasing
the usagecount once will mean they'll continually fall back to disk access.
2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not
5 as before. ... . This step on its own would be assumed extremely
counter-productive by those in the know, but I believe that other
measures ameliorate the downsides. I could be wrong about how true
that is in other cases, but then the case helped here isn't what you'd
call a narrow benchmark.
I don't see which mechanisms you have suggested that counter this?
I think having more granular usagecount is a good idea, but I don't
think it can realistically be implemented with the current method of
choosing victim buffers. The amount of cacheline misses around that is
already a major scalability limit; we surely can't make this even
worse. I think it'd be possible to get back to this if we had a better
bgwriter implementation.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I think this is unfortunately completely out of question. For one a
gettimeofday() for every uffer pin will become a significant performance
problem. Even the computation of the xact/stm start/stop timestamps
shows up pretty heavily in profiles today - and they are far less
frequent than buffer pins. And that's on x86 linux, where gettimeofday()
is implemented as something more lightweight than a full syscall.
Come on, Andres. Of course exactly what I've done here is completely
out of the question as a patch that we can go and commit right now.
I've numerous caveats about bloating the buffer descriptors, and about
it being a proof of concept. I'm pretty sure we can come up with a
scheme to significantly cut down on the number of gettimeofday() calls
if it comes down to it. In any case, I'm interested in advancing our
understanding of the problem right now. Let's leave the minutiae to
one side for the time being.
The other significant problem I see with this is that its not adaptive
to the actual throughput of buffers in s_b. In many cases there's
hundreds of clock cycles through shared buffers in 3 seconds. By only
increasing the usagecount that often you've destroyed the little
semblance to a working LRU there is right now.
If a usage_count can get to BM_MAX_USAGE_COUNT from its initial
allocation within an instant, that's bad. It's that simple. Consider
all the ways in which that can happen almost by accident.
You could probably reasonably argue that the trade-off or lack of
adaption (between an LRU and an LFU) that this particular sketch of
mine represents is inappropriate or sub-optimal, but I don't
understand why you're criticizing the patch for doing what I expressly
set out to do. I wrote "I think a very real problem that may be that
approximating an LRU is bad because an actual LRU is bad".
It also wouldn't work well for situations with a fast changing
workload >> s_b. If you have frequent queries that take a second or so
and access some data repeatedly (index nodes or whatnot) only increasing
the usagecount once will mean they'll continually fall back to disk access.
No, it shouldn't, because there is a notion of buffers getting a fair
chance to prove themselves. Now, it might well be the case that there
are workloads where what I've done to make that happen in this
prototype doesn't work out too well - I've already said so. But should
a buffer get a usage count of 5 just because the user inserted 5
tuples within a single DML command, for example? If so, why?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-04-16 01:58:23 -0700, Peter Geoghegan wrote:
On Wed, Apr 16, 2014 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:
I think this is unfortunately completely out of question. For one a
gettimeofday() for every uffer pin will become a significant performance
problem. Even the computation of the xact/stm start/stop timestamps
shows up pretty heavily in profiles today - and they are far less
frequent than buffer pins. And that's on x86 linux, where gettimeofday()
is implemented as something more lightweight than a full syscall.Come on, Andres. Of course exactly what I've done here is completely
out of the question as a patch that we can go and commit right now.
I've numerous caveats about bloating the buffer descriptors, and about
it being a proof of concept. I'm pretty sure we can come up with a
scheme to significantly cut down on the number of gettimeofday() calls
if it comes down to it. In any case, I'm interested in advancing our
understanding of the problem right now. Let's leave the minutiae to
one side for the time being.
*I* don't think any scheme that involves measuring the time around
buffer pins is going to be acceptable. It's better than I say that now
rather than when you've invested significant time into the approach, no?
The other significant problem I see with this is that its not adaptive
to the actual throughput of buffers in s_b. In many cases there's
hundreds of clock cycles through shared buffers in 3 seconds. By only
increasing the usagecount that often you've destroyed the little
semblance to a working LRU there is right now.If a usage_count can get to BM_MAX_USAGE_COUNT from its initial
allocation within an instant, that's bad. It's that simple. Consider
all the ways in which that can happen almost by accident.
Yes, I agree that that's a problem. It immediately going down to zero is
a problem as well though. And that's what will happen in many scenarios,
because you have time limits on increasing the usagecount, but not when
decreasing.
It also wouldn't work well for situations with a fast changing
workload >> s_b. If you have frequent queries that take a second or so
and access some data repeatedly (index nodes or whatnot) only increasing
the usagecount once will mean they'll continually fall back to disk access.No, it shouldn't, because there is a notion of buffers getting a fair
chance to prove themselves.
If you have a workload with > (BM_MAX_USAGE_COUNT + 1) clock
cycles/second, how does *any* buffer has a chance to prove itself?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 16, 2014 at 2:18 AM, Andres Freund <andres@2ndquadrant.com> wrote:
*I* don't think any scheme that involves measuring the time around
buffer pins is going to be acceptable. It's better than I say that now
rather than when you've invested significant time into the approach, no?
Well, I do think that it will be possible to make something like that
work. LRU-K/LRU-2 involves remembering the last two access times (not
the last one). Researchers considered preeminent authorities on
caching algorithms thought that was a good idea in 1993. There are
plenty of other examples of similar schemes too.
No, it shouldn't, because there is a notion of buffers getting a fair
chance to prove themselves.If you have a workload with > (BM_MAX_USAGE_COUNT + 1) clock
cycles/second, how does *any* buffer has a chance to prove itself?
There could be lots of ways. I thought about representing that more
directly. I don't think that it's useful to have a large number of
revolutions in search of a victim under any circumstances.
Fundamentally, you're asking "what if any scheme here leans too
heavily towards frequency?". That could certainly be a problem, as
I've said, and we could think about adaptation over heuristics, as
I've said, but it is very obviously a big problem that clock sweep
doesn't really care about frequency one bit right now.
Why should I be the one with all the answers? Aren't you interested in
the significance of the patch, and the test case?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2014-04-16 02:57:54 -0700, Peter Geoghegan wrote:
Why should I be the one with all the answers?
Who said you need to be? The only thing I am saying is that I don't
agree with some of your suggestions?
I only responded to the thread now because downthread (in
CAM3SWZQa2OAVUrfPL-df=we1sMozKBR392SW_NoVuKZEPXhu9w@mail.gmail.com) you
further argued using the timestamp - which I think is a flawed
concept. So I thought it'd be fair to argue against it now, rather than
later.
Aren't you interested in the significance of the patch, and the test case?
Not particularly in the specifics to be honest. The tradeoffs of the
techniques you used in there seem prohibitive to me. It's easy to make
individual cases faster by sacrificing others.
Sometimes it's useful to prototype solutions while narrowing the scope
for evaluation to get faster feedback, but as I don't see the solutions
to be applicable in the general case...
I think it's very important to improve upon the current state. It's imo
one of postgres' biggest issues. But it's also far from trivial,
otherwise it'd be done already.
I *personally* don't think it's very likely that we can improve
significantly upon the current state as long as every process regularly
participates in the clock sweep. ISTM that prevents many more elaborate
techniques to be used (cache misses/bus traffic, locking). But that's
just gut feeling.
I also think there are bigger issues than the actual LRU/whatever
behaviour, namely the scalability issues around shared buffers making
both small and big s_b settings major bottlenecks. But that's just where
I have seen more issues personally.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Apr 16, 2014 at 3:22 AM, Peter Geoghegan <pg@heroku.com> wrote:
It's possible that I've misunderstood what you mean here, but do you
really think it's likely that everything will be hot, in the event of
using something like what I've sketched here? I think it's an
important measure against this general problem that buffers really
earn the right to be considered hot, so to speak. With my prototype,
in order for a buffer to become as hard to evict as possible, at a
minimum it must be *continually* pinned for at least 30 seconds.
That's actually a pretty tall order. Although, as I said, I wouldn't
be surprised if it was worth making it possible for buffers to be even
more difficult to evict than that. It should be extremely difficult to
evict a root B-Tree page, and to a lesser extent inner pages even
under a lot of cache pressure, for example. There are lots of
workloads in which that can happen, and I have a hard time believing
that it's worth it to evict given the extraordinary difference in
their utility as compared to a lot of other things. I can imagine a
huge barrier against evicting what is actually a relatively tiny
number of pages being worth it.
I'm making a general statement about a property that I think a buffer
eviction algorithm ought to have. I actually didn't say anything
about the algorithm you've chosen one way or the other. Obviously,
you've built in some protections against everything becoming hot, and
that's a good thing as far as it goes. But you also have a greatly
increased risk of everything becoming cold. All you need is a rate of
buffer eviction that circles shared_buffers more often than once every
3 seconds, and everything will gradually cool down until you once
again can't distinguish which stuff is hot from which stuff isn't.
I don't want to dismiss what you're saying about heating and cooling
being unrelated, but I don't find the conclusion that not everything
can be hot obvious. Maybe "heat" should be relative rather than
absolute, and maybe that's actually what you meant. There is surely
some workload where buffer access actually is perfectly uniform, and
what do you do there? What "temperature" are those buffers?
Obviously, some value lower than the maximum and higher than the
minimum. If they're all at max temperature and then a new buffer (a
btree room or vm page, for example) comes along and is much hotter,
there's no room on the scale left to express that. If they're all at
min temperature and then a new buffer comes along that is just used
once and thrown out, there's no room left on the scale for that buffer
to emerge as a good candidate for eviction.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers