Clock with Adaptive Replacement

Started by Konstantin Knizhnikabout 10 years ago63 messageshackers
Jump to latest
#1Konstantin Knizhnik
k.knizhnik@postgrespro.ru

Hi hackers,

What do you think about improving cache replacement clock-sweep algorithm in PostgreSQL with adaptive version proposed in this article:

http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf

Are there some well known drawbacks of this approach or it will be interesting to adopt this algorithm to PostgreSQL and measure it impact om performance under different workloads?
I find this ten years old thread:

/messages/by-id/d2jkde$6bg$1@sea.gmane.org

but it mostly discus possible patent issues with another algorithm ARC (CAR is inspired by ARC, but it is different algorithm).
As far as I know there are several problems with current clock-sweep algorithm in PostgreSQL, especially for very large caches.
May be CAR can address some of them?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Konstantin Knizhnik (#1)
Re: Clock with Adaptive Replacement

On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

What do you think about improving cache replacement clock-sweep algorithm in
PostgreSQL with adaptive version proposed in this article:

http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf

Are there some well known drawbacks of this approach or it will be
interesting to adopt this algorithm to PostgreSQL and measure it impact om
performance under different workloads?
I find this ten years old thread:

/messages/by-id/d2jkde$6bg$1@sea.gmane.org

but it mostly discus possible patent issues with another algorithm ARC (CAR
is inspired by ARC, but it is different algorithm).
As far as I know there are several problems with current clock-sweep
algorithm in PostgreSQL, especially for very large caches.
May be CAR can address some of them?

Maybe, but the proof of the pudding is in the eating. Just because an
algorithm is smarter, newer, and better in general than our current
algorithm - and really, it wouldn't be hard - doesn't mean that it
will actually solve the problems we care about. A few of my
EnterpriseDB colleagues spent a lot of time benchmarking various
tweaks to our current algorithm last year and were unable to construct
a test case where it sped anything up. If they tried the same tweaks
against the 9.4 source base, they could get a speedup. But 9.5 had
locking improvements around buffer eviction, and with those
improvements committed there was no longer any measurable benefit to
improving the quality of buffer eviction decisions. That's a
surprising result, to me anyway, and somebody else might well find a
test case where a benefit can be shown - but our research was not
successful.

I think it's important to spend time and energy figuring out exactly
what the problems with our current algorithm are. We know in general
terms that usage counts tend to converge to either 5 or 0 and
therefore sometimes evict buffers both at great cost and almost
randomly. But what's a lot less clear is how much that actually hurts
us given that we are relying on the OS cache anyway. It may be that
we need to fix some other things before or after improving the buffer
eviction algorithm before we actually get a performance benefit. I
suspect, for example, that a lot of the problems with large
shared_buffers settings have to do with the bgwriter and checkpointer
behavior rather than with the buffer eviction algorithm; and that
others have to do with cache duplication between PostgreSQL and the
operating system. So, I would suggest (although of course it's up to
you) that you might want to focus on experiments that will help you
understand where the problems are before you plunge into writing code
to fix them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Robert Haas (#2)
Re: Clock with Adaptive Replacement

Thank you very much for response.

I am not sure that CART can significantly improve PostgreSQL
performance - I just want to know opinion of community about
CAR/CART and other possible alternative to GCLOCK algorithm.

Looks like it CAR really provides better cache hit ratio and so at some
workloads should increase Postgres performance.
But now amount of memory at servers is large enough to completely keep
most of typical databases in cache.
So time of locating buffer in cache is more critical then time of buffer
eviction.
And here CART doesn't provide any benefits comparing with GCLOCK algorithm.

One of the problems with GCLOCK algorithm from my point of view is that
for large caches, containing larger number of pages locating victim page
can take substantial amount of time, because we have to perform several
turnovers before some count becomes zero. In theory CART can address
this problem because there are not counters - justs single bit per page.

On 12.02.2016 18:55, Robert Haas wrote:

On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

What do you think about improving cache replacement clock-sweep algorithm in
PostgreSQL with adaptive version proposed in this article:

http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf

Are there some well known drawbacks of this approach or it will be
interesting to adopt this algorithm to PostgreSQL and measure it impact om
performance under different workloads?
I find this ten years old thread:

/messages/by-id/d2jkde$6bg$1@sea.gmane.org

but it mostly discus possible patent issues with another algorithm ARC (CAR
is inspired by ARC, but it is different algorithm).
As far as I know there are several problems with current clock-sweep
algorithm in PostgreSQL, especially for very large caches.
May be CAR can address some of them?

Maybe, but the proof of the pudding is in the eating. Just because an
algorithm is smarter, newer, and better in general than our current
algorithm - and really, it wouldn't be hard - doesn't mean that it
will actually solve the problems we care about. A few of my
EnterpriseDB colleagues spent a lot of time benchmarking various
tweaks to our current algorithm last year and were unable to construct
a test case where it sped anything up. If they tried the same tweaks
against the 9.4 source base, they could get a speedup. But 9.5 had
locking improvements around buffer eviction, and with those
improvements committed there was no longer any measurable benefit to
improving the quality of buffer eviction decisions. That's a
surprising result, to me anyway, and somebody else might well find a
test case where a benefit can be shown - but our research was not
successful.

I think it's important to spend time and energy figuring out exactly
what the problems with our current algorithm are. We know in general
terms that usage counts tend to converge to either 5 or 0 and
therefore sometimes evict buffers both at great cost and almost
randomly. But what's a lot less clear is how much that actually hurts
us given that we are relying on the OS cache anyway. It may be that
we need to fix some other things before or after improving the buffer
eviction algorithm before we actually get a performance benefit. I
suspect, for example, that a lot of the problems with large
shared_buffers settings have to do with the bgwriter and checkpointer
behavior rather than with the buffer eviction algorithm; and that
others have to do with cache duplication between PostgreSQL and the
operating system. So, I would suggest (although of course it's up to
you) that you might want to focus on experiments that will help you
understand where the problems are before you plunge into writing code
to fix them.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#2)
Re: Clock with Adaptive Replacement

On 2/12/16 9:55 AM, Robert Haas wrote:

I think it's important to spend time and energy figuring out exactly
what the problems with our current algorithm are. We know in general
terms that usage counts tend to converge to either 5 or 0 and
therefore sometimes evict buffers both at great cost and almost

Has anyone done testing on the best cap to usage count? IIRC 5 was
pulled out of thin air. Actually, I don't recall ever seeing a clock
sweep that supported more than a single bit, though often there are
multiple 'pools' a buffer could be in (ie: active vs inactive in most
unix VMs).

If you have a reasonable amount of 1 or 0 count buffers then this
probably doesn't matter too much, but if your working set is
significantly higher than shared buffers then you're probably doing a
LOT of full sweeps to try and get something decremented down to 0.

randomly. But what's a lot less clear is how much that actually hurts
us given that we are relying on the OS cache anyway. It may be that
we need to fix some other things before or after improving the buffer
eviction algorithm before we actually get a performance benefit. I
suspect, for example, that a lot of the problems with large
shared_buffers settings have to do with the bgwriter and checkpointer
behavior rather than with the buffer eviction algorithm; and that
others have to do with cache duplication between PostgreSQL and the
operating system. So, I would suggest (although of course it's up to

It would be nice if there was at least an option to instrument how long
an OS read request took, so that you could guage how many requests were
coming from the OS vs disk. (Obviously direct knowledge from the OS is
even better, but I don't think those APIs exist.)

you) that you might want to focus on experiments that will help you
understand where the problems are before you plunge into writing code
to fix them.

+1
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#4)
Re: Clock with Adaptive Replacement

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

On 2/12/16 9:55 AM, Robert Haas wrote:

I think it's important to spend time and energy figuring out exactly
what the problems with our current algorithm are. We know in general
terms that usage counts tend to converge to either 5 or 0 and
therefore sometimes evict buffers both at great cost and almost

Has anyone done testing on the best cap to usage count? IIRC 5 was
pulled out of thin air.

My recollection is that there was some testing behind it ... but that
was back around 2005 so it seems safe to assume that that testing
is no longer terribly relevant. In particular, I'm sure it was tested
with shared_buffer counts far smaller than what we now consider sane.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Konstantin Knizhnik (#1)
Re: [HACKERS] Clock with Adaptive Replacement

On Fri, Feb 12, 2016 at 10:02 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Hi hackers,

What do you think about improving cache replacement clock-sweep algorithm in
PostgreSQL with adaptive version proposed in this article:

http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf

Are there some well known drawbacks of this approach or it will be
interesting to adopt this algorithm to PostgreSQL and measure it impact om
performance under different workloads?

I'm not currently planning to work in this area and have done no real
investigation, so please consider the following to be "water cooler
talk".

I also came across that paper while reading about buffering as general
background. Yeah, CAR[T] looks pretty interesting at first glance.
Automatic scan resistance seems like something we'd want, its approach
to frequency sounds smarter than GCLOCK's usage counter with arbitrary
parameter 5. Here's a nice slide deck from the authors:

http://www.cse.iitd.ernet.in/~sbansal/pubs/fast04ppt.pdf

There doesn't seem to be as much written about GCLOCK beyond the 1992
TPC-A paper[1]http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.9699&amp;rep=rep1&amp;type=pdf, possibly because as the CAR paper says "[a]
fundamental disadvantage of GCLOCK is that it requires counter
increment on every page hit which makes it infeasible for virtual
memory", making it less interesting to the OS guys. That is, we
actually have slightly more information, because we have an
opportunity to bump the usage counter when pinning and the VM guys
don't, probably explaining why the paper compares with CLOCK and not
GCLOCK.

One question is whether SPC-1 looks anything like a typical database
workload. Say someone wrote a prototype CAR[T] patch for PostgreSQL,
how would you validate it?

I'm also curious about how the algorithms at different levels interact
when using buffered IO. While other databases typically do direct IO,
you can still find a trail of papers about this topic since disk
arrays and network-attached storage systems also have caches and page
replacement problems[2]https://www.usenix.org/legacy/event/usenix01/full_papers/zhou/zhou.pdf[3]https://www.usenix.org/legacy/event/usenix02/full_papers/wong/wong_html/, and of course multi-level caching problems
apply to non-database applications too. I wonder to what extent
Linux's use of "a mash of a number of different algorithms with a
number of modifications for catching corner cases and various
optimisations"[4]http://www.spinics.net/lists/linux-mm/msg13385.html affects the performance of different clock-based
algorithms operating higher up.

If we had a page replacement algorithm with CART's magical claimed
properties and it really does perform better than GCLOCK + our scan
resistance special cases, I wonder if it would be tempting to stick
SLRU contents into shared buffers. slru.c could gain a
smgr-compatible interface so that bufmgr.c could treat those buffers
like any other, using smgr_which to select the appropriate storage
manager. (I'm going to be proposing something similar for undo log
storage, more on that later.)

[1]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.9699&amp;rep=rep1&amp;type=pdf
[2]: https://www.usenix.org/legacy/event/usenix01/full_papers/zhou/zhou.pdf
[3]: https://www.usenix.org/legacy/event/usenix02/full_papers/wong/wong_html/
[4]: http://www.spinics.net/lists/linux-mm/msg13385.html

--
Thomas Munro
http://www.enterprisedb.com

#7Andrey Borodin
amborodin@acm.org
In reply to: Thomas Munro (#6)
Re: [HACKERS] Clock with Adaptive Replacement

Hi, Thomas!

24 апр. 2018 г., в 2:41, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):

On Fri, Feb 12, 2016 at 10:02 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Are there some well known drawbacks of this approach or it will be
interesting to adopt this algorithm to PostgreSQL and measure it impact om
performance under different workloads?

I'm not currently planning to work in this area and have done no real
investigation, so please consider the following to be "water cooler
talk".

I've intention to make some prototypes in this area, but still I hadn't allocated any time chunks sufficient enough to make anything usefull.

I think that replacement of current CS5 will:
1. allow use of big shared buffers
2. make DIRECT_IO realistic possibility
3. improve BgWriter
4. unify different buffering strategies into single buffer manager (there will be no need in placing VACUUM into special buffer ring)
5. finally allow aio and more efficient prefetching like [0]http://diku.dk/forskning/Publikationer/tekniske_rapporter/2004/04-03.pdf

Here's what we have about size of shared buffer now [1]https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png (taken from [2]http://blog.dataegret.com/2018/01/postgresql-performance-meltdown.html). I believe right hill must be big enough to reduce central pit to zero and make function monotonic: OS page cache knows less about data blocks and is expected to be less efficient.

I'm not sure CART is the best possibility, though.
I think that the right way is to implement many prototypes with LRU, ARC, CAR, CART, and 2Q. Then, benchmark them well. Or even make this algorithm pluggable? But currently we have a lot of dependent parts in the system. I do not even know where to start.

Best regards, Andrey Borodin.

[0]: http://diku.dk/forskning/Publikationer/tekniske_rapporter/2004/04-03.pdf
[1]: https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png
[2]: http://blog.dataegret.com/2018/01/postgresql-performance-meltdown.html

#8Юрий Соколов
funny.falcon@gmail.com
In reply to: Andrey Borodin (#7)
Re: [HACKERS] Clock with Adaptive Replacement

вт, 24 апр. 2018 г., 8:04 Andrey Borodin <x4mmm@yandex-team.ru>:

Hi, Thomas!

24 апр. 2018 г., в 2:41, Thomas Munro <thomas.munro@enterprisedb.com>

написал(а):

On Fri, Feb 12, 2016 at 10:02 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Are there some well known drawbacks of this approach or it will be
interesting to adopt this algorithm to PostgreSQL and measure it impact

om

performance under different workloads?

I'm not currently planning to work in this area and have done no real
investigation, so please consider the following to be "water cooler
talk".

I've intention to make some prototypes in this area, but still I hadn't
allocated any time chunks sufficient enough to make anything usefull.

I think that replacement of current CS5 will:
1. allow use of big shared buffers
2. make DIRECT_IO realistic possibility
3. improve BgWriter
4. unify different buffering strategies into single buffer manager (there
will be no need in placing VACUUM into special buffer ring)
5. finally allow aio and more efficient prefetching like [0]

Here's what we have about size of shared buffer now [1] (taken from [2]).
I believe right hill must be big enough to reduce central pit to zero and
make function monotonic: OS page cache knows less about data blocks and is
expected to be less efficient.

I'm not sure CART is the best possibility, though.
I think that the right way is to implement many prototypes with LRU, ARC,
CAR, CART, and 2Q. Then, benchmark them well. Or even make this algorithm
pluggable? But currently we have a lot of dependent parts in the system. I
do not even know where to start.

Best regards, Andrey Borodin.

[0]
http://diku.dk/forskning/Publikationer/tekniske_rapporter/2004/04-03.pdf
[1]
https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png
[2] http://blog.dataegret.com/2018/01/postgresql-performance-meltdown.html

Before implementing algorithms within PostgreSQL it will be great to test
them outside with real traces.

I think, there should be mechamism to collect traces from real-world
postgresql instalations: every read and write access. It should be
extremely eficient to be enabled in real world. Something like circular
buffer in shared memory, and separate worker to dump it to disk.
Instead of full block address, 64bit hash could be used. Even 63bit + 1bit
to designate read/write access.

Using these traces, it will be easy to choose couple of "theoretically"
best algorithms, and then attempt to implement them.

With regards,
Yura

#9Andrey Borodin
amborodin@acm.org
In reply to: Юрий Соколов (#8)
Re: [HACKERS] Clock with Adaptive Replacement

24 апр. 2018 г., в 11:31, Юрий Соколов <funny.falcon@gmail.com> написал(а):

Before implementing algorithms within PostgreSQL it will be great to test them outside with real traces.

I think, there should be mechamism to collect traces from real-world postgresql instalations: every read and write access. It should be extremely eficient to be enabled in real world. Something like circular buffer in shared memory, and separate worker to dump it to disk.
Instead of full block address, 64bit hash could be used. Even 63bit + 1bit to designate read/write access.

Yes, this is good idea to track pattern of blocks usage.
But, I think that cost of development of real page eviction strategy itself is neglectable small compared to infrastructure changes needed by any non-CS5 strategy.

Best regards, Andrey Borodin.

#10Юрий Соколов
funny.falcon@gmail.com
In reply to: Andrey Borodin (#9)
Re: [HACKERS] Clock with Adaptive Replacement

вт, 24 апр. 2018 г., 15:16 Andrey Borodin <x4mmm@yandex-team.ru>:

24 апр. 2018 г., в 11:31, Юрий Соколов <funny.falcon@gmail.com>

написал(а):

Before implementing algorithms within PostgreSQL it will be great to

test them outside with real traces.

I think, there should be mechamism to collect traces from real-world

postgresql instalations: every read and write access. It should be
extremely eficient to be enabled in real world. Something like circular
buffer in shared memory, and separate worker to dump it to disk.

Instead of full block address, 64bit hash could be used. Even 63bit +

1bit to designate read/write access.
Yes, this is good idea to track pattern of blocks usage.
But, I think that cost of development of real page eviction strategy
itself is neglectable small compared to infrastructure changes needed by
any non-CS5 strategy.

Best regards, Andrey Borodin.

Exactly. And it will be pity if implemented strategy will not be "the best".

Different strategies may need different infrastructure. And implementing
five infrastructures are certainly more expensive than choosing two best
strategies using traces and implementing them.

#11Andres Freund
andres@anarazel.de
In reply to: Andrey Borodin (#9)
Re: [HACKERS] Clock with Adaptive Replacement

Hi,

On 2018-04-24 17:16:47 +0500, Andrey Borodin wrote:

But, I think that cost of development of real page eviction strategy
itself is neglectable small compared to infrastructure changes needed
by any non-CS5 strategy.

What problems are you seeing? This isn't a lot of code?

Greetings,

Andres Freund

#12Andrey Borodin
amborodin@acm.org
In reply to: Andres Freund (#11)
Re: [HACKERS] Clock with Adaptive Replacement

24 апр. 2018 г., в 23:14, Andres Freund <andres@anarazel.de> написал(а):

On 2018-04-24 17:16:47 +0500, Andrey Borodin wrote:

But, I think that cost of development of real page eviction strategy
itself is neglectable small compared to infrastructure changes needed
by any non-CS5 strategy.

What problems are you seeing? This isn't a lot of code?

1. Teaching BgWriter to used data from eviction strategy to aggressively flush data to disk (instead of ++next_to_clean )
2. Implementing strategies as lock-free algorithms for freelist
These parts seem most important for benchmarking.
Also:
3. Converting all rings to single buffer manager where possible
4. Using O_DIRECT while writing data files
5. Using aio and scheduling of writes
These parts are not necessary, but most challenging, while not impossible though.

Best regards, Andrey Borodin.

#13Thomas Munro
thomas.munro@gmail.com
In reply to: Andrey Borodin (#7)
Re: [HACKERS] Clock with Adaptive Replacement

On Tue, Apr 24, 2018 at 5:04 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Here's what we have about size of shared buffer now [1] (taken from [2]). I believe right hill must be big enough to reduce central pit to zero and make function monotonic: OS page cache knows less about data blocks and is expected to be less efficient.

[1] https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png

Interesting curve. Can you explain it? Here is my guess, assuming
that pgbench is generating uniformly distributed random access, and
assuming 26GB dataset and 32GB physical RAM:

1. In the range 128MB to 6GB, only a small fraction of the dataset
fits into shared_buffers, but it doesn't matter much: you get a 100%
hit ratio from the kernel's page cache.

2. In the range 6GB to 16GB, the dataset doesn't fit in shared_buffer
OR the kernel's page cache, so you begin to risk misses in both
caches. Every byte you give to one you take away from the other, but
it might be worse than zero sum: at the lowest point in your graph, x
= 16GB, I suppose you have a ~61% hit ratio in shared buffers
(16GB/26GB), but when you miss you're also likely to miss in the OS
page cache too because that probably holds the most recently read 16GB
(= other half of physical RAM) of data... in other words the same
data, assuming uniform random access. (If it were not random, then
some pages would have a greater chance of staying in PG's cache and
falling out of the kernel's cache, but essentially random replacement
prevents that (?))

3. In the range 16GB to 26GB, the shared_buffers hit ratio increases
from 50% to 100%.

I guess if you had 48GB of RAM the dip between 6GB and 26GB wouldn't happen?

Am I close?

I'm not sure CART is the best possibility, though.

I don't know either, but LRU and ARC don't seem promising for various
reasons. Aside from the scan resistance claim, it sounds like CART
should avoid long searches for a buffer. In my primitive attempts to
understand that a bit better, I tried instrumenting freelist.c to tell
dtrace how far it had to move the clock hand to find a buffer, and I
got histograms like this (this example from a 1GB dataset, 512MB
shared buffers, pgbench select-only, and the most frequent numbers
were in the 4-7 bucket but there were 3 freak cases in the 16384-32767
bucket):

value ------------- Distribution ------------- count
-1 | 0
0 |@@@@ 65535
1 |@@@@@ 89190
2 |@@@@@@@@@ 139114
4 |@@@@@@@@@@@ 170881
8 |@@@@@@@@ 132579
16 |@@@ 47731
32 | 5530
64 | 137
128 | 1
256 | 0
512 | 0
1024 | 0
2048 | 0
4096 | 0
8192 | 1
16384 | 3
32768 | 0

You can see a just a few outliers there. With smaller shared buffers
the counts are bigger but the average hand movement distance gets
smaller (here 128MB shared buffers):

value ------------- Distribution ------------- count
-1 | 0
0 | 16383
1 |@@@@@@@@@@@@@@@ 1289265
2 |@@@@@@@@@@@@@@@ 1270531
4 |@@@@@@@@ 659389
8 |@ 111091
16 | 2815
32 | 19
64 | 1
128 | 0
256 | 0
512 | 0
1024 | 0
2048 | 2
4096 | 2
8192 | 0

I think pgbench isn't a very interesting workload though. I don't
have time right now but it would be fun to dig further and try to
construct workloads that generate pathological searches for buffers (I
believe that such workloads exist, from anecdotal reports).

--
Thomas Munro
http://www.enterprisedb.com

In reply to: Thomas Munro (#13)
Re: [HACKERS] Clock with Adaptive Replacement

On Wed, Apr 25, 2018 at 5:26 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I think pgbench isn't a very interesting workload though. I don't
have time right now but it would be fun to dig further and try to
construct workloads that generate pathological searches for buffers (I
believe that such workloads exist, from anecdotal reports).

While pgbench might not be the most interesting workload, I think even
its super-synthetic, uniformly distributed point lookups can benefit
from better caching decisions. The LRU-K paper's example 1.1 explains
why this might be very cogently: Point lookups from TPC-A (or TPC-C)
consist of descending the index, and then accessing the record in the
main table from a TID, which actually implies quite a non-uniform
access pattern for individual pages. Despite accessing table records
uniformly.

There are a negligible number of internal pages involved with pgbench
(and they should always be cached), so they can almost be ignored.
It's really just a matter of what proportion of shared_buffers should
be leaf pages versus heap pages. The pgbench accounts table is
approximately 6x the size of its primary key, so the leaf blocks have
6x the utility of heap blocks for this workload. It actually is pretty
much that simple. Or it would be, if we didn't have the OS cache as a
confounding factor. The optimal strategy for the pgbench workload as
far as maximizing hits in shared_buffers goes is to *only* cache leaf
blocks, at least until you have all leaf blocks cached; we should
evict table blocks *immediately*. This isn't even remotely close to
the behavior of an LRU, of course. Actually, as the LRU-K paper also
points out, an LRU has a tendency to perversely somewhat *favor* the
table blocks, since presumably each point lookup's table block is
accessed after its leaf block is pinned/accessed.

Before some of the big shared_buffers bottlenecks were alleviated
several years ago, it was possible to observe shared_buffers evictions
occurring essentially at random. I have no idea if that's still true,
but it could be. A random replacement strategy is actually not as bad
as it sounds, but it is still pretty bad, especially given that we
could pay a high price in contention in return for such a bad
strategy. This convinces me that we will not be able to assess any
algorithm well in isolation, through simulations, because lock
contention is too important overall, and can actually affect the cache
hit ratio of a clock-based approach. I believe that it's generally
true that our clock sweep algorithm approximates an LRU, but I have
also seen it totally fail to do that, which it sounds like you've
heard about, too. Andrey is probably correct in his suspicion that
we'll end up prototyping a number of approaches.

I'm glad that you're thinking about this, in any case. Seems like a
promising area to work on.

--
Peter Geoghegan

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#14)
Re: [HACKERS] Clock with Adaptive Replacement

On Thu, Apr 26, 2018 at 10:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Apr 25, 2018 at 5:26 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I think pgbench isn't a very interesting workload though. I don't
have time right now but it would be fun to dig further and try to
construct workloads that generate pathological searches for buffers (I
believe that such workloads exist, from anecdotal reports).

While pgbench might not be the most interesting workload, I think even
its super-synthetic, uniformly distributed point lookups can benefit
from better caching decisions. The LRU-K paper's example 1.1 explains
why this might be very cogently: Point lookups from TPC-A (or TPC-C)
consist of descending the index, and then accessing the record in the
main table from a TID, which actually implies quite a non-uniform
access pattern for individual pages. Despite accessing table records
uniformly.

There are a negligible number of internal pages involved with pgbench
(and they should always be cached), so they can almost be ignored.
It's really just a matter of what proportion of shared_buffers should
be leaf pages versus heap pages. The pgbench accounts table is
approximately 6x the size of its primary key, so the leaf blocks have
6x the utility of heap blocks for this workload. It actually is pretty
much that simple. Or it would be, if we didn't have the OS cache as a
confounding factor. The optimal strategy for the pgbench workload as
far as maximizing hits in shared_buffers goes is to *only* cache leaf
blocks, at least until you have all leaf blocks cached; we should
evict table blocks *immediately*. This isn't even remotely close to
the behavior of an LRU, of course. Actually, as the LRU-K paper also
points out, an LRU has a tendency to perversely somewhat *favor* the
table blocks, since presumably each point lookup's table block is
accessed after its leaf block is pinned/accessed.

Huh. Right. So it's not truly uniform. I wonder if any of these
algorithms are sensitive to the higher value of the leaf pages than
the heap pages. It seems quite subtle: even though we can see that
the individual leaf pages must be 6x more likely to be accessed again
next time than individual heap pages due to their higher tuple
density, they're still very unlikely to be accessed again soon, and
the question is whether any of these algorithms can track that for
long enough to see the difference between two very low access
frequencies, in a huge field of unlikely-to-be-accessed pages. LRU,
by not even attempting to model frequency, is I guess uniquely
affected by the heap-after-index-leaf thing. (I haven't read the
LRU-K paper... I have a lot to learn before I could contribute much
here. Added to pile. Thanks.)

A thought experiment about the U-shaped performance when your dataset
fits in neither PG nor kernel cache, but would fit entirely in
physical memory if you made either of the two caches big enough: I
suppose when you read a page in, you could tell the kernel that you
POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could
tell the kernel that you POSIX_FADV_WILLNEED its former contents (in
advance somehow), on the theory that the coldest stuff in the PG cache
should now become the hottest stuff in the OS cache. Of course that
sucks, because the best the kernel can do then is go and read it from
disk, and the goal is to avoid IO. Given a hypothetical way to
"write" "clean" data to the kernel (so it wouldn't mark it dirty and
generate IO, but it would let you read it back without generating IO
if you're lucky), then perhaps you could actually achieve exclusive
caching at the two levels, and use all your physical RAM without
duplication. Then perhaps the U shape would go away, and the curve
would instead go up all the way until shared buffers is big enough to
whole the whole dataset?

I'm glad that you're thinking about this, in any case. Seems like a
promising area to work on.

It's a promising area that I'm interested in learning more about, but
I'm not promising to work on it :-)

--
Thomas Munro
http://www.enterprisedb.com

In reply to: Thomas Munro (#15)
Re: [HACKERS] Clock with Adaptive Replacement

On Wed, Apr 25, 2018 at 6:31 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Huh. Right. So it's not truly uniform. I wonder if any of these
algorithms are sensitive to the higher value of the leaf pages than
the heap pages. It seems quite subtle: even though we can see that
the individual leaf pages must be 6x more likely to be accessed again
next time than individual heap pages due to their higher tuple
density, they're still very unlikely to be accessed again soon, and
the question is whether any of these algorithms can track that for
long enough to see the difference between two very low access
frequencies, in a huge field of unlikely-to-be-accessed pages. LRU,
by not even attempting to model frequency, is I guess uniquely
affected by the heap-after-index-leaf thing.

Right. Another insight is that it's worth considering weighing the
type of page involved, to artificially favor indexes to some degree.
I'm not saying that it's a good idea, and it's pretty inelegant. But
I'm pretty sure it's been done before, with satisfactory results.

A thought experiment about the U-shaped performance when your dataset
fits in neither PG nor kernel cache, but would fit entirely in
physical memory if you made either of the two caches big enough: I
suppose when you read a page in, you could tell the kernel that you
POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could
tell the kernel that you POSIX_FADV_WILLNEED its former contents (in
advance somehow), on the theory that the coldest stuff in the PG cache
should now become the hottest stuff in the OS cache. Of course that
sucks, because the best the kernel can do then is go and read it from
disk, and the goal is to avoid IO.

Not sure about that. I will say that the intuition that this is a good
area to work on is based on the challenges that we have with
shared_buffers in particular. The fact that shared buffers is
typically sized no larger than 16GB, say, and yet main memory sizes
can now easily be far larger is clearly a problem, and one that we're
going to have to get around to addressing directly.

ISTM that the whole shared_buffers issue should have little influence
on how much we're willing to remember about block popularity; why
should we be willing to track information about only the exact number
of blocks that will fit in shared_buffers at any one time? As you
know, ARC/CAR explicitly models some blocks that are not cache
resident as being within two ghost lists. Sounds like something that
might have a larger-than-expected benefit for us.

How much of a problem is it that we waste memory bandwidth copying to
and from OS cache, particularly on large systems? Might that be the
bigger problem, but also one that can be addressed incrementally?

I think that I may be repeating some of what Andrey said, in another
way (not sure of that).

--
Peter Geoghegan

#17Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#14)
Re: [HACKERS] Clock with Adaptive Replacement

On Wed, Apr 25, 2018 at 6:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:

Before some of the big shared_buffers bottlenecks were alleviated
several years ago, it was possible to observe shared_buffers evictions
occurring essentially at random. I have no idea if that's still true,
but it could be.

I think it is. We haven't done anything to address it. I think if we
want to move to direct I/O -- which may be something we need or want
to do -- we're going to have to work a lot harder at making good page
eviction decisions. Your patch to change the page eviction algorithm
didn't help noticeably once we eliminated the contention around buffer
eviction, but that's just because the cost of a bad eviction went
down, not because we stopped doing bad evictions. I think it would be
interesting to insert a usleep() call into mdread() and then test
buffer eviction various algorithms with that in place.

I'm personally not very excited about making rules like "index pages
are more valuable than heap pages". Such rules will in many cases be
true, but it's easy to come up with cases where they don't hold: for
example, we might run pgbench for a while and then stop running
pgbench and start running big sequential scans for reporting purposes.
We don't want to artificially pin the index buffers in shared_buffers
just because they're index pages; we want to figure out which pages
really matter. Now, I realize that you weren't proposing (and
wouldn't propose) a rule that index pages never get evicted, but I
think that favoring index pages even in some milder way is basically a
hack. Index pages aren't *intrinsically* more valuable; they are more
valuable because they will, in many workloads, be accessed more often
than heap pages. A good algorithm ought to be able to figure that out
based on the access pattern, without being explicitly given a hint, I
think.

I believe the root of the problem here is that the usage count we have
today does a very poor job distinguishing what's hot from what's not.
There have been previous experiments around making usage_count use
some kind of a log scale: we make the maximum, say, 32, and the clock
hand divides by 2 instead of subtracting 1. I don't think those
experiments were enormously successful and I suspect that a big part
of the reason is that it's still pretty easy to get to a state where
the counters are maxed out for a large number of buffers, and at that
point you can't distinguish between those buffers any more: they all
look equally hot. We need something better. If a system like this is
working properly, things like interior index pages and visibility map
pages ought to show up as fiery hot on workloads where the index or
visibility map, as the case may be, is heavily used.

A related problem is that user-connected backends end up doing a lot
of buffer eviction themselves on many workloads. Maybe the
bgreclaimer patch Amit wrote a few years ago could help somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Robert Haas (#17)
Re: [HACKERS] Clock with Adaptive Replacement

On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think it is. We haven't done anything to address it. I think if we
want to move to direct I/O -- which may be something we need or want
to do -- we're going to have to work a lot harder at making good page
eviction decisions.

+1

Your patch to change the page eviction algorithm
didn't help noticeably once we eliminated the contention around buffer
eviction, but that's just because the cost of a bad eviction went
down, not because we stopped doing bad evictions.

The goal of that patch, which was literally written over a wet
weekend, was to make people realize that we were missing something
there. I think that it accomplished that much.

I think it would be
interesting to insert a usleep() call into mdread() and then test
buffer eviction various algorithms with that in place.

That could be a useful strategy.

I'm personally not very excited about making rules like "index pages
are more valuable than heap pages". Such rules will in many cases be
true, but it's easy to come up with cases where they don't hold: for
example, we might run pgbench for a while and then stop running
pgbench and start running big sequential scans for reporting purposes.

I am not in favor of this general approach. Actually, I'm willing to
come out against it strongly right now: let's not teach the buffer
manager to distinguish between the AM of a block.

We don't want to artificially pin the index buffers in shared_buffers
just because they're index pages; we want to figure out which pages
really matter. Now, I realize that you weren't proposing (and
wouldn't propose) a rule that index pages never get evicted, but I
think that favoring index pages even in some milder way is basically a
hack. Index pages aren't *intrinsically* more valuable; they are more
valuable because they will, in many workloads, be accessed more often
than heap pages. A good algorithm ought to be able to figure that out
based on the access pattern, without being explicitly given a hint, I
think.

I won't argue against any of that, but I think that it's rather
nuanced, and that the nuance probably matters a lot.

First of all, we must have a sense of proportion about what is likely
to be true in the average case. I mentioned to Thomas that the pgbench
accounts table is 6x the size its primary key, and that with a uniform
distribution + point lookups the leaf pages literally have 6x the
utility of the heap pages. It really is that simple there. Also,
whatever the distribution of the point lookups is, "cache more leaf
pages" is probably going to be the effect that a better caching
strategy would have. Even 6x is probably an underestimate of the
utility of leaf pages compared to heap pages in the average case. I
bet it's more like 10x for a typical OLTP app.

Second of all, we must have a sense of perspective about what we're
trying to do here, which is to predict the future based on the past.
The best outcome we can hope for is to lose less on average without
hurting anything that already works well enough.

I believe the root of the problem here is that the usage count we have
today does a very poor job distinguishing what's hot from what's not.

I think that the root of the problem is that we don't remember
anything about evicted buffers. The same block can bounce in and out
of shared_buffers repeatedly, and we don't recognize that. There are
probably second-order effects from sizing shared_buffers. We
needlessly tie the number of buffers to our capacity to remember
things about block popularity. That would be bad if we had no FS
cache, but with the FS cache it seems awful.

I share the intuition that we need something adaptive, that can be
analysed and understood relatively easily, and has little to no magic.
That's why I'm willing to say now that we shouldn't do anything based
on the type of the blocks involved. However, I think that that model
might work about as well, and that this matters because it provides
motivating examples. Surely you're right when you say that index
blocks are not intrinsically more useful than any of type of block,
but if they're 10x more useful on average, and 20x more useful at
other times, then a person could be forgiven for modelling them as
intrinsically more useful.

It's also useful to examine why a random replacement strategy is
merely mediocre to bad, and not abysmal. That's what every replacement
strategy ends up being with enough cache pressure anyway. This makes
it more reasonable to focus on the average case than it is in other
places.

There have been previous experiments around making usage_count use
some kind of a log scale: we make the maximum, say, 32, and the clock
hand divides by 2 instead of subtracting 1. I don't think those
experiments were enormously successful and I suspect that a big part
of the reason is that it's still pretty easy to get to a state where
the counters are maxed out for a large number of buffers, and at that
point you can't distinguish between those buffers any more: they all
look equally hot. We need something better. If a system like this is
working properly, things like interior index pages and visibility map
pages ought to show up as fiery hot on workloads where the index or
visibility map, as the case may be, is heavily used.

I don't think that internal index pages and the visibility map are the
problem, since there is so few of them, and they are accessed at least
hundreds of times more frequently than the leaf pages.

--
Peter Geoghegan

#19Stephen Frost
sfrost@snowman.net
In reply to: Peter Geoghegan (#18)
Re: [HACKERS] Clock with Adaptive Replacement

Greetings,

* Peter Geoghegan (pg@bowt.ie) wrote:

On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm personally not very excited about making rules like "index pages
are more valuable than heap pages". Such rules will in many cases be
true, but it's easy to come up with cases where they don't hold: for
example, we might run pgbench for a while and then stop running
pgbench and start running big sequential scans for reporting purposes.

I am not in favor of this general approach. Actually, I'm willing to
come out against it strongly right now: let's not teach the buffer
manager to distinguish between the AM of a block.

Perhaps I'm misunderstanding the case here, but with the ring buffer we
use for sequential access, aren't we already discouraging keeping heap
pages around, particularly when they're coming from a sequential scan?

I'm not suggesting that's a bad thing either, in fact it seems like a
generally good thing and I don't think we get many complaints about it,
I'm just trying to point out that we already have a distinction between
heap-pages-from-seq-scans and index pages (and heap pages from using
those indexes) and therefore I'm not convinced by an argument that we
shouldn't ever treat pages from the heap differently than pages from the
indexes.

This could change when considering an environment where we aren't trying
to take advantage of the kernel's buffer cache and things like its
read-ahead, and maybe that's what you're getting at here, but I would
also point out that we have more inherent understanding of how pages are
likely to be accessed than the kernel does and we shouldn't be
explicitly trying to avoid taking advantage of that knowledge.

We don't want to artificially pin the index buffers in shared_buffers
just because they're index pages; we want to figure out which pages
really matter. Now, I realize that you weren't proposing (and
wouldn't propose) a rule that index pages never get evicted, but I
think that favoring index pages even in some milder way is basically a
hack. Index pages aren't *intrinsically* more valuable; they are more
valuable because they will, in many workloads, be accessed more often
than heap pages. A good algorithm ought to be able to figure that out
based on the access pattern, without being explicitly given a hint, I
think.

I won't argue against any of that, but I think that it's rather
nuanced, and that the nuance probably matters a lot.

First of all, we must have a sense of proportion about what is likely
to be true in the average case. I mentioned to Thomas that the pgbench
accounts table is 6x the size its primary key, and that with a uniform
distribution + point lookups the leaf pages literally have 6x the
utility of the heap pages. It really is that simple there. Also,
whatever the distribution of the point lookups is, "cache more leaf
pages" is probably going to be the effect that a better caching
strategy would have. Even 6x is probably an underestimate of the
utility of leaf pages compared to heap pages in the average case. I
bet it's more like 10x for a typical OLTP app.

I agree that we should definitely be thinking about data density here-
and that's another case where we have more information about the data
than the kernel does (which I compare to as it tries to predict value
also but doesn't have as much information). Another consideration here
to think about would be internal pages rather than just leaf pages-
those have an even higher likelihood of being needed again quickly as
they need to be traversed much more than leaf pages.

Second of all, we must have a sense of perspective about what we're
trying to do here, which is to predict the future based on the past.
The best outcome we can hope for is to lose less on average without
hurting anything that already works well enough.

Agreed.

I believe the root of the problem here is that the usage count we have
today does a very poor job distinguishing what's hot from what's not.

I think that the root of the problem is that we don't remember
anything about evicted buffers. The same block can bounce in and out
of shared_buffers repeatedly, and we don't recognize that. There are
probably second-order effects from sizing shared_buffers. We
needlessly tie the number of buffers to our capacity to remember
things about block popularity. That would be bad if we had no FS
cache, but with the FS cache it seems awful.

This is definitely an interesting and valuable point. Having pages
bouncing in and out of shared buffers means that we aren't properly
tracking their value. Just brainstorming, but I wonder if there's a way
we could track their value even when we've evicted them, without slowing
down the entire system. Not having any great ideas off-hand for that
but some kind of "we last accessed this block an hour ago" and "we last
accessed this block half a millisecond ago" might be interesting to try
to avoid such ping-ponging. I wonder if there are known strategies for
addressing this, perhaps by having a data structure which effectively
tracks hit rates for all potential blocks...

I share the intuition that we need something adaptive, that can be
analysed and understood relatively easily, and has little to no magic.
That's why I'm willing to say now that we shouldn't do anything based
on the type of the blocks involved. However, I think that that model
might work about as well, and that this matters because it provides
motivating examples. Surely you're right when you say that index
blocks are not intrinsically more useful than any of type of block,
but if they're 10x more useful on average, and 20x more useful at
other times, then a person could be forgiven for modelling them as
intrinsically more useful.

If we can understand that they're, on average, 10x-20x more useful, then
ignoring that seems like we're throwing away information that we
effectively know about the future.

It's also useful to examine why a random replacement strategy is
merely mediocre to bad, and not abysmal. That's what every replacement
strategy ends up being with enough cache pressure anyway. This makes
it more reasonable to focus on the average case than it is in other
places.

Every replacement strategy which doesn't care about the kind of block
that's being worked with will end up performing random replacement under
enough pressure. We don't actually have to allow that to happen though
since we do have information about the kind of block and therefore it
seems like we should be able to keep things mediocre, and avoid letting
things get bad under such pressure.

There have been previous experiments around making usage_count use
some kind of a log scale: we make the maximum, say, 32, and the clock
hand divides by 2 instead of subtracting 1. I don't think those
experiments were enormously successful and I suspect that a big part
of the reason is that it's still pretty easy to get to a state where
the counters are maxed out for a large number of buffers, and at that
point you can't distinguish between those buffers any more: they all
look equally hot. We need something better. If a system like this is
working properly, things like interior index pages and visibility map
pages ought to show up as fiery hot on workloads where the index or
visibility map, as the case may be, is heavily used.

I don't think that internal index pages and the visibility map are the
problem, since there is so few of them, and they are accessed at least
hundreds of times more frequently than the leaf pages.

Yet they still end up getting evicted under a random replacement
strategy under sufficient pressure, because we don't distinguish them
from any other buffer, and that leads to high latency on queries which
should always be extremely fast. As long as we're talking about a
system which can devolve into random replacement, we're going to have
those cases and we aren't going to have any answer for them. I'm not
really thrilled with any approach that allows that kind of degradation.

Thanks!

Stephen

#20Thomas Munro
thomas.munro@gmail.com
In reply to: Stephen Frost (#19)
Re: [HACKERS] Clock with Adaptive Replacement

On Sun, Apr 29, 2018 at 2:39 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Peter Geoghegan (pg@bowt.ie) wrote:

On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm personally not very excited about making rules like "index pages
are more valuable than heap pages". Such rules will in many cases be
true, but it's easy to come up with cases where they don't hold: for
example, we might run pgbench for a while and then stop running
pgbench and start running big sequential scans for reporting purposes.

I am not in favor of this general approach. Actually, I'm willing to
come out against it strongly right now: let's not teach the buffer
manager to distinguish between the AM of a block.

Perhaps I'm misunderstanding the case here, but with the ring buffer we
use for sequential access, aren't we already discouraging keeping heap
pages around, particularly when they're coming from a sequential scan?

I'm not suggesting that's a bad thing either, in fact it seems like a
generally good thing and I don't think we get many complaints about it,
I'm just trying to point out that we already have a distinction between
heap-pages-from-seq-scans and index pages (and heap pages from using
those indexes) and therefore I'm not convinced by an argument that we
shouldn't ever treat pages from the heap differently than pages from the
indexes.

I think the idea is that CAR might be even better than our hard coded
BAS_BULKREAD, BAS_BULKWRITE, BAS_VACUUM rings because it has a kind of
dynamic self-tuning scan resistance. It works by having one clock (=
a circular list, not just a rotating hand in an array like our current
scheme) T1 that holds buffers that were touched once or twice, to
absorb scans. It'll reclaim T1 buffers that were only accessed once
(but keep the descriptor for longer -- see below). It will move
things to T2, a separate clock for valuable pages, after they've been
accessed twice in recent memory. It dynamically adjusts the sizes of
T1 and T2 in a way that is supposedly beneficial.

[...]

I believe the root of the problem here is that the usage count we have
today does a very poor job distinguishing what's hot from what's not.

I think that the root of the problem is that we don't remember
anything about evicted buffers. The same block can bounce in and out
of shared_buffers repeatedly, and we don't recognize that. There are
probably second-order effects from sizing shared_buffers. We
needlessly tie the number of buffers to our capacity to remember
things about block popularity. That would be bad if we had no FS
cache, but with the FS cache it seems awful.

This is definitely an interesting and valuable point. Having pages
bouncing in and out of shared buffers means that we aren't properly
tracking their value. Just brainstorming, but I wonder if there's a way
we could track their value even when we've evicted them, without slowing
down the entire system. Not having any great ideas off-hand for that
but some kind of "we last accessed this block an hour ago" and "we last
accessed this block half a millisecond ago" might be interesting to try
to avoid such ping-ponging. I wonder if there are known strategies for
addressing this, perhaps by having a data structure which effectively
tracks hit rates for all potential blocks...

That's what the CAR algorithm does. It has twice as many
BufferDescriptors as buffers. As well as the T1 and T2 clocks that
hold buffers, it also has doubly linked lists B1 and B2 which hold
BufferDescriptors with no buffers for things that recently fell out of
T1 and T2 to extend the algorithm's memory. For example, if there is
a "miss" in B1 (meaning we found the page we're looking for in there
but it currently has no buffer, and it was recently evicted from T1),
then we can immediately find a buffer for it in T2 (it qualifies as
having been accessed twice in recent memory, and doesn't have to go
through T1 again where it risks early eviction). Though of course it
too is finite.

--
Thomas Munro
http://www.enterprisedb.com

In reply to: Stephen Frost (#19)
#22Юрий Соколов
funny.falcon@gmail.com
In reply to: Peter Geoghegan (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Andrey Borodin (#12)
#24Andrey Borodin
amborodin@acm.org
In reply to: Andres Freund (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Andrey Borodin (#24)
#26Thomas Munro
thomas.munro@gmail.com
In reply to: Andrey Borodin (#24)
#27Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#15)
#28Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#27)
#29Юрий Соколов
funny.falcon@gmail.com
In reply to: Thomas Munro (#26)
#30Andrey Borodin
amborodin@acm.org
In reply to: Andres Freund (#25)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#21)
In reply to: Robert Haas (#31)
#33Юрий Соколов
funny.falcon@gmail.com
In reply to: Peter Geoghegan (#32)
#34Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#32)
#35Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Robert Haas (#34)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Vladimir Sitnikov (#35)
In reply to: Robert Haas (#36)
#38Andrey Borodin
amborodin@acm.org
In reply to: Robert Haas (#36)
#39Юрий Соколов
funny.falcon@gmail.com
In reply to: Andrey Borodin (#38)
#40Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Peter Geoghegan (#37)
#41Andrey Borodin
amborodin@acm.org
In reply to: Юрий Соколов (#39)
#42Юрий Соколов
funny.falcon@gmail.com
In reply to: Andrey Borodin (#41)
#43Andrey Borodin
amborodin@acm.org
In reply to: Юрий Соколов (#42)
#44Юрий Соколов
funny.falcon@gmail.com
In reply to: Andrey Borodin (#43)
#45Andrey Borodin
amborodin@acm.org
In reply to: Юрий Соколов (#44)
#46Юрий Соколов
funny.falcon@gmail.com
In reply to: Andrey Borodin (#45)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Andrey Borodin (#41)
#48Юрий Соколов
funny.falcon@gmail.com
In reply to: Robert Haas (#47)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Юрий Соколов (#48)
#50Юрий Соколов
funny.falcon@gmail.com
In reply to: Robert Haas (#49)
#51Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Юрий Соколов (#50)
#52Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Vladimir Sitnikov (#51)
#53Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Jesper Pedersen (#52)
#54Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: Vladimir Sitnikov (#53)
#55Robert Haas
robertmhaas@gmail.com
In reply to: Юрий Соколов (#50)
#56Merlin Moncure
mmoncure@gmail.com
In reply to: Robert Haas (#55)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Merlin Moncure (#56)
In reply to: Peter Geoghegan (#37)
#59Thomas Munro
thomas.munro@gmail.com
In reply to: Andrey Borodin (#12)
#60Benjamin Manes
ben.manes@gmail.com
In reply to: Thomas Munro (#59)
#61Andrey Borodin
amborodin@acm.org
In reply to: Benjamin Manes (#60)
#62Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#34)
In reply to: Robert Haas (#31)