Page replacement algorithm in buffer cache

Started by Atri Sharmaabout 13 years ago82 messageshackers

atri.jiit@gmail.com

about 13 years ago

Hello all,

Sorry if this is a naive question.

I was going through Greg Smith's slides on buffer
cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf).
When going through the page replacement algorithm that we use i.e.
clocksweep algorithm, I felt a potential problem in our current
system.

Specifically, when a new entry is allocated in the buffer, it's
USAGE_COUNT is set to 1. On each sweep of the algorithm, the
USAGE_COUNT is decremented and an entry whose USAGE_COUNT becomes
zero is replaced.

I feel that this could lead to a bias towards replacement of
relatively younger pages in the cache over older pages. An entry
which has just entered the cache with USAGE_COUNT=1 could be replaced
soon, but it may be needed frequently in the near future, which would
result in it being repeatedly brought into the cache, leading to
replacement overheads.

I think this is the well known face off between LRU and MRU algorithms.

How do we work around this problem?

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Atri Sharma (#1)

Re: Page replacement algorithm in buffer cache

On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:

Hello all,

Sorry if this is a naive question.

I was going through Greg Smith's slides on buffer
cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
he.pdf).
When going through the page replacement algorithm that we use i.e.
clocksweep algorithm, I felt a potential problem in our current
system.

Specifically, when a new entry is allocated in the buffer, it's
USAGE_COUNT is set to 1. On each sweep of the algorithm, the
USAGE_COUNT is decremented and an entry whose USAGE_COUNT becomes
zero is replaced.

Yes, it is replaced but in the next clock sweep pass, not immediately after
making 0.
So till the time of next pass if nobody accesses the buffer and all other
buffers have higher count, it can be replaced.
Also the buffer, it has returned for which the usage count becomes 1, it
will come to reduce the usage count only in next pass.
So in whole, I think it needs 2 passes for a freshly returned buffer to be
re-used incase no one uses it again.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Atri Sharma (#1)

Re: Page replacement algorithm in buffer cache

Sent from my iPad

On 22-Mar-2013, at 11:28, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:

Hello all,

Sorry if this is a naive question.

I was going through Greg Smith's slides on buffer
cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac
he.pdf).
When going through the page replacement algorithm that we use i.e.
clocksweep algorithm, I felt a potential problem in our current
system.

Specifically, when a new entry is allocated in the buffer, it's
USAGE_COUNT is set to 1. On each sweep of the algorithm, the
USAGE_COUNT is decremented and an entry whose USAGE_COUNT becomes
zero is replaced.

Yes, it is replaced but in the next clock sweep pass, not immediately after
making 0.
So till the time of next pass if nobody accesses the buffer and all other
buffers have higher count, it can be replaced.
Also the buffer, it has returned for which the usage count becomes 1, it
will come to reduce the usage count only in next pass.
So in whole, I think it needs 2 passes for a freshly returned buffer to be
re-used incase no one uses it again.

With Regards,
Amit Kapila.

Hmm,so in the second pass,it gets replaced,right?

I think that if the initialization of USAGE_COUNT starts at the maximum allowed value instead of one, we can have a better solution to this problem.

Another,more complex solution could be to introduce an ageing factor as well.

Regards,

Atri

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 514bf302.84d5440a.43f2.ffffe411SMTPIN_ADDED_BROKEN@mx.google.com

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Atri Sharma (#3)

Re: Page replacement algorithm in buffer cache

On Friday, March 22, 2013 12:00 PM Atri Sharma wrote:

Sent from my iPad

On 22-Mar-2013, at 11:28, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, March 22, 2013 10:22 AM Atri Sharma wrote:

Hello all,

Sorry if this is a naive question.

I was going through Greg Smith's slides on buffer

cache(http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCac

he.pdf).
When going through the page replacement algorithm that we use i.e.
clocksweep algorithm, I felt a potential problem in our current
system.

Specifically, when a new entry is allocated in the buffer, it's
USAGE_COUNT is set to 1. On each sweep of the algorithm, the
USAGE_COUNT is decremented and an entry whose USAGE_COUNT becomes
zero is replaced.

Yes, it is replaced but in the next clock sweep pass, not immediately

after

making 0.
So till the time of next pass if nobody accesses the buffer and all

other

buffers have higher count, it can be replaced.
Also the buffer, it has returned for which the usage count becomes 1,

it

will come to reduce the usage count only in next pass.
So in whole, I think it needs 2 passes for a freshly returned buffer

to be

re-used incase no one uses it again.

With Regards,
Amit Kapila.

Hmm,so in the second pass,it gets replaced,right?

Yes.

I think that if the initialization of USAGE_COUNT starts at the maximum
allowed value instead of one, we can have a better solution to this
problem.

So what is your idea, if you start at maximum, what we will do for further
accesses to it?
Why do you want to give more priority to just loaded page?

Another,more complex solution could be to introduce an ageing factor as
well.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Atri Sharma (#1)

Re: Page replacement algorithm in buffer cache

I think that if the initialization of USAGE_COUNT starts at the maximum
allowed value instead of one, we can have a better solution to this
problem.

So what is your idea, if you start at maximum, what we will do for further
accesses to it?

I havent chalked out a detailed plan yet, but I think the idea of
initializing USAGE_COUNT to maximum value is not at all good. I was
just thinking off the top of my head.

Why do you want to give more priority to just loaded page?

I just want it to have more chances to stay, rather than being
replaced pretty early. This is because, as I said earlier, a new page
could be in high demand in near future, which would lead to repeated
replacement-bringing in of page and hence cause overheads.

Another,more complex solution could be to introduce an aging factor

This is the one I think would work out best, add an age factor as to
the time duration which an entry has spent in the cache along with its
usage count.

So, what I am proposing here is to add another factor in the
clocksweep algorithm when it selects victim pages for replacement.
Specifically, the selection of victim pages should be done with the
usage_count AND the time spent by the entry in the cache. This would
give priority to pages with high accesses and not ignore relatively
young pages as well. If a page is not accessed for a long time after
it was allocated, it would be the ideal victim for replacement both in
terms of USAGE_COUNT as well as age.

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 514c3395.487a420a.28d1.764eSMTPIN_ADDED_BROKEN@mx.google.com

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Atri Sharma (#5)

Re: Page replacement algorithm in buffer cache

On Friday, March 22, 2013 4:16 PM Atri Sharma wrote:

I think that if the initialization of USAGE_COUNT starts at the

maximum

allowed value instead of one, we can have a better solution to this
problem.

So what is your idea, if you start at maximum, what we will do for

further

accesses to it?

I havent chalked out a detailed plan yet, but I think the idea of
initializing USAGE_COUNT to maximum value is not at all good. I was
just thinking off the top of my head.

Why do you want to give more priority to just loaded page?

I just want it to have more chances to stay, rather than being
replaced pretty early. This is because, as I said earlier, a new page
could be in high demand in near future, which would lead to repeated
replacement-bringing in of page and hence cause overheads.

Another,more complex solution could be to introduce an aging factor

This is the one I think would work out best, add an age factor as to
the time duration which an entry has spent in the cache along with its
usage count.

So, what I am proposing here is to add another factor in the
clocksweep algorithm when it selects victim pages for replacement.
Specifically, the selection of victim pages should be done with the
usage_count AND the time spent by the entry in the cache. This would
give priority to pages with high accesses and not ignore relatively
young pages as well. If a page is not accessed for a long time after
it was allocated, it would be the ideal victim for replacement both in
terms of USAGE_COUNT as well as age.

What would you do if the only young page has usage count zero during second
sweep.
I don't think introducing another factor along with usage count would do any
much help.
Have you encountered any such workload very relatively young pages are
getting victimized
and the same is causing performance issues?

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Atri Sharma (#1)

Re: Page replacement algorithm in buffer cache

What would you do if the only young page has usage count zero during second
sweep.

Umm....The same approach we take when there is no page with usage
count zero in a sweep in the current algorithm?

I don't think introducing another factor along with usage count would do any
much help.

Which else approach can we take here?

Have you encountered any such workload very relatively young pages are
getting victimized
and the same is causing performance issues?

Not yet, I figured this might be a problem and am designing test cases
for the same. I would be glad for some help there please.

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 514c39b2.c46f0e0a.7ee8.6b68SMTPIN_ADDED_BROKEN@mx.google.com

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Atri Sharma (#7)

Re: Page replacement algorithm in buffer cache

On Friday, March 22, 2013 4:36 PM Atri Sharma wrote:

What would you do if the only young page has usage count zero during

second

sweep.

Umm....The same approach we take when there is no page with usage
count zero in a sweep in the current algorithm?

It would give more priority to young page as compare to more used page.
I don't know if that would be correct thing to do.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Atri Sharma (#1)

Re: Page replacement algorithm in buffer cache

On Fri, Mar 22, 2013 at 4:53 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, March 22, 2013 4:36 PM Atri Sharma wrote:

What would you do if the only young page has usage count zero during

second

sweep.

Umm....The same approach we take when there is no page with usage
count zero in a sweep in the current algorithm?

It would give more priority to young page as compare to more used page.
I don't know if that would be correct thing to do.

This is my idea, give equal priority to new pages when they enter the
cache, so that they all have an equal chance to be replaced initially.
With time, usage_count shall become the deciding factor in victim
selection.

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 514c3f46.22e7420a.1941.4628SMTPIN_ADDED_BROKEN@mx.google.com

#10

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Atri Sharma (#5)

Re: Page replacement algorithm in buffer cache

On Mar 22, 2013 12:46 PM, "Atri Sharma" <atri.jiit@gmail.com> wrote:

This is the one I think would work out best, add an age factor as to
the time duration which an entry has spent in the cache along with its
usage count.

You might want to check out the LIRS cache replacement algorithm [1]http://en.wikipedia.org/wiki/LIRS_caching_algorithm.
That algorithm tries to estimate least frequently used instead of
least recently used. Mysql uses it for their buffer replacement
policy. There is also a clock sweep based approximation called
CLOCK-Pro. Papers describing and evaluating both are available on the
net. The evaluations in the papers showed significantly better
performance for both of those compared to regular clock sweep or even
ARC.

However, I think the main issue isn't finding new algorithms that are
better in some specific circumstances. The hard part is figuring out
whether their performance is better in general. My idea was to create
a patch to capture page pinning traffic from PostgreSQL (maybe stream
out into a per backend file), run it with some production workloads
and use that to generate testing workloads for the cache replacement
policies. Haven't gotten round to actually doing that though.

[1]: http://en.wikipedia.org/wiki/LIRS_caching_algorithm

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Ants Aasma (#10)

Re: Page replacement algorithm in buffer cache

However, I think the main issue isn't finding new algorithms that are
better in some specific circumstances. The hard part is figuring out
whether their performance is better in general. My idea was to create
a patch to capture page pinning traffic from PostgreSQL (maybe stream
out into a per backend file), run it with some production workloads
and use that to generate testing workloads for the cache replacement
policies. Haven't gotten round to actually doing that though.

[1] http://en.wikipedia.org/wiki/LIRS_caching_algorithm

Thanks for the link. I think LIRS can indeed be helpful in our case.

We should indeed build some test cases for testing this theory. I am
all for capturing page replacement and usage data and analyzing it.

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Ants Aasma (#10)

Re: Page replacement algorithm in buffer cache

Ants Aasma <ants@cybertec.at> writes:

You might want to check out the LIRS cache replacement algorithm [1].
That algorithm tries to estimate least frequently used instead of
least recently used. Mysql uses it for their buffer replacement
policy. There is also a clock sweep based approximation called
CLOCK-Pro. Papers describing and evaluating both are available on the
net. The evaluations in the papers showed significantly better
performance for both of those compared to regular clock sweep or even
ARC.

I seem to recall that CLOCK-Pro, or something named similarly to that,
was one of the alternatives discussed when we went over to the current
clock-sweep approach. And we definitely looked at ARC. It might be
worth checking the archives from back then to see what's already been
considered.

However, I think the main issue isn't finding new algorithms that are
better in some specific circumstances. The hard part is figuring out
whether their performance is better in general.

Yeah. You can prove almost anything with the right set of test cases :-(

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Tom Lane (#12)

Re: Page replacement algorithm in buffer cache

On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

And we definitely looked at ARC

We didn't just look at it. At least one release used it. Then patent
issues were raised (and I think the implementation had some contention
problems).

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#13)

Re: Page replacement algorithm in buffer cache

On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark <stark@mit.edu> wrote:

On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

And we definitely looked at ARC

We didn't just look at it. At least one release used it. Then patent
issues were raised (and I think the implementation had some contention
problems).

--
greg

What is the general thinking? Is it time to start testing again and
thinking about improvements to the current algorithm?

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Atri Sharma (#14)

Re: Page replacement algorithm in buffer cache

On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma <atri.jiit@gmail.com> wrote:

On Fri, Mar 22, 2013 at 11:36 PM, Greg Stark <stark@mit.edu> wrote:

On Fri, Mar 22, 2013 at 2:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

And we definitely looked at ARC

We didn't just look at it. At least one release used it. Then patent
issues were raised (and I think the implementation had some contention
problems).

--
greg

What is the general thinking? Is it time to start testing again and
thinking about improvements to the current algorithm?

well, what problem are you trying to solve exactly? the main problems
I see today are not so much in terms of page replacement but spinlock
and lwlock contention.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Merlin Moncure (#15)

Re: Page replacement algorithm in buffer cache

Merlin Moncure <mmoncure@gmail.com> writes:

On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma <atri.jiit@gmail.com> wrote:

What is the general thinking? Is it time to start testing again and
thinking about improvements to the current algorithm?

well, what problem are you trying to solve exactly? the main problems
I see today are not so much in terms of page replacement but spinlock
and lwlock contention.

Even back when we last hacked on that algorithm, the concerns were not
so much about which pages it replaced as how much overhead and
contention was created by the management algorithm. I haven't seen any
reason to think we have a problem with the quality of the replacement
choices. The proposal to increase the initial usage count would
definitely lead to more overhead/contention, though, because it would
result in having to circle around all the buffers more times (on
average) to get a free buffer.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Tom Lane (#16)

Re: Page replacement algorithm in buffer cache

On Fri, Mar 22, 2013 at 2:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

On Fri, Mar 22, 2013 at 1:13 PM, Atri Sharma <atri.jiit@gmail.com> wrote:

What is the general thinking? Is it time to start testing again and
thinking about improvements to the current algorithm?

well, what problem are you trying to solve exactly? the main problems
I see today are not so much in terms of page replacement but spinlock
and lwlock contention.

Even back when we last hacked on that algorithm, the concerns were not
so much about which pages it replaced as how much overhead and
contention was created by the management algorithm. I haven't seen any
reason to think we have a problem with the quality of the replacement
choices. The proposal to increase the initial usage count would
definitely lead to more overhead/contention, though, because it would
result in having to circle around all the buffers more times (on
average) to get a free buffer.

yup...absolutely. I have a hunch that the occasional gripes we see
about server stalls under high load with read only (or mostly read
only) loads are coming from spinlock contention under the lwlock
hitting a critical point and shutting the server down effectively
until by chance the backend with the lwlock gets lucky and lands the
spinlock.

I think there is some very low hanging optimization fruit in the clock
sweep loop. first and foremost, I see no good reason why when
scanning pages we have to spin and wait on a buffer in order to
pedantically adjust usage_count. some simple refactoring there could
set it up so that a simple TAS (or even a TTAS with the first test in
front of the cache line lock as we done automatically in x86 IIRC)
could guard the buffer and, in the event of any lock detected, simply
move on to the next candidate without messing around with that buffer
at all. This could construed as a 'trylock' variant of a spinlock
and might help out with cases where an especially hot buffer is
locking up the sweep. This is exploiting the fact that from
StrategyGetBuffer we don't need a *particular* buffer, just *a*
buffer.

I also wonder if we shouldn't (perhaps in addition to the above)
resuscitate Jeff Jane's idea to get rid of the lwlock completely and
manage everything with spinlocks..

Naturally, all of this would have to be confirmed with some very robust testing.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Merlin Moncure (#17)

Re: Page replacement algorithm in buffer cache

Merlin Moncure <mmoncure@gmail.com> writes:

I think there is some very low hanging optimization fruit in the clock
sweep loop. first and foremost, I see no good reason why when
scanning pages we have to spin and wait on a buffer in order to
pedantically adjust usage_count. some simple refactoring there could
set it up so that a simple TAS (or even a TTAS with the first test in
front of the cache line lock as we done automatically in x86 IIRC)
could guard the buffer and, in the event of any lock detected, simply
move on to the next candidate without messing around with that buffer
at all. This could construed as a 'trylock' variant of a spinlock
and might help out with cases where an especially hot buffer is
locking up the sweep. This is exploiting the fact that from
StrategyGetBuffer we don't need a *particular* buffer, just *a*
buffer.

Hm. You could argue in fact that if there's contention for the buffer
header, that's proof that it's busy and shouldn't have its usage count
decremented. So this seems okay from a logical standpoint.

However, I'm not real sure that it's possible to do a conditional
spinlock acquire that doesn't create just as much hardware-level
contention as a full acquire (ie, TAS is about as bad whether it
gets the lock or not). So the actual benefit is a bit less clear.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Tom Lane (#18)

Re: Page replacement algorithm in buffer cache

On Fri, Mar 22, 2013 at 3:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

I think there is some very low hanging optimization fruit in the clock
sweep loop. first and foremost, I see no good reason why when
scanning pages we have to spin and wait on a buffer in order to
pedantically adjust usage_count. some simple refactoring there could
set it up so that a simple TAS (or even a TTAS with the first test in
front of the cache line lock as we done automatically in x86 IIRC)
could guard the buffer and, in the event of any lock detected, simply
move on to the next candidate without messing around with that buffer
at all. This could construed as a 'trylock' variant of a spinlock
and might help out with cases where an especially hot buffer is
locking up the sweep. This is exploiting the fact that from
StrategyGetBuffer we don't need a *particular* buffer, just *a*
buffer.

Hm. You could argue in fact that if there's contention for the buffer
header, that's proof that it's busy and shouldn't have its usage count
decremented. So this seems okay from a logical standpoint.

However, I'm not real sure that it's possible to do a conditional
spinlock acquire that doesn't create just as much hardware-level
contention as a full acquire (ie, TAS is about as bad whether it
gets the lock or not). So the actual benefit is a bit less clear.

well if you do a non-locking test first you could at least avoid some
cases (and, if you get the answer wrong, so what?) by jumping to the
next buffer immediately. if the non locking test comes good, only
then do you do a hardware TAS.

you could in fact go further and dispense with all locking in front of
usage_count, on the premise that it's only advisory and not a real
refcount. so you only then lock if/when it's time to select a
candidate buffer, and only then when you did a non locking test first.
this would of course require some amusing adjustments to various
logical checks (usage_count <= 0, heh).

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Merlin Moncure (#19)

Re: Page replacement algorithm in buffer cache

On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

well if you do a non-locking test first you could at least avoid some
cases (and, if you get the answer wrong, so what?) by jumping to the
next buffer immediately. if the non locking test comes good, only
then do you do a hardware TAS.

you could in fact go further and dispense with all locking in front of
usage_count, on the premise that it's only advisory and not a real
refcount. so you only then lock if/when it's time to select a
candidate buffer, and only then when you did a non locking test first.
this would of course require some amusing adjustments to various
logical checks (usage_count <= 0, heh).

Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning. There we need an accurate count and there are
some pages like index roots that get hit very heavily. Things to do
there would be in my opinion convert to a futex based spinlock so when
there is contention it doesn't completely kill performance and then
try to get rid of the contention. Converting to lock-free pinning
won't help much here as what is killing us here is the cacheline
bouncing.

One way to get rid of contention is the buffer nailing idea that
Robert came up with. If some buffer gets so hot that maintaining
refcount on the buffer header leads to contention, promote that buffer
to a nailed status, let everyone keep their pin counts locally and
sometime later revisit the nailing decision and if necessary convert
pins back to the buffer header.

One other interesting idea I have seen is closeable scalable nonzero
indication (C-SNZI) from scalable rw-locks [1]http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf. The idea there is to
use a tree structure to dynamically stripe access to the shared lock
counter when contention is detected. Downside is that considerable
amount of shared memory is needed so there needs to be some way to
limit the resource usage. This is actually somewhat isomorphic to the
nailing idea.

The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers. I think the
improvements should concentrate on finding out what is the problem
there and figuring out how to fix it. A simple idea to test would be
to just partition shared buffers along with the whole clock sweep
machinery into smaller ones, like the buffer mapping hash tables
already are. This should at the very least reduce contention for the
clock sweep even if it doesn't reduce work done per page miss.

[1]: http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Jim Nasby

Jim.Nasby@BlueTreble.com

about 13 years ago

In reply to: Ants Aasma (#20)

#22

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Ants Aasma (#20)

#23

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Jim Nasby (#21)

#24

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Atri Sharma (#22)

#25

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Jim Nasby (#21)

#26

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Jim Nasby (#21)

#27

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Atri Sharma (#1)

#28

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Ants Aasma (#20)

#29

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Jeff Janes (#27)

#30

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Atri Sharma (#7)

#31

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Jeff Janes (#30)

#32

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Merlin Moncure (#28)

#33

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Atri Sharma (#32)

#34

Greg Smith

gsmith@gregsmith.com

about 13 years ago

In reply to: Ants Aasma (#10)

#35

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Greg Smith (#34)

#36

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Ants Aasma (#33)

#37

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Atri Sharma (#36)

#38

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Bruce Momjian (#13)

#39

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Tom Lane (#18)

#40

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#39)

#41

Jeff Janes

jeff.janes@gmail.com

about 13 years ago

In reply to: Ants Aasma (#20)

#42

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Jeff Janes (#41)

#43

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Merlin Moncure (#42)

#44

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Robert Haas (#43)

#45

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Merlin Moncure (#44)

#46

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#45)

#47

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Merlin Moncure (#42)

#48

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Andres Freund (#47)

#49

Jim Nasby

Jim.Nasby@BlueTreble.com

about 13 years ago

In reply to: Ants Aasma (#25)

#50

Jim Nasby

Jim.Nasby@BlueTreble.com

about 13 years ago

In reply to: Merlin Moncure (#48)

#51

Jim Nasby

Jim.Nasby@BlueTreble.com

about 13 years ago

In reply to: Amit Kapila (#23)

#52

Jim Nasby

Jim.Nasby@BlueTreble.com

about 13 years ago

In reply to: Greg Smith (#34)

#53

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Merlin Moncure (#42)

#54

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Jim Nasby (#50)

#55

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Jim Nasby (#51)

#56

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Jim Nasby (#49)

#57

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Andres Freund (#56)

#58

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Merlin Moncure (#54)

#59

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Robert Haas (#58)

#60

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Merlin Moncure (#59)

#61

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Robert Haas (#60)

#62

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Robert Haas (#60)

#63

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Tom Lane (#62)

#64

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Andres Freund (#63)

#65

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Tom Lane (#64)

#66

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Tom Lane (#64)

#67

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Bruce Momjian (#66)

#68

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Robert Haas (#60)

#69

Merlin Moncure

mmoncure@gmail.com

about 13 years ago

In reply to: Atri Sharma (#68)

#70

Atri Sharma

atri.jiit@gmail.com

about 13 years ago

In reply to: Merlin Moncure (#69)

#71

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Andres Freund (#65)

#72

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: Robert Haas (#71)

#73

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Bruce Momjian (#72)

#74

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Tom Lane (#73)

#75

Greg Smith

gsmith@gregsmith.com

about 13 years ago

In reply to: Robert Haas (#60)

#76

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Greg Smith (#75)

#77

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Greg Smith (#75)

#78

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Robert Haas (#77)

#79

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#13)

#80

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Robert Haas (#79)

#81

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Bruce Momjian (#13)

#82

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Robert Haas (#81)