2nd Level Buffer Cache
Hi,
I have implemented initial concept of 2nd level cache. Idea is to keep some
segments of shared memory for special buffers (e.g. indices) to prevent
overwrite those by other operations. I added those functionality to nbtree
index scan.
I tested this with doing index scan, seq read, drop system buffers, do index
scan and in few places I saw performance improvements, but actually, I'm not
sure if this was just "random" or intended improvement.
There is few places to optimize code as well, and patch need many work, but
may you see it and give opinions?
Regards,
Radek
Attachments:
2nd_lvl_cache.difftext/x-patch; charset=UTF-8; name=2nd_lvl_cache.diffDownload+335-51
Rados*aw Smogura<rsmogura@softperience.eu> wrote:
I have implemented initial concept of 2nd level cache. Idea is to
keep some segments of shared memory for special buffers (e.g.
indices) to prevent overwrite those by other operations. I added
those functionality to nbtree index scan.I tested this with doing index scan, seq read, drop system
buffers, do index scan and in few places I saw performance
improvements, but actually, I'm not sure if this was just "random"
or intended improvement.
I've often wondered about this. In a database I developed back in
the '80s it was clearly a win to have a special cache for index
entries and other special pages closer to the database than the
general cache. A couple things have changed since the '80s (I mean,
besides my waistline and hair color), and PostgreSQL has many
differences from that other database, so I haven't been sure it
would help as much, but I have wondered.
I can't really look at this for a couple weeks, but I'm definitely
interested. I suggest that you add this to the next CommitFest as a
WIP patch, under the Performance category.
https://commitfest.postgresql.org/action/commitfest_view/open
There is few places to optimize code as well, and patch need many
work, but may you see it and give opinions?
For something like this it makes perfect sense to show "proof of
concept" before trying to cover everything.
-Kevin
On Thu, 17 Mar 2011 16:02:18 -0500, Kevin Grittner wrote:
Rados*aw Smogura<rsmogura@softperience.eu> wrote:
I have implemented initial concept of 2nd level cache. Idea is to
keep some segments of shared memory for special buffers (e.g.
indices) to prevent overwrite those by other operations. I added
those functionality to nbtree index scan.I tested this with doing index scan, seq read, drop system
buffers, do index scan and in few places I saw performance
improvements, but actually, I'm not sure if this was just "random"
or intended improvement.I've often wondered about this. In a database I developed back in
the '80s it was clearly a win to have a special cache for index
entries and other special pages closer to the database than the
general cache. A couple things have changed since the '80s (I mean,
besides my waistline and hair color), and PostgreSQL has many
differences from that other database, so I haven't been sure it
would help as much, but I have wondered.I can't really look at this for a couple weeks, but I'm definitely
interested. I suggest that you add this to the next CommitFest as a
WIP patch, under the Performance category.https://commitfest.postgresql.org/action/commitfest_view/open
There is few places to optimize code as well, and patch need many
work, but may you see it and give opinions?For something like this it makes perfect sense to show "proof of
concept" before trying to cover everything.-Kevin
Yes, there is some change, and I looked at this more carefully, as my
performance results wasn't such as I expected. I found PG uses
BufferAccessStrategy to do sequence scans, so my test query took only 32
buffers from pool and didn't overwritten index pool too much. This BAS
is really surprising. In any case when I end polishing I will send good
patch, with proof.
Actually idea of this patch was like this:
Some operations requires many buffers, PG uses "clock sweep" to get
next free buffer, so it may overwrite index buffer. From point of view
of good database design We should use indices, so purging out index from
cache will affect performance.
As the side effect I saw that this 2nd level keeps pg_* indices in
memory too, so I think to include 3rd level cache for some pg_* tables.
Regards,
Radek
rsmogura <rsmogura@softperience.eu> wrote:
Yes, there is some change, and I looked at this more carefully, as
my performance results wasn't such as I expected. I found PG uses
BufferAccessStrategy to do sequence scans, so my test query took
only 32 buffers from pool and didn't overwritten index pool too
much. This BAS is really surprising. In any case when I end
polishing I will send good patch, with proof.
Yeah, that heuristic makes this less critical, for sure.
Actually idea of this patch was like this:
Some operations requires many buffers, PG uses "clock sweep" to
get next free buffer, so it may overwrite index buffer. From point
of view of good database design We should use indices, so purging
out index from cache will affect performance.As the side effect I saw that this 2nd level keeps pg_* indices
in memory too, so I think to include 3rd level cache for some pg_*
tables.
Well, the more complex you make it the more overhead there is, which
makes it harder to come out ahead. FWIW, in musing about it (as
recently as this week), my idea was to add another field which would
factor into the clock sweep calculations. For indexes, it might be
"levels above leaf pages". I haven't reviewed the code in depth to
know how to use it, this was just idle daydreaming based on that
prior experience. It's far from certain that the concept will
actually prove beneficial in PostgreSQL.
Maybe the thing to focus on first is the oft-discussed "benchmark
farm" (similar to the "build farm"), with a good mix of loads, so
that the impact of changes can be better tracked for multiple
workloads on a variety of platforms and configurations. Without
something like that it is very hard to justify the added complexity
of an idea like this in terms of the performance benefit gained.
-Kevin
On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
Maybe the thing to focus on first is the oft-discussed "benchmark
farm" (similar to the "build farm"), with a good mix of loads, so
that the impact of changes can be better tracked for multiple
workloads on a variety of platforms and configurations. Without
something like that it is very hard to justify the added complexity
of an idea like this in terms of the performance benefit gained.
A related area that could use some looking at is why performance tops
out at shared_buffers ~8GB and starts to fall thereafter. InnoDB can
apparently handle much larger buffer pools without a performance
drop-off. There are some advantages to our reliance on the OS buffer
cache, to be sure, but as RAM continues to grow this might start to
get annoying. On a 4GB system you might have shared_buffers set to
25% of memory, but on a 64GB system it'll be a smaller percentage, and
as memory capacities continue to clime it'll be smaller still.
Unfortunately I don't have the hardware to investigate this, but it's
worth thinking about, especially if we're thinking of doing things
that add more caching.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Excerpts from rsmogura's message of vie mar 18 11:57:48 -0300 2011:
Actually idea of this patch was like this:
Some operations requires many buffers, PG uses "clock sweep" to get
next free buffer, so it may overwrite index buffer. From point of view
of good database design We should use indices, so purging out index from
cache will affect performance.
The BufferAccessStrategy stuff was written to solve this problem.
As the side effect I saw that this 2nd level keeps pg_* indices in
memory too, so I think to include 3rd level cache for some pg_* tables.
Keep in mind that there's already another layer of caching (see
syscache.c) for system catalogs on top of the buffer cache.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Radek,
I have implemented initial concept of 2nd level cache. Idea is to keep some
segments of shared memory for special buffers (e.g. indices) to prevent
overwrite those by other operations. I added those functionality to nbtree
index scan.
The problem with any "special" buffering of database objects (other than
maybe the system catalogs) improves one use case at the expense of
others. For example, special buffering of indexes would have a negative
effect on use cases which are primarily seq scans. Also, how would your
index buffer work for really huge indexes, like GiST and GIN indexes?
In general, I think that improving the efficiency/scalability of our
existing buffer system is probably going to bear a lot more fruit than
adding extra levels of buffering.
That being said, one my argue that the root pages of btree indexes are a
legitimate special case. However, it seems like clock-sweep would end
up keeping those in shared buffers all the time regardless.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mar 18, 2011, at 11:19 AM, Robert Haas wrote:
On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
A related area that could use some looking at is why performance tops
out at shared_buffers ~8GB and starts to fall thereafter. InnoDB can
apparently handle much larger buffer pools without a performance
drop-off. There are some advantages to our reliance on the OS buffer
cache, to be sure, but as RAM continues to grow this might start to
get annoying. On a 4GB system you might have shared_buffers set to
25% of memory, but on a 64GB system it'll be a smaller percentage, and
as memory capacities continue to clime it'll be smaller still.
Unfortunately I don't have the hardware to investigate this, but it's
worth thinking about, especially if we're thinking of doing things
that add more caching.
+1
To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
On Fri, Mar 18, 2011 at 2:15 PM, Jim Nasby <jim@nasby.net> wrote:
+1
To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...
The problem is that the OS doesn't offer any mechanism that would
allow us to obey the WAL-before-data rule.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> Thursday 17 March 2011 22:02:18
Rados*aw Smogura<rsmogura@softperience.eu> wrote:
I have implemented initial concept of 2nd level cache. Idea is to
keep some segments of shared memory for special buffers (e.g.
indices) to prevent overwrite those by other operations. I added
those functionality to nbtree index scan.I tested this with doing index scan, seq read, drop system
buffers, do index scan and in few places I saw performance
improvements, but actually, I'm not sure if this was just "random"
or intended improvement.I've often wondered about this. In a database I developed back in
the '80s it was clearly a win to have a special cache for index
entries and other special pages closer to the database than the
general cache. A couple things have changed since the '80s (I mean,
besides my waistline and hair color), and PostgreSQL has many
differences from that other database, so I haven't been sure it
would help as much, but I have wondered.I can't really look at this for a couple weeks, but I'm definitely
interested. I suggest that you add this to the next CommitFest as a
WIP patch, under the Performance category.https://commitfest.postgresql.org/action/commitfest_view/open
There is few places to optimize code as well, and patch need many
work, but may you see it and give opinions?For something like this it makes perfect sense to show "proof of
concept" before trying to cover everything.-Kevin
Here I attach latest version of patch with few performance improvements (code
is still dirty) and some reports from test, as well my simple tests.
Actually there is small improvement without dropping system caches, and bigger
with dropping. I have small performance decrease (if we can talk about
measuring basing on this tests) to original PG version when dealing with same
configuration, but increase is with 2nd level buffers... or maybe I badly
compared reports.
In tests I tried to choose typical, simple queries.
Regards,
Radek
Attachments:
On 3/18/11 11:15 AM, Jim Nasby wrote:
To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...
As far as I know, no OS has a more sophisticated approach to eviction
than LRU. And clock-sweep is a significant improvement on performance
over LRU for frequently accessed database objects ... plus our
optimizations around not overwriting the whole cache for things like VACUUM.
2-level caches work well for a variety of applications.
Now, what would be *really* useful is some way to avoid all the data
copying we do between shared_buffers and the FS cache.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <josh@agliodbs.com> wrote:
To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...
A lot of people have talked about it. You can find references to mmap
going at least as far back as 2001 or so. The problem is that it would
depend on the OS implementing things in a certain way and guaranteeing
things we don't think can be portably assumed. We would need to mlock
large amounts of address space which most OS's don't allow, and we
would need to at least mlock and munlock lots of small bits of memory
all over the place which would create lots and lots of mappings which
the kernel and hardware implementations would generally not
appreciate.
As far as I know, no OS has a more sophisticated approach to eviction
than LRU. And clock-sweep is a significant improvement on performance
over LRU for frequently accessed database objects ... plus our
optimizations around not overwriting the whole cache for things like VACUUM.
The clock-sweep algorithm was standard OS design before you or I knew
how to type. I would expect any half-decent OS to have sometihng at
least as good -- perhaps better because it can rely on hardware
features to handle things.
However the second point is the crux of the issue and of all similar
issues on where to draw the line between the OS and Postgres. The OS
knows better about the hardware characteristics and can better
optimize the overall system behaviour, but Postgres understands better
its own access patterns and can better optimize its behaviour whereas
the OS is stuck reverse-engineering what Postgres needs, usually from
simple heuristics.
2-level caches work well for a variety of applications.
I think 2-level caches with simple heuristics like "pin all the
indexes" is unlikely to be helpful. At least it won't optimize the
average case and I think that's been proven. It might be helpful for
optimizing the worst-case which would reduce the standard deviation.
Perhaps we're at the point now where that matters.
Where it might be helpful is as a more refined version of the
"sequential scans use limited set of buffers" patch. Instead of having
each sequential scan use a hard coded number of buffers, perhaps all
sequential scans should share a fraction of the global buffer pool
managed separately from the main pool. Though in my thought
experiments I don't see any real win here. In the current scheme if
there's any sign the buffer is useful it gets thrown from the
sequential scan's set of buffers to reuse anyways.
Now, what would be *really* useful is some way to avoid all the data
copying we do between shared_buffers and the FS cache.
Well the two options are mmap/mlock or directio. The former might be a
fun experiment but I expect any OS to fall over pretty quickly when
faced with thousands (or millions) of 8kB mappings. The latter would
need Postgres to do async i/o and hopefully a global view of its i/o
access patterns so it could do prefetching in a lot more cases.
--
greg
On Mon, 21 Mar 2011 10:24:22 +0000, Greg Stark wrote:
On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <josh@agliodbs.com>
wrote:To take the opposite approach... has anyone looked at having the OS
just manage all caching for us? Something like MMAPed shared buffers?
Even if we find the issue with large shared buffers, we still can't
dedicate serious amounts of memory to them because of work_mem
issues. Granted, that's something else on the TODO list, but it
really seems like we're re-inventing the wheels that the OS has
already created here...A lot of people have talked about it. You can find references to mmap
going at least as far back as 2001 or so. The problem is that it
would
depend on the OS implementing things in a certain way and
guaranteeing
things we don't think can be portably assumed. We would need to mlock
large amounts of address space which most OS's don't allow, and we
would need to at least mlock and munlock lots of small bits of memory
all over the place which would create lots and lots of mappings which
the kernel and hardware implementations would generally not
appreciate.
Actually, just from curious, I done test with mmap, and I got 2% boost
on data reading, maybe because of skipping memcpy in fread. I really
curious how fast, if even, it will be if I add some good and needed
stuff and how e.g. vacuum will work.
<snip>
2-level caches work well for a variety of applications.
I think 2-level caches with simple heuristics like "pin all the
indexes" is unlikely to be helpful. At least it won't optimize the
average case and I think that's been proven. It might be helpful for
optimizing the worst-case which would reduce the standard deviation.
Perhaps we're at the point now where that matters.
Actually, 2nd level caches do not pin index buffer. It's just, in
simple words, some set of reserved buffers' ids to be used for index
pages, all logic with pining, etc. it's same, the difference is that
default level operation will not touch 2nd level. I post some reports
from my simple tests. When I was experimenting with 2nd level caches I
saw that some operations may swap out system tables buffers, too.
<snip>
Regards,
Radek
On Mon, Mar 21, 2011 at 5:24 AM, Greg Stark <gsstark@mit.edu> wrote:
On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <josh@agliodbs.com> wrote:
To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here...
A lot of people have talked about it. You can find references to mmap
going at least as far back as 2001 or so. The problem is that it would
depend on the OS implementing things in a certain way and guaranteeing
things we don't think can be portably assumed. We would need to mlock
large amounts of address space which most OS's don't allow, and we
would need to at least mlock and munlock lots of small bits of memory
all over the place which would create lots and lots of mappings which
the kernel and hardware implementations would generally not
appreciate.As far as I know, no OS has a more sophisticated approach to eviction
than LRU. And clock-sweep is a significant improvement on performance
over LRU for frequently accessed database objects ... plus our
optimizations around not overwriting the whole cache for things like VACUUM.The clock-sweep algorithm was standard OS design before you or I knew
how to type. I would expect any half-decent OS to have sometihng at
least as good -- perhaps better because it can rely on hardware
features to handle things.However the second point is the crux of the issue and of all similar
issues on where to draw the line between the OS and Postgres. The OS
knows better about the hardware characteristics and can better
optimize the overall system behaviour, but Postgres understands better
its own access patterns and can better optimize its behaviour whereas
the OS is stuck reverse-engineering what Postgres needs, usually from
simple heuristics.2-level caches work well for a variety of applications.
I think 2-level caches with simple heuristics like "pin all the
indexes" is unlikely to be helpful. At least it won't optimize the
average case and I think that's been proven. It might be helpful for
optimizing the worst-case which would reduce the standard deviation.
Perhaps we're at the point now where that matters.Where it might be helpful is as a more refined version of the
"sequential scans use limited set of buffers" patch. Instead of having
each sequential scan use a hard coded number of buffers, perhaps all
sequential scans should share a fraction of the global buffer pool
managed separately from the main pool. Though in my thought
experiments I don't see any real win here. In the current scheme if
there's any sign the buffer is useful it gets thrown from the
sequential scan's set of buffers to reuse anyways.Now, what would be *really* useful is some way to avoid all the data
copying we do between shared_buffers and the FS cache.Well the two options are mmap/mlock or directio. The former might be a
fun experiment but I expect any OS to fall over pretty quickly when
faced with thousands (or millions) of 8kB mappings. The latter would
need Postgres to do async i/o and hopefully a global view of its i/o
access patterns so it could do prefetching in a lot more cases.
Can't you make just one large mapping and lock it in 8k regions? I
thought the problem with mmap was not being able to detect other
processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html)
compatibility issues (possibly obsolete), etc.
merlin
On 21.03.2011 17:54, Merlin Moncure wrote:
Can't you make just one large mapping and lock it in 8k regions? I
thought the problem with mmap was not being able to detect other
processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html)
compatibility issues (possibly obsolete), etc.
That mail is about replacing SysV shared memory with mmap(). Detecting
other processes is a problem in that use, but that's not an issue with
using mmap() to replace shared buffers.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 3/21/11 3:24 AM, Greg Stark wrote:
2-level caches work well for a variety of applications.
I think 2-level caches with simple heuristics like "pin all the
indexes" is unlikely to be helpful. At least it won't optimize the
average case and I think that's been proven. It might be helpful for
optimizing the worst-case which would reduce the standard deviation.
Perhaps we're at the point now where that matters.
You're missing my point ... Postgres already *has* a 2-level cache:
shared_buffers and the FS cache. Anything we add to that will be adding
levels.
We already did that, actually, when we implemented ARC: effectively gave
PostgreSQL a 3-level cache. The results were not very good, although
the algorithm could be at fault there.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
Can't you make just one large mapping and lock it in 8k regions? I
thought the problem with mmap was not being able to detect other
processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html)
compatibility issues (possibly obsolete), etc.
I was assuming that locking part of a mapping would force the kernel
to split the mapping. It has to record the locked state somewhere so
it needs a data structure that represents the size of the locked
section and that would, I assume, be the mapping.
It's possible the kernel would not in fact fall over too badly doing
this. At some point I'll go ahead and do experiments on it. It's a bit
fraught though as it the performance may depend on the memory
management features of the chipset.
That said, that's only part of the battle. On 32bit you can't map the
whole database as your database could easily be larger than your
address space. I have some ideas on how to tackle that but the
simplest test would be to just mmap 8kB chunks everywhere.
But it's worse than that. Since you're not responsible for flushing
blocks to disk any longer you need some way to *unlock* a block when
it's possible to be flushed. That means when you flush the xlog you
have to somehow find all the blocks that might no longer need to be
locked and atomically unlock them. That would require new
infrastructure we don't have though it might not be too hard.
What would be nice is a mlock_until() where you eventually issue a
call to tell the kernel what point in time you've reached and it
unlocks everything older than that time.
--
greg
On Mon, Mar 21, 2011 at 4:47 PM, Josh Berkus <josh@agliodbs.com> wrote:
You're missing my point ... Postgres already *has* a 2-level cache:
shared_buffers and the FS cache. Anything we add to that will be adding
levels.
I don't think those two levels are interesting -- they don't interact
cleverly at all.
I was assuming the two levels were segments of the shared buffers that
didn't interoperate at all. If you kick buffers from the higher level
cache into the lower level one then why not just increase the number
of clock sweeps before you flush a buffer and insert non-index pages
into a lower clock level instead of writing code for two levels?
I don't think it will outperform in general because LRU is provably
within some margin from optimal and the clock sweep is an approximate
LRU. The only place you're going to find wins is when you know
something extra about the *future* access pattern that the lru/clock
doesn't know based on the past behaviour. Just saying "indexes are
heavily used" or "system tables are heavily used" isn't really extra
information since the LRU can figure that out. Something like
"sequential scans of tables larger than shared buffers don't go back
and read old pages before they age out" is.
The other place you might win is if you have some queries that you
want to always be fast at the expense of slower queries. So your short
web queries that only need to touch a few small tables and system
tables can tag buffers that are higher priority and shouldn't be
swapped out to achieve a slightly higher hit rate on the global cache.
--
greg
Excerpts from Josh Berkus's message of lun mar 21 13:47:21 -0300 2011:
We already did that, actually, when we implemented ARC: effectively gave
PostgreSQL a 3-level cache. The results were not very good, although
the algorithm could be at fault there.
Was it really all that bad? IIRC we replaced ARC with the current clock
sweep due to patent concerns. (Maybe there were performance concerns as
well, I don't remember).
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Was it really all that bad? IIRC we replaced ARC with the current clock
sweep due to patent concerns. (Maybe there were performance concerns as
well, I don't remember).
Yeah, that was why the patent was frustrating. Performance was poor and
we were planning on replacing ARC in 8.2 anyway. Instead we had to
backport it.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com