First set of OSDL Shared Mem scalability results, some wierdness ...
Folks,
I'm hoping that some of you can shed some light on this.
I've been trying to peg the "sweet spot" for shared memory using OSDL's
equipment. With Jan's new ARC patch, I was expecting that the desired
amount of shared_buffers to be greatly increased. This has not turned out to
be the case.
The first test series was using OSDL's DBT2 (OLTP) test, with 150
"warehouses". All tests were run on a 4-way Pentium III 700mhz 3.8GB RAM
system hooked up to a rather high-end storage device (14 spindles). Tests
were on PostgreSQL 8.0b3, Linux 2.6.7.
Here's a top-level summary:
shared_buffers % RAM NOTPM20*
1000 0.2% 1287
23000 5% 1507
46000 10% 1481
69000 15% 1382
92000 20% 1375
115000 25% 1380
138000 30% 1344
* = New Order Transactions Per Minute, last 20 Minutes
Higher is better. The maximum possible is 1800.
As you can see, the "sweet spot" appears to be between 5% and 10% of RAM,
which is if anything *lower* than recommendations for 7.4!
This result is so surprising that I want people to take a look at it and tell
me if there's something wrong with the tests or some bottlenecking factor
that I've not seen.
in order above:
http://khack.osdl.org/stp/297959/
http://khack.osdl.org/stp/297960/
http://khack.osdl.org/stp/297961/
http://khack.osdl.org/stp/297962/
http://khack.osdl.org/stp/297963/
http://khack.osdl.org/stp/297964/
http://khack.osdl.org/stp/297965/
Please note that many of the Graphs in these reports are broken. For one
thing, some aren't recorded (flat lines) and the CPU usage graph has
mislabeled lines.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
I have an idea that makes some assumptions about internals that I think
are correct.
When you have a huge number of buffers in a list that has to be
traversed to look for things in cache, e.g. 100k, you will generate an
almost equivalent number of cache line misses on the processor to jump
through all those buffers. As I understand it (and I haven't looked so
I could be wrong), the buffer cache is searched by traversing it
sequentially. OTOH, it seems reasonable to me that the OS disk cache
may actually be using a tree structure that would generate vastly fewer
cache misses by comparison to find a buffer. This could mean a
substantial linear search cost as a function of the number of buffers,
big enough to rise above the noise floor when you have hundreds of
thousands of buffers.
Cache misses start to really add up when a code path generates many,
many thousands of them, and differences in the access path between the
buffer cache and disk cache would be reflected when you have that many
buffers. I've seen these types of unexpected performance anomalies
before that got traced back to code patterns and cache efficiency and
gotten integer factors improvements by making some seemingly irrelevant
code changes.
So I guess my question would be 1) are my assumptions about the
internals correct, and 2) if they are, is there a way to optimize
searching the buffer cache so that a search doesn't iterate over a
really long buffer list that is bottlenecked on cache line replacement.
My random thought of the day,
j. andrew rogers
Josh Berkus <josh@agliodbs.com> writes:
Here's a top-level summary:
shared_buffers % RAM NOTPM20*
1000 0.2% 1287
23000 5% 1507
46000 10% 1481
69000 15% 1382
92000 20% 1375
115000 25% 1380
138000 30% 1344
As you can see, the "sweet spot" appears to be between 5% and 10% of RAM,
which is if anything *lower* than recommendations for 7.4!
This doesn't actually surprise me a lot. There are a number of aspects
of Postgres that will get slower the more buffers there are.
One thing that I hadn't focused on till just now, which is a new
overhead in 8.0, is that StrategyDirtyBufferList() scans the *entire*
buffer list *every time it's called*, which is to say once per bgwriter
loop. And to add insult to injury, it's doing that with the BufMgrLock
held (not that it's got any choice).
We could alleviate this by changing the API between this function and
BufferSync, such that StrategyDirtyBufferList can stop as soon as it's
found all the buffers that are going to be written in this bgwriter
cycle ... but AFAICS that means abandoning the "bgwriter_percent" knob
since you'd never really know how many dirty pages there were
altogether.
BTW, what is the actual size of the test database (disk footprint wise)
and how much of that do you think is heavily accessed during the run?
It's possible that the test conditions are such that adjusting
shared_buffers isn't going to mean anything anyway.
regards, tom lane
"J. Andrew Rogers" <jrogers@neopolitan.com> writes:
As I understand it (and I haven't looked so I could be wrong), the
buffer cache is searched by traversing it sequentially.
You really should look first.
The main-line code paths use hashed lookups. There are some cases that
do linear searches through the buffer headers or the CDB lists; in
theory those are supposed to be non-performance-critical cases, though
I am suspicious that some are not (see other response). In any case,
those structures are considerably more compact than the buffers proper,
and I doubt that cache misses per se are the killer factor.
This does raise a question for Josh though, which is "where's the
oprofile results?" If we do have major problems at the level of cache
misses then oprofile would be able to prove it.
regards, tom lane
On Fri, Oct 08, 2004 at 06:32:32PM -0400, Tom Lane wrote:
This does raise a question for Josh though, which is "where's the
oprofile results?" If we do have major problems at the level of cache
misses then oprofile would be able to prove it.
Or cachegrind. I've found it to be really effective at pinpointing cache
misses in the past (one CPU-intensive routine was sped up by 30% just by
avoiding a memory clear). :-)
/* Steinar */
--
Homepage: http://www.sesse.net/
Tom,
This does raise a question for Josh though, which is "where's the
oprofile results?" If we do have major problems at the level of cache
misses then oprofile would be able to prove it.
Missing, I'm afraid. OSDL has been having technical issues with STP all week.
Hopefully the next test run will have them.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
Tom,
BTW, what is the actual size of the test database (disk footprint wise)
and how much of that do you think is heavily accessed during the run?
It's possible that the test conditions are such that adjusting
shared_buffers isn't going to mean anything anyway.
The raw data is 32GB, but a lot of the activity is incremental, that is
inserts and updates to recent inserts. Still, according to Mark, most of
the data does get queried in the course of filling orders.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
josh@agliodbs.com (Josh Berkus) wrote:
I've been trying to peg the "sweet spot" for shared memory using
OSDL's equipment. With Jan's new ARC patch, I was expecting that
the desired amount of shared_buffers to be greatly increased. This
has not turned out to be the case.
That doesn't surprise me.
My primary expectation would be that ARC would be able to make small
buffers much more effective alongside vacuums and seq scans than they
used to be. That does not establish anything about the value of
increasing the size buffer caches...
This result is so surprising that I want people to take a look at it
and tell me if there's something wrong with the tests or some
bottlenecking factor that I've not seen.
I'm aware of two conspicuous scenarios where ARC would be expected to
_substantially_ improve performance:
1. When it allows a VACUUM not to throw useful data out of
the shared cache in that VACUUM now only 'chews' on one
page of the cache;
2. When it allows a Seq Scan to not push useful data out of
the shared cache, for much the same reason.
I don't imagine either scenario are prominent in the OSDL tests.
Increasing the number of cache buffers _is_ likely to lead to some
slowdowns:
- Data that passes through the cache also passes through kernel
cache, so it's recorded twice, and read twice...
- The more cache pages there are, the more work is needed for
PostgreSQL to manage them. That will notably happen anywhere
that there is a need to scan the cache.
- If there are any inefficiencies in how the OS kernel manages shared
memory, as their size scales, well, that will obviously cause a
slowdown.
--
If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me
http://www.ntlug.org/~cbbrowne/internet.html
"One World. One Web. One Program." -- MICROS~1 hype
"Ein Volk, ein Reich, ein Fuehrer" -- Nazi hype
(One people, one country, one leader)
Christopher Browne wrote:
Increasing the number of cache buffers _is_ likely to lead to some
slowdowns:- Data that passes through the cache also passes through kernel
cache, so it's recorded twice, and read twice...
Even worse, memory that's used for the PG cache is memory that's not
available to the kernel's page cache. Even if the overall memory
usage in the system isn't enough to cause some paging to disk, most
modern kernels will adjust the page/disk cache size dynamically to fit
the memory demands of the system, which in this case means it'll be
smaller if running programs need more memory for their own use.
This is why I sometimes wonder whether or not it would be a win to use
mmap() to access the data and index files -- doing so under a truly
modern OS would surely at the very least save a buffer copy (from the
page/disk cache to program memory) because the OS could instead
direcly map the buffer cache pages directly to the program's memory
space.
Since PG often has to have multiple files open at the same time, and
in a production database many of those files will be rather large, PG
would have to limit the size of the mmap()ed region on 32-bit
platforms, which means that things like the order of mmap() operations
to access various parts of the file can become just as important in
the mmap()ed case as it is in the read()/write() case (if not more
so!). I would imagine that the use of mmap() on a 64-bit platform
would be a much, much larger win because PG would most likely be able
to mmap() entire files and let the OS work out how to order disk reads
and writes.
The biggest problem as I see it is that (I think) mmap() would have to
be made to cooperate with malloc() for virtual address space. I
suspect issues like this have already been worked out by others,
however...
--
Kevin Brown kevin@sysexperts.com
Christopher Browne wrote:
josh@agliodbs.com (Josh Berkus) wrote:
This result is so surprising that I want people to take a look at it
and tell me if there's something wrong with the tests or some
bottlenecking factor that I've not seen.I'm aware of two conspicuous scenarios where ARC would be expected to
_substantially_ improve performance:1. When it allows a VACUUM not to throw useful data out of
the shared cache in that VACUUM now only 'chews' on one
page of the cache;
Right, Josh, I assume you didn't run these test with pg_autovacuum
running, which might be interesting.
Also how do these numbers compare to 7.4? They may not be what you
expected, but they might still be an improvment.
Matthew
Kevin Brown <kevin@sysexperts.com> writes:
This is why I sometimes wonder whether or not it would be a win to use
mmap() to access the data and index files --
mmap() is Right Out because it does not afford us sufficient control
over when changes to the in-memory data will propagate to disk. The
address-space-management problems you describe are also a nasty
headache, but that one is the showstopper.
regards, tom lane
Tom Lane wrote:
Kevin Brown <kevin@sysexperts.com> writes:
This is why I sometimes wonder whether or not it would be a win to use
mmap() to access the data and index files --mmap() is Right Out because it does not afford us sufficient control
over when changes to the in-memory data will propagate to disk. The
address-space-management problems you describe are also a nasty
headache, but that one is the showstopper.
Huh? Surely fsync() or fdatasync() of the file descriptor associated
with the mmap()ed region at the appropriate times would accomplish
much of this? I'm particularly confused since PG's entire approach to
disk I/O is predicated on the notion that the OS, and not PG, is the
best arbiter of when data hits the disk. Otherwise it would be using
raw partitions for the highest-speed data store, yes?
Also, there isn't any particular requirement to use mmap() for
everything -- you can use traditional open/write/close calls for the
WAL and mmap() for the data/index files (but it wouldn't surprise me
if this would require some extensive code changes).
That said, if it's typical for many changes to made to a page
internally before PG needs to commit that page to disk, then your
argument makes sense, and that's especially true if we simply cannot
have the page written to disk in a partially-modified state (something
I can easily see being an issue for the WAL -- would the same hold
true of the index/data files?).
--
Kevin Brown kevin@sysexperts.com
I wrote:
That said, if it's typical for many changes to made to a page
internally before PG needs to commit that page to disk, then your
argument makes sense, and that's especially true if we simply cannot
have the page written to disk in a partially-modified state (something
I can easily see being an issue for the WAL -- would the same hold
true of the index/data files?).
Also, even if multiple changes would be made to the page, with the
page being valid for a disk write only after all such changes are
made, the use of mmap() (in conjunction with an internal buffer that
would then be copied to the mmap()ed memory space at the appropriate
time) would potentially save a system call over the use of write()
(even if write() were used to write out multiple pages). However,
there is so much lower-hanging fruit than this that an mmap()
implementation almost certainly isn't worth pursuing for this alone.
So: it seems to me that mmap() is worth pursuing only if most internal
buffers tend to be written to only once or if it's acceptable for a
partially modified data/index page to be written to disk (which I
suppose could be true for data/index pages in the face of a rock-solid
WAL).
--
Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes:
Tom Lane wrote:
mmap() is Right Out because it does not afford us sufficient control
over when changes to the in-memory data will propagate to disk.
... that's especially true if we simply cannot
have the page written to disk in a partially-modified state (something
I can easily see being an issue for the WAL -- would the same hold
true of the index/data files?).
You're almost there. Remember the fundamental WAL rule: log entries
must hit disk before the data changes they describe. That means that we
need not only a way of forcing changes to disk (fsync) but a way of
being sure that changes have *not* gone to disk yet. In the existing
implementation we get that by just not issuing write() for a given page
until we know that the relevant WAL log entries are fsync'd down to
disk. (BTW, this is what the LSN field on every page is for: it tells
the buffer manager the latest WAL offset that has to be flushed before
it can safely write the page.)
mmap provides msync which is comparable to fsync, but AFAICS it
provides no way to prevent an in-memory change from reaching disk too
soon. This would mean that WAL entries would have to be written *and
flushed* before we could make the data change at all, which would
convert multiple updates of a single page into a series of write-and-
wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction
is bad enough, once per atomic action is intolerable.
There is another reason for doing things this way. Consider a backend
that goes haywire and scribbles all over shared memory before crashing.
When the postmaster sees the abnormal child termination, it forcibly
kills the other active backends and discards shared memory altogether.
This gives us fairly good odds that the crash did not affect any data on
disk. It's not perfect of course, since another backend might have been
in process of issuing a write() when the disaster happens, but it's
pretty good; and I think that that isolation has a lot to do with PG's
good reputation for not corrupting data in crashes. If we had a large
fraction of the address space mmap'd then this sort of crash would be
just about guaranteed to propagate corruption into the on-disk files.
regards, tom lane
Josh Berkus wrote:
Folks,
I'm hoping that some of you can shed some light on this.
I've been trying to peg the "sweet spot" for shared memory using OSDL's
equipment. With Jan's new ARC patch, I was expecting that the desired
amount of shared_buffers to be greatly increased. This has not turned out to
be the case.The first test series was using OSDL's DBT2 (OLTP) test, with 150
"warehouses". All tests were run on a 4-way Pentium III 700mhz 3.8GB RAM
system hooked up to a rather high-end storage device (14 spindles). Tests
were on PostgreSQL 8.0b3, Linux 2.6.7.
I'd like to see these tests running using the cpu affinity capability in order
to oblige a backend to not change CPU during his life, this could drastically
increase the cache hit.
Regards
Gaetano Mendola
On Fri, 8 Oct 2004, Josh Berkus wrote:
As you can see, the "sweet spot" appears to be between 5% and 10% of RAM,
which is if anything *lower* than recommendations for 7.4!
What recommendation is that? To have shared buffers being about 10% of the
ram sounds familiar to me. What was recommended for 7.4? In the past we
used to say that the worst value is 50% since then the same things might
be cached both by pg and the os disk cache.
Why do we excpect the shared buffer size sweet spot to change because of
the new arc stuff? And why would it make it better to have bigger shared
mem?
Wouldn't it be the opposit, that now we don't invalidate as much of the
cache for vacuums and seq. scan so now we can do as good caching as
before but with less shared buffers.
That said, testing and getting some numbers of good sizes for shared mem
is good.
--
/Dennis Bj�rklund
On 10/8/2004 10:10 PM, Christopher Browne wrote:
josh@agliodbs.com (Josh Berkus) wrote:
I've been trying to peg the "sweet spot" for shared memory using
OSDL's equipment. With Jan's new ARC patch, I was expecting that
the desired amount of shared_buffers to be greatly increased. This
has not turned out to be the case.That doesn't surprise me.
Neither does it surprise me.
My primary expectation would be that ARC would be able to make small
buffers much more effective alongside vacuums and seq scans than they
used to be. That does not establish anything about the value of
increasing the size buffer caches...
The primary goal of ARC is to prevent total cache eviction caused by
sequential scans. Which means it is designed to avoid the catastrophic
impact of a pg_dump or other, similar access in parallel to the OLTP
traffic. It would be much more interesting to see how a half way into a
2 hour measurement interval started pg_dump affects the response times.
One also has to take a closer look at the data of the DBT2. What amount
of that 32GB is high-frequently accessed, and therefore a good thing to
live in the PG shared cache? A cache significantly larger than that
doesn't make sense to me, under no cache strategy.
Jan
This result is so surprising that I want people to take a look at it
and tell me if there's something wrong with the tests or some
bottlenecking factor that I've not seen.I'm aware of two conspicuous scenarios where ARC would be expected to
_substantially_ improve performance:1. When it allows a VACUUM not to throw useful data out of
the shared cache in that VACUUM now only 'chews' on one
page of the cache;2. When it allows a Seq Scan to not push useful data out of
the shared cache, for much the same reason.I don't imagine either scenario are prominent in the OSDL tests.
Increasing the number of cache buffers _is_ likely to lead to some
slowdowns:- Data that passes through the cache also passes through kernel
cache, so it's recorded twice, and read twice...- The more cache pages there are, the more work is needed for
PostgreSQL to manage them. That will notably happen anywhere
that there is a need to scan the cache.- If there are any inefficiencies in how the OS kernel manages shared
memory, as their size scales, well, that will obviously cause a
slowdown.
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
On 10/9/2004 7:20 AM, Kevin Brown wrote:
Christopher Browne wrote:
Increasing the number of cache buffers _is_ likely to lead to some
slowdowns:- Data that passes through the cache also passes through kernel
cache, so it's recorded twice, and read twice...Even worse, memory that's used for the PG cache is memory that's not
available to the kernel's page cache. Even if the overall memory
Which underlines my previous statement, that a PG shared cache much
larger than the high-frequently accessed data portion of the DB is
counterproductive. Double buffering (kernel-disk-buffer plus shared
buffer) only makes sense for data that would otherwise cause excessive
memory copies in and out of the shared buffer. After that, in only
lowers the memory available for disk buffers.
Jan
usage in the system isn't enough to cause some paging to disk, most
modern kernels will adjust the page/disk cache size dynamically to fit
the memory demands of the system, which in this case means it'll be
smaller if running programs need more memory for their own use.This is why I sometimes wonder whether or not it would be a win to use
mmap() to access the data and index files -- doing so under a truly
modern OS would surely at the very least save a buffer copy (from the
page/disk cache to program memory) because the OS could instead
direcly map the buffer cache pages directly to the program's memory
space.Since PG often has to have multiple files open at the same time, and
in a production database many of those files will be rather large, PG
would have to limit the size of the mmap()ed region on 32-bit
platforms, which means that things like the order of mmap() operations
to access various parts of the file can become just as important in
the mmap()ed case as it is in the read()/write() case (if not more
so!). I would imagine that the use of mmap() on a 64-bit platform
would be a much, much larger win because PG would most likely be able
to mmap() entire files and let the OS work out how to order disk reads
and writes.The biggest problem as I see it is that (I think) mmap() would have to
be made to cooperate with malloc() for virtual address space. I
suspect issues like this have already been worked out by others,
however...
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes:
On 10/8/2004 10:10 PM, Christopher Browne wrote:
josh@agliodbs.com (Josh Berkus) wrote:
I've been trying to peg the "sweet spot" for shared memory using
OSDL's equipment. With Jan's new ARC patch, I was expecting that
the desired amount of shared_buffers to be greatly increased. This
has not turned out to be the case.That doesn't surprise me.
Neither does it surprise me.
There's been some speculation that having a large shared buffers be about 50%
of your RAM is pessimal as it guarantees the OS cache is merely doubling up on
all the buffers postgres is keeping. I wonder whether there's a second sweet
spot where the postgres cache is closer to the total amount of RAM.
That configuration would have disadvantages for servers running other jobs
besides postgres. And I was led to believe earlier that postgres starts each
backend with a fairly fresh slate as far as the ARC algorithm, so it wouldn't
work well for a postgres server that had lots of short to moderate life
sessions.
But if it were even close it could be interesting. Reading the data with
O_DIRECT and having a single global cache could be interesting experiments. I
know there are arguments against each of these, but ...
I'm still pulling for an mmap approach to eliminate postgres's buffer cache
entirely in the long term, but it seems like slim odds now. But one way or the
other having two layers of buffering seems like a waste.
--
greg
On 10/13/2004 11:52 PM, Greg Stark wrote:
Jan Wieck <JanWieck@Yahoo.com> writes:
On 10/8/2004 10:10 PM, Christopher Browne wrote:
josh@agliodbs.com (Josh Berkus) wrote:
I've been trying to peg the "sweet spot" for shared memory using
OSDL's equipment. With Jan's new ARC patch, I was expecting that
the desired amount of shared_buffers to be greatly increased. This
has not turned out to be the case.That doesn't surprise me.
Neither does it surprise me.
There's been some speculation that having a large shared buffers be about 50%
of your RAM is pessimal as it guarantees the OS cache is merely doubling up on
all the buffers postgres is keeping. I wonder whether there's a second sweet
spot where the postgres cache is closer to the total amount of RAM.
Which would require that shared memory is not allowed to be swapped out,
and that is allowed in Linux by default IIRC, not to completely distort
the entire test.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #