mosbench revisited

Started by Robert Haasover 14 years ago29 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

About nine months ago, we had a discussion of some benchmarking that
was done by the mosbench folks at MIT:

http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php

Although the authors used PostgreSQL as a test harness for driving
load, it's pretty clear from reading the paper that their primary goal
was to stress the Linux kernel, so the applicability of the paper to
real-world PostgreSQL performance improvement is less than it might
be. Still, having now actually investigated in some detail many of
the same performance issues that they were struggling with, I have a
much clearer understanding of what's really going on here. In
PostgreSQL terms, here are the bottlenecks they ran into:

1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
because PostgreSQL protects its free-list with a single lock and thus
scales poorly with smaller caches." This is a complaint about
BufFreeList lock which, in fact, I've seen as a huge point of
contention on some workloads. In fact, on read-only workloads, with
my lazy vxid lock patch applied, this is, I believe, the only
remaining unpartitioned LWLock that is ever taken in exclusive mode;
or at least the only one that's taken anywhere near often enough to
matter. I think we're going to do something about this, although I
don't have a specific idea in mind at the moment.

2. "PostgreSQL implements row- and table-level locks atop user-level
mutexes; as a result, even a non-conflicting row- or table-level lock
acquisition requires exclusively locking one of only 16 global
mutexes." I think that the reference to row-level locks here is a red
herring; or at least, I haven't seen any evidence that row-level
locking is a meaningful source of contention on any workload I've
tested. Table-level locks clearly are, and this is the problem that
the now-committed fastlock patch addressed. So, fixed!

3. "Our workload creates one PostgreSQL connection per server core and
sends queries (selects or updates) in batches of 256, aggregating
successive read-only transac- tions into single transactions. This
workload is intended to minimize application-level contention within
PostgreSQL in order to maximize the stress PostgreSQL places on the
kernel." I had no idea what this was talking about at the time, but
it's now obvious in retrospect that they were working around the
overhead imposed by acquiring and releasing relation and virtualxid
locks. My pending "lazy vxids" patch will address the remaining issue
here.

4. "With modified PostgreSQL on stock Linux, throughput for both
workloads collapses at 36 cores .. The main reason is the kernel's
lseek implementation." With the fastlock, sinval-hasmessages, and
lazy-vxid patches applied (the first two are committed now), it's now
much easier to run headlong into this bottleneck. Prior to those
patches, for this to be an issue, you would need to batch your queries
together in big groups to avoid getting whacked by the lock manager
and/or sinval overhead first. With those problems and the recently
discovered bottleneck in glibc's random() implementation fixed, good
old pgbench -S is enough to hit this problem if you have enough
clients and enough cores. And it turns out that the word "collapse"
is not an exaggeration. On a 64-core Intel box running RHEL 6.1,
performance ramped up from 24k TPS at 4 clients to 175k TPS at 32
clients and then to 207k TPS at 44 clients. After that it fell off a
cliff, dropping to 93k TPS at 52 clients and 26k TPS at 64 clients,
consuming truly horrifying amounts of system time in the process. A
somewhat tedious investigation revealed that the problem is, in fact,
contention on the inode mutex caused by lseek(). Results are much
better with -M prepared (310k TPS at 48 clients, 294k TPS at 64
clients). All one-minute tests with scale factor 100, fitting inside
8GB of shared_buffers (clearly not enough for serious benchmarking,
but enough to demonstrate this issue).

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will. For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c. On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode->i_size without taking a spinlock. A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected. That now
seems unfortunate. Barring a kernel-level fix, we could try to
implement our own cache to work around this problem. However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.

5. With all of the above problems fixed or worked around, the authors
write, "PostgreSQL's overall scalability is primarily limited by
contention for the spinlock protecting the buffer cache page for the
root of the table index". This is the only problem on their list that
I haven't yet encountered in testing. I'm kind of interested by the
result, actually, as I had feared that the spinlock protecting
ProcArrayLock was going to be a bigger problem sooner. But maybe not.
I'm also concerned about the spinlock protecting the buffer mapping
lock that covers the root index page. I'll investigate further if and
when I come up with a way to dodge the lseek() contention problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Robert Haas (#1)
Re: mosbench revisited

On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote:

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will. For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c. On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode->i_size without taking a spinlock.

Interesting. There's this thread from 2003 suggesting the use of pread
instead, it was rejected on the argument that lseek is cheap so not a
problem.

http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php

Perhaps we now have a benchmark where the effect can be measured.

There's the issue about whether it screws up the readahead mechanism...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#2)
Re: mosbench revisited

Martijn van Oosterhout <kleptog@svana.org> writes:

On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote:

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will. For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c. On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode->i_size without taking a spinlock.

Interesting. There's this thread from 2003 suggesting the use of pread
instead, it was rejected on the argument that lseek is cheap so not a
problem.

http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php

That seems rather unrelated. The point here is our use of lseek to find
out the current file size --- or at least, I would hope they're not
trying to read the inode's file size in a SEEK_CUR call.

The reason "-M prepared" helps is presumably that it eliminates most of
the RelationGetNumberOfBlocks calls the planner does to check current
table size. While we could certainly consider using a cheaper (possibly
more stale) value there, it's a bit astonishing to think that that's the
main cost in a parse/plan/execute cycle. Perhaps there are more hotspot
calls than that one?

regards, tom lane

#4Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#3)
Re: mosbench revisited

On Wed, Aug 3, 2011 at 2:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

On Wed, Aug 03, 2011 at 02:21:25PM -0400, Robert Haas wrote:

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will.  For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c.  On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode->i_size without taking a spinlock.

Interesting. There's this thread from 2003 suggesting the use of pread
instead, it was rejected on the argument that lseek is cheap so not a
problem.

http://archives.postgresql.org/pgsql-patches/2003-02/msg00197.php

That seems rather unrelated.  The point here is our use of lseek to find
out the current file size --- or at least, I would hope they're not
trying to read the inode's file size in a SEEK_CUR call.

Correct.

The reason "-M prepared" helps is presumably that it eliminates most of
the RelationGetNumberOfBlocks calls the planner does to check current
table size.  While we could certainly consider using a cheaper (possibly
more stale) value there, it's a bit astonishing to think that that's the
main cost in a parse/plan/execute cycle.  Perhaps there are more hotspot
calls than that one?

Nope.

On a straight pgbench -S test, you get four system calls per query:
recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates
the two lseeks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#4)
Re: mosbench revisited

Robert Haas <robertmhaas@gmail.com> writes:

On a straight pgbench -S test, you get four system calls per query:
recvfrom(), lseek(), lseek(), sendto(). Adding -M prepared eliminates
the two lseeks.

[ scratches head... ] Two? Is that one for the table and one for its
lone index, or are we being redundant there?

(If the query ended up being a seqscan, I'd expect a second
lseek(SEEK_END) when the executor starts up, but I gather from the other
complaints that the mosbench people were only testing simple indexscan
queries.)

regards, tom lane

#6Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#5)
Re: mosbench revisited

On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On a straight pgbench -S test, you get four system calls per query:
recvfrom(), lseek(), lseek(), sendto().  Adding -M prepared eliminates
the two lseeks.

[ scratches head... ]  Two?

Yep.

Is that one for the table and one for its
lone index, or are we being redundant there?

The former. Specifically, it appears we're smart enough to only test
the last segment (in this case, the table is large enough that there
is a .1 file, and that's what we're lseeking).

(If the query ended up being a seqscan, I'd expect a second
lseek(SEEK_END) when the executor starts up, but I gather from the other
complaints that the mosbench people were only testing simple indexscan
queries.)

Yeah, it seems that for a sequential scan we lseek the heap, then the
index, then the heap again; but for index scans we just hit the heap
and the index.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: mosbench revisited

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(If the query ended up being a seqscan, I'd expect a second
lseek(SEEK_END) when the executor starts up, but I gather from the other
complaints that the mosbench people were only testing simple indexscan
queries.)

Yeah, it seems that for a sequential scan we lseek the heap, then the
index, then the heap again; but for index scans we just hit the heap
and the index.

Sure. The first two come from the planner getting the table and index
sizes for estimation purposes (look in plancat.c). The last is done in
heapam.c's initscan(). We could possibly accept stale values for the
planner estimates, but I think heapam's number had better be accurate.

regards, tom lane

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#7)
Re: mosbench revisited

On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 3, 2011 at 3:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(If the query ended up being a seqscan, I'd expect a second
lseek(SEEK_END) when the executor starts up, but I gather from the other
complaints that the mosbench people were only testing simple indexscan
queries.)

Yeah, it seems that for a sequential scan we lseek the heap, then the
index, then the heap again; but for index scans we just hit the heap
and the index.

Sure.  The first two come from the planner getting the table and index
sizes for estimation purposes (look in plancat.c).  The last is done in
heapam.c's initscan().  We could possibly accept stale values for the
planner estimates, but I think heapam's number had better be accurate.

I think the exact requirement is that, if the relation turns out to be
larger than the size we read, the extra blocks had better not contain
any tuples our snapshot can see. There's actually no interlock
between smgrnblocks() and smgrextend() right now, so presumably we
don't need to add one. However, a value cached from a few seconds ago
is clearly not going to cut it.

I don't really think there's any sensible way to implement a
per-backend cache, because that would require invalidation events of
some kind to be sent on relation extension, and that seems utterly
insane from a performance standpoint, even if we invented something
less expensive than sinval. I guess it might work for planning
purposes if you only sent out invalidation events on every N'th
extension or something, but penalizing the accuracy of planning to
work around a Linux kernel bug that only manifests itself on machines
with >32 cores doesn't seem very appealing.

A shared cache seems like it could work, but the locking is tricky.
Normally we'd just use a hash table protected by an LWLock, one one
LWLock per partition, but here that's clearly not going to work. The
kernel is using a spinlock per file, and that's still too
heavy-weight. I think that if we could prepopulate the cache with all
the keys (i.e. relfilenodes) we care about and then never add or evict
any, we could run it completely unlocked. I believe that the existing
memory barriers in things like LWLockAcquire() would be sufficient to
prevent us from reading a too-old value (i.e. block count). In
particular, you couldn't read a value that predated your snapshot, if
you got your snapshot by holding ProcArrayLock. But the races
involved with adding and removing items from the cache are hard to
deal with without using locks, especially because the keys are 12
bytes or more and therefore can't be read or written atomically.

I've been mulling over how we might deal with this and actually coded
up an implementation, but it turns out (surprise, surprise) to have
problems with insufficient locking. So I'm thinking it over some
more. And hoping that the Linux guys decide to do something about it.
This isn't really our bug - lseek is quite cheap in the uncontended
case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#8)
Re: mosbench revisited

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Aug 3, 2011 at 4:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

... We could possibly accept stale values for the
planner estimates, but I think heapam's number had better be accurate.

I think the exact requirement is that, if the relation turns out to be
larger than the size we read, the extra blocks had better not contain
any tuples our snapshot can see. There's actually no interlock
between smgrnblocks() and smgrextend() right now, so presumably we
don't need to add one.

No interlock in userspace, you mean. We're relying on the kernel to do
it, ie, give us a number that is not older than the time of our (already
taken at this point) snapshot.

I don't really think there's any sensible way to implement a
per-backend cache, because that would require invalidation events of
some kind to be sent on relation extension, and that seems utterly
insane from a performance standpoint, even if we invented something
less expensive than sinval.

Yeah, that's the issue. But "relation extension" is not actually a
cheap operation, since it requires a minimum of one kernel call that is
presumably doing something nontrivial in the filesystem. I'm not
entirely convinced that we couldn't make this work --- especially since
we could certainly derate the duty cycle by a factor of ten or more
without giving up anything remotely meaningful in planning accuracy.
(I'd be inclined to make it send an inval only once the relation size
had changed at least, say, 10%.)

A shared cache seems like it could work, but the locking is tricky.
Normally we'd just use a hash table protected by an LWLock, one one
LWLock per partition, but here that's clearly not going to work. The
kernel is using a spinlock per file, and that's still too
heavy-weight.

That still seems utterly astonishing to me. We're touching each of
those files once per query cycle; a cycle that contains two message
sends, who knows how many internal spinlock/lwlock/heavyweightlock
acquisitions inside Postgres (some of which *do* contend with each
other), and a not insignificant amount of plain old computing.
Meanwhile, this particular spinlock inside the kernel is protecting
what, a single doubleword fetch? How is that the bottleneck?

I am wondering whether kernel spinlocks are broken.

regards, tom lane

#10Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#1)
Re: mosbench revisited

On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:

1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
because PostgreSQL protects its free-list with a single lock and thus
scales poorly with smaller caches." This is a complaint about
BufFreeList lock which, in fact, I've seen as a huge point of
contention on some workloads. In fact, on read-only workloads, with
my lazy vxid lock patch applied, this is, I believe, the only
remaining unpartitioned LWLock that is ever taken in exclusive mode;
or at least the only one that's taken anywhere near often enough to
matter. I think we're going to do something about this, although I
don't have a specific idea in mind at the moment.

This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads).

The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep.
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#11Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#9)
Re: mosbench revisited

On Wed, Aug 3, 2011 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That still seems utterly astonishing to me.  We're touching each of
those files once per query cycle; a cycle that contains two message
sends, who knows how many internal spinlock/lwlock/heavyweightlock
acquisitions inside Postgres (some of which *do* contend with each
other), and a not insignificant amount of plain old computing.
Meanwhile, this particular spinlock inside the kernel is protecting
what, a single doubleword fetch?  How is that the bottleneck?

Spinlocks seem to have a very ugly "tipping point". When I tested
pgbench -S on a 64-core system with the lazy vxid patch applied and a
patch to use random_r() in lieu of random, the amount of system time
used per SELECT-only transaction at 48 clients was 3.59 times as much
as it was at 4 clients. And the amount used per transaction at 52
clients was 3.63 times the amount used per transaction at 48 clients.
And the amount used at 56 clients was 3.25 times the amount used at 52
clients. You can see the throughput graph starting to flatten out in
the 32-44 client range, but it's not particularly alarming. However,
once you pass that point things rapidly get totally out of control in
a real hurry. A few more clients and the machine is basically doing
nothing but spin.

I am wondering whether kernel spinlocks are broken.

I don't think so. Stefan Kaltenbrunner had one profile where he
showed something like sixty or eighty percent of the usermode CPU time
in s_lock. I didn't have access to that particular hardware, but the
testing I've done strongly suggests that most of that was the
SInvalReadLock spinlock. And before I patched pgbench to avoid
calling random(), that was doing the same thing - literally flattening
a 64-core box fighting over a single futex that normally costs almost
nothing. (That one wasn't quite as bad because the futex actually
deschedules the waiters, but it was still bad.) I'm actually not
really sure why it shakes out this way (birthday paradox?) but having
seen the effect several times now, I'm disinclined to believe it's an
artifact.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#10)
Re: mosbench revisited

On Wed, Aug 3, 2011 at 6:21 PM, Jim Nasby <jim@nasby.net> wrote:

On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:

1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
because PostgreSQL protects its free-list with a single lock and thus
scales poorly with smaller caches."  This is a complaint about
BufFreeList lock which, in fact, I've seen as a huge point of
contention on some workloads.  In fact, on read-only workloads, with
my lazy vxid lock patch applied, this is, I believe, the only
remaining unpartitioned LWLock that is ever taken in exclusive mode;
or at least the only one that's taken anywhere near often enough to
matter.  I think we're going to do something about this, although I
don't have a specific idea in mind at the moment.

This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads).

The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep.

Based on recent benchmarking, I'm going to say "no". It doesn't seem
to matter how short you make the critical section: a single
program-wide mutex is a loser. Furthermore, the "free list" is a
joke, because it's nearly always going to be completely empty. We
could probably just rip that out and use the clock sweep and never
miss it, but I doubt it would improve performance much.

I think what we probably need to do is have multiple clock sweeps in
progress at the same time. So, for example, if you have 8GB of
shared_buffers, you might have 8 mutexes, one for each GB. When a
process wants a buffer, it locks one of the mutexes and sweeps through
that 1GB partition. If it finds a buffer before returning to the
point at which it started the scan, it's done. Otherwise, it releases
its mutex, grabs the next one, and continues on until it finds a free
buffer.

The trick with any modification in this area is that pretty much any
degree of increased parallelism is potentially going to reduce the
quality of buffer replacement to some degree. So the trick will be to
try to squeeze out as much concurrency as possible while minimizing
degradation in the quality of buffer replacements.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#11)
Re: mosbench revisited

On Wed, Aug 3, 2011 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Spinlocks seem to have a very ugly "tipping point".

And on that note, here are oprofile results from "pgbench -n -T 300 -S
-c 64 -j 64 -M prepared" on the latest master branch, compiled with
"-O2 -fno-omit-frame-pointer". shared_buffers=8GB, 64-core machine,
RHEL 6.1. By running with "-M prepared", it dodges the lseek()
problem.

960576 23.7580 postgres postgres s_lock
562821 13.9203 no-vmlinux no-vmlinux /no-vmlinux
321191 7.9440 postgres postgres
LWLockRelease
317653 7.8565 postgres postgres
LWLockAcquire
224812 5.5603 postgres postgres
GetSnapshotData
81156 2.0072 postgres postgres _bt_compare
78744 1.9476 postgres postgres PinBuffer
58101 1.4370 postgres postgres
hash_search_with_hash_value
43865 1.0849 postgres postgres
AllocSetAlloc
25832 0.6389 postgres postgres PostgresMain

Since SpinLockAcquire() is an in-line macro that only calls s_lock()
if the initial TAS fails, not only the time directly attributed to
s_lock but also a good chunk of the CPU time attributable to
LWLockAcquire and LWLockRelease() is likely time spent fighting over
spinlocks. Since I compiled with frame pointers, it's pretty easy to
see where those s_lock calls are coming from. Here's an excerpt from
opreport -c:

5 5.0e-04 postgres postgres _bt_getbuf
6 6.0e-04 postgres postgres
_bt_relandgetbuf
14 0.0014 postgres postgres
ReleaseAndReadBuffer
85 0.0085 postgres postgres
ReadBuffer_common
206 0.0207 postgres postgres
GetSnapshotData
18344 1.8437 postgres postgres
UnpinBuffer
24977 2.5103 postgres postgres PinBuffer
406948 40.9009 postgres postgres
LWLockRelease
544376 54.7133 postgres postgres
LWLockAcquire
994947 23.5746 postgres postgres s_lock

It's also fairly easy to track down who is calling LWLockAcquire and
LWLockRelease. Nearly all of the calls are from just two
contributors:

241655 27.6830 postgres postgres
ReadBuffer_common
566434 64.8885 postgres postgres
GetSnapshotData
328548 7.7847 postgres postgres
LWLockAcquire

176629 23.8917 postgres postgres
ReadBuffer_common
524348 70.9259 postgres postgres
GetSnapshotData
332333 7.8744 postgres postgres
LWLockRelease

So, most of the s_lock calls come from LWLockAcquire, and most of the
LWLockAcquire calls come from GetSnapshotData. That's not quite
enough to prove that all the spinning going on here is coming from
contention over the spinlock protecting ProcArrayLock, because it
needn't be the case that all calls to LWLockAcquire are equally likely
to end up in s_lock. You could speculate that ProcArrayLock isn't
actually responsible for many of those s_lock calls and that some
other lock, like maybe the buffer mapping locks, is disproportionately
responsible for the s_lock calls. But in fact I think it's exactly
the other way around: the buffer mapping locks are partitioned 16
ways, while there's only one ProcArrayLock. I'm willing to bet that's
where nearly all of the spinning is happening, and I'll further bet
that that spinning accounts for AT LEAST a third of the total CPU time
on this workload. And maybe closer to half.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Aidan Van Dyk
aidan@highrise.ca
In reply to: Robert Haas (#8)
Re: mosbench revisited

On Wed, Aug 3, 2011 at 5:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

 And hoping that the Linux guys decide to do something about it.
 This isn't really our bug - lseek is quite cheap in the uncontended
case.

Has anyone tried this on a recent kernel (i.e. 2.6.39 or later), where
they've finally remove the BKL out of VFS/inode?

I mean, complaining about scalability in linux 2.6.18 is like
complaining about scalability in postgresql 8.2 ;-)

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

#15Robert Haas
robertmhaas@gmail.com
In reply to: Aidan Van Dyk (#14)
Re: mosbench revisited

On Thu, Aug 4, 2011 at 5:09 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:

On Wed, Aug 3, 2011 at 5:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

     And hoping that the Linux guys decide to do something about it.
 This isn't really our bug - lseek is quite cheap in the uncontended
case.

Has anyone tried this on a recent kernel (i.e. 2.6.39 or later), where
they've finally remove the BKL out of VFS/inode?

I mean, complaining about scalability in linux 2.6.18 is like
complaining about scalability in postgresql 8.2 ;-)

Hmm. This machine is running 2.6.32-131.6.1.el6.x86_64, not 2.6.18.
Not sure how much the code has changed since then, but the spinlock is
there in the master branch of Linus's repository.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#1)
Re: mosbench revisited

On Wed, Aug 3, 2011 at 11:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:

About nine months ago, we had a discussion of some benchmarking that
was done by the mosbench folks at MIT:

http://archives.postgresql.org/pgsql-hackers/2010-10/msg00160.php

Although the authors used PostgreSQL as a test harness for driving
load, it's pretty clear from reading the paper that their primary goal
was to stress the Linux kernel, so the applicability of the paper to
real-world PostgreSQL performance improvement is less than it might
be.  Still, having now actually investigated in some detail many of
the same performance issues that they were struggling with, I have a
much clearer understanding of what's really going on here.  In
PostgreSQL terms, here are the bottlenecks they ran into:

1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
because PostgreSQL protects its free-list with a single lock and thus
scales poorly with smaller caches."  This is a complaint about
BufFreeList lock which, in fact, I've seen as a huge point of
contention on some workloads.  In fact, on read-only workloads, with
my lazy vxid lock patch applied, this is, I believe, the only
remaining unpartitioned LWLock that is ever taken in exclusive mode;
or at least the only one that's taken anywhere near often enough to
matter.  I think we're going to do something about this, although I
don't have a specific idea in mind at the moment.

I was going to ask if you if had done any benchmarks with scale such
that the tables fit in RAM but not in shared_buffers. I guess you
have.

The attached experimental patch fixed freelist contention on 8 cores.
It would be nice to see what happens above that.

It has been cherry picked up to HEAD, but not tested against it. (Last
tested in Dec 2010, my how time flies)

The approach is to move the important things from a LWLock to a
spinlock, and to not do any locking for increments to clock-hand
increment and numBufferAllocs.
That means that some buffers might occasionally get inspected twice
and some might not get inspected at all during any given clock cycle,
but this should not lead to any correctness problems. (Disclosure:
Tom didn't like this approach when it was last discussed.)

I just offer this for whatever it is worth to you--I'm not proposing
it as an actual patch to be applied.

When data fits in RAM but not shared_buffers, maybe the easiest fix is
to increase shared_buffers. Which brings up the other question I had
for you about your work with Nate's celebrated loaner machine. Have
you tried to reproduce the performance problems that have been
reported (but without public disclosure of how to reproduce) with
shared_buffers > 8GB on machines with RAM >>8GB ?

Cheers,

Jeff

Attachments:

freelist.patchtext/x-patch; charset=US-ASCII; name=freelist.patchDownload+33-13
#17Jeff Janes
jeff.janes@gmail.com
In reply to: Jim Nasby (#10)
Re: mosbench revisited

On Wed, Aug 3, 2011 at 3:21 PM, Jim Nasby <jim@nasby.net> wrote:

On Aug 3, 2011, at 1:21 PM, Robert Haas wrote:

1. "We configure PostgreSQL to use a 2 Gbyte application-level cache
because PostgreSQL protects its free-list with a single lock and thus
scales poorly with smaller caches."  This is a complaint about
BufFreeList lock which, in fact, I've seen as a huge point of
contention on some workloads.  In fact, on read-only workloads, with
my lazy vxid lock patch applied, this is, I believe, the only
remaining unpartitioned LWLock that is ever taken in exclusive mode;
or at least the only one that's taken anywhere near often enough to
matter.  I think we're going to do something about this, although I
don't have a specific idea in mind at the moment.

This has been discussed before: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01406.php (which itself references 2 other threads).

The basic idea is: have a background process that proactively moves buffers onto the free list so that backends should normally never have to run the clock sweep (which is rather expensive). The challenge there is figuring out how to get stuff onto the free list with minimal locking impact. I think one possible option would be to put the freelist under it's own lock (IIRC we currently use it to protect the clock sweep as well). Of course, that still means the free list lock could be a point of contention, but presumably it's far faster to add or remove something from the list than it is to run the clock sweep.

Hi Jim,

My experiments have shown that the freelist proper is not
substantially faster than the freelist clocksweep--and that is even
under the assumption that putting things back into the freelist is
absolutely free. Under all the workload's I've been able to contrive,
other than ones contrived by actually hacking the code itself to make
it pathological, the average number of buffers inspected per run of
the clock sweep is <2.5. Under contention, the mere act of acquiring
a lock is more traumatic than the actual work carried out under the
lock.

Cheers,

Jeff

#18Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Robert Haas (#1)
Re: mosbench revisited

Robert Haas <robertmhaas@gmail.com> writes:

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will. For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c. On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode->i_size without taking a spinlock. A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected. That now
seems unfortunate. Barring a kernel-level fix, we could try to
implement our own cache to work around this problem. However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.

What about making the relation extension much less frequent? It's been
talked about before here, that instead of extending 8kB at a time we
could (should) extend by much larger chunks. I would go as far as
preallocating the whole next segment (1GB) (in the background) as soon
as the current is more than half full, or such a policy.

Then you have the problem that you can't really use lseek() anymore to
guess'timate a relation size, but Tom said in this thread that the
planner certainly doesn't need something that accurate. Maybe the
reltuples would do? If not, it could be that some adapting of its
accuracy could be done?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#17)
Re: mosbench revisited

Jeff Janes <jeff.janes@gmail.com> writes:

My experiments have shown that the freelist proper is not
substantially faster than the freelist clocksweep--and that is even
under the assumption that putting things back into the freelist is
absolutely free.

The freelist isn't there to make buffer allocation faster, though;
it's there for allocation efficiency. The point is that when some
buffers have become completely useless (eg, because we dropped the table
they were for), they'll be recycled in preference to reclaiming buffers
that contain still-possibly-useful data. It would certainly be simple
to get rid of the freelist and only recycle dead buffers when the clock
sweep reaches them, but I think we'd be paying for that in extra,
unnecessary I/O.

regards, tom lane

#20Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Janes (#16)
Re: mosbench revisited

On Sat, Aug 6, 2011 at 1:43 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

The approach is to move the important things from a LWLock to a
spinlock, and to not do any locking for increments to clock-hand
increment and numBufferAllocs.
That means that some buffers might occasionally get inspected twice
and some might not get inspected at all during any given clock cycle,
but this should not lead to any correctness problems.   (Disclosure:
Tom didn't like this approach when it was last discussed.)

I just offer this for whatever it is worth to you--I'm not proposing
it as an actual patch to be applied.

Interesting approach.

When data fits in RAM but not shared_buffers, maybe the easiest fix is
to increase shared_buffers.  Which brings up the other question I had
for you about your work with Nate's celebrated loaner machine.  Have
you tried to reproduce the performance problems that have been
reported (but without public disclosure of how to reproduce) with
shared_buffers > 8GB on machines with RAM >>8GB ?

No. That's on my list, but thus far has not made it to the top of
said list. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Robert Haas
robertmhaas@gmail.com
In reply to: Dimitri Fontaine (#18)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#19)
#23Jesper Krogh
jesper@krogh.cc
In reply to: Robert Haas (#21)
#24Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Robert Haas (#21)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Dimitri Fontaine (#24)
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#26)
#28Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#1)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#28)