Spinlocks, yet again: analysis and proposed patches
The test case I just posted shows that our spinlock code, which we had
thought largely done, is once again becoming a performance bottleneck.
It's time to resurrect some of the ideas we kicked around in early
2002, and didn't pursue because we decided spinlocks weren't our major
performance problem.
I started investigating this by trying to understand why the "futex
spinlock" patch had helped the Red Hat customer I mentioned, when it
hadn't done anything much in last year's experiments. It turns out
that this customer was using an 8-way Opteron, and the futex patch
helps a lot more on that architecture than others. Some random
experimentation eventually turned up part of the reason: the x86 TAS
assembly code we are using is pessimal for x86_64. Specifically,
s_lock.h uses this for both architectures:
/* Use a non-locking test before asserting the bus lock */
__asm__ __volatile__(
" cmpb $0,%1 \n"
" jne 1f \n"
" lock \n"
" xchgb %0,%1 \n"
"1: \n"
and it turns out that deleting the cmpb and jne makes things go
significantly faster on Opterons. So the "non-locking test" is
not a win at all on x86_64. As best I can tell from testing,
removing it wins because it causes the rate of spinlock delays
to drop significantly. Why this should be is not completely
obvious. I speculate that the Opterons are capable of pulling
in a copy of the cache line that contains the spinlock and then
running 100 iterations of the cmpb test without ever noticing
that the processor holding the lock has now released it. Without
the cmpb, they are forced by the locking xchgb test to actually
look at the real state of the spinlock each time.
I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors: it can only be a win if you expect a lot
of contention for the spin lock, but in percentage terms we still
have a very low conflict rate, so in most executions of the TAS
macro, the cmpb is just wasted cycles. Bottom line: we definitely
don't want it for x86_64, and maybe not at all, but we need more
research to decide the latter.
The second reason that the futex patch is helping is that when
a spinlock delay does occur, it allows the delaying process to be
awoken almost immediately, rather than delaying 10 msec or more
as the existing code does. However, given that we are only expecting
the spinlock to be held for a couple dozen instructions, using the
kernel futex mechanism is huge overkill --- the in-kernel overhead
to manage the futex state is almost certainly several orders of
magnitude more than the delay we actually want.
I looked into several other methods of doing the spinlock delay
instead. I think all of these were suggested at one point or
another in our earlier discussions of spinlocks:
1. Use sched_yield() if available: it does just what we want,
ie, yield the processor without forcing any useless time delay
before we can be rescheduled. This doesn't exist everywhere
but it exists in recent Linuxen, so I tried it. It made for a
beautiful improvement in my test case numbers: CPU utilization
went to 100% and the context swap rate to almost nil. Unfortunately,
I also saw fairly frequent "stuck spinlock" panics when running
more queries than there were processors --- this despite increasing
NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
(I speculate that it's set up to only yield the processor to other
processes already affiliated to that processor. In any case, it
is definitely capable of getting through 10000 yields without
running the guy who's holding the spinlock.)
2. Reduce the length of the select() delays in s_lock. The current code
delays in quanta of 10msec, but on recent Linuxen (and I think other
platforms too) the scheduling quantum is 1msec, so we can get the
processor back faster if we ask for a 1msec delay. I tried this and it
is definitely a win on Linux; it doesn't seem to hurt anything on older
systems either, they just round the delay up to 10msec like before.
So I think we should do this, even though it's only a partial answer.
3. Modify the spin loop to do a little more work between TAS() tests.
In the existing code there are only about half a dozen instructions
total in the normal spin loop, most of them very fast, with the result
that the spinning processor does its best to monopolize the system bus
with locked probe instructions. This is obviously not good, as it'll
interfere with the ability of the spinlock holder to complete its work
and release the lock. (The bulk of the spinlock uses are for LWLocks,
and with the current data structures the LWLock's own state is usually
going to be in the same cache line as the spinlock proper, so that
cache grabs for the spinlock hurt the LWLock state updating as well
as the eventual spinlock release.) I modified the code to use an
integer modulo operation as part of the test for SPINS_PER_DELAY;
on most processors this should keep the processor's attention and
keep it away from the system bus for several times longer than an
average instruction. (It's relevant here that the "rep;nop" used
to slow the spin loop on recent Intel processors appears to be
completely ignored by Opterons. Basically we need a non-hardware-
specific trick to slow the loop.)
4. Modify SPINS_PER_DELAY. Looping more times instead of entering the
kernel is a win for SMP boxes: whenever you give up and call the kernel
to delay, it's likely that the spinlock holder will have released the
lock before you even get all the way into the kernel, let alone get
rescheduled again. I found that pushing SPINS_PER_DELAY above 200
caused the select delay rate to fall to almost nil in this test case.
However, that would be counterproductive for uniprocessors where the
optimal strategy is normally to give up the processor immediately.
We've talked about making the spin count user configurable, but I never
cared for that answer. Instead, I've worked up some code to
automatically adjust the spin count depending on whether we seem to be
in a uniprocessor or multiprocessor. The idea is to decrease the spin
count (down to a lower limit) whenever s_lock has to delay in order to
get a lock, and to increase the count (up to a limit) whenever we see
the lock come free before we've had to delay. The former condition
suggests that we are in a uniprocessor and the latter suggests we are in
a multiprocessor. I recall having complained that this idea would
probably be unstable, but in testing it seems to do pretty nicely at
converging to the minimum spin count on a 1-CPU machine and the maximum
limit on an SMP box. It could do with more testing though to see if it
works the same for other people.
My proposal therefore is to do #2, #3, and #4, and to modify the TAS
assembly code at least on x86_64. Together, these changes bring
my test case on a 4-way Opteron from
N, runtime: 1 36s 2 61s 4 105s 8 198s
to
N, runtime: 1 35s 2 48s 4 58s 8 105s
At 4 queries, CPU utilization is 100% and the context swap rate nearly
nil, which seems to mean we've gained another (temporary no doubt)
victory over the spinlock problem.
I attach two proposed patches: the first removes the cmpb/jne from
the x86 TAS assembly code, and the second one does the s_lock changes
enumerated as points #2, #3, #4. The first one in particular needs
more testing to see if it hurts performance on any non-Opteron x86
chips. (If so, we'd just make it conditional to x86_64.)
Comments and testing invited.
regards, tom lane
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors: it can only be a win if you expect a lot
of contention for the spin lock, but in percentage terms we still
have a very low conflict rate, so in most executions of the TAS
macro, the cmpb is just wasted cycles. Bottom line: we definitely
don't want it for x86_64, and maybe not at all, but we need more
research to decide the latter.
I think an important question is wether this is for x86_64 in
general, of opteron specific. It could be that it's not the same
on Intel's EM64Ts.
One reason this might behave differently for opterons is that
it's a cc-NUMA instead of the "normal" SMP.
Something else to consider is the OS you're using. I've been
told that Linux isn't that good in NUMA and FreeBSD might be
better.
Kurt
Kurt Roeckx <kurt@roeckx.be> writes:
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors:
I think an important question is wether this is for x86_64 in
general, of opteron specific. It could be that it's not the same
on Intel's EM64Ts.
Good point --- anyone have one to try?
Something else to consider is the OS you're using. I've been
told that Linux isn't that good in NUMA and FreeBSD might be
better.
It's hard to see how the OS could affect behavior at the level of
processor cache operations --- unless they did something truly
spectacularly stupid, like mark main memory non-cacheable.
regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Kurt Roeckx <kurt@roeckx.be> writes:
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors:I think an important question is wether this is for x86_64 in
general, of opteron specific. It could be that it's not the same
on Intel's EM64Ts.Good point --- anyone have one to try?
I've got one I can test on. I need to upgrade the kernel and some other
things on it though (it's running 2.6.8 atm, and an older
Debian/unstable which I should probably bring up to current).
I'll work on it starting now and post results once I get some.
Stephen
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
My proposal therefore is to do #2, #3, and #4, and to modify the TAS
assembly code at least on x86_64. Together, these changes bring
my test case on a 4-way Opteron fromN, runtime: 1 36s 2 61s 4 105s 8 198s
em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:
N, runtime: 1 31s 2 47s 4 86s 8 159s
to
N, runtime: 1 35s 2 48s 4 58s 8 105s
N, runtime: 1 32s 2 53s 4 90s 8 169s
CPU utilization is definitely higher when running with the patch though.
Hope this helps, happy to do additional testing if necessary.
Thanks,
Stephen
Stephen Frost <sfrost@snowman.net> writes:
em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:
N, runtime: 1 31s 2 47s 4 86s 8 159s
N, runtime: 1 32s 2 53s 4 90s 8 169s
Er, which (or both) of the two patches did you apply here?
regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Stephen Frost <sfrost@snowman.net> writes:
em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:
N, runtime: 1 31s 2 47s 4 86s 8 159s
N, runtime: 1 32s 2 53s 4 90s 8 169s
Er, which (or both) of the two patches did you apply here?
Applied both, sorry that wasn't clear.
Thanks,
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Er, which (or both) of the two patches did you apply here?
Applied both, sorry that wasn't clear.
Thanks. If you've got the time, could you try the two patches
separately and see what you get?
regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Er, which (or both) of the two patches did you apply here?
Applied both, sorry that wasn't clear.
Thanks. If you've got the time, could you try the two patches
separately and see what you get?
Sure.
CVS Head:
N, runtime: 1 31s 2 47s 4 86s 8 159s
With just slock-no-cmpb.patch:
N, runtime: 1 32s 2 39s 4 82s 8 167s
With just spin-delay.patch
N, runtime: 1 32s 2 52s 4 94s 8 164s
With both:
N, runtime: 1 32s 2 53s 4 90s 8 169s
Hope that helps,
Stephen
Tom Lane <tgl@sss.pgh.pa.us> writes:
Something else to consider is the OS you're using. I've been
told that Linux isn't that good in NUMA and FreeBSD might be
better.It's hard to see how the OS could affect behavior at the level of
processor cache operations --- unless they did something truly
spectacularly stupid, like mark main memory non-cacheable.
Well it could schedule processes on processors in ways that force less than
optimal memory usage patterns.
But maybe you should tell the Altix folk with their 32-processor 384Gb NUMA
machines what you've "been told" about Linux not being that good in NUMA.
Really, these kind of cargo cult anecdotes are pretty pointless.
--
greg
* Stephen Frost (sfrost@snowman.net) wrote:
Thanks. If you've got the time, could you try the two patches
separately and see what you get?Sure.
[...]
Just to be clear- this was from a completely default 'make install'
using the Debian configure options from 8.0.3 (which aren't that
particularly interesting really- nls, integer-datetimes, debug,
disable-rpath, tcl, perl, python, pam, krb5, openssl, gnu-ld,
enable-thread-safety). If it'd be useful for the test I can adjust
whichever parameters are appropriate (work_mem, cache_size, etc).
There's absolutely nothing else going on except for these test, a few
shells, top, etc, on the box.
Enjoy,
Stephen
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
Some changes are based on tests and heuristics, so can we make them into the
configure script so the choice could be made automatically?
It's a bit premature to propose that, when we don't yet know if the
suggested changes are a win or loss under any conditions, let alone
how to automatically tell the difference. Right now we are in the
find-out-the-facts mode ...
regards, tom lane
Import Notes
Reply to msg id not found: dg2sp71qv21@news.hub.org
Tom Lane wrote:
I attach two proposed patches: the first removes the cmpb/jne from
the x86 TAS assembly code, and the second one does the s_lock changes
enumerated as points #2, #3, #4. The first one in particular needs
more testing to see if it hurts performance on any non-Opteron x86
chips. (If so, we'd just make it conditional to x86_64.)
2x PIII 1G 2G Freebsd 6.0Beta4
8.1beta1 (2005-08-28):
N runtime: 1 85s 2 139s 4 220s
8.1beta1 (2005-08-28) + patch 1 (s_lock.h only)
N runtime: 1 89s 2 137s 4 229s
8.1beta1 (2005-08-28) + patch 2
N runtime: 1 84s 2 108s 4 214s
Observe the interesting little speed improvement for patch 2 with 2
processes (seems to be repeatable).
Let me know if you want to see vmstat output for any of these.
regards
Mark
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
My proposal therefore is to do #2, #3, and #4, and to modify the TAS
assembly code at least on x86_64. Together, these changes bring
my test case on a 4-way Opteron from
Some changes are based on tests and heuristics, so can we make them into the
configure script so the choice could be made automatically?
Regards,
Qingqing
Tom Lane wrote:
Comments and testing invited.
I have tested the patches on a Dual Xeon 2,4 GHz w/ HT (no EM64T).
(Configured with
"CFLAGS='-O2 -mcpu=pentium4 -march=pentium4' --enable-casserts").
The results were pretty stable (around .2 seconds). I would not trust the
numbers for N=2, linux, at least 2.4 is not good at not scheduling two
running processes on two different HTs on the same core. Those values also
had the most variance (> 1s). All other measures were quite stable over
several runs.
CVS tip from 2005-09-12 ~16:00
1: 57s 2: 82s 4: 124s 8: 237s
with only slock-no-cmpb.patch applied
1: 55s 2: 79s 4: 119s 8: 229s
with only spin-delay.patch applied
1: 56s 2: 79s 4: 124s 8: 235s
with both patches applied
1: 55s 2: 78s 4: 124s 8: 235s
compare to 7.4.8 on the same machine ;-)
1: 92s 2: 235s 4: 474s 8: did not try ...
It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch
does not really help much on this machine. That seems to match Stephen
Frost's results with EM64T, if I read them correctly.
The cs rate is about 150 on CVS tip without patches and below 100 with the
patches (all three cases).
With 7.4.8 its 230000-280000 with N>1. 8.1 is clearly the winner here. Great
work, Tom.
I hope some more data helps.
Best Regards,
Michael Paesold
"Michael Paesold" <mpaesold@gmx.at> writes:
It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch
does not really help much on this machine. That seems to match Stephen
Frost's results with EM64T, if I read them correctly.
Yeah, it's interesting that you both see slock-no-cmpb as saving some
cycles and the second patch as giving them back. I wonder whether the
integer-modulo-to-slow-the-loop trick is counterproductive on your
machines. You both seem to be using hardware that will recognize
rep;nop and maybe that's all that's needed.
I probably should have broken down the spindelay patch into multiple
components. But it's only a small change --- could you try simplifying
the patched line
if ((--spins % MAX_SPINS_PER_DELAY) == 0)
to
if (--spins == 0)
and see how the patch does that way?
regards, tom lane
Tom Lane wrote:
I probably should have broken down the spindelay patch into multiple
components. But it's only a small change --- could you try simplifying
the patched lineif ((--spins % MAX_SPINS_PER_DELAY) == 0)
to
if (--spins == 0)
and see how the patch does that way?
I'll do tomorrow morning (CEST, i.e. in about 11 hours).
Best Regards,
Michael Paesold
I wrote:
... and see how the patch does that way?
BTW, please do look at "vmstat 1" while running the test case
corresponding to your number of processors. It's hard to tell from the
runtime alone whether the patch is fully accomplishing its goal of
reducing wasted cycles. If you see user CPU percent go to 100 and
context swaps drop to a low value, though, you know it's working.
regards, tom lane
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
The second reason that the futex patch is helping is that when
a spinlock delay does occur, it allows the delaying process to be
awoken almost immediately, rather than delaying 10 msec or more
as the existing code does. However, given that we are only expecting
the spinlock to be held for a couple dozen instructions, using the
kernel futex mechanism is huge overkill --- the in-kernel overhead
to manage the futex state is almost certainly several orders of
magnitude more than the delay we actually want.
Why do you think so? AFAIK on uncontented case there will be no
kernel access, only atomic inc/dec. On contented case you'll
want task switch anyway, so the futex managing should not
matter. Also this mechanism is specifically optimized for
inter-process locking, I don't think you can get more efficient
mechanism from side-effects from generic syscalls.
If you don't want Linux-specific locking in core code, then
it's another matter...
1. Use sched_yield() if available: it does just what we want,
ie, yield the processor without forcing any useless time delay
before we can be rescheduled. This doesn't exist everywhere
but it exists in recent Linuxen, so I tried it. It made for a
beautiful improvement in my test case numbers: CPU utilization
went to 100% and the context swap rate to almost nil. Unfortunately,
I also saw fairly frequent "stuck spinlock" panics when running
more queries than there were processors --- this despite increasing
NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
(I speculate that it's set up to only yield the processor to other
processes already affiliated to that processor. In any case, it
is definitely capable of getting through 10000 yields without
running the guy who's holding the spinlock.)
This is intended behaviour of sched_yield.
http://lwn.net/Articles/31462/
http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2
--
marko
I wrote:
I'll do tomorrow morning (CEST, i.e. in about 11 hours).
These are the tests with the change:
if ((--spins % MAX_SPINS_PER_DELAY) == 0)
to
if (--spins == 0)
I have called the resulting patch (spin-delay + this change) spin-delay-2.
again with only slock-no-cmpb applied
1: 55s 4: 119s (cpu ~97%)
with only spin-delay-2 applied
1: 56s 4: 125s (cpu ~99.5%)
with slock-no-cmpb and spin-delay-2 applied
1: 56s 4: 125s (cpu ~100%)
(Note that the cpu averages are not calulated but estimated by looking at
vmstat.)
Yesterdays results:
CVS tip from 2005-09-12 ~16:00
1: 57s 2: 82s 4: 124s 8: 237swith only slock-no-cmpb.patch applied
1: 55s 2: 79s 4: 119s 8: 229swith only spin-delay.patch applied
1: 56s 2: 79s 4: 124s 8: 235swith both patches applied
1: 55s 2: 78s 4: 124s 8: 235s
To have other data, I have retested the patches on a single-cpu Intel P4
3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon
results it's clear that this is in reality only one cpu. While the runtime
for N=1 is better than the other system, for N=4 it's already worse. The
situation with the patches is quite different, though. Unfortunatly.
CVS tip from 2005-09-12:
1: 36s 2: 77s (cpu ~85%) 4: 159s (cpu ~98%)
only slock-no-cmpb:
1: 36s 2: 81s (cpu ~79%) 4: 177s (cpu ~94%)
(doesn't help this time)
only spin-delay:
1: 36s 2: 86s (cpu =100%) 4: 157s (cpu =100%)
(bad runtime for N=2 (repeatable), cpu not doing real work here?)
slock-no-cmpb and spin-delay:
1: 36s 2: 106s (cpu =100%) 4: 192s (cpu =100%)
(it gets worse)
only spin-delay-2:
1: 36s 2: 85s (cpu =100%) 4: 160s (cpu =100%)
(quite the same as spin-delay)
slock-no-cmpb and spin-delay-2:
1: 36s 2: 109s (cpu =100%) 4: 198s (cpu =100%)
(worse again)
CS rate was low (20-50) for all tests, increasing for N>2 which has to be
expected. For this system I see no case for applying these patches.
Is there a portable way to detect the CPU we are running on? Do you have any
other idea how to implement the delays?
Best Regards,
Michael Paesold