Spinlocks, yet again: analysis and proposed patches

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Kurt Roeckx (#2)

Re: Spinlocks, yet again: analysis and proposed patches

Kurt Roeckx <kurt@roeckx.be> writes:

On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:

I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors:

I think an important question is wether this is for x86_64 in
general, of opteron specific. It could be that it's not the same
on Intel's EM64Ts.

Good point --- anyone have one to try?

Something else to consider is the OS you're using. I've been
told that Linux isn't that good in NUMA and FreeBSD might be
better.

It's hard to see how the OS could affect behavior at the level of
processor cache operations --- unless they did something truly
spectacularly stupid, like mark main memory non-cacheable.

regards, tom lane

sfrost@snowman.net

almost 21 years ago

In reply to: Tom Lane (#3)

Re: Spinlocks, yet again: analysis and proposed patches

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Kurt Roeckx <kurt@roeckx.be> writes:

On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:

I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors:

I think an important question is wether this is for x86_64 in
general, of opteron specific. It could be that it's not the same
on Intel's EM64Ts.

Good point --- anyone have one to try?

I've got one I can test on. I need to upgrade the kernel and some other
things on it though (it's running 2.6.8 atm, and an older
Debian/unstable which I should probably bring up to current).

I'll work on it starting now and post results once I get some.

Stephen

sfrost@snowman.net

almost 21 years ago

In reply to: Tom Lane (#1)

Re: Spinlocks, yet again: analysis and proposed patches

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

My proposal therefore is to do #2, #3, and #4, and to modify the TAS
assembly code at least on x86_64. Together, these changes bring
my test case on a 4-way Opteron from

N, runtime: 1 36s 2 61s 4 105s 8 198s

em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

N, runtime: 1 31s 2 47s 4 86s 8 159s

to

N, runtime: 1 35s 2 48s 4 58s 8 105s

N, runtime: 1 32s 2 53s 4 90s 8 169s

CPU utilization is definitely higher when running with the patch though.
Hope this helps, happy to do additional testing if necessary.

Thanks,

Stephen

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Stephen Frost (#5)

Re: Spinlocks, yet again: analysis and proposed patches

Stephen Frost <sfrost@snowman.net> writes:

em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

N, runtime: 1 31s 2 47s 4 86s 8 159s

N, runtime: 1 32s 2 53s 4 90s 8 169s

Er, which (or both) of the two patches did you apply here?

regards, tom lane

sfrost@snowman.net

almost 21 years ago

In reply to: Tom Lane (#6)

Re: Spinlocks, yet again: analysis and proposed patches

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

N, runtime: 1 31s 2 47s 4 86s 8 159s

N, runtime: 1 32s 2 53s 4 90s 8 169s

Er, which (or both) of the two patches did you apply here?

Applied both, sorry that wasn't clear.

Thanks,

Stephen

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Stephen Frost (#7)

Re: Spinlocks, yet again: analysis and proposed patches

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Er, which (or both) of the two patches did you apply here?

Applied both, sorry that wasn't clear.

Thanks. If you've got the time, could you try the two patches
separately and see what you get?

regards, tom lane

sfrost@snowman.net

almost 21 years ago

In reply to: Tom Lane (#8)

Re: Spinlocks, yet again: analysis and proposed patches

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Er, which (or both) of the two patches did you apply here?

Applied both, sorry that wasn't clear.

Thanks. If you've got the time, could you try the two patches
separately and see what you get?

Sure.

CVS Head:

N, runtime: 1 31s 2 47s 4 86s 8 159s

With just slock-no-cmpb.patch:

N, runtime: 1 32s 2 39s 4 82s 8 167s

With just spin-delay.patch

N, runtime: 1 32s 2 52s 4 94s 8 164s

With both:

N, runtime: 1 32s 2 53s 4 90s 8 169s

Hope that helps,

Stephen

#10

bruce@momjian.us

almost 21 years ago

In reply to: Tom Lane (#3)

Re: Spinlocks, yet again: analysis and proposed patches

Tom Lane <tgl@sss.pgh.pa.us> writes:

Something else to consider is the OS you're using. I've been
told that Linux isn't that good in NUMA and FreeBSD might be
better.

It's hard to see how the OS could affect behavior at the level of
processor cache operations --- unless they did something truly
spectacularly stupid, like mark main memory non-cacheable.

Well it could schedule processes on processors in ways that force less than
optimal memory usage patterns.

But maybe you should tell the Altix folk with their 32-processor 384Gb NUMA
machines what you've "been told" about Linux not being that good in NUMA.
Really, these kind of cargo cult anecdotes are pretty pointless.

--
greg

#11

sfrost@snowman.net

almost 21 years ago

In reply to: Stephen Frost (#9)

Re: Spinlocks, yet again: analysis and proposed patches

* Stephen Frost (sfrost@snowman.net) wrote:

Thanks. If you've got the time, could you try the two patches
separately and see what you get?

Sure.

[...]

Just to be clear- this was from a completely default 'make install'
using the Debian configure options from 8.0.3 (which aren't that
particularly interesting really- nls, integer-datetimes, debug,
disable-rpath, tcl, perl, python, pam, krb5, openssl, gnu-ld,
enable-thread-safety). If it'd be useful for the test I can adjust
whichever parameters are appropriate (work_mem, cache_size, etc).

There's absolutely nothing else going on except for these test, a few
shells, top, etc, on the box.

Enjoy,

Stephen

#12

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Tom Lane (#1)

Re: Spinlocks, yet again: analysis and proposed patches

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

Some changes are based on tests and heuristics, so can we make them into the
configure script so the choice could be made automatically?

It's a bit premature to propose that, when we don't yet know if the
suggested changes are a win or loss under any conditions, let alone
how to automatically tell the difference. Right now we are in the
find-out-the-facts mode ...

regards, tom lane

Import Notes

Reply to msg id not found: dg2sp71qv21@news.hub.org

#13

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 21 years ago

In reply to: Tom Lane (#1)

Re: Spinlocks, yet again: analysis and proposed patches

Tom Lane wrote:

I attach two proposed patches: the first removes the cmpb/jne from
the x86 TAS assembly code, and the second one does the s_lock changes
enumerated as points #2, #3, #4. The first one in particular needs
more testing to see if it hurts performance on any non-Opteron x86
chips. (If so, we'd just make it conditional to x86_64.)

2x PIII 1G 2G Freebsd 6.0Beta4

8.1beta1 (2005-08-28):

N runtime: 1 85s 2 139s 4 220s

8.1beta1 (2005-08-28) + patch 1 (s_lock.h only)

N runtime: 1 89s 2 137s 4 229s

8.1beta1 (2005-08-28) + patch 2

N runtime: 1 84s 2 108s 4 214s

Observe the interesting little speed improvement for patch 2 with 2
processes (seems to be repeatable).

Let me know if you want to see vmstat output for any of these.

regards

Mark

#14

Qingqing Zhou

zhouqq@cs.toronto.edu

almost 21 years ago

In reply to: Tom Lane (#1)

Re: Spinlocks, yet again: analysis and proposed patches

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

My proposal therefore is to do #2, #3, and #4, and to modify the TAS
assembly code at least on x86_64. Together, these changes bring
my test case on a 4-way Opteron from

Some changes are based on tests and heuristics, so can we make them into the
configure script so the choice could be made automatically?

Regards,
Qingqing

#15

mpaesold@gmx.at

almost 21 years ago

In reply to: Tom Lane (#1)

Re: Spinlocks, yet again: analysis and proposed patches

Tom Lane wrote:

Comments and testing invited.

I have tested the patches on a Dual Xeon 2,4 GHz w/ HT (no EM64T).
(Configured with
"CFLAGS='-O2 -mcpu=pentium4 -march=pentium4' --enable-casserts").
The results were pretty stable (around .2 seconds). I would not trust the
numbers for N=2, linux, at least 2.4 is not good at not scheduling two
running processes on two different HTs on the same core. Those values also
had the most variance (> 1s). All other measures were quite stable over
several runs.

CVS tip from 2005-09-12 ~16:00
1: 57s 2: 82s 4: 124s 8: 237s

with only slock-no-cmpb.patch applied
1: 55s 2: 79s 4: 119s 8: 229s

with only spin-delay.patch applied
1: 56s 2: 79s 4: 124s 8: 235s

with both patches applied
1: 55s 2: 78s 4: 124s 8: 235s

compare to 7.4.8 on the same machine ;-)
1: 92s 2: 235s 4: 474s 8: did not try ...

It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch
does not really help much on this machine. That seems to match Stephen
Frost's results with EM64T, if I read them correctly.

The cs rate is about 150 on CVS tip without patches and below 100 with the
patches (all three cases).
With 7.4.8 its 230000-280000 with N>1. 8.1 is clearly the winner here. Great
work, Tom.

I hope some more data helps.

Best Regards,
Michael Paesold

#16

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Michael Paesold (#15)

Re: Spinlocks, yet again: analysis and proposed patches

"Michael Paesold" <mpaesold@gmx.at> writes:

It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch
does not really help much on this machine. That seems to match Stephen
Frost's results with EM64T, if I read them correctly.

Yeah, it's interesting that you both see slock-no-cmpb as saving some
cycles and the second patch as giving them back. I wonder whether the
integer-modulo-to-slow-the-loop trick is counterproductive on your
machines. You both seem to be using hardware that will recognize
rep;nop and maybe that's all that's needed.

I probably should have broken down the spindelay patch into multiple
components. But it's only a small change --- could you try simplifying
the patched line

if ((--spins % MAX_SPINS_PER_DELAY) == 0)

if (--spins == 0)

and see how the patch does that way?

regards, tom lane

#17

mpaesold@gmx.at

almost 21 years ago

In reply to: Tom Lane (#1)

Re: Spinlocks, yet again: analysis and proposed patches

Tom Lane wrote:

I probably should have broken down the spindelay patch into multiple
components. But it's only a small change --- could you try simplifying
the patched line

if ((--spins % MAX_SPINS_PER_DELAY) == 0)

to

if (--spins == 0)

and see how the patch does that way?

I'll do tomorrow morning (CEST, i.e. in about 11 hours).
Best Regards,
Michael Paesold

#18

http://lwn.net/Articles/31462/
http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Tom Lane (#16)

Re: Spinlocks, yet again: analysis and proposed patches

I wrote:

... and see how the patch does that way?

BTW, please do look at "vmstat 1" while running the test case
corresponding to your number of processors. It's hard to tell from the
runtime alone whether the patch is fully accomplishing its goal of
reducing wasted cycles. If you see user CPU percent go to 100 and
context swaps drop to a low value, though, you know it's working.

regards, tom lane

#19

Marko Kreen

markokr@gmail.com

almost 21 years ago

In reply to: Tom Lane (#1)

Re: Spinlocks, yet again: analysis and proposed patches

On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:

The second reason that the futex patch is helping is that when
a spinlock delay does occur, it allows the delaying process to be
awoken almost immediately, rather than delaying 10 msec or more
as the existing code does. However, given that we are only expecting
the spinlock to be held for a couple dozen instructions, using the
kernel futex mechanism is huge overkill --- the in-kernel overhead
to manage the futex state is almost certainly several orders of
magnitude more than the delay we actually want.

Why do you think so? AFAIK on uncontented case there will be no
kernel access, only atomic inc/dec. On contented case you'll
want task switch anyway, so the futex managing should not
matter. Also this mechanism is specifically optimized for
inter-process locking, I don't think you can get more efficient
mechanism from side-effects from generic syscalls.

If you don't want Linux-specific locking in core code, then
it's another matter...

1. Use sched_yield() if available: it does just what we want,
ie, yield the processor without forcing any useless time delay
before we can be rescheduled. This doesn't exist everywhere
but it exists in recent Linuxen, so I tried it. It made for a
beautiful improvement in my test case numbers: CPU utilization
went to 100% and the context swap rate to almost nil. Unfortunately,
I also saw fairly frequent "stuck spinlock" panics when running
more queries than there were processors --- this despite increasing
NUM_DELAYS to 10000 in s_lock.c. So I don't trust sched_yield
anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect.
(I speculate that it's set up to only yield the processor to other
processes already affiliated to that processor. In any case, it
is definitely capable of getting through 10000 yields without
running the guy who's holding the spinlock.)

This is intended behaviour of sched_yield.

--
marko

#20

mpaesold@gmx.at

almost 21 years ago

In reply to: Tom Lane (#1)

Re: Spinlocks, yet again: analysis and proposed patches

I wrote:

I'll do tomorrow morning (CEST, i.e. in about 11 hours).

These are the tests with the change:

if ((--spins % MAX_SPINS_PER_DELAY) == 0)

to

if (--spins == 0)

I have called the resulting patch (spin-delay + this change) spin-delay-2.

again with only slock-no-cmpb applied
1: 55s 4: 119s (cpu ~97%)

with only spin-delay-2 applied
1: 56s 4: 125s (cpu ~99.5%)

with slock-no-cmpb and spin-delay-2 applied
1: 56s 4: 125s (cpu ~100%)

(Note that the cpu averages are not calulated but estimated by looking at
vmstat.)

Yesterdays results:

CVS tip from 2005-09-12 ~16:00
1: 57s 2: 82s 4: 124s 8: 237s

with only slock-no-cmpb.patch applied
1: 55s 2: 79s 4: 119s 8: 229s

with only spin-delay.patch applied
1: 56s 2: 79s 4: 124s 8: 235s

with both patches applied
1: 55s 2: 78s 4: 124s 8: 235s

To have other data, I have retested the patches on a single-cpu Intel P4
3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon
results it's clear that this is in reality only one cpu. While the runtime
for N=1 is better than the other system, for N=4 it's already worse. The
situation with the patches is quite different, though. Unfortunatly.

CVS tip from 2005-09-12:
1: 36s 2: 77s (cpu ~85%) 4: 159s (cpu ~98%)

only slock-no-cmpb:
1: 36s 2: 81s (cpu ~79%) 4: 177s (cpu ~94%)
(doesn't help this time)

only spin-delay:
1: 36s 2: 86s (cpu =100%) 4: 157s (cpu =100%)
(bad runtime for N=2 (repeatable), cpu not doing real work here?)

slock-no-cmpb and spin-delay:
1: 36s 2: 106s (cpu =100%) 4: 192s (cpu =100%)
(it gets worse)

only spin-delay-2:
1: 36s 2: 85s (cpu =100%) 4: 160s (cpu =100%)
(quite the same as spin-delay)

slock-no-cmpb and spin-delay-2:
1: 36s 2: 109s (cpu =100%) 4: 198s (cpu =100%)
(worse again)

CS rate was low (20-50) for all tests, increasing for N>2 which has to be
expected. For this system I see no case for applying these patches.

Is there a portable way to detect the CPU we are running on? Do you have any
other idea how to implement the delays?

Best Regards,
Michael Paesold

#21

bruce@momjian.us

almost 21 years ago

In reply to: Marko Kreen (#19)

#22

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Marko Kreen (#19)

#23

bruce@momjian.us

almost 21 years ago

In reply to: Marko Kreen (#19)

#24

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Bruce Momjian (#21)

#25

bruce@momjian.us

almost 21 years ago

In reply to: Tom Lane (#22)

#26

Josh Berkus

josh@agliodbs.com

almost 21 years ago

In reply to: Bruce Momjian (#25)

#27

Doug McNaught

doug@mcnaught.org

almost 21 years ago

In reply to: Josh Berkus (#26)

#28

Doug McNaught

doug@mcnaught.org

almost 21 years ago

In reply to: Bruce Momjian (#25)

#29

bruce@momjian.us

almost 21 years ago

In reply to: Doug McNaught (#28)

#30

Doug McNaught

doug@mcnaught.org

almost 21 years ago

In reply to: Bruce Momjian (#29)

#31

markw@osdl.org

almost 21 years ago

In reply to: Doug McNaught (#27)

#32

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Doug McNaught (#30)

#33

sfrost@snowman.net

almost 21 years ago

In reply to: Tom Lane (#24)

#34

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Michael Paesold (#20)

#35

Marko Kreen

markokr@gmail.com

almost 21 years ago

In reply to: Tom Lane (#22)

#36

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Marko Kreen (#35)

#37

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Tom Lane (#36)

#38

Min Xu (Hsu)

xu@cs.wisc.edu

almost 21 years ago

In reply to: Tom Lane (#37)

#39

sfrost@snowman.net

almost 21 years ago

In reply to: Tom Lane (#37)

#40

swm@linuxworld.com.au

almost 21 years ago

In reply to: Stephen Frost (#39)

#41

sfrost@snowman.net

almost 21 years ago

In reply to: Gavin Sherry (#40)

#42

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Min Xu (Hsu) (#38)

#43

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Stephen Frost (#41)

#44

Min Xu (Hsu)

xu@cs.wisc.edu

almost 21 years ago

In reply to: Tom Lane (#42)

#45

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Min Xu (Hsu) (#44)

#46

Min Xu (Hsu)

xu@cs.wisc.edu

almost 21 years ago

In reply to: Tom Lane (#45)

#47

mpaesold@gmx.at

almost 21 years ago

In reply to: Tom Lane (#1)

#48

Dave Page

dpage@pgadmin.org

almost 21 years ago

In reply to: Michael Paesold (#47)

#49

bruce@momjian.us

almost 21 years ago

In reply to: Stephen Frost (#39)

#50

mpaesold@gmx.at

almost 21 years ago

In reply to: Tom Lane (#1)

#51

Martijn van Oosterhout

kleptog@svana.org

almost 21 years ago

In reply to: Stephen Frost (#41)

#52

sfrost@snowman.net

almost 21 years ago

In reply to: Stephen Frost (#9)

#53

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Tom Lane (#36)

#54

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Tom Lane (#37)

#55

mpaesold@gmx.at

almost 21 years ago

In reply to: Tom Lane (#1)

#56

mpaesold@gmx.at

almost 21 years ago

In reply to: Tom Lane (#1)

#57

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Tom Lane (#54)

#58

swm@linuxworld.com.au

almost 21 years ago

In reply to: Tom Lane (#57)

#59

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Gavin Sherry (#58)

#60

swm@linuxworld.com.au

almost 21 years ago

In reply to: Tom Lane (#59)

#61

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Tom Lane (#1)

#62

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Gavin Sherry (#60)

#63

Gregory Maxwell

gmaxwell@gmail.com

almost 21 years ago

In reply to: Tom Lane (#62)

#64

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Gregory Maxwell (#63)

#65

swm@linuxworld.com.au

almost 21 years ago

In reply to: Tom Lane (#64)

#66

simon@2ndQuadrant.com

almost 21 years ago

In reply to: Tom Lane (#57)

#67

Josh Berkus

josh@agliodbs.com

almost 21 years ago

In reply to: Tom Lane (#64)

#68

bruce@momjian.us

almost 21 years ago

In reply to: Josh Berkus (#67)

#69

sfrost@snowman.net

almost 21 years ago

In reply to: Bruce Momjian (#68)

#70

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Stephen Frost (#69)

#71

swm@linuxworld.com.au

almost 21 years ago

In reply to: Tom Lane (#70)

#72

tgl@sss.pgh.pa.us

almost 21 years ago

In reply to: Gavin Sherry (#71)

#73

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 21 years ago

In reply to: Tom Lane (#72)

#74

swm@linuxworld.com.au

almost 21 years ago

In reply to: Jim Nasby (#73)

#75

bruce@momjian.us

almost 21 years ago

In reply to: Tom Lane (#24)

#76

Alexander Kirpa

postgres@bilteks.com

almost 21 years ago

In reply to: Bruce Momjian (#75)

#77

bruce@momjian.us

almost 21 years ago

In reply to: Tom Lane (#1)

#78

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#54)

#79

Emil Briggs

emil@baymountain.com

over 20 years ago

In reply to: Simon Riggs (#78)

#80

markw@osdl.org

over 20 years ago

In reply to: Simon Riggs (#78)

#81

Jim Nasby

Jim.Nasby@BlueTreble.com

over 20 years ago

In reply to: Emil Briggs (#79)

#82

simon@2ndQuadrant.com

over 20 years ago

In reply to: Tom Lane (#1)

#83

markw@osdl.org

over 20 years ago

In reply to: Simon Riggs (#82)

#84

simon@2ndQuadrant.com

over 20 years ago

In reply to: Mark Wong (#83)

#85

markw@osdl.org

over 20 years ago

In reply to: Simon Riggs (#84)

#86

simon@2ndQuadrant.com

over 20 years ago

In reply to: Mark Wong (#85)

#87

markw@osdl.org

over 20 years ago

In reply to: Simon Riggs (#86)

#88

bruce@momjian.us

about 20 years ago

In reply to: Simon Riggs (#78)

#89

Qingqing Zhou

zhouqq@cs.toronto.edu

about 20 years ago

In reply to: Simon Riggs (#78)

#90

bruce@momjian.us

about 20 years ago

In reply to: Qingqing Zhou (#89)

#91