s_lock() seems too aggressive for machines with many sockets

Started by Jan Wieckalmost 11 years ago35 messageshackers
Jump to latest
#1Jan Wieck
JanWieck@Yahoo.com

Hi,

I think I may have found one of the problems, PostgreSQL has on machines
with many NUMA nodes. I am not yet sure what exactly happens on the NUMA
bus, but there seems to be a tipping point at which the spinlock
concurrency wreaks havoc and the performance of the database collapses.

On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total,
a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput
then takes a very sharp dive and reaches around 20,000 TPS at 120
clients. It never recovers from there.

The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".
The 8 socket machine in question scales to over 350,000 TPS.

The patch is meant to demonstrate this effect only. It has a negative
performance impact on smaller machines and client counts < #cores, so
the real solution will probably look much different. But I thought it
would be good to share this and start the discussion about reevaluating
the spinlock code before PGCon.

Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

Attachments:

spins_per_delay.difftext/x-patch; name=spins_per_delay.diffDownload+4-4
#2Andres Freund
andres@anarazel.de
In reply to: Jan Wieck (#1)
Re: s_lock() seems too aggressive for machines with many sockets

On 2015-06-10 09:18:56 -0400, Jan Wieck wrote:

On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a
pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then
takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It
never recovers from there.

85k? Phew, that's pretty bad. What exact type of CPU is this? Which
pgbench scale? Did you use -M prepared?

Could you share a call graph perf profile?

The attached patch demonstrates that less aggressive spinning and
(much) more often delaying improves the performance "on this type of
machine". The 8 socket machine in question scales to over 350,000 TPS.

Even that seems quite low. I've gotten over 500k TPS on a four socket
x86 machine, and about 700k on a 8 socket x86 machine.

Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#1)
Re: s_lock() seems too aggressive for machines with many sockets

On Wed, Jun 10, 2015 at 09:18:56AM -0400, Jan Wieck wrote:

The attached patch demonstrates that less aggressive spinning and
(much) more often delaying improves the performance "on this type of
machine". The 8 socket machine in question scales to over 350,000
TPS.

The patch is meant to demonstrate this effect only. It has a
negative performance impact on smaller machines and client counts <
#cores, so the real solution will probably look much different. But
I thought it would be good to share this and start the discussion
about reevaluating the spinlock code before PGCon.

Wow, you are in that code! We kind of guessed on some of those
constants many years ago, and never revisited it. It would be nice to
get a better heuristic for those, but I am concerned it would require
the user to specify the number of CPU cores.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jan Wieck
JanWieck@Yahoo.com
In reply to: Andres Freund (#2)
Re: s_lock() seems too aggressive for machines with many sockets

On 06/10/2015 09:28 AM, Andres Freund wrote:

On 2015-06-10 09:18:56 -0400, Jan Wieck wrote:

On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a
pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then
takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It
never recovers from there.

85k? Phew, that's pretty bad. What exact type of CPU is this? Which
pgbench scale? Did you use -M prepared?

model name : Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz

numactl --hardware shows the distance to the attached memory as 10, the
distance to every other node as 21. I interpret that as the machine
having one NUMA bus with all cpu packages attached to that, rather than
individual connections from cpu to cpu or something different.

pgbench scale=300, -Msimple.

Could you share a call graph perf profile?

I do not have them handy at the moment and the machine is in use for
something else until tomorrow. I will forward perf and systemtap based
graphs ASAP.

What led me into that spinlock area was the fact that a wall clock based
systemtap FlameGraph showed a large portion of the time spent in
BufferPin() and BufferUnpin().

The attached patch demonstrates that less aggressive spinning and
(much) more often delaying improves the performance "on this type of
machine". The 8 socket machine in question scales to over 350,000 TPS.

Even that seems quite low. I've gotten over 500k TPS on a four socket
x86 machine, and about 700k on a 8 socket x86 machine.

There is more wrong with the machine in question than just that. But for
the moment I am satisfied with having a machine where I can reproduce
this phenomenon in what appears to be a worst case.

Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.

I have the impression that the code assumes that there is little penalty
for accessing the shared byte in a tight loop from any number of cores
in parallel. That apparently is true for some architectures and core
counts, but no longer holds for these machines with many sockets.

Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Jan Wieck (#4)
Re: s_lock() seems too aggressive for machines with many sockets

Hi,

On 2015-06-10 09:54:00 -0400, Jan Wieck wrote:

model name : Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz

numactl --hardware shows the distance to the attached memory as 10, the
distance to every other node as 21. I interpret that as the machine having
one NUMA bus with all cpu packages attached to that, rather than individual
connections from cpu to cpu or something different.

Generally that doesn't say very much - IIRC the distances are defined by
the bios.

What led me into that spinlock area was the fact that a wall clock based
systemtap FlameGraph showed a large portion of the time spent in
BufferPin() and BufferUnpin().

I've seen that as a bottleneck in the past as well. My plan to fix that
is to "simply" make buffer pinning lockless for the majority of cases. I
don't have access to hardware to test that at higher node counts atm
though.

My guess is that the pins are on the btree root pages. But it'd be good
to confirm that.

Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.

I have the impression that the code assumes that there is little penalty for
accessing the shared byte in a tight loop from any number of cores in
parallel. That apparently is true for some architectures and core counts,
but no longer holds for these machines with many sockets.

It's just generally a tradeoff. It's beneficial to spin longer if
there's only mild amounts of contention. If the likelihood of getting
the spinlock soon is high (i.e. existing, but low contention), it'll
nearly always be beneficial to spin. If the likelihood is low, it'll be
mostly beneficial to sleep. The latter is especially true if a machine
is sufficiently overcommitted that it's likely that it'll sleep while
holding a spinlock. The danger of sleeping while holding a spinlock,
without targeted wakeup, is why spinlocks in userspace aren't a really
good idea.

My bet is that if you'd measure using different number iterations for
different spinlocks you'd find some where the higher number of
iterations is rather beneficial as well.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Nils Goroll
slink@schokola.de
In reply to: Jan Wieck (#1)
Re: s_lock() seems too aggressive for machines with many sockets

On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:

Ref: /messages/by-id/4FEDE0BF.7080203@schokola.de

I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.

Nils

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Nils Goroll
slink@schokola.de
In reply to: Andres Freund (#5)
Re: s_lock() seems too aggressive for machines with many sockets

On 10/06/15 16:05, Andres Freund wrote:

it'll nearly always be beneficial to spin

Trouble is that postgres cannot know if the process holding the lock actually
does run, so if it doesn't, all we're doing is burn cycles and make the problem
worse.

Contrary to that, the kernel does know, so for a (f|m)utex which fails to
acquire immediately and thus needs to syscall, the kernel has the option to spin
only if the lock holder is running (the "adaptive" mutex).

Nils

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Jan Wieck
JanWieck@Yahoo.com
In reply to: Nils Goroll (#6)
Re: s_lock() seems too aggressive for machines with many sockets

On 06/10/2015 10:07 AM, Nils Goroll wrote:

On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:

Ref: /messages/by-id/4FEDE0BF.7080203@schokola.de

I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.

I have played with test code that isolates a stripped down version of
s_lock() and uses it with multiple threads. I then implemented multiple
different versions of that s_lock(). The results with 200 concurrent
threads are that using a __sync_val_compare_and_swap() to acquire the
lock and then falling back to a futex() is limited to about 500,000
locks/second. Spinning for 10 times and then doing a usleep(1000) (one
millisecond) gives me 25 million locks/second.

Note that the __sync_val_compare_and_swap() GCC built in seems identical
in performance with the assembler xchgb operation used by PostgreSQL
today on x84_64.

Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#1)
Re: s_lock() seems too aggressive for machines with many sockets

Jan Wieck <jan@wi3ck.info> writes:

The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".

Hm. One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help. The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no? Maybe we just need to alter the feedback loop a bit.

(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@anarazel.de
In reply to: Nils Goroll (#7)
Re: s_lock() seems too aggressive for machines with many sockets

On 2015-06-10 16:12:05 +0200, Nils Goroll wrote:

On 10/06/15 16:05, Andres Freund wrote:

it'll nearly always be beneficial to spin

Trouble is that postgres cannot know if the process holding the lock actually
does run, so if it doesn't, all we're doing is burn cycles and make the problem
worse.

That's precisely what I referred to in the bit you cut away...

Contrary to that, the kernel does know, so for a (f|m)utex which fails to
acquire immediately and thus needs to syscall, the kernel has the option to spin
only if the lock holder is running (the "adaptive" mutex).

Unfortunately there's no portable futex support. That's what stopped us
from adopting them so far. And even futexes can be significantly more
heavyweight under moderate contention than our spinlocks - It's rather
easy to reproduce scenarios where futexes cause significant slowdown in
comparison to spinning in userspace (just reproduce contention on a
spinlock where the protected area will be *very* short - i.e. no cache
misses, no branches or such).

I think we should eventually work on replacing most of the currently
spinlock using code to either use lwlocks (which will enter the kernel
under contention, but not otherwise) or use lockless programming
techniques. I think there's relatively few relevant places left. Most
prominetly the buffer header spinlocks...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#10)
Re: s_lock() seems too aggressive for machines with many sockets

Andres Freund <andres@anarazel.de> writes:

Unfortunately there's no portable futex support. That's what stopped us
from adopting them so far. And even futexes can be significantly more
heavyweight under moderate contention than our spinlocks - It's rather
easy to reproduce scenarios where futexes cause significant slowdown in
comparison to spinning in userspace (just reproduce contention on a
spinlock where the protected area will be *very* short - i.e. no cache
misses, no branches or such).

Which, you'll note, is the ONLY case that's allowed by our coding rules
for spinlock use. If there are any locking sections that are not very
short straight-line code, or at least code with easily predicted branches,
we need to fix those before we worry about the spinlock mechanism per se.
Optimizing for misuse of the mechanism is not the way.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#9)
Re: s_lock() seems too aggressive for machines with many sockets

On 06/10/2015 10:20 AM, Tom Lane wrote:

Jan Wieck <jan@wi3ck.info> writes:

The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".

Hm. One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help. The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no? Maybe we just need to alter the feedback loop a bit.

(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)

The feedback loop looks a bit heavy leaning towards increasing the spin
count vs. decreasing it (100 up vs. 1 down). I have test time booked on
the machine for tomorrow and will test that as well.

However, to me it seems that with the usual minimum sleep() interval of
1ms, once we have to delay at all we are already losing. That spinning
10x still outperforms the same code with 1,000 spins per delay by factor
5 tells me that "on this particular box" something is going horribly
wrong once we get over the tipping point in concurrency. As said, I am
not sure what exactly that is yet. At a minimum the probability that
another CPU package is stealing the cache line from the one, holding the
spinlock, is going up. Which cannot possibly be good for performance.
But I would expect that to show a more gradual drop in throughput than
what I see in the pgbench -S example. Kevin had speculated to me that it
may be possible that at that tipping point the kernel starts feeling the
need to relocate the memory page in question to whichever cpu package
makes the most failing requests and thus ends up with a huge round robin
page relocation project. Unfortunately I don't know how to confirm or
disprove that theory.

This is done on CentOS 7 with kernel 3.10 BTW. And no, I am not at
liberty to install a different distribution or switch to another kernel.

Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Nils Goroll
slink@schokola.de
In reply to: Andres Freund (#10)
Re: s_lock() seems too aggressive for machines with many sockets

On 10/06/15 16:20, Andres Freund wrote:

That's precisely what I referred to in the bit you cut away...

I apologize, yes.

On 10/06/15 16:25, Tom Lane wrote:

Optimizing for misuse of the mechanism is not the way.

I absolutely agree and I really appreciate all efforts towards lockless data
structures or at least better concurrency using classical mutual exclusion.

But still I am convinced that on today's massively parallel NUMAs, spinlocks are
plain wrong:

- Even if critical sections are kept minimal, they can still become hot spots

- When they do, we get potentially massive negative scalability, it will be
hard to exclude the possibility of a system "tilting" under (potentially yet
unknown) load patterns as long as userland slocks exist.

Briefly: When slocks fail, they fail big time

- slocks optimize for the best case, but I think on today's systems we should
optimize for the worst case.

- The fact that well behaved mutexes have a higher initial cost could even
motivate good use of them rather than optimize misuse.

Cheers,

Nils

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#11)
Re: s_lock() seems too aggressive for machines with many sockets

On 2015-06-10 10:25:32 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Unfortunately there's no portable futex support. That's what stopped us
from adopting them so far. And even futexes can be significantly more
heavyweight under moderate contention than our spinlocks - It's rather
easy to reproduce scenarios where futexes cause significant slowdown in
comparison to spinning in userspace (just reproduce contention on a
spinlock where the protected area will be *very* short - i.e. no cache
misses, no branches or such).

Which, you'll note, is the ONLY case that's allowed by our coding rules
for spinlock use. If there are any locking sections that are not very
short straight-line code, or at least code with easily predicted branches,
we need to fix those before we worry about the spinlock mechanism per
se.

We haven't followed that all that strictly imo. While lwlocks are a bit
less problematic in 9.5 (as they take far fewer spinlocks), they're
still far from perfect as we manipulate linked lists while holding a
lock. We malso do lots of hard to predict stuff while the buffer header
spinlock is held...

Optimizing for misuse of the mechanism is not the way.

Agreed. I'm not particularly interested in optimizing spinlocks. We
should get rid of most.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#9)
Re: s_lock() seems too aggressive for machines with many sockets

On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jan Wieck <jan@wi3ck.info> writes:

The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".

Hm. One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help. The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no? Maybe we just need to alter the feedback loop a bit.

(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)

The code seems to have been written with the idea that we should
converge to MAX_SPINS_PER_DELAY if spinning *ever* works. The way
that's implemented is that, if we get a spinlock without having to
delay, we add 100 to spins_per_delay, but if we have to delay at least
once (potentially hundreds of times), then we subtract 1.
spins_per_delay will be >900 most of the time even if only 1% of the
lock acquisitions manage to get the lock without delaying.

It is possible that, as you say, all we need to do is alter the
feedback loop so that, say, we subtract 1 every time we delay (rather
than every time we have at least 1 delay) and add 1 (rather than 100)
every time we don't end up needing to delay. I'm a bit concerned,
though, that this would tend to make spins_per_delay unstable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Andres Freund
andres@anarazel.de
In reply to: Nils Goroll (#13)
Re: s_lock() seems too aggressive for machines with many sockets

On 2015-06-10 16:55:31 +0200, Nils Goroll wrote:

But still I am convinced that on today's massively parallel NUMAs, spinlocks are
plain wrong:

Sure. But a large number of installations are not using massive NUMA
systems, so we can't focus on optimizing for NUMA.

We definitely have quite some catchup to do there. Unfortunately most of
the problems are only reproducible on 4, 8 socket machines, and it's
hard to get hand on those for prolonged amounts of time.

- Even if critical sections are kept minimal, they can still become hot spots

That's why we started to remove several of them...

- The fact that well behaved mutexes have a higher initial cost could even
motivate good use of them rather than optimize misuse.

Well. There's many locks in a RDBMS that can't realistically be
avoided. So optimizing for no and moderate contention isn't something
you can simply forgo.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Jan Wieck
JanWieck@Yahoo.com
In reply to: Robert Haas (#15)
Re: s_lock() seems too aggressive for machines with many sockets

On 06/10/2015 10:59 AM, Robert Haas wrote:

On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jan Wieck <jan@wi3ck.info> writes:

The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".

Hm. One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help. The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no? Maybe we just need to alter the feedback loop a bit.

(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)

The code seems to have been written with the idea that we should
converge to MAX_SPINS_PER_DELAY if spinning *ever* works. The way
that's implemented is that, if we get a spinlock without having to
delay, we add 100 to spins_per_delay, but if we have to delay at least
once (potentially hundreds of times), then we subtract 1.
spins_per_delay will be >900 most of the time even if only 1% of the
lock acquisitions manage to get the lock without delaying.

And note that spins_per_delay is global. Getting ANY lock without delay
adds 100, regardless of that being a high or low contented one. Your
process only needs to hit one low contention slock every 100 calls to
securely peg this value >=900.

Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Nils Goroll
slink@schokola.de
In reply to: Jan Wieck (#8)
Re: s_lock() seems too aggressive for machines with many sockets

On 10/06/15 16:18, Jan Wieck wrote:

I have played with test code that isolates a stripped down version of s_lock()
and uses it with multiple threads. I then implemented multiple different
versions of that s_lock(). The results with 200 concurrent threads are that
using a __sync_val_compare_and_swap() to acquire the lock and then falling back
to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and
then doing a usleep(1000) (one millisecond) gives me 25 million locks/second.

Note that the __sync_val_compare_and_swap() GCC built in seems identical in
performance with the assembler xchgb operation used by PostgreSQL today on x84_64.

These numbers don't work for me. Do IUC that you are not holding the lock for
any reasonable time? If yes, the test case is invalid (the uncontended case is
never relevant). If No, the numbers don't match up - if you held one lock for
1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd
get 200.000 locks/s.

Can you please explain what the message is you are trying to get across?

Thanks, Nils

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Jan Wieck
JanWieck@Yahoo.com
In reply to: Nils Goroll (#18)
Re: s_lock() seems too aggressive for machines with many sockets

On 06/10/2015 11:06 AM, Nils Goroll wrote:

On 10/06/15 16:18, Jan Wieck wrote:

I have played with test code that isolates a stripped down version of s_lock()
and uses it with multiple threads. I then implemented multiple different
versions of that s_lock(). The results with 200 concurrent threads are that
using a __sync_val_compare_and_swap() to acquire the lock and then falling back
to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and
then doing a usleep(1000) (one millisecond) gives me 25 million locks/second.

Note that the __sync_val_compare_and_swap() GCC built in seems identical in
performance with the assembler xchgb operation used by PostgreSQL today on x84_64.

These numbers don't work for me. Do IUC that you are not holding the lock for
any reasonable time? If yes, the test case is invalid (the uncontended case is
never relevant). If No, the numbers don't match up - if you held one lock for
1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd
get 200.000 locks/s.

Can you please explain what the message is you are trying to get across?

The test case is that 200 threads are running in a tight loop like this:

for (...)
{
s_lock();
// do something with a global variable
s_unlock();
}

That is the most contended case I can think of, yet the short and
predictable code while holding the lock is the intended use case for a
spinlock.

The code in s_lock() is what is doing multiple CAS attempts, then sleep.
The code is never holding the lock for 1ms. Sorry if that wasn't clear.

Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Nils Goroll
slink@schokola.de
In reply to: Andres Freund (#16)
Re: s_lock() seems too aggressive for machines with many sockets

On 10/06/15 17:01, Andres Freund wrote:

- The fact that well behaved mutexes have a higher initial cost could even
motivate good use of them rather than optimize misuse.

Well. There's many locks in a RDBMS that can't realistically be
avoided. So optimizing for no and moderate contention isn't something
you can simply forgo.

Let's get back to my initial suggestion:

On 10/06/15 16:07, Nils Goroll wrote:

I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.

Where we use the mutex patch we have not seen any relevant negative impact -
neither in benchmarks nor in production.

So, yes, postgres should still work fine on a 2-core laptop and I don't see any
reason why using posix mutexes *where they are known to behave* would do any harm.

And, to be honest, Linux is quite dominant, so solving the issue for this
platform would be a start at least.

Nils

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Nils Goroll
slink@schokola.de
In reply to: Jan Wieck (#19)
#22Andres Freund
andres@anarazel.de
In reply to: Nils Goroll (#6)
#23Nils Goroll
slink@schokola.de
In reply to: Andres Freund (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Jan Wieck (#19)
#25Andres Freund
andres@anarazel.de
In reply to: Nils Goroll (#23)
#26Jan Wieck
JanWieck@Yahoo.com
In reply to: Andres Freund (#24)
#27Nils Goroll
slink@schokola.de
In reply to: Andres Freund (#25)
#28Andres Freund
andres@anarazel.de
In reply to: Jan Wieck (#26)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#28)
#30Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#29)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#32)
#34Jan Wieck
JanWieck@Yahoo.com
In reply to: Jan Wieck (#1)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Jan Wieck (#34)