s_lock() seems too aggressive for machines with many sockets
Hi,
I think I may have found one of the problems, PostgreSQL has on machines
with many NUMA nodes. I am not yet sure what exactly happens on the NUMA
bus, but there seems to be a tipping point at which the spinlock
concurrency wreaks havoc and the performance of the database collapses.
On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total,
a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput
then takes a very sharp dive and reaches around 20,000 TPS at 120
clients. It never recovers from there.
The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".
The 8 socket machine in question scales to over 350,000 TPS.
The patch is meant to demonstrate this effect only. It has a negative
performance impact on smaller machines and client counts < #cores, so
the real solution will probably look much different. But I thought it
would be good to share this and start the discussion about reevaluating
the spinlock code before PGCon.
Regards, Jan
--
Jan Wieck
Senior Software Engineer
http://slony.info
Attachments:
spins_per_delay.difftext/x-patch; name=spins_per_delay.diffDownload
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 0afcba1..edaa8fd 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -83,8 +83,8 @@ s_lock(volatile slock_t *lock, const char *file, int line)
* the probability of unintended failure) than to fix the total time
* spent.
*/
-#define MIN_SPINS_PER_DELAY 10
-#define MAX_SPINS_PER_DELAY 1000
+#define MIN_SPINS_PER_DELAY 4
+#define MAX_SPINS_PER_DELAY 10
#define NUM_DELAYS 1000
#define MIN_DELAY_USEC 1000L
#define MAX_DELAY_USEC 1000000L
@@ -144,7 +144,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)
{
/* we never had to delay */
if (spins_per_delay < MAX_SPINS_PER_DELAY)
- spins_per_delay = Min(spins_per_delay + 100, MAX_SPINS_PER_DELAY);
+ spins_per_delay = Min(spins_per_delay + 2, MAX_SPINS_PER_DELAY);
}
else
{
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index c63cf54..42eb353 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -973,7 +973,7 @@ extern slock_t dummy_spinlock;
extern int s_lock(volatile slock_t *lock, const char *file, int line);
/* Support for dynamic adjustment of spins_per_delay */
-#define DEFAULT_SPINS_PER_DELAY 100
+#define DEFAULT_SPINS_PER_DELAY 10
extern void set_spins_per_delay(int shared_spins_per_delay);
extern int update_spins_per_delay(int shared_spins_per_delay);
On 2015-06-10 09:18:56 -0400, Jan Wieck wrote:
On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a
pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then
takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It
never recovers from there.
85k? Phew, that's pretty bad. What exact type of CPU is this? Which
pgbench scale? Did you use -M prepared?
Could you share a call graph perf profile?
The attached patch demonstrates that less aggressive spinning and
(much) more often delaying improves the performance "on this type of
machine". The 8 socket machine in question scales to over 350,000 TPS.
Even that seems quite low. I've gotten over 500k TPS on a four socket
x86 machine, and about 700k on a 8 socket x86 machine.
Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 10, 2015 at 09:18:56AM -0400, Jan Wieck wrote:
The attached patch demonstrates that less aggressive spinning and
(much) more often delaying improves the performance "on this type of
machine". The 8 socket machine in question scales to over 350,000
TPS.The patch is meant to demonstrate this effect only. It has a
negative performance impact on smaller machines and client counts <
#cores, so the real solution will probably look much different. But
I thought it would be good to share this and start the discussion
about reevaluating the spinlock code before PGCon.
Wow, you are in that code! We kind of guessed on some of those
constants many years ago, and never revisited it. It would be nice to
get a better heuristic for those, but I am concerned it would require
the user to specify the number of CPU cores.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/10/2015 09:28 AM, Andres Freund wrote:
On 2015-06-10 09:18:56 -0400, Jan Wieck wrote:
On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a
pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then
takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It
never recovers from there.85k? Phew, that's pretty bad. What exact type of CPU is this? Which
pgbench scale? Did you use -M prepared?
model name : Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
numactl --hardware shows the distance to the attached memory as 10, the
distance to every other node as 21. I interpret that as the machine
having one NUMA bus with all cpu packages attached to that, rather than
individual connections from cpu to cpu or something different.
pgbench scale=300, -Msimple.
Could you share a call graph perf profile?
I do not have them handy at the moment and the machine is in use for
something else until tomorrow. I will forward perf and systemtap based
graphs ASAP.
What led me into that spinlock area was the fact that a wall clock based
systemtap FlameGraph showed a large portion of the time spent in
BufferPin() and BufferUnpin().
The attached patch demonstrates that less aggressive spinning and
(much) more often delaying improves the performance "on this type of
machine". The 8 socket machine in question scales to over 350,000 TPS.Even that seems quite low. I've gotten over 500k TPS on a four socket
x86 machine, and about 700k on a 8 socket x86 machine.
There is more wrong with the machine in question than just that. But for
the moment I am satisfied with having a machine where I can reproduce
this phenomenon in what appears to be a worst case.
Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.
I have the impression that the code assumes that there is little penalty
for accessing the shared byte in a tight loop from any number of cores
in parallel. That apparently is true for some architectures and core
counts, but no longer holds for these machines with many sockets.
Regards, Jan
--
Jan Wieck
Senior Software Engineer
http://slony.info
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2015-06-10 09:54:00 -0400, Jan Wieck wrote:
model name : Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
numactl --hardware shows the distance to the attached memory as 10, the
distance to every other node as 21. I interpret that as the machine having
one NUMA bus with all cpu packages attached to that, rather than individual
connections from cpu to cpu or something different.
Generally that doesn't say very much - IIRC the distances are defined by
the bios.
What led me into that spinlock area was the fact that a wall clock based
systemtap FlameGraph showed a large portion of the time spent in
BufferPin() and BufferUnpin().
I've seen that as a bottleneck in the past as well. My plan to fix that
is to "simply" make buffer pinning lockless for the majority of cases. I
don't have access to hardware to test that at higher node counts atm
though.
My guess is that the pins are on the btree root pages. But it'd be good
to confirm that.
Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.I have the impression that the code assumes that there is little penalty for
accessing the shared byte in a tight loop from any number of cores in
parallel. That apparently is true for some architectures and core counts,
but no longer holds for these machines with many sockets.
It's just generally a tradeoff. It's beneficial to spin longer if
there's only mild amounts of contention. If the likelihood of getting
the spinlock soon is high (i.e. existing, but low contention), it'll
nearly always be beneficial to spin. If the likelihood is low, it'll be
mostly beneficial to sleep. The latter is especially true if a machine
is sufficiently overcommitted that it's likely that it'll sleep while
holding a spinlock. The danger of sleeping while holding a spinlock,
without targeted wakeup, is why spinlocks in userspace aren't a really
good idea.
My bet is that if you'd measure using different number iterations for
different spinlocks you'd find some where the higher number of
iterations is rather beneficial as well.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:
Ref: /messages/by-id/4FEDE0BF.7080203@schokola.de
I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.
Nils
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/06/15 16:05, Andres Freund wrote:
it'll nearly always be beneficial to spin
Trouble is that postgres cannot know if the process holding the lock actually
does run, so if it doesn't, all we're doing is burn cycles and make the problem
worse.
Contrary to that, the kernel does know, so for a (f|m)utex which fails to
acquire immediately and thus needs to syscall, the kernel has the option to spin
only if the lock holder is running (the "adaptive" mutex).
Nils
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/10/2015 10:07 AM, Nils Goroll wrote:
On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:Ref: /messages/by-id/4FEDE0BF.7080203@schokola.de
I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.
I have played with test code that isolates a stripped down version of
s_lock() and uses it with multiple threads. I then implemented multiple
different versions of that s_lock(). The results with 200 concurrent
threads are that using a __sync_val_compare_and_swap() to acquire the
lock and then falling back to a futex() is limited to about 500,000
locks/second. Spinning for 10 times and then doing a usleep(1000) (one
millisecond) gives me 25 million locks/second.
Note that the __sync_val_compare_and_swap() GCC built in seems identical
in performance with the assembler xchgb operation used by PostgreSQL
today on x84_64.
Regards, Jan
--
Jan Wieck
Senior Software Engineer
http://slony.info
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jan Wieck <jan@wi3ck.info> writes:
The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".
Hm. One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help. The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no? Maybe we just need to alter the feedback loop a bit.
(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-10 16:12:05 +0200, Nils Goroll wrote:
On 10/06/15 16:05, Andres Freund wrote:
it'll nearly always be beneficial to spin
Trouble is that postgres cannot know if the process holding the lock actually
does run, so if it doesn't, all we're doing is burn cycles and make the problem
worse.
That's precisely what I referred to in the bit you cut away...
Contrary to that, the kernel does know, so for a (f|m)utex which fails to
acquire immediately and thus needs to syscall, the kernel has the option to spin
only if the lock holder is running (the "adaptive" mutex).
Unfortunately there's no portable futex support. That's what stopped us
from adopting them so far. And even futexes can be significantly more
heavyweight under moderate contention than our spinlocks - It's rather
easy to reproduce scenarios where futexes cause significant slowdown in
comparison to spinning in userspace (just reproduce contention on a
spinlock where the protected area will be *very* short - i.e. no cache
misses, no branches or such).
I think we should eventually work on replacing most of the currently
spinlock using code to either use lwlocks (which will enter the kernel
under contention, but not otherwise) or use lockless programming
techniques. I think there's relatively few relevant places left. Most
prominetly the buffer header spinlocks...
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
Unfortunately there's no portable futex support. That's what stopped us
from adopting them so far. And even futexes can be significantly more
heavyweight under moderate contention than our spinlocks - It's rather
easy to reproduce scenarios where futexes cause significant slowdown in
comparison to spinning in userspace (just reproduce contention on a
spinlock where the protected area will be *very* short - i.e. no cache
misses, no branches or such).
Which, you'll note, is the ONLY case that's allowed by our coding rules
for spinlock use. If there are any locking sections that are not very
short straight-line code, or at least code with easily predicted branches,
we need to fix those before we worry about the spinlock mechanism per se.
Optimizing for misuse of the mechanism is not the way.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/10/2015 10:20 AM, Tom Lane wrote:
Jan Wieck <jan@wi3ck.info> writes:
The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".Hm. One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help. The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no? Maybe we just need to alter the feedback loop a bit.(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)
The feedback loop looks a bit heavy leaning towards increasing the spin
count vs. decreasing it (100 up vs. 1 down). I have test time booked on
the machine for tomorrow and will test that as well.
However, to me it seems that with the usual minimum sleep() interval of
1ms, once we have to delay at all we are already losing. That spinning
10x still outperforms the same code with 1,000 spins per delay by factor
5 tells me that "on this particular box" something is going horribly
wrong once we get over the tipping point in concurrency. As said, I am
not sure what exactly that is yet. At a minimum the probability that
another CPU package is stealing the cache line from the one, holding the
spinlock, is going up. Which cannot possibly be good for performance.
But I would expect that to show a more gradual drop in throughput than
what I see in the pgbench -S example. Kevin had speculated to me that it
may be possible that at that tipping point the kernel starts feeling the
need to relocate the memory page in question to whichever cpu package
makes the most failing requests and thus ends up with a huge round robin
page relocation project. Unfortunately I don't know how to confirm or
disprove that theory.
This is done on CentOS 7 with kernel 3.10 BTW. And no, I am not at
liberty to install a different distribution or switch to another kernel.
Regards, Jan
--
Jan Wieck
Senior Software Engineer
http://slony.info
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/06/15 16:20, Andres Freund wrote:
That's precisely what I referred to in the bit you cut away...
I apologize, yes.
On 10/06/15 16:25, Tom Lane wrote:
Optimizing for misuse of the mechanism is not the way.
I absolutely agree and I really appreciate all efforts towards lockless data
structures or at least better concurrency using classical mutual exclusion.
But still I am convinced that on today's massively parallel NUMAs, spinlocks are
plain wrong:
- Even if critical sections are kept minimal, they can still become hot spots
- When they do, we get potentially massive negative scalability, it will be
hard to exclude the possibility of a system "tilting" under (potentially yet
unknown) load patterns as long as userland slocks exist.
Briefly: When slocks fail, they fail big time
- slocks optimize for the best case, but I think on today's systems we should
optimize for the worst case.
- The fact that well behaved mutexes have a higher initial cost could even
motivate good use of them rather than optimize misuse.
Cheers,
Nils
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-10 10:25:32 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Unfortunately there's no portable futex support. That's what stopped us
from adopting them so far. And even futexes can be significantly more
heavyweight under moderate contention than our spinlocks - It's rather
easy to reproduce scenarios where futexes cause significant slowdown in
comparison to spinning in userspace (just reproduce contention on a
spinlock where the protected area will be *very* short - i.e. no cache
misses, no branches or such).Which, you'll note, is the ONLY case that's allowed by our coding rules
for spinlock use. If there are any locking sections that are not very
short straight-line code, or at least code with easily predicted branches,
we need to fix those before we worry about the spinlock mechanism per
se.
We haven't followed that all that strictly imo. While lwlocks are a bit
less problematic in 9.5 (as they take far fewer spinlocks), they're
still far from perfect as we manipulate linked lists while holding a
lock. We malso do lots of hard to predict stuff while the buffer header
spinlock is held...
Optimizing for misuse of the mechanism is not the way.
Agreed. I'm not particularly interested in optimizing spinlocks. We
should get rid of most.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jan Wieck <jan@wi3ck.info> writes:
The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".Hm. One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help. The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no? Maybe we just need to alter the feedback loop a bit.(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)
The code seems to have been written with the idea that we should
converge to MAX_SPINS_PER_DELAY if spinning *ever* works. The way
that's implemented is that, if we get a spinlock without having to
delay, we add 100 to spins_per_delay, but if we have to delay at least
once (potentially hundreds of times), then we subtract 1.
spins_per_delay will be >900 most of the time even if only 1% of the
lock acquisitions manage to get the lock without delaying.
It is possible that, as you say, all we need to do is alter the
feedback loop so that, say, we subtract 1 every time we delay (rather
than every time we have at least 1 delay) and add 1 (rather than 100)
every time we don't end up needing to delay. I'm a bit concerned,
though, that this would tend to make spins_per_delay unstable.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-10 16:55:31 +0200, Nils Goroll wrote:
But still I am convinced that on today's massively parallel NUMAs, spinlocks are
plain wrong:
Sure. But a large number of installations are not using massive NUMA
systems, so we can't focus on optimizing for NUMA.
We definitely have quite some catchup to do there. Unfortunately most of
the problems are only reproducible on 4, 8 socket machines, and it's
hard to get hand on those for prolonged amounts of time.
- Even if critical sections are kept minimal, they can still become hot spots
That's why we started to remove several of them...
- The fact that well behaved mutexes have a higher initial cost could even
motivate good use of them rather than optimize misuse.
Well. There's many locks in a RDBMS that can't realistically be
avoided. So optimizing for no and moderate contention isn't something
you can simply forgo.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/10/2015 10:59 AM, Robert Haas wrote:
On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jan Wieck <jan@wi3ck.info> writes:
The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".Hm. One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help. The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no? Maybe we just need to alter the feedback loop a bit.(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)The code seems to have been written with the idea that we should
converge to MAX_SPINS_PER_DELAY if spinning *ever* works. The way
that's implemented is that, if we get a spinlock without having to
delay, we add 100 to spins_per_delay, but if we have to delay at least
once (potentially hundreds of times), then we subtract 1.
spins_per_delay will be >900 most of the time even if only 1% of the
lock acquisitions manage to get the lock without delaying.
And note that spins_per_delay is global. Getting ANY lock without delay
adds 100, regardless of that being a high or low contented one. Your
process only needs to hit one low contention slock every 100 calls to
securely peg this value >=900.
Jan
--
Jan Wieck
Senior Software Engineer
http://slony.info
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/06/15 16:18, Jan Wieck wrote:
I have played with test code that isolates a stripped down version of s_lock()
and uses it with multiple threads. I then implemented multiple different
versions of that s_lock(). The results with 200 concurrent threads are that
using a __sync_val_compare_and_swap() to acquire the lock and then falling back
to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and
then doing a usleep(1000) (one millisecond) gives me 25 million locks/second.Note that the __sync_val_compare_and_swap() GCC built in seems identical in
performance with the assembler xchgb operation used by PostgreSQL today on x84_64.
These numbers don't work for me. Do IUC that you are not holding the lock for
any reasonable time? If yes, the test case is invalid (the uncontended case is
never relevant). If No, the numbers don't match up - if you held one lock for
1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd
get 200.000 locks/s.
Can you please explain what the message is you are trying to get across?
Thanks, Nils
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/10/2015 11:06 AM, Nils Goroll wrote:
On 10/06/15 16:18, Jan Wieck wrote:
I have played with test code that isolates a stripped down version of s_lock()
and uses it with multiple threads. I then implemented multiple different
versions of that s_lock(). The results with 200 concurrent threads are that
using a __sync_val_compare_and_swap() to acquire the lock and then falling back
to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and
then doing a usleep(1000) (one millisecond) gives me 25 million locks/second.Note that the __sync_val_compare_and_swap() GCC built in seems identical in
performance with the assembler xchgb operation used by PostgreSQL today on x84_64.These numbers don't work for me. Do IUC that you are not holding the lock for
any reasonable time? If yes, the test case is invalid (the uncontended case is
never relevant). If No, the numbers don't match up - if you held one lock for
1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd
get 200.000 locks/s.Can you please explain what the message is you are trying to get across?
The test case is that 200 threads are running in a tight loop like this:
for (...)
{
s_lock();
// do something with a global variable
s_unlock();
}
That is the most contended case I can think of, yet the short and
predictable code while holding the lock is the intended use case for a
spinlock.
The code in s_lock() is what is doing multiple CAS attempts, then sleep.
The code is never holding the lock for 1ms. Sorry if that wasn't clear.
Regards, Jan
--
Jan Wieck
Senior Software Engineer
http://slony.info
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/06/15 17:01, Andres Freund wrote:
- The fact that well behaved mutexes have a higher initial cost could even
motivate good use of them rather than optimize misuse.Well. There's many locks in a RDBMS that can't realistically be
avoided. So optimizing for no and moderate contention isn't something
you can simply forgo.
Let's get back to my initial suggestion:
On 10/06/15 16:07, Nils Goroll wrote:
I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.
Where we use the mutex patch we have not seen any relevant negative impact -
neither in benchmarks nor in production.
So, yes, postgres should still work fine on a 2-core laptop and I don't see any
reason why using posix mutexes *where they are known to behave* would do any harm.
And, to be honest, Linux is quite dominant, so solving the issue for this
platform would be a start at least.
Nils
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/06/15 17:12, Jan Wieck wrote:
for (...)
{
s_lock();
// do something with a global variable
s_unlock();
}
OK, I understand now, thank you. I am not sure if this test case is appropriate
for the critical sections in postgres (if it was, we'd not have the problem we
are discussion).
Nils
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-10 16:07:50 +0200, Nils Goroll wrote:
On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:
Interesting. I've been able to reproduce quite massive slowdowns doing
this on a 4 socket linux machine (after applying the lwlock patch that's
now in 9.5) in readonly workloads. As in 200%+ slower. And that was with
a new kernel/glibc. That was primarily due to buffer header
spinlocks. For write workloads the difference was smaller, but still
noticeably. There xlog.c's spinlocks where noticeable which are usually
held very shortly.
Do you have any details about the workloads that scaled badly back then?
It'd be interesting to find out which spinlocks they bottlenecked
on.
I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.
If we get rid of the 'hot' spinlocks something very roughly like this
will possibly be realistic (we can't rely on pthreads, but ...). I don't
think it'll be before that.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10/06/15 17:17, Andres Freund wrote:
On 2015-06-10 16:07:50 +0200, Nils Goroll wrote:
On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:Interesting. I've been able to reproduce quite massive slowdowns doing
this on a 4 socket linux machine (after applying the lwlock patch that's
now in 9.5)
Sorry, I cannot comment on this, 9.4.1 is the latest we are running in
production and I haven't even tested the patch with 9.5.
As in 200%+ slower.
Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?
Do you have any details about the workloads that scaled badly back then?
It'd be interesting to find out which spinlocks they bottlenecked
on.
OLTP. But really the root cause from back then should be eliminated, this was
with 9.1.3
I only got woken up by s_lock() in email subjects.
Nils
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-10 11:12:46 -0400, Jan Wieck wrote:
The test case is that 200 threads are running in a tight loop like this:
for (...)
{
s_lock();
// do something with a global variable
s_unlock();
}That is the most contended case I can think of, yet the short and
predictable code while holding the lock is the intended use case for a
spinlock.
But lots of our code isn't that predictable. Check e.g. the buffer
spinlock case:
static void
UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
{
PrivateRefCountEntry *ref;
/* not moving as we're likely deleting it soon anyway */
ref = GetPrivateRefCountEntry(buf->buf_id + 1, false);
Assert(ref != NULL);
if (fixOwner)
ResourceOwnerForgetBuffer(CurrentResourceOwner,
BufferDescriptorGetBuffer(buf));
Assert(ref->refcount > 0);
ref->refcount--;
if (ref->refcount == 0)
{
/* I'd better not still hold any locks on the buffer */
Assert(!LWLockHeldByMe(buf->content_lock));
Assert(!LWLockHeldByMe(buf->io_in_progress_lock));
LockBufHdr(buf);
/* Decrement the shared reference count */
Assert(buf->refcount > 0);
buf->refcount--;
/* Support LockBufferForCleanup() */
if ((buf->flags & BM_PIN_COUNT_WAITER) &&
buf->refcount == 1)
{
/* we just released the last pin other than the waiter's */
int wait_backend_pid = buf->wait_backend_pid;
buf->flags &= ~BM_PIN_COUNT_WAITER;
UnlockBufHdr(buf);
ProcSendSignal(wait_backend_pid);
}
else
UnlockBufHdr(buf);
ForgetPrivateRefCountEntry(ref);
}
}
If you check the section where the spinlock is held there's nontrivial
code executed. Under contention you'll have the problem that if backend
tries to acquire the the spinlock while another backend holds the lock,
it'll "steal" the cacheline on which the lock and the rest of the buffer
descriptor resides. Which means that operations like buf->refcount--,the
if (), and the &= will now have to wait till the cacheline is
transferred back (as the cacheline will directly be marked as modified
on the attempted locker). All the while other backends will also try to
acquire the pin, causing the same kind of trouble.
If this'd instead be written as
ret = pg_atomic_fetch_sub_u32(&buf->state, 1);
if (ret & BM_PIN_COUNT_WAITER)
{
pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER);
/* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER */
}
the whole thing looks entirely different. While there's obviously still
cacheline bouncing, every operation is guaranteed to make progress (on
x86 at least, but even elsewhere we'll never block another locker).
Now unlock is the simpler case, but...
- Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-10 17:30:33 +0200, Nils Goroll wrote:
On 10/06/15 17:17, Andres Freund wrote:
On 2015-06-10 16:07:50 +0200, Nils Goroll wrote:
Interesting. I've been able to reproduce quite massive slowdowns doing
this on a 4 socket linux machine (after applying the lwlock patch that's
now in 9.5)Sorry, I cannot comment on this, 9.4.1 is the latest we are running in
production and I haven't even tested the patch with 9.5.
Ok.
As in 200%+ slower.
Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?
Yes.
Do you have any details about the workloads that scaled badly back then?
It'd be interesting to find out which spinlocks they bottlenecked
on.OLTP. But really the root cause from back then should be eliminated, this was
with 9.1.3
Hm, ok. Any chance you have profiles from back then? It'd be very
interesting to know which spinlock you were contending on. If we convert
spinlocks into something more heavyweight we'll want to benchmark the
actually contending cases to avoid regressions.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/10/2015 11:34 AM, Andres Freund wrote:
If you check the section where the spinlock is held there's nontrivial
code executed. Under contention you'll have the problem that if backend
tries to acquire the the spinlock while another backend holds the lock,
it'll "steal" the cacheline on which the lock and the rest of the buffer
descriptor resides. Which means that operations like buf->refcount--,the
if (), and the &= will now have to wait till the cacheline is
transferred back (as the cacheline will directly be marked as modified
on the attempted locker). All the while other backends will also try to
acquire the pin, causing the same kind of trouble.
Yes, that is the possibility of "cache line stealing while holding the
lock" that I was talking about. It looks like we are in wild agreement
(as Kevin usually puts it).
If this'd instead be written as
ret = pg_atomic_fetch_sub_u32(&buf->state, 1);
if (ret & BM_PIN_COUNT_WAITER)
{
pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER);
/* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER */
}
There are atomic AND and OR functions too, at least in the GCC built in
parts. We might be able to come up with pg_atomic_...() versions of them
and avoid the race part.
the whole thing looks entirely different. While there's obviously still
cacheline bouncing, every operation is guaranteed to make progress (on
x86 at least, but even elsewhere we'll never block another locker).Now unlock is the simpler case, but...
While some locks may be avoidable, some may be replaced by atomic
operations, I believe that we will still be left with some of them.
Optimizing spinlocks if we can do it in a generic fashion that does not
hurt other platforms will still give us something.
Regards, Jan
--
Jan Wieck
Senior Software Engineer
http://slony.info
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
As in 200%+ slower.
Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?
Yes.
Ok, if this can be validated, we might have a new case now for which my
suggestion would not be helpful. Reviewed, optimized code with short critical
sections and no hotspots by design could indeed be an exception where to keep
slock as they are.
Hm, ok. Any chance you have profiles from back then?
IIUC I had shared all relevant data on the list. Does this help?
/messages/by-id/4FE9EB27.9020502@schokola.de
Thanks, NIls
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-10 11:51:06 -0400, Jan Wieck wrote:
ret = pg_atomic_fetch_sub_u32(&buf->state, 1);
if (ret & BM_PIN_COUNT_WAITER)
{
pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER);
/* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER */
}There are atomic AND and OR functions too, at least in the GCC built in
parts. We might be able to come up with pg_atomic_...() versions of them and
avoid the race part.
The race part isn't actually about that. It's that BM_PIN_COUNT_WAITER
might have been set after the fetch_sub above.
fetch_sub() itself would actually be race-free to unset a flag, if it's
a proper power of two.
While some locks may be avoidable, some may be replaced by atomic
operations, I believe that we will still be left with some of them.
Besides the two xlog.c ones and lwlock.c, which are hot otherwise? I
think we pretty much removed the rest?
Optimizing spinlocks if we can do it in a generic fashion that does not hurt
other platforms will still give us something.
Sure, I'm just doubtful that's easy.
I think we should just gank spinlocks asap. The hard part is removing
them from lwlock.c's slow path and the buffer headers imo. After that we
should imo be fine replacing them with lwlocks.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 10, 2015 at 11:58 AM, Andres Freund <andres@anarazel.de> wrote:
I think we should just gank spinlocks asap. The hard part is removing
them from lwlock.c's slow path and the buffer headers imo. After that we
should imo be fine replacing them with lwlocks.
Mmmph. I'm not convinced there's any point in replacing
lightly-contended spinlocks with slower, more-complex lwlocks. But
maybe it's best to stay away from that argument and focus on getting
rid of the spinlocks that are hosing us right now.
I'm not altogether convinced that a simple replacement of spinlocks
with atomics is going to solve this problem. If it does, great. But
that doesn't eliminate spinning; it just moves it from the spinlock
loop to a CAS loop. This is clearly better: when the CAS works,
you're done, as opposed to having to then manipulate the cache line
further and release the spinlock, during any of which the cache line
may get stolen from you. However, I'm worried that it will still be
possible to create the same kinds of performance collapses that we see
with spinlocks with the CAS loops with which we place them - or even
worse problems, because clearly the spin_delay stuff *works*, and
we're unlikely to incorporate similarly sophisticated logic into every
CAS loop.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-10 13:19:14 -0400, Robert Haas wrote:
On Wed, Jun 10, 2015 at 11:58 AM, Andres Freund <andres@anarazel.de> wrote:
I think we should just gank spinlocks asap. The hard part is removing
them from lwlock.c's slow path and the buffer headers imo. After that we
should imo be fine replacing them with lwlocks.Mmmph. I'm not convinced there's any point in replacing
lightly-contended spinlocks with slower, more-complex lwlocks. But
maybe it's best to stay away from that argument and focus on getting
rid of the spinlocks that are hosing us right now.
In the uncontended case lwlocks are just as fast as spinlocks now, with
the exception of the local tracking array. They're faster if there's
differences with read/write lockers.
I'm not altogether convinced that a simple replacement of spinlocks
with atomics is going to solve this problem. If it does, great. But
that doesn't eliminate spinning; it just moves it from the spinlock
loop to a CAS loop.
Well, not necessarily. If you can write your algorithm in a way that
xadd etc are used, instead of a lock cmpxchg, you're actually never
spinning on x86 as it's guaranteed to succeed. I think that's possible
for most of the places that currently lock buffer headers.
(I had a version of the lwlock changes that used xadd for shared lock
acquisition - but the code needed to back out in error cases made things
more complicated, and the benefit on a four socket machine wasn't that
large)
This is clearly better: when the CAS works,
you're done, as opposed to having to then manipulate the cache line
further and release the spinlock, during any of which the cache line
may get stolen from you.
It's not just that, it's also that there's no chance of sleeping while
holding a lock. Which doesn't happen that infrequently in contended, cpu
bound workloads.
However, I'm worried that it will still be possible to create the same
kinds of performance collapses that we see with spinlocks with the CAS
loops with which we place them - or even worse problems, because
clearly the spin_delay stuff *works*, and we're unlikely to
incorporate similarly sophisticated logic into every CAS loop.
I doubt it's really neccessary, but It shouldn't be too hard to
generalize the sleeping logic so it could be used with little code in
the individual callsites.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 10, 2015 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote:
In the uncontended case lwlocks are just as fast as spinlocks now, with
the exception of the local tracking array. They're faster if there's
differences with read/write lockers.
If nothing else, the spinlock calls are inline, while the lwlock calls
involve a function call. So in the uncontended case I think the
spinlock stuff is like 2 instructions, an atomic and a branch.
I'm not altogether convinced that a simple replacement of spinlocks
with atomics is going to solve this problem. If it does, great. But
that doesn't eliminate spinning; it just moves it from the spinlock
loop to a CAS loop.Well, not necessarily. If you can write your algorithm in a way that
xadd etc are used, instead of a lock cmpxchg, you're actually never
spinning on x86 as it's guaranteed to succeed. I think that's possible
for most of the places that currently lock buffer headers.
Well, it will succeed by looping inside the instruction, I suppose. But OK.
(I had a version of the lwlock changes that used xadd for shared lock
acquisition - but the code needed to back out in error cases made things
more complicated, and the benefit on a four socket machine wasn't that
large)
Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
try your patch there, bound to varying numbers of sockets.
This is clearly better: when the CAS works,
you're done, as opposed to having to then manipulate the cache line
further and release the spinlock, during any of which the cache line
may get stolen from you.It's not just that, it's also that there's no chance of sleeping while
holding a lock. Which doesn't happen that infrequently in contended, cpu
bound workloads.
That's a really good point.
However, I'm worried that it will still be possible to create the same
kinds of performance collapses that we see with spinlocks with the CAS
loops with which we place them - or even worse problems, because
clearly the spin_delay stuff *works*, and we're unlikely to
incorporate similarly sophisticated logic into every CAS loop.I doubt it's really neccessary, but It shouldn't be too hard to
generalize the sleeping logic so it could be used with little code in
the individual callsites.
We should probably experiment with it to see whether it is needed. I
am fine with leaving it out if it isn't, but it's worth testing.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2015-06-10 13:52:14 -0400, Robert Haas wrote:
On Wed, Jun 10, 2015 at 1:39 PM, Andres Freund <andres@anarazel.de> wrote:
Well, not necessarily. If you can write your algorithm in a way that
xadd etc are used, instead of a lock cmpxchg, you're actually never
spinning on x86 as it's guaranteed to succeed. I think that's possible
for most of the places that currently lock buffer headers.Well, it will succeed by looping inside the instruction, I suppose. But OK.
On x86 atomic ops hold the bus lock for a short while - that's why
they're that expensive - and in that case you directly can do useful
work (xadd) or just return after reading the current value if there was
a concurrent change (cmpxchg). Afaik there's a more fundamental
difference than one variant just doing the retry in microcode. It's hard
to get definitive answers to that.
(I had a version of the lwlock changes that used xadd for shared lock
acquisition - but the code needed to back out in error cases made things
more complicated, and the benefit on a four socket machine wasn't that
large)Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
try your patch there, bound to varying numbers of sockets.
It'd be a significant amount of work to rebase it ontop current HEAD. I
guess the easiest thing would be to try an older version of the patch
with the xadd in place, and use a tree from back then.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 10, 2015 at 1:58 PM, Andres Freund <andres@anarazel.de> wrote:
Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
try your patch there, bound to varying numbers of sockets.It'd be a significant amount of work to rebase it ontop current HEAD. I
guess the easiest thing would be to try an older version of the patch
with the xadd in place, and use a tree from back then.
We could do that, I guess. But we've made other significant
improvements in the meantime, so...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
The whole thing turns out to be based on wrong baseline data, taken with
a pgbench client running from a remote machine. It all started out from
an investigation against 9.3. Curiously enough, the s_lock() problem
that existed in 9.3 has a very similar effect on throughput as a network
bottleneck has on 9.5.
Sorry for the noise, Jan
On 06/10/2015 09:18 AM, Jan Wieck wrote:
Hi,
I think I may have found one of the problems, PostgreSQL has on machines
with many NUMA nodes. I am not yet sure what exactly happens on the NUMA
bus, but there seems to be a tipping point at which the spinlock
concurrency wreaks havoc and the performance of the database collapses.On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total,
a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput
then takes a very sharp dive and reaches around 20,000 TPS at 120
clients. It never recovers from there.The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".
The 8 socket machine in question scales to over 350,000 TPS.The patch is meant to demonstrate this effect only. It has a negative
performance impact on smaller machines and client counts < #cores, so
the real solution will probably look much different. But I thought it
would be good to share this and start the discussion about reevaluating
the spinlock code before PGCon.Regards, Jan
--
Jan Wieck
Senior Software Engineer
http://slony.info
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Jun 14, 2015 at 9:52 AM, Jan Wieck <jan@wi3ck.info> wrote:
The whole thing turns out to be based on wrong baseline data, taken with a
pgbench client running from a remote machine. It all started out from an
investigation against 9.3. Curiously enough, the s_lock() problem that
existed in 9.3 has a very similar effect on throughput as a network
bottleneck has on 9.5.
So, changing max_spins_per_delay no longer helps on 9.5?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers