why roll-your-own s_lock? / improving scalability
Hi,
I am currently trying to understand what looks like really bad scalability of
9.1.3 on a 64core 512GB RAM system: the system runs OK when at 30% usr, but only
marginal amounts of additional load seem to push it to 70% and the application
becomes highly unresponsive.
My current understanding basically matches the issues being addressed by various
9.2 improvements, well summarized in
http://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU-performance-in-9.2.pdf
An additional aspect is that, in order to address the latent risk of data loss &
corruption with WBCs and async replication, we have deliberately moved the db
from a similar system with WB cached storage to ssd based storage without a WBC,
which, by design, has (in the best WBC case) approx. 100x higher latencies, but
much higher sustained throughput.
On the new system, even with 30% user "acceptable" load, oprofile makes apparent
significant lock contention:
opreport --symbols --merge tgid -l /mnt/db1/hdd/pgsql-9.1/bin/postgres
Profiling through timer interrupt
samples % image name symbol name
30240 27.9720 postgres s_lock
5069 4.6888 postgres GetSnapshotData
3743 3.4623 postgres AllocSetAlloc
3167 2.9295 libc-2.12.so strcoll_l
2662 2.4624 postgres SearchCatCache
2495 2.3079 postgres hash_search_with_hash_value
2143 1.9823 postgres nocachegetattr
1860 1.7205 postgres LWLockAcquire
1642 1.5189 postgres base_yyparse
1604 1.4837 libc-2.12.so __strcmp_sse42
1543 1.4273 libc-2.12.so __strlen_sse42
1156 1.0693 libc-2.12.so memcpy
Unfortunately I don't have profiling data for the high-load / contention
condition yet, but I fear the picture will be worse and pointing in the same
direction.
<pure speculation>
In particular, the _impression_ is that lock contention could also be related to
I/O latencies making me fear that cases could exist where spin locks are being
helt while blocking on IO.
</pure speculation>
Looking at the code, it appears to me that the roll-your-own s_lock code cannot
handle a couple of cases, for instance it will also spin when the lock holder is
not running at all or blocking on IO (which could even be implicit, e.g. for a
page flush). These issues have long been addressed by adaptive mutexes and futexes.
Also, the s_lock code tries to be somehow adaptive using spins_per_delay (when
having spun for long (not not blocked), spin even longer in future), which
appears to me to have the potential of becoming highly counter-productive.
Now that the scene is set, here's the simple question: Why all this? Why not
simply use posix mutexes which, on modern platforms, will map to efficient
implementations like adaptive mutexes or futexes?
Thanks, Nils
On Tue, Jun 26, 2012 at 12:02 PM, Nils Goroll <slink@schokola.de> wrote:
Hi,
I am currently trying to understand what looks like really bad scalability of
9.1.3 on a 64core 512GB RAM system: the system runs OK when at 30% usr, but only
marginal amounts of additional load seem to push it to 70% and the application
becomes highly unresponsive.My current understanding basically matches the issues being addressed by various
9.2 improvements, well summarized in
http://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU-performance-in-9.2.pdfAn additional aspect is that, in order to address the latent risk of data loss &
corruption with WBCs and async replication, we have deliberately moved the db
from a similar system with WB cached storage to ssd based storage without a WBC,
which, by design, has (in the best WBC case) approx. 100x higher latencies, but
much higher sustained throughput.On the new system, even with 30% user "acceptable" load, oprofile makes apparent
significant lock contention:opreport --symbols --merge tgid -l /mnt/db1/hdd/pgsql-9.1/bin/postgres
Profiling through timer interrupt
samples % image name symbol name
30240 27.9720 postgres s_lock
5069 4.6888 postgres GetSnapshotData
3743 3.4623 postgres AllocSetAlloc
3167 2.9295 libc-2.12.so strcoll_l
2662 2.4624 postgres SearchCatCache
2495 2.3079 postgres hash_search_with_hash_value
2143 1.9823 postgres nocachegetattr
1860 1.7205 postgres LWLockAcquire
1642 1.5189 postgres base_yyparse
1604 1.4837 libc-2.12.so __strcmp_sse42
1543 1.4273 libc-2.12.so __strlen_sse42
1156 1.0693 libc-2.12.so memcpyUnfortunately I don't have profiling data for the high-load / contention
condition yet, but I fear the picture will be worse and pointing in the same
direction.<pure speculation>
In particular, the _impression_ is that lock contention could also be related to
I/O latencies making me fear that cases could exist where spin locks are being
helt while blocking on IO.
</pure speculation>Looking at the code, it appears to me that the roll-your-own s_lock code cannot
handle a couple of cases, for instance it will also spin when the lock holder is
not running at all or blocking on IO (which could even be implicit, e.g. for a
page flush). These issues have long been addressed by adaptive mutexes and futexes.Also, the s_lock code tries to be somehow adaptive using spins_per_delay (when
having spun for long (not not blocked), spin even longer in future), which
appears to me to have the potential of becoming highly counter-productive.Now that the scene is set, here's the simple question: Why all this? Why not
simply use posix mutexes which, on modern platforms, will map to efficient
implementations like adaptive mutexes or futexes?
Well, that would introduce a backend dependency on pthreads, which is
unpleasant. Also you'd need to feature test via
_POSIX_THREAD_PROCESS_SHARED to make sure you can mutex between
processes (and configure your mutexes as such when you do). There are
probably other reasons why this can't be done, but I personally don' t
klnow of any.
Also, it's forbidden to do things like invoke i/o in the backend while
holding only a spinlock. As to your larger point, it's an interesting
assertion -- some data to back it up would help.
merlin
Nils Goroll <slink@schokola.de> writes:
Now that the scene is set, here's the simple question: Why all this? Why not
simply use posix mutexes which, on modern platforms, will map to efficient
implementations like adaptive mutexes or futexes?
(1) They do not exist everywhere.
(2) There is absolutely no evidence to suggest that they'd make things better.
If someone cared to rectify (2), we could consider how to use them as an
alternative implementation. But if you start with "let's not support
any platforms that don't have this feature", you're going to get a cold
reception.
regards, tom lane
Hi Merlin,
_POSIX_THREAD_PROCESS_SHARED
sure.
Also, it's forbidden to do things like invoke i/o in the backend while
holding only a spinlock. As to your larger point, it's an interesting
assertion -- some data to back it up would help.
Let's see if I can get any. ATM I've only got indications, but no proof.
Nils
But if you start with "let's not support any platforms that don't have this feature"
This will never be my intention.
Nils
On Tue, Jun 26, 2012 at 01:46:06PM -0500, Merlin Moncure wrote:
Well, that would introduce a backend dependency on pthreads, which is
unpleasant. Also you'd need to feature test via
_POSIX_THREAD_PROCESS_SHARED to make sure you can mutex between
processes (and configure your mutexes as such when you do). There are
probably other reasons why this can't be done, but I personally don' t
klnow of any.
And then you have fabulous things like:
https://git.reviewboard.kde.org/r/102145/
(OSX defines _POSIX_THREAD_PROCESS_SHARED but does not actually support
it.)
Seems not very well tested in any case.
It might be worthwhile testing futexes on Linux though, they are
specifically supported on any kind of shared memory (shm/mmap/fork/etc)
and quite well tested.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
Martijn van Oosterhout <kleptog@svana.org> writes:
And then you have fabulous things like:
https://git.reviewboard.kde.org/r/102145/
(OSX defines _POSIX_THREAD_PROCESS_SHARED but does not actually support
it.)
Seems not very well tested in any case.
It might be worthwhile testing futexes on Linux though, they are
specifically supported on any kind of shared memory (shm/mmap/fork/etc)
and quite well tested.
Yeah, a Linux-specific replacement of spinlocks with futexes seems like
a lot safer idea than "let's rely on posix mutexes everywhere". It's
still unproven whether it'd be an improvement, but you could expect to
prove it one way or the other with a well-defined amount of testing.
regards, tom lane
It's
still unproven whether it'd be an improvement, but you could expect to
prove it one way or the other with a well-defined amount of testing.
I've hacked the code to use adaptive pthread mutexes instead of spinlocks. see
attached patch. The patch is for the git head, but it can easily be applied for
9.1.3, which is what I did for my tests.
This had disastrous effects on Solaris because it does not use anything similar
to futexes for PTHREAD_PROCESS_SHARED mutexes (only the _PRIVATE mutexes do
without syscalls for the simple case).
But I was surprised to see that it works relatively well on linux. Here's a
glimpse of my results:
hacked code 9.1.3:
-bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time
./postgres -D ../data -p 55502 & ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ;
./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid
sending incremental file list
...
ransaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 768
number of threads: 128
number of transactions per client: 20
number of transactions actually processed: 15360/15360
tps = 476.873261 (including connections establishing)
tps = 485.964355 (excluding connections establishing)
LOG: received smart shutdown request
LOG: autovacuum launcher shutting down
-bash-4.1$ LOG: shutting down
LOG: database system is shut down
210.58user 78.88system 0:50.64elapsed 571%CPU (0avgtext+0avgdata
1995968maxresident)k
0inputs+1153872outputs (0major+2464649minor)pagefaults 0swaps
original code (vanilla build on amd64) 9.1.3:
-bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time
./postgres -D ../data -p 55502 & ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ;
./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid
sending incremental file list
...
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 768
number of threads: 128
number of transactions per client: 20
number of transactions actually processed: 15360/15360
tps = 499.993685 (including connections establishing)
tps = 510.410883 (excluding connections establishing)
LOG: received smart shutdown request
-bash-4.1$ LOG: autovacuum launcher shutting down
LOG: shutting down
LOG: database system is shut down
196.21user 71.38system 0:47.99elapsed 557%CPU (0avgtext+0avgdata
1360800maxresident)k
0inputs+1147904outputs (0major+2375965minor)pagefaults 0swaps
config:
-bash-4.1$ egrep '^[a-z]' /tmp/test_template_data/postgresql.conf
max_connections = 1800 # (change requires restart)
shared_buffers = 10GB # min 128kB
temp_buffers = 64MB # min 800kB
work_mem = 256MB # min 64kB,d efault 1MB
maintenance_work_mem = 2GB # min 1MB, default 16MB
bgwriter_delay = 10ms # 10-10000ms between rounds
bgwriter_lru_maxpages = 1000 # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 10.0 # 0-10.0 multipler on buffers scanned/round
wal_level = hot_standby # minimal, archive, or hot_standby
wal_buffers = 64MB # min 32kB, -1 sets based on shared_buffers
commit_delay = 10000 # range 0-100000, in microseconds
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
seq_page_cost = 1.0 # measured on an arbitrary scale
random_page_cost = 1.5 # same scale as above (default: 4.0)
cpu_tuple_cost = 0.005
cpu_index_tuple_cost = 0.0025
cpu_operator_cost = 0.0001
effective_cache_size = 192GB
So it looks like using pthread_mutexes could at least be an option on Linux.
Using futexes directly could be even cheaper.
As a side note, it looks like I have not expressed myself clearly:
I did not intend to suggest to replace proven, working code (which probably is
the best you can get for some platforms) with posix calls. I apologize for the
provocative question.
Regarding the actual production issue, I did not manage to synthetically provoke
the saturation we are seeing in production using pgbench - I could not even get
anywhere near the production load. So I cannot currently test if reducing the
amount of spinning and waking up exactly one waiter (which is what linux/nptl
pthread_mutex_unlock does) would solve/mitigate the production issue I am
working on, and I'd highly appreciate any pointers in this direction.
Cheers, Nils
Attachments:
experimental_linux_pthread_mutex_instead_of_spinlick.patchtext/plain; name=experimental_linux_pthread_mutex_instead_of_spinlick.patchDownload+118-2
On Wed, Jun 27, 2012 at 12:58:47AM +0200, Nils Goroll wrote:
So it looks like using pthread_mutexes could at least be an option on Linux.
Using futexes directly could be even cheaper.
Note that below this you only have the futex(2) system call. Futexes
require all counter manipulation to happen in userspace, just like now,
so all the per architecture stuff remains. On Linux pthread mutexes
are really just a thin wrapper on top of this.
The futex(2) system call merely provides an interface for handling the
blocking and waking of other processes and releasing locks on process
exit (so everything can still work after a kill -9).
So it's more a replacement for the SysV semaphores than anything else.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
Using futexes directly could be even cheaper.
Note that below this you only have the futex(2) system call.
I was only referring to the fact that we could save one function and one library
call, which could make a difference for the uncontended case.
On Tue, Jun 26, 2012 at 3:58 PM, Nils Goroll <slink@schokola.de> wrote:
It's
still unproven whether it'd be an improvement, but you could expect to
prove it one way or the other with a well-defined amount of testing.I've hacked the code to use adaptive pthread mutexes instead of spinlocks. see
attached patch. The patch is for the git head, but it can easily be applied for
9.1.3, which is what I did for my tests.This had disastrous effects on Solaris because it does not use anything similar
to futexes for PTHREAD_PROCESS_SHARED mutexes (only the _PRIVATE mutexes do
without syscalls for the simple case).But I was surprised to see that it works relatively well on linux. Here's a
glimpse of my results:hacked code 9.1.3:
...
tps = 485.964355 (excluding connections establishing)
original code (vanilla build on amd64) 9.1.3:
...
tps = 510.410883 (excluding connections establishing)
It looks like the hacked code is slower than the original. That
doesn't seem so good to me. Am I misreading this?
Also, 20 transactions per connection is not enough of a run to make
any evaluation on.
How many cores are you testing on?
Regarding the actual production issue, I did not manage to synthetically provoke
the saturation we are seeing in production using pgbench - I could not even get
anywhere near the production load.
What metrics/tools are you using to compare the two loads? What is
the production load like?
Each transaction has to update one of ten pgbench_branch rows, so you
can't have more than ten transactions productively active at any given
time, even though you have 768 connections. So you need to jack up
the pgbench scale, or switch to using -N mode.
Also, you should use -M prepared, otherwise you spend more time
parsing and planning the statements than executing them.
Cheers,
Jeff
On Thu, Jun 28, 2012 at 11:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
Also, 20 transactions per connection is not enough of a run to make
any evaluation on.
FWIW, I kicked off a looong benchmarking run on this a couple of days
ago on the IBM POWER7 box, testing pgbench -S, regular pgbench, and
pgbench --unlogged-tables at various client counts with and without
the patch; three half-hour test runs for each test configuration. It
should be done tonight and I will post the results once they're in.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
I'll reply to Jeff with a brief thank you to Robert the bottom.
First of all, here's an update:
I have slightly modified the patch, I'll attach what I have at the moment. The
main difference are
- loops around the pthread_mutex calls: As the locking function signature is to
return void at the moment, there is no error handling code in the callers
(so, theoretically, there should be a chance for an infinite loop on a
spinlock in the current code if you SIGKILL a spinlock holder (which you
shouldn't do, sure). Using robust mutexes, we could avoid this issue).
Retrying is probably the best we can do without implementing error recovery
in all callers.
- ereport(FATAL,"") instead of assertions, which is really what we should do
(imagine setting PTHREAD_PROCESS_SHARED fails and we still start up)
Some insights:
- I noticed that, for the simple pgbench tests I ran, PTHREAD_MUTEX_ADAPTIVE_NP
yielded worse results than PTHREAD_MUTEX_NORMAL, which is somehow counter-
intuitive, because _ADAPTIVE is closer to the current spinlock logic, but
yet syscalling in the first place seems to be more efficient than spinning
a little first and then syscalling (for the contended case).
The increase in usr/sys time for my tests was in the order of 10-20%.
- Also I noticed a general issue with linking to libpthread: My understanding is
that this should also change the code to be reentrant when compiling with
gcc (does anyone know precisely?), which we don't need - we only need the
locking code, unless we want to roll our own futex implementation (see below).
I am not sure if this is really root-caused because I have not fully
understood what is going on, but when compiling with LDFLAGS=-lpthread
for the top level Makefile, usr increses by some 10% for my tests.
The code is more efficient when I simply leave out -lpthread, libpthread
gets linked anyway.
- I had a look at futex sample code, for instance
http://locklessinc.com/articles/mutex_cv_futex/ and Ulrichs paper but I must say
at this point
I don't feel ready to roll own futex code for this most critical piece of
code. There is simply too much which can go wrong and major mistakes are very
hard to spot.
I'd very much prefer to use an existing, proven implementation.
At this point, I'd guess pulling in the relevant code from glibc/nptl
would be one of the safest bets, but even this path is risky.
On benchmarks:
With the same pgbench parameters as before, I ended up with comparable results
for unpatched and patched in terms of resource consumption:
Test setup for both:
for i in {1..10} ; do
rsync -av --delete /tmp/test_template_data/ /tmp/data/
/usr/bin/time ./postgres -D /tmp/data -p 55502 & ppid=$!
pid=$(pgrep -P $ppid)
sleep 15
./pgbench -c 256 -t 20 -j 128 -p 55502 postgres
kill $pid
wait $ppid
wait
while pgrep -f 55502 ; do
echo procs still running - hm
sleep 1
done
done
unpatched (bins postgresql-server-91-9.1.3-1PGDG.rhel6.rpm)
-bash-4.1$ grep elapsed /var/tmp/20120627_noslock_check/orig_code_2_perf
34.55user 20.07system 0:25.63elapsed 213%CPU (0avgtext+0avgdata 1360688maxresident)k
35.26user 19.90system 0:25.38elapsed 217%CPU (0avgtext+0avgdata 1360704maxresident)k
38.04user 21.68system 0:26.24elapsed 227%CPU (0avgtext+0avgdata 1360704maxresident)k
36.72user 21.95system 0:27.21elapsed 215%CPU (0avgtext+0avgdata 1360688maxresident)k
37.19user 22.00system 0:26.44elapsed 223%CPU (0avgtext+0avgdata 1360704maxresident)k
37.88user 22.58system 0:25.70elapsed 235%CPU (0avgtext+0avgdata 1360704maxresident)k
35.70user 20.90system 0:25.63elapsed 220%CPU (0avgtext+0avgdata 1360688maxresident)k
40.24user 21.65system 0:26.02elapsed 237%CPU (0avgtext+0avgdata 1360688maxresident)k
44.93user 22.96system 0:26.38elapsed 257%CPU (0avgtext+0avgdata 1360704maxresident)k
38.10user 21.51system 0:26.66elapsed 223%CPU (0avgtext+0avgdata 1360688maxresident)k
-bash-4.1$ grep elapsed /var/tmp/20120627_noslock_check/orig_code_2_perf | tail
-10 | sed 's:[^0-9. ]::g' | awk '{ u+=$1; s+=$2; c++;} END { print "avg " u/c "
" s/c; }'
avg 37.861 21.52
patched (based upon modified source rpm of the above)
-bash-4.1$ egrep elapsed
/var/tmp/20120627_noslock_check/with_slock_6_nocompile_without_top_-lpthread
42.32user 27.16system 0:28.18elapsed 246%CPU (0avgtext+0avgdata 2003488maxresident)k
39.14user 26.31system 0:27.24elapsed 240%CPU (0avgtext+0avgdata 2003504maxresident)k
38.81user 26.17system 0:26.67elapsed 243%CPU (0avgtext+0avgdata 2003520maxresident)k
41.04user 27.80system 0:29.00elapsed 237%CPU (0avgtext+0avgdata 2003520maxresident)k
35.41user 22.85system 0:27.15elapsed 214%CPU (0avgtext+0avgdata 2003504maxresident)k
32.74user 21.87system 0:25.62elapsed 213%CPU (0avgtext+0avgdata 2003504maxresident)k
35.68user 24.86system 0:27.16elapsed 222%CPU (0avgtext+0avgdata 2003520maxresident)k
32.10user 20.18system 0:27.26elapsed 191%CPU (0avgtext+0avgdata 2003504maxresident)k
31.32user 18.67system 0:26.95elapsed 185%CPU (0avgtext+0avgdata 2003488maxresident)k
29.99user 19.78system 0:32.08elapsed 155%CPU (0avgtext+0avgdata 2003504maxresident)k
-bash-4.1$ egrep elapsed
/var/tmp/20120627_noslock_check/with_slock_6_nocompile_without_top_-lpthread |
sed 's:[^0-9. ]::g' | awk '{ u+=$1; s+=$2; c++;} END { print "avg " u/c " " s/c; }'
avg 35.855 23.565
Hopefully I will get a chance to run this in production soon, unless I get
feedback from anyone with reasons why I shouldn't do this.
On 06/28/12 05:21 PM, Jeff Janes wrote:
It looks like the hacked code is slower than the original. That
doesn't seem so good to me. Am I misreading this?
No, you are right - in a way. This is not about maximizing tps, this is about
maximizing efficiency under load situations which I can't even simulate at the
moment. So What I am looking for are "comparable" resource consumption and
"comparable" tps - but no risk for concurrent spins on locks.
For minimal contention, using pthread_ functions _must_ be slightly slower than
the current s_lock spin code, but they _should_ scale *much* better at high
contention.
The tps values I got for the runs mentioned above are:
## original code
# egrep ^tps orig_code_2_perf | grep excl | tail -10 | tee /dev/tty | awk '{ a+=
$3; c++; } END { print a/c; }'
tps = 607.241375 (excluding connections establishing)
tps = 622.255763 (excluding connections establishing)
tps = 615.397928 (excluding connections establishing)
tps = 632.821217 (excluding connections establishing)
tps = 620.415654 (excluding connections establishing)
tps = 611.083542 (excluding connections establishing)
tps = 631.301615 (excluding connections establishing)
tps = 612.337597 (excluding connections establishing)
tps = 606.433209 (excluding connections establishing)
tps = 574.031095 (excluding connections establishing)
613.332
## patched code
# egrep ^tps with_slock_6_nocompile_without_top_-lpthread | grep excl | tail -10
| tee /dev/tty | awk '{ a+= $3; c++; } END { print a/c; }'
tps = 584.761390 (excluding connections establishing)
tps = 620.994437 (excluding connections establishing)
tps = 630.983695 (excluding connections establishing)
tps = 502.116770 (excluding connections establishing)
tps = 595.879789 (excluding connections establishing)
tps = 679.814563 (excluding connections establishing)
tps = 655.053339 (excluding connections establishing)
tps = 603.453768 (excluding connections establishing)
tps = 679.481280 (excluding connections establishing)
tps = 440.999884 (excluding connections establishing)
599.354
Also, 20 transactions per connection is not enough of a run to make
any evaluation on.
As you can see I've repeated the tests 10 times. I've tested slight variations
as mentioned above, so I was looking for quick results with acceptable variation.
How many cores are you testing on?
64 x AMD64 1.6GHz (4x6262HE in one box)
Regarding the actual production issue, I did not manage to synthetically provoke
the saturation we are seeing in production using pgbench - I could not even get
anywhere near the production load.What metrics/tools are you using to compare the two loads?
We've got cpu + load avg statistics for the old+new machine and compared values
before/after the migration. The user load presumably is comparable and the main
metric is "users complaining" vs. "users happy".
I wish we had a synthetic benchmark close to the actual load, and I hope that
one of the insights from this will be that the customer should have one.
During what I believe is an overload situation with very high lock contention,
the load avg rises well above 300 and usr+sys well above 80%.
The temporary relief was to move some databases off to other machines.
Interestingly, moving away <10% of the load returned the system to a well
behaved state with usr+sys in the order of 20-30%, which is the main reason why
I believe that this must be a negative scalability issue for situations beyond
some saturation point determined by concurrency on locks.
What is the production load like?
Here's an anonymized excerpt from a pgFouine analysis of 137 seconds worth of
query logs at "average production user load".
Type Count Percentage
SELECT 80,217 27.1
INSERT 6,248 2.1
UPDATE 37,159 12.6
DELETE 4,579 1.5
Queries that took up the most time (N) ^
Rank Total duration Times executed Av. duration s Query
1 3m39s 83,667 0.00 COMMIT;
2 54.4s 2 27.18 SELECT ...
3 41.1s 281 0.15 UPDATE ...
4 25.4s 18,960 0.00 UPDATE ...
5 21.9s ...
the 9th rank is already below 10 seconds Total duration
Each transaction has to update one of ten pgbench_branch rows, so you
can't have more than ten transactions productively active at any given
time, even though you have 768 connections. So you need to jack up
the pgbench scale, or switch to using -N mode.
Sorry for having omitted that detail. I had initialized pgbench with -i -s 100
Also, you should use -M prepared, otherwise you spend more time
parsing and planning the statements than executing them.
Ah, good point, thank you. As you will have noticed, I don't have years worth of
background with pgbench yet.
On 06/28/12 05:29 PM, Robert Haas wrote:
FWIW, I kicked off a looong benchmarking run on this a couple of days
ago on the IBM POWER7 box, testing pgbench -S, regular pgbench, and
pgbench --unlogged-tables at various client counts with and without
the patch; three half-hour test runs for each test configuration. It
should be done tonight and I will post the results once they're in.
Sounds great! I am really curious.
Nils
Attachments:
0001-experimental-use-pthread-mutexes-instead-of-spinloc.patchtext/plain; name=0001-experimental-use-pthread-mutexes-instead-of-spinloc.patchDownload
On Friday, June 29, 2012 07:07:11 PM Nils Goroll wrote:
Also, 20 transactions per connection is not enough of a run to make
any evaluation on.As you can see I've repeated the tests 10 times. I've tested slight
variations as mentioned above, so I was looking for quick results with
acceptable variation.
Running only 20 transactions is still meaningless. Quite often that will means
that no backends run concurrently because the starting up takes longer than to
process those 20 transactions. You need at the very, very least 10s. Check out
-T.
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
You need at the very, very least 10s.
ok, thanks.
On Fri, Jun 29, 2012 at 12:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On Friday, June 29, 2012 07:07:11 PM Nils Goroll wrote:
Also, 20 transactions per connection is not enough of a run to make
any evaluation on.As you can see I've repeated the tests 10 times. I've tested slight
variations as mentioned above, so I was looking for quick results with
acceptable variation.Running only 20 transactions is still meaningless. Quite often that will means
that no backends run concurrently because the starting up takes longer than to
process those 20 transactions. You need at the very, very least 10s. Check out
-T.
yeah. also, standard pgbench is typically very much i/o bound on
typical hardware. it's would be much more interesting to see
performance in spinlock heavy workloads -- the OP noted one when
introducing the thread. would it be possible to simulate those
conditions.
merlin
On Fri, Jun 29, 2012 at 1:07 PM, Nils Goroll <slink@schokola.de> wrote:
FWIW, I kicked off a looong benchmarking run on this a couple of days
ago on the IBM POWER7 box, testing pgbench -S, regular pgbench, and
pgbench --unlogged-tables at various client counts with and without
the patch; three half-hour test runs for each test configuration. It
should be done tonight and I will post the results once they're in.Sounds great! I am really curious.
Here are the results. Each result is the median of three 30-minute
test runs on an IBM POWER7 system with 16 cores, 64 hardware threads.
Configuration was shared_buffers = 8GB, maintenance_work_mem = 1GB,
synchronous_commit = off, checkpoint_segments = 300,
checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
wal_writer_delay = 20ms, log_line_prefix = '%t [%p] '. Lines
beginning with m show performance on master; lines beginning with p
show performance with patch; the following number is the # of clients
used for the test.
Permanent Tables
================
m01 tps = 1364.521373 (including connections establishing)
m08 tps = 9175.281381 (including connections establishing)
m32 tps = 14770.652793 (including connections establishing)
m64 tps = 14183.495875 (including connections establishing)
p01 tps = 1366.447001 (including connections establishing)
p08 tps = 9406.181857 (including connections establishing)
p32 tps = 14608.766540 (including connections establishing)
p64 tps = 14182.576636 (including connections establishing)
Unlogged Tables
===============
m01 tps = 1459.649000 (including connections establishing)
m08 tps = 11872.102025 (including connections establishing)
m32 tps = 32834.258026 (including connections establishing)
m64 tps = 33404.988834 (including connections establishing)
p01 tps = 1481.876584 (including connections establishing)
p08 tps = 11787.657258 (including connections establishing)
p32 tps = 32959.342248 (including connections establishing)
p64 tps = 33672.008244 (including connections establishing)
SELECT-only
===========
m01 tps = 8777.971832 (including connections establishing)
m08 tps = 70695.558964 (including connections establishing)
m32 tps = 201762.696020 (including connections establishing)
m64 tps = 310137.544470 (including connections establishing)
p01 tps = 8914.165586 (including connections establishing)
p08 tps = 71351.501358 (including connections establishing)
p32 tps = 201946.425301 (including connections establishing)
p64 tps = 305627.413716 (including connections establishing)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Thank you, Robert.
as this patch was not targeted towards increasing tps, I am at happy to hear
that your benchmarks also suggest that performance is "comparable".
But my main question is: how about resource consumption? For the issue I am
working on, my current working hypothesis is that spinning on locks saturates
resources and brings down overall performance in a high-contention situation.
Do you have any getrusage figures or anything equivalent?
Thanks, Nils
test runs on an IBM POWER7 system with 16 cores, 64 hardware threads.
Could you add the CPU Type / clock speed please?
On Sun, Jul 1, 2012 at 11:13 AM, Nils Goroll <slink@schokola.de> wrote:
as this patch was not targeted towards increasing tps, I am at happy to hear
that your benchmarks also suggest that performance is "comparable".But my main question is: how about resource consumption? For the issue I am
working on, my current working hypothesis is that spinning on locks saturates
resources and brings down overall performance in a high-contention situation.Do you have any getrusage figures or anything equivalent?
Spinlock contentions cause tps to go down. The fact that tps didn't
change much in this case suggests that either these workloads don't
generate enough spinlock contention to benefit from your patch, or
your patch doesn't meaningfully reduce it, or both. We might need a
test case that is more spinlock-bound to observe an effect.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company