Better LWLocks with compare-and-swap (9.4)

Started by Heikki Linnakangasalmost 13 years ago21 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

I've been working on-and-off on the WAL-insert scaling patch. It's in
pretty good shape now, and I'll post it shortly, but one thing I noticed
is that it benefits a lot from using an atomic compare-and-swap
instruction for the contention-critical part.

I realized that we could also use compare-and-swap to make LWLocks scale
better. The LWLock struct is too large to compare-and-swap atomically,
but we can still use CAS to increment/decrement the shared/exclusive
counters, when there's no need to manipulate the wait queue. That would
help with workloads where you have a lot of CPUs, and a lot of backends
need to acquire the same lwlock in shared mode, but there's no real
contention (ie. few exclusive lockers).

pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile,
when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux
machine:

-  64.09%  postgres  postgres           [.] tas
    - tas
       - 99.83% s_lock
          - 53.22% LWLockAcquire
             + 99.87% GetSnapshotData
          - 46.78% LWLockRelease
               GetSnapshotData
             + GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My
theory is that after that point all the cores are busy, and processes
start to be sometimes context switched while holding the spinlock, which
kills performance. Has anyone else seen that pattern? Curiously, I don't
see that when connecting pgbench via TCP over localhost, only when
connecting via unix domain sockets. Overall performance is higher over
unix domain sockets, so I guess the TCP layer adds some overhead,
hurting performance, and also affects scheduling somehow, making the
steep drop go away.

Using a compare-and-swap instruction in place of spinning solves the
problem (green line in attached graph). This needs some more testing
with different workloads to make sure it doesn't hurt other scenarios,
but I don't think it will. I'm also waiting for a colleague to test this
on a different machine, where he saw a similar steep drop in performance.

The attached patch is still work-in-progress. There needs to be a
configure test and fallback to spinlock if a CAS instruction is not
available. I used the gcc __sync_val_compare_and_swap() builtin
directly, that needs to be abstracted. Also, in the case that the wait
queue needs to be manipulated, the code spins on the CAS instruction,
but there is no delay mechanism like there is on a regular spinlock;
that needs to be added in somehow.

- Heikki

Attachments:

pgbench-lwlock-cas-local-clients-sets.pngimage/png; name=pgbench-lwlock-cas-local-clients-sets.pngDownload
lwlock-cas-2.patchtext/x-diff; name=lwlock-cas-2.patchDownload+165-45
#2Merlin Moncure
mmoncure@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Better LWLocks with compare-and-swap (9.4)

On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The attached patch is still work-in-progress. There needs to be a configure
test and fallback to spinlock if a CAS instruction is not available. I used
the gcc __sync_val_compare_and_swap() builtin directly, that needs to be
abstracted. Also, in the case that the wait queue needs to be manipulated,
the code spins on the CAS instruction, but there is no delay mechanism like
there is on a regular spinlock; that needs to be added in somehow.

These are really interesting results. Why is the CAS method so much
faster then TAS? Did you see any contention?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Daniel Farina
daniel@heroku.com
In reply to: Heikki Linnakangas (#1)
Re: Better LWLocks with compare-and-swap (9.4)

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:

-  64.09%  postgres  postgres           [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
is that after that point all the cores are busy, and processes start to be
sometimes context switched while holding the spinlock, which kills
performance.

I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes >> number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Daniel Farina
daniel@heroku.com
In reply to: Daniel Farina (#3)
Re: Better LWLocks with compare-and-swap (9.4)

On Wed, May 15, 2013 at 3:08 PM, Daniel Farina <daniel@heroku.com> wrote:

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:

-  64.09%  postgres  postgres           [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
is that after that point all the cores are busy, and processes start to be
sometimes context switched while holding the spinlock, which kills
performance.

I accidentally some important last words from Heikki's last words in
his mail, which make my correspondence pretty bizarre to understand at
the outset. Apologies. He wrote:

[...] Has anyone else seen that pattern?

I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes >> number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Stephen Frost
sfrost@snowman.net
In reply to: Heikki Linnakangas (#1)
Re: Better LWLocks with compare-and-swap (9.4)

* Heikki Linnakangas (hlinnakangas@vmware.com) wrote:

My theory is that after that point all the cores are busy,
and processes start to be sometimes context switched while holding
the spinlock, which kills performance. Has anyone else seen that
pattern?

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

Curiously, I don't see that when connecting pgbench via TCP
over localhost, only when connecting via unix domain sockets.
Overall performance is higher over unix domain sockets, so I guess
the TCP layer adds some overhead, hurting performance, and also
affects scheduling somehow, making the steep drop go away.

I wonder if the kernel locks around unix domain sockets are helping us
out here, while it's not able to take advantage of such knowledge about
the process that's waiting when it's a TCP connection? Just a hunch.

Thanks,

Stephen

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#5)
Re: Better LWLocks with compare-and-swap (9.4)

Stephen Frost <sfrost@snowman.net> writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters". While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Dickson S. Guedes
listas@guedesoft.net
In reply to: Heikki Linnakangas (#1)
Re: Better LWLocks with compare-and-swap (9.4)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Em 13-05-2013 09:50, Heikki Linnakangas escreveu:

I've been working on-and-off on the WAL-insert scaling patch. It's
in pretty good shape now, and I'll post it shortly, but one thing I
noticed is that it benefits a lot from using an atomic
compare-and-swap instruction for the contention-critical part.

I realized that we could also use compare-and-swap to make LWLocks
scale better. The LWLock struct is too large to compare-and-swap
atomically, but we can still use CAS to increment/decrement the
shared/exclusive counters, when there's no need to manipulate the
wait queue. That would help with workloads where you have a lot of
CPUs, and a lot of backends need to acquire the same lwlock in
shared mode, but there's no real contention (ie. few exclusive
lockers).

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
32-core Linux machine:

- 64.09% postgres postgres [.] tas - tas - 99.83%
s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97%
postgres postgres [.] tas + 1.53% postgres
libc-2.13.so [.] 0x119873 + 1.44% postgres postgres
[.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k]
arch_local_irq_enable + 1.18% postgres postgres [.]
AllocSetAlloc ...

I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?

Thanks!

[]s
- --
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://github.net/guedes - twitter: @guediz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJRltDJAAoJEBa5zL7BI5C7UQkH/Au8p90pTMl1qvbft3q1Gtxp
a4PV8fjOrzQou2I+9Sxu5W1ql3qyVmfFare+bJVKg5L3LmvACjZ6bbw9oKBEnPGB
vzE9nB6+3F3eyo464Niq19cTVgmyRQBcuOT/Ye88Uh2mrrgUYB+lGfk9M2Af7on1
nUZI5YsWWXt/bm9wf6rRCzDs76fS7ity943V0aSg2AHryjfcB8o4oBhJBnrRfnm7
v+SxLg0xDEWQPo8VOCQlIw5IhoxNokHjMAt8Ho7o0dXJRR91vSerdulK4Uxkz13Q
E9GlDBDBzZsHmqHCGECNSglqVegXRA5g2i/o3tmQ/lEKzCF9OiX7GBSkXN+gEsc=
=nGJ5
-----END PGP SIGNATURE-----

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Dickson S. Guedes (#7)
Re: Better LWLocks with compare-and-swap (9.4)

On 18.05.2013 03:52, Dickson S. Guedes wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
32-core Linux machine:

- 64.09% postgres postgres [.] tas - tas - 99.83%
s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97%
postgres postgres [.] tas + 1.53% postgres
libc-2.13.so [.] 0x119873 + 1.44% postgres postgres
[.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k]
arch_local_irq_enable + 1.18% postgres postgres [.]
AllocSetAlloc ...

I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?

Sure, here are the non-default postgresql.conf settings:

name | current_setting
----------------------------+-----------------------------------------
application_name | psql
autovacuum | off
checkpoint_segments | 70
config_file | /home/hlinnakangas/data/postgresql.conf
data_directory | /home/hlinnakangas/data
default_text_search_config | pg_catalog.english
hba_file | /home/hlinnakangas/data/pg_hba.conf
ident_file | /home/hlinnakangas/data/pg_ident.conf
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
log_timezone | US/Pacific
logging_collector | on
max_connections | 100
max_stack_depth | 2MB
server_encoding | UTF8
shared_buffers | 2GB
synchronous_commit | off
TimeZone | US/Pacific
transaction_deferrable | off
transaction_isolation | read committed
transaction_read_only | off
wal_buffers | 16MB

While pgbench was running, I ran this:

perf record -p 6050 -g -e cpu-clock

to connect to one of the backends. (I used cpu-clock, because the
default cpu-cycles event didn't work on the box)

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Dickson S. Guedes
listas@guedesoft.net
In reply to: Heikki Linnakangas (#8)
Re: Better LWLocks with compare-and-swap (9.4)

Em Dom, 2013-05-19 às 09:29 +0300, Heikki Linnakangas escreveu:

On 18.05.2013 03:52, Dickson S. Guedes wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when I run "pgbench -S -c64 -j64 -T60 -M prepared" on a
32-core Linux machine:

- 64.09% postgres postgres [.] tas - tas - 99.83%
s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
LWLockRelease GetSnapshotData + GetTransactionSnapshot + 2.97%
postgres postgres [.] tas + 1.53% postgres
libc-2.13.so [.] 0x119873 + 1.44% postgres postgres
[.] GetSnapshotData + 1.29% postgres [kernel.kallsyms] [k]
arch_local_irq_enable + 1.18% postgres postgres [.]
AllocSetAlloc ...

I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?

Sure, here are the non-default postgresql.conf settings:

Thank you for your information.

While pgbench was running, I ran this:

perf record -p 6050 -g -e cpu-clock

to connect to one of the backends. (I used cpu-clock, because the
default cpu-cycles event didn't work on the box)

Hum, I was supposing that I was doing something wrong but I'm getting
the same result as before even using your test case and my results is
still different from yours:

+ 71,27% postgres postgres         [.] AtEOXact_Buffers
+  7,67% postgres postgres         [.] AtEOXact_CatCache
+  6,30% postgres postgres         [.] AllocSetCheck
+  5,34% postgres libc-2.12.so     [.] __mcount_internal
+  2,14% postgres [kernel.kallsyms][k] activate_page

It's a 64-core machine with PGDATA in a SAN.

vendor_id : GenuineIntel
cpu family : 6
model : 47
model name : Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz
stepping : 2
cpu MHz : 1064.000
cache size : 24576 KB
physical id : 3
siblings : 16
core id : 24
cpu cores : 8
apicid : 241
initial apicid : 241
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good
xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx
smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes
lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 4255.87
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:

Would you like that I test some other configuration to try to simulate
that expected workload?

[]s
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A

#10Andres Freund
andres@anarazel.de
In reply to: Dickson S. Guedes (#9)
Re: Better LWLocks with compare-and-swap (9.4)

On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:

Hum, I was supposing that I was doing something wrong but I'm getting
the same result as before even using your test case and my results is
still different from yours:

+ 71,27% postgres postgres         [.] AtEOXact_Buffers
+  7,67% postgres postgres         [.] AtEOXact_CatCache
+  6,30% postgres postgres         [.] AllocSetCheck
+  5,34% postgres libc-2.12.so     [.] __mcount_internal
+  2,14% postgres [kernel.kallsyms][k] activate_page

That looks like you have configured with --enable-cassert and probably
also --enable-profiling? The former will give completely distorted
performance results...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Dickson S. Guedes
listas@guedesoft.net
In reply to: Andres Freund (#10)
Re: Better LWLocks with compare-and-swap (9.4)

Em Seg, 2013-05-20 às 14:35 +0200, Andres Freund escreveu:

On 2013-05-20 09:31:15 -0300, Dickson S. Guedes wrote:

Hum, I was supposing that I was doing something wrong but I'm getting
the same result as before even using your test case and my results is
still different from yours:

+ 71,27% postgres postgres         [.] AtEOXact_Buffers
+  7,67% postgres postgres         [.] AtEOXact_CatCache
+  6,30% postgres postgres         [.] AllocSetCheck
+  5,34% postgres libc-2.12.so     [.] __mcount_internal
+  2,14% postgres [kernel.kallsyms][k] activate_page

That looks like you have configured with --enable-cassert and probably
also --enable-profiling? The former will give completely distorted
performance results...

Ah! Wrong PATH, so wrong binaries. Thanks Andres.

--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Merlin Moncure (#2)
Re: Better LWLocks with compare-and-swap (9.4)

On 13.05.2013 17:21, Merlin Moncure wrote:

On Mon, May 13, 2013 at 7:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The attached patch is still work-in-progress. There needs to be a configure
test and fallback to spinlock if a CAS instruction is not available. I used
the gcc __sync_val_compare_and_swap() builtin directly, that needs to be
abstracted. Also, in the case that the wait queue needs to be manipulated,
the code spins on the CAS instruction, but there is no delay mechanism like
there is on a regular spinlock; that needs to be added in somehow.

These are really interesting results. Why is the CAS method so much
faster then TAS?

If my theory is right, the steep fall is caused by backends being
sometimes context switched while holding the spinlock. The CAS method
alleviates that, because the lock is never "held" by anyone. With a
spinlock, if a process gets context switched while holding the lock,
everyone else will have to wait until the process gets context switched
back, and releases the spinlock. With the CAS method, if a process gets
switched in the CAS-loop, it doesn't hurt others.

With the patch, the CAS-loop still spins, just like a spinlock, when the
wait queue has to be manipulated. So when that happens, it will behave
more or less like the spinlock implementation. I'm not too concerned
about that optimizing that further; sleeping and signaling sleeping
backends are heavy-weight operations anyway.

Did you see any contention?

With the current spinlock implementation? Yeah, per the profile I
posted, a lot of CPU time was spent spinning, which is a sign of
contention. I didn't see contention with the CAS implemention.

Attached is a new version of the patch. I cleaned it up a lot, and added
a configure test, and fallback to the good old spinlock implementation
if compare-and-swap is not available. Also, if the CAS loops needs to
spin, like a spinlock, it now sleeps after a while and updates the
"spins_per_delay" like regular spinlock.

- Heikki

Attachments:

lwlock-cas-3.patchtext/x-diff; name=lwlock-cas-3.patchDownload+636-240
#13Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#6)
Re: Better LWLocks with compare-and-swap (9.4)

On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:

Stephen Frost <sfrost@snowman.net> writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters". While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.

Uh, is this an x86-64-only optimization? Seems so.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#12)
Re: Better LWLocks with compare-and-swap (9.4)
diff --git a/configure.in b/configure.in
index 4ea5699..ff8470e 100644
--- a/configure.in
+++ b/configure.in
@@ -1445,17 +1445,6 @@ fi
AC_CHECK_FUNCS([strtoll strtoq], [break])
AC_CHECK_FUNCS([strtoull strtouq], [break])

-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
- [int lock = 0;
- __sync_lock_test_and_set(&lock, 1);
- __sync_lock_release(&lock);],
- [pgac_cv_gcc_int_atomics="yes"],
- [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
- AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-

Careful here --- s_lock.h has some code conditional on
HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
removing the definition, unless I'm misreading.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#13)
Re: Better LWLocks with compare-and-swap (9.4)

On 20.05.2013 23:01, Bruce Momjian wrote:

On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:

Stephen Frost<sfrost@snowman.net> writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters". While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.

Uh, is this an x86-64-only optimization? Seems so.

All modern architectures have an atomic compare-and-swap instruction (or
something more powerful that can be used to implement it). That includes
x86, x86-64, ARM, PowerPC, among others.

There are some differences in how wide values can be swapped with it;
386 only supported 32-bit, until Pentium, which added a 64-bit variant.
I used the 64-bit variant in the patch, but for lwlocks, we could
actually get away with the 32-bit variant if we packed the booleans and
the shared lock counter more tightly.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#14)
Re: Better LWLocks with compare-and-swap (9.4)

On 20.05.2013 23:11, Alvaro Herrera wrote:

diff --git a/configure.in b/configure.in
index 4ea5699..ff8470e 100644
--- a/configure.in
+++ b/configure.in
@@ -1445,17 +1445,6 @@ fi
AC_CHECK_FUNCS([strtoll strtoq], [break])
AC_CHECK_FUNCS([strtoull strtouq], [break])

-AC_CACHE_CHECK([for builtin locking functions], pgac_cv_gcc_int_atomics,
-[AC_TRY_LINK([],
- [int lock = 0;
- __sync_lock_test_and_set(&lock, 1);
- __sync_lock_release(&lock);],
- [pgac_cv_gcc_int_atomics="yes"],
- [pgac_cv_gcc_int_atomics="no"])])
-if test x"$pgac_cv_gcc_int_atomics" = x"yes"; then
- AC_DEFINE(HAVE_GCC_INT_ATOMICS, 1, [Define to 1 if you have __sync_lock_test_and_set(int *) and friends.])
-fi
-

Careful here --- s_lock.h has some code conditional on
HAVE_GCC_INT_ATOMICS which your patch is not touching, yet it is
removing the definition, unless I'm misreading.

Thanks, good catch. I renamed HAVE_GCC_INT_ATOMICS to
HAVE_GCC_INT_TEST_AND_SET because "atomics" seems too generic when we
also test for __sync_val_compare_and_swap(p, oldval, newval).

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#15)
Re: Better LWLocks with compare-and-swap (9.4)

On Mon, May 20, 2013 at 11:16:41PM +0300, Heikki Linnakangas wrote:

On 20.05.2013 23:01, Bruce Momjian wrote:

On Thu, May 16, 2013 at 12:08:40PM -0400, Tom Lane wrote:

Stephen Frost<sfrost@snowman.net> writes:

Isn't this the same issue which has prompted multiple people to propose
(sometimes with code, as I recall) to rip out our internal spinlock
system and replace it with kernel-backed calls which do it better,
specifically by dealing with issues like the above? Have you seen those
threads in the past? Any thoughts about moving in that direction?

All of the proposals of that sort that I've seen had a flavor of
"my OS is the only one that matters". While I don't object to
platform-dependent implementations of spinlocks as such, they're not
much of a cure for a generic performance issue.

Uh, is this an x86-64-only optimization? Seems so.

All modern architectures have an atomic compare-and-swap instruction
(or something more powerful that can be used to implement it). That
includes x86, x86-64, ARM, PowerPC, among others.

There are some differences in how wide values can be swapped with
it; 386 only supported 32-bit, until Pentium, which added a 64-bit
variant. I used the 64-bit variant in the patch, but for lwlocks, we
could actually get away with the 32-bit variant if we packed the
booleans and the shared lock counter more tightly.

So we are going to need to add this kind of assembly language
optimization for every CPU type?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Daniel Farina (#3)
Re: Better LWLocks with compare-and-swap (9.4)

On 16.05.2013 01:08, Daniel Farina wrote:

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:

-  64.09%  postgres  postgres           [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
is that after that point all the cores are busy, and processes start to be
sometimes context switched while holding the spinlock, which kills
performance.

I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes>> number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature

Stop the press! I'm getting the same speedup on that 32-core box I got
with the compare-and-swap patch, from this one-liner:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;

#define TAS(lock) tas(lock)

+#define TAS_SPIN(lock)	(*(lock) ? 1 : TAS(lock))
+
  static __inline__ int
  tas(volatile slock_t *lock)
  {

So, on this system, doing a non-locked test before the locked xchg
instruction while spinning, is a very good idea. That contradicts the
testing that was done earlier when the x86-64 implementation was added,
as we have this comment in the tas() implementation:

/*
* On Opteron, using a non-locking test before the locking instruction
* is a huge loss. On EM64T, it appears to be a wash or small loss,
* so we needn't bother to try to distinguish the sub-architectures.
*/

On my test system, the non-locking test is a big win. I tested this
because I was reading this article from Intel:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.
It says very explicitly that the non-locking test is a good idea:

Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free.

Now, I'm not sure what to do about this. If we put the non-locking test
in there, according to the previous testing that would be a huge loss on
Opterons.

Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY.
That way, even if each TAS_SPIN test is very expensive, we don't spend
too much time spinning if it's really busy, or held by a process that is
sleeping.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#18)
Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))

On 21.05.2013 00:20, Heikki Linnakangas wrote:

On 16.05.2013 01:08, Daniel Farina wrote:

On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux
machine:

- 64.09% postgres postgres [.] tas
- tas
- 99.83% s_lock
- 53.22% LWLockAcquire
+ 99.87% GetSnapshotData
- 46.78% LWLockRelease
GetSnapshotData
+ GetTransactionSnapshot
+ 2.97% postgres postgres [.] tas
+ 1.53% postgres libc-2.13.so [.] 0x119873
+ 1.44% postgres postgres [.] GetSnapshotData
+ 1.29% postgres [kernel.kallsyms] [k] arch_local_irq_enable
+ 1.18% postgres postgres [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My
theory
is that after that point all the cores are busy, and processes start
to be
sometimes context switched while holding the spinlock, which kills
performance.

I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes>> number of processors. It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time. I figured preemption while in
the spinlock was to blame at the time, given the extreme nature

Stop the press! I'm getting the same speedup on that 32-core box I got
with the compare-and-swap patch, from this one-liner:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;

#define TAS(lock) tas(lock)

+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
static __inline__ int
tas(volatile slock_t *lock)
{

So, on this system, doing a non-locked test before the locked xchg
instruction while spinning, is a very good idea. That contradicts the
testing that was done earlier when the x86-64 implementation was added,
as we have this comment in the tas() implementation:

/*
* On Opteron, using a non-locking test before the locking instruction
* is a huge loss. On EM64T, it appears to be a wash or small loss,
* so we needn't bother to try to distinguish the sub-architectures.
*/

On my test system, the non-locking test is a big win. I tested this
because I was reading this article from Intel:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/.
It says very explicitly that the non-locking test is a good idea:

Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait
loops is attempting to spin on an atomic instruction instead of
spinning on a volatile read. Spinning on a dirty read instead of
attempting to acquire a lock consumes less time and resources. This
allows an application to only attempt to acquire a lock only when it
is free.

Now, I'm not sure what to do about this. If we put the non-locking test
in there, according to the previous testing that would be a huge loss on
Opterons.

I think we should apply the attached patch to put a non-locked test in
TAS_SPIN() on x86_64.

Tom tested this back in 2011 on a 32-core Opteron [1]/messages/by-id/8292.1314641721@sss.pgh.pa.us and found little
difference. And on a 80-core Xeon (160 with hyperthreading) system, he
saw a quite significant gain from it [2]/messages/by-id/25989.1314734752@sss.pgh.pa.us. Reading the thread, I don't
understand why it wasn't applied to x86_64 back then. Maybe it was for
the fear of having a negative impact on performance on old Opterons, but
I strongly suspect that the performance would be OK even on old Opterons
if you only do the non-locked test in TAS_SPIN() and not in TAS(). Back
when the comment about poor performance with a non-locking test on
Opteron was written, we used the same TAS() macro in the first and while
spinning.

I tried to find some sort of an authoritative guide from AMD that would
say how spinlocks should be implemented, but didn't find any. The Intel
guide I linked above says we should apply the patch. Linux uses a ticket
spinlock thingie nowadays, which is slightly different, but if I'm
reading the older pre-ticket version correctly, they used to do the
same. Ie. non-locking test while spinning, but not before the first
attempt to grab the lock. Same in FreeBSD.

So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to
master. But I would love to get feedback from people running different
x86_64 hardware.

[1]: /messages/by-id/8292.1314641721@sss.pgh.pa.us
[2]: /messages/by-id/25989.1314734752@sss.pgh.pa.us

- Heikki

Attachments:

non-locked-tas-spin-x86_64.patchtext/x-diff; name=non-locked-tas-spin-x86_64.patchDownload+9-0
#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#19)
Re: Spinlock implementation on x86_64 (was Re: Better LWLocks with compare-and-swap (9.4))

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

So, my plan is to apply the attached non-locked-tas-spin-x86_64.patch to
master. But I would love to get feedback from people running different
x86_64 hardware.

Surely this patch should update the existing comment at line 209? Or at
least point out that a non-locked test in TAS_SPIN is not the same as a
non-locked test in tas() itself.

Other than the commenting, I have no objection to this. I think you're
probably right that the old tests in which this looked like a bad idea
were adding the unlocked test to tas() not only the spin case.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#20)