Improving spin-lock implementation on ARM.

krunalbauskar@gmail.com

over 5 years ago

Improving spin-lock implementation on ARM.
------------------------------------------------------------

* Spin-Lock is known to have a significant effect on performance
with increasing scalability.

* Existing Spin-Lock implementation for ARM is sub-optimal due to
use of TAS (test and swap)

* TAS is implemented on ARM as load-store so even if the lock is not free,
store operation will execute to replace the same value.
This redundant operation (mainly store) is costly.

* CAS is implemented on ARM as load-check-store-check that means if the
lock is not free, check operation, post-load will cause the loop to
return there-by saving on costlier store operation. [1]https://godbolt.org/z/jqbEsa

* x86 uses optimized xchg operation.
ARM too started supporting it (using Large System Extension) with
ARM-v8.1 but since it not supported with ARM-v8, GCC default tends
to roll more generic load-store assembly code.

* gcc-9.4+ onwards there is support for outline-atomics that could emit
both the variants of the code (load-store and cas/swp) and based on
underlying supported architecture proper variant it used but still a lot
of distros don't support GCC-9.4 as the default compiler.

* In light of this, we would like to propose a CAS-based approach based on
our local testing has shown improvement in the range of 10-40%.
(attaching graph).

* Patch enables CAS based approach if the CAS is supported depending on
existing compiled flag HAVE_GCC__ATOMIC_INT32_CAS

(Thanks to Amit Khandekar for rigorously performance testing this patch
with different combinations).

[1]: https://godbolt.org/z/jqbEsa

P.S: Sorry if I missed any standard pgsql protocol since I am just starting
with pgsql.

---
Krunal Bauskar
#mysqlonarm
Huawei Technologies

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Krunal Bauskar (#1)

Re: Improving spin-lock implementation on ARM.

On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:

(Thanks to Amit Khandekar for rigorously performance testing this patch
with different combinations).

For the simple-update and tpcb-like graphs, do you have any actual
numbers to share between 128 and 1024 connections? The blue lines
look like they are missing some measurements in-between, so it is hard
to tell if this is an actual improvement or just some lack of data.
--
Michael

krunalbauskar@gmail.com

over 5 years ago

In reply to: Michael Paquier (#2)

Re: Improving spin-lock implementation on ARM.

scalability baseline patched
----------- --------- ----------
update tpcb update tpcb
--------------------------------------------------------------
128 107932 78554 108081 78569
256 82877 64682 101543 73774
512 55174 46494 77886 61105
1024 32267 27020 33170 30597

configuration:
https://github.com/mysqlonarm/benchmark-suites/blob/master/pgsql-pbench/conf/pgsql.cnf/postgresql.conf

On Thu, 26 Nov 2020 at 10:36, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:

(Thanks to Amit Khandekar for rigorously performance testing this patch
with different combinations).

For the simple-update and tpcb-like graphs, do you have any actual
numbers to share between 128 and 1024 connections? The blue lines
look like they are missing some measurements in-between, so it is hard
to tell if this is an actual improvement or just some lack of data.
--
Michael

--
Regards,
Krunal Bauskar

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Michael Paquier (#2)

Re: Improving spin-lock implementation on ARM.

Michael Paquier <michael@paquier.xyz> writes:

On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:

(Thanks to Amit Khandekar for rigorously performance testing this patch
with different combinations).

For the simple-update and tpcb-like graphs, do you have any actual
numbers to share between 128 and 1024 connections?

Also, exactly what hardware/software platform were these curves
obtained on?

regards, tom lane

krunalbauskar@gmail.com

over 5 years ago

In reply to: Tom Lane (#4)

Re: Improving spin-lock implementation on ARM.

On Thu, 26 Nov 2020 at 10:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michael Paquier <michael@paquier.xyz> writes:

On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:

(Thanks to Amit Khandekar for rigorously performance testing this patch
with different combinations).

For the simple-update and tpcb-like graphs, do you have any actual
numbers to share between 128 and 1024 connections?

Also, exactly what hardware/software platform were these curves
obtained on?

Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for
server and 8 for client) [2 numa nodes]
Storage: 3.2 TB NVMe SSD
OS: CentOS Linux release 7.6
PGSQL: baseline = Release Tag 13.1
Invocation suite:
https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses
pgbench)

regards, tom lane

--
Regards,
Krunal Bauskar

Amit Khandekar

amitdkhan.pg@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#5)

Re: Improving spin-lock implementation on ARM.

On Thu, 26 Nov 2020 at 10:55, Krunal Bauskar <krunalbauskar@gmail.com> wrote:

Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for server and 8 for client) [2 numa nodes]
Storage: 3.2 TB NVMe SSD
OS: CentOS Linux release 7.6
PGSQL: baseline = Release Tag 13.1
Invocation suite: https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses pgbench)

Using the same hardware, attached are my improvement figures, which
are pretty much in line with your figures. Except that, I did not run
for more than 400 number of clients. And, I am getting some
improvement even for select-only workloads, in case of 200-400
clients. For read-write load, I had seen that the s_lock() contention
was caused when the XLogFlush() uses the spinlock. But for read-only
case, I have not analyzed where the improvement occurred.

The .png files in the attached tar have the graphs for head versus patch.

The GUCs that I changed :

work_mem=64MB
shared_buffers=128GB
maintenance_work_mem = 1GB
min_wal_size = 20GB
max_wal_size = 100GB
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9
full_page_writes = on
synchronous_commit = on
effective_io_concurrency = 200
log_checkpoints = on

For backends, 64 CPUs were allotted (covering 2 NUMA nodes) , and for
pgbench clients a separate set of 28 CPUs were allotted on a different
socket. Server was pre_warmed().

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 5 years ago

In reply to: Krunal Bauskar (#1)

Re: Improving spin-lock implementation on ARM.

On 26/11/2020 06:30, Krunal Bauskar wrote:

Improving spin-lock implementation on ARM.
------------------------------------------------------------

* Spin-Lock is known to have a significant effect on performance
with increasing scalability.

* Existing Spin-Lock implementation for ARM is sub-optimal due to
use of TAS (test and swap)

* TAS is implemented on ARM as load-store so even if the lock is not free,
store operation will execute to replace the same value.
This redundant operation (mainly store) is costly.

* CAS is implemented on ARM as load-check-store-check that means if the
lock is not free, check operation, post-load will cause the loop to
return there-by saving on costlier store operation. [1]

Can you add some code comments to explain that why CAS is cheaper than
TAS on ARM?

Is there some official ARM documentation, like a programmer's reference
manual or something like that, that would show a reference
implementation of a spinlock on ARM? It would be good to refer to an
authoritative source on this.

- Heikki

aekorotkov@gmail.com

over 5 years ago

In reply to: Heikki Linnakangas (#7)

Re: Improving spin-lock implementation on ARM.

On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 26/11/2020 06:30, Krunal Bauskar wrote:

Improving spin-lock implementation on ARM.
------------------------------------------------------------

* Spin-Lock is known to have a significant effect on performance
with increasing scalability.

* Existing Spin-Lock implementation for ARM is sub-optimal due to
use of TAS (test and swap)

* TAS is implemented on ARM as load-store so even if the lock is not free,
store operation will execute to replace the same value.
This redundant operation (mainly store) is costly.

* CAS is implemented on ARM as load-check-store-check that means if the
lock is not free, check operation, post-load will cause the loop to
return there-by saving on costlier store operation. [1]

Can you add some code comments to explain that why CAS is cheaper than
TAS on ARM?

Is there some official ARM documentation, like a programmer's reference
manual or something like that, that would show a reference
implementation of a spinlock on ARM? It would be good to refer to an
authoritative source on this.

Let me add my 2 cents.

I've compared assembly output of gcc implementations of CAS and TAS.
The sample C-program is attached. I've compiled it on raspberry pi 4
using gcc 9.3.0.

The inner loop of CAS is as follows. So, if the value loaded by ldaxr
doesn't match expected value, then we immediately quit the loop.

.L3:
ldxr w3, [x0]
cmp w3, w1
bne .L4
stlxr w4, w2, [x0]
cbnz w4, .L3
.L4:

The inner loop of TAS is as follows. So it really does "stxr"
unconditionally. In principle, it could check if a new value matches
the observed value and there is nothing to do, but it doesn't.
Moreover, stxr might fail, then we can spend multiple loops of
ldxr/stxr due to concurrent memory access. AFAIK, those concurrent
accesses could reflect not only lock release, but also other
unsuccessful lock attempts. So, good chance for extra loops to be
useless.

.L7:
ldxr w2, [x0]
stxr w3, w1, [x0]
cbnz w3, .L7

I've also googled for spinlock implementation on arm and found a blog
post about spinlock implementation in linux kernel [1]. Surprisingly
it doesn't use the trick to skip stxr if the lock is busy. Instead,
they use some arm-specific power-saving option called WFE.

So, I think it's quite clear that switching from TAS to CAS on arm
would be a win. But there could be other options to do this even
better.

Links
1. https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html

------
Regards,
Alexander Korotkov

krunalbauskar@gmail.com

over 5 years ago

In reply to: Heikki Linnakangas (#7)

Re: Improving spin-lock implementation on ARM.

On Thu, 26 Nov 2020 at 16:02, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 26/11/2020 06:30, Krunal Bauskar wrote:

Improving spin-lock implementation on ARM.
------------------------------------------------------------

* Spin-Lock is known to have a significant effect on performance
with increasing scalability.

* Existing Spin-Lock implementation for ARM is sub-optimal due to
use of TAS (test and swap)

* TAS is implemented on ARM as load-store so even if the lock is not

free,

store operation will execute to replace the same value.
This redundant operation (mainly store) is costly.

* CAS is implemented on ARM as load-check-store-check that means if the
lock is not free, check operation, post-load will cause the loop to
return there-by saving on costlier store operation. [1]

Can you add some code comments to explain that why CAS is cheaper than
TAS on ARM?

1. As Alexey too pointed out in followup email
CAS = load value -> check if the value is expected -> if yes then replace
(store new value) -> else exit/break
TAS = load value -> store new value

This means each TAS would be converted to 2 operations that are LOAD and
STORE and of-course
STORE is costlier in terms of latency. CAS ensures optimization in this
regard with an early check.

Let's look at some micro-benchmarking. I implemented a simple spin-loop
using both approaches and
as expected with increase scalability, CAS continues to out-perform TAS by
a margin up to 50%.

---- TAS ----
Running 128 parallelism
Elapsed time: 1.34271 s
Running 256 parallelism
Elapsed time: 3.6487 s
Running 512 parallelism
Elapsed time: 11.3725 s
Running 1024 parallelism
Elapsed time: 43.5376 s
---- CAS ----
Running 128 parallelism
Elapsed time: 1.00131 s
Running 256 parallelism
Elapsed time: 2.53202 s
Running 512 parallelism
Elapsed time: 7.66829 s
Running 1024 parallelism
Elapsed time: 22.6294 s

This could be also observed from the perf profiling

TAS:
15.57 │44: ldxr w0, [x19]
83.93 │ stxr w1, w21, [x19]

CAS:
81.29 │58: ↓ b.ne cc
....
9.86 │cc: ldaxr w0, [x22]
8.84 │ cmp w0, #0x0
│ ↑ b.ne 58
│ stxr w1, w20, [x22]

*In TAS: STORE is pretty costly.*

2. I have added the needed comment in the patch. Updated patch attached.

----------------------
Thanks for taking look at this and surely let me know if any more info is
needed.

Is there some official ARM documentation, like a programmer's reference
manual or something like that, that would show a reference
implementation of a spinlock on ARM? It would be good to refer to an
authoritative source on this.

- Heikki

--
Regards,
Krunal Bauskar

#10

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Krunal Bauskar (#5)

Re: Improving spin-lock implementation on ARM.

Krunal Bauskar <krunalbauskar@gmail.com> writes:

On Thu, 26 Nov 2020 at 10:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, exactly what hardware/software platform were these curves
obtained on?

Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for
server and 8 for client) [2 numa nodes]
Storage: 3.2 TB NVMe SSD
OS: CentOS Linux release 7.6
PGSQL: baseline = Release Tag 13.1

Hmm, might not be the sort of hardware ordinary mortals can get their
hands on. What's likely to be far more common ARM64 hardware in the
near future is Apple's new gear. So I thought I'd try this on the new
M1 mini I just got.

... and, after retrieving my jaw from the floor, I present the
attached. Apple's chips evidently like this style of spinlock a LOT
better. The difference is so remarkable that I wonder if I made a
mistake somewhere. Can anyone else replicate these results?

Test conditions are absolutely brain dead:

Today's HEAD (dcfff74fb), no special build options

All server parameters are out-of-the-box defaults, except
I had to raise max_connections for the larger client counts

pgbench scale factor 100

Read-only tests are like
pgbench -S -T 60 -c 32 -j 16 bench
Quoted figure is median of three runs; except for the lowest
client count, results were quite repeatable. (I speculate that
at -c 4, the scheduler might've been doing something funny about
sometimes using the slow cores instead of fast cores.)

Read-write tests are like
pgbench -T 300 -c 16 -j 8 bench
I didn't have the patience to run three full repetitions,
but again the numbers seemed pretty repeatable.

I used -j equal to half -c, except I could not get -j above 128
to work, so the larger client counts have -j 128. Did not try
to run down that problem yet, but I'm probably hitting some ulimit
somewhere. (I did have to raise "ulimit -n" to get these results.)

Anyway, this seems to be a slam-dunk win on M1.

regards, tom lane

#11

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Alexander Korotkov (#8)

Re: Improving spin-lock implementation on ARM.

Alexander Korotkov <aekorotkov@gmail.com> writes:

On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Is there some official ARM documentation, like a programmer's reference
manual or something like that, that would show a reference
implementation of a spinlock on ARM? It would be good to refer to an
authoritative source on this.

I've compared assembly output of gcc implementations of CAS and TAS.

FWIW, I see quite different assembly using Apple's clang on their M1
processor. What I get for SpinLockAcquire on HEAD is (lock pointer
initially in x0):

mov x19, x0
mov w8, #1
swpal w8, w8, [x0]
cbz w8, LBB0_2
adrp x1, l_.str@PAGE
add x1, x1, l_.str@PAGEOFF
adrp x3, l___func__.foo@PAGE
add x3, x3, l___func__.foo@PAGEOFF
mov x0, x19
mov w2, #12
bl _s_lock
LBB0_2:
... lock is acquired

while SpinLockRelease is just

stlr wzr, [x19]

With the patch, I get

mov x19, x0
mov w8, #0
mov w9, #1
casa w8, w9, [x0]
cmp w8, #0 ; =0
b.eq LBB0_2
adrp x1, l_.str@PAGE
add x1, x1, l_.str@PAGEOFF
adrp x3, l___func__.foo@PAGE
add x3, x3, l___func__.foo@PAGEOFF
mov x0, x19
mov w2, #12
bl _s_lock
LBB0_2:
... lock is acquired

and SpinLockRelease is the same.

Don't know much of anything about ARM assembly, so I don't
know if these instructions are late-model-only.

regards, tom lane

#12

aekorotkov@gmail.com

over 5 years ago

In reply to: Tom Lane (#10)

Re: Improving spin-lock implementation on ARM.

On Fri, Nov 27, 2020 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Krunal Bauskar <krunalbauskar@gmail.com> writes:

On Thu, 26 Nov 2020 at 10:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, exactly what hardware/software platform were these curves
obtained on?

Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for
server and 8 for client) [2 numa nodes]
Storage: 3.2 TB NVMe SSD
OS: CentOS Linux release 7.6
PGSQL: baseline = Release Tag 13.1

Hmm, might not be the sort of hardware ordinary mortals can get their
hands on. What's likely to be far more common ARM64 hardware in the
near future is Apple's new gear. So I thought I'd try this on the new
M1 mini I just got.

... and, after retrieving my jaw from the floor, I present the
attached. Apple's chips evidently like this style of spinlock a LOT
better. The difference is so remarkable that I wonder if I made a
mistake somewhere. Can anyone else replicate these results?

Results look very surprising to me. I didn't expect there would be
any very busy spin-lock when the number of clients is as low as 4.
Especially in read-only pgbench.

I don't have an M1 at hand. Could you do some profiling to identify
the source of such a huge difference.

------
Regards,
Alexander Korotkov

#13

aekorotkov@gmail.com

over 5 years ago

In reply to: Tom Lane (#11)

Re: Improving spin-lock implementation on ARM.

On Fri, Nov 27, 2020 at 2:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:

On Thu, Nov 26, 2020 at 1:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Is there some official ARM documentation, like a programmer's reference
manual or something like that, that would show a reference
implementation of a spinlock on ARM? It would be good to refer to an
authoritative source on this.

I've compared assembly output of gcc implementations of CAS and TAS.

FWIW, I see quite different assembly using Apple's clang on their M1
processor. What I get for SpinLockAcquire on HEAD is (lock pointer
initially in x0):

Yep, arm v8.1 implements single-instruction atomic operations swpal
and casa, which much more look like x86 atomic instructions rather
than loops of ldxr/stlxr.

So, all the reasoning upthread shouldn't work here, but the advantage
is much more huge.

------
Regards,
Alexander Korotkov

#14

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Alexander Korotkov (#12)

Re: Improving spin-lock implementation on ARM.

Alexander Korotkov <aekorotkov@gmail.com> writes:

On Fri, Nov 27, 2020 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

... and, after retrieving my jaw from the floor, I present the
attached. Apple's chips evidently like this style of spinlock a LOT
better. The difference is so remarkable that I wonder if I made a
mistake somewhere. Can anyone else replicate these results?

Results look very surprising to me. I didn't expect there would be
any very busy spin-lock when the number of clients is as low as 4.

Yeah, that wasn't making sense to me either. The most likely explanation
seems to be that I messed up the test somehow ... but I don't see where.
So, again, I'm wondering if anyone else can replicate or refute this.
I can't be the only geek around here who sprang for an M1.

regards, tom lane

#15

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Tom Lane (#14)

Re: Improving spin-lock implementation on ARM.

On Fri, Nov 27, 2020 at 02:50:30AM -0500, Tom Lane wrote:

Yeah, that wasn't making sense to me either. The most likely explanation
seems to be that I messed up the test somehow ... but I don't see where.
So, again, I'm wondering if anyone else can replicate or refute this.

I do find your results extremely surprising not only for 4, but for
all tests with connection numbers lower than 32. With a scale factor
of 100 that's suspiciously a lot of difference.

I can't be the only geek around here who sprang for an M1.

Not planning to buy one here, anything I have read on that tells that
it is worth a performance study.
--
Michael

#16

aekorotkov@gmail.com

over 5 years ago

In reply to: Michael Paquier (#15)

Re: Improving spin-lock implementation on ARM.

On Fri, Nov 27, 2020 at 11:55 AM Michael Paquier <michael@paquier.xyz> wrote:

Not planning to buy one here, anything I have read on that tells that
it is worth a performance study.

Another interesting area for experiments is AWS graviton2 instances.
Specification says it supports arm v8.2, so it should have swpal/casa
instructions as well.

------
Regards,
Alexander Korotkov

#17

Peter Eisentraut

peter_e@gmx.net

over 5 years ago

In reply to: Tom Lane (#10)

Re: Improving spin-lock implementation on ARM.

On 2020-11-26 23:55, Tom Lane wrote:

... and, after retrieving my jaw from the floor, I present the
attached. Apple's chips evidently like this style of spinlock a LOT
better. The difference is so remarkable that I wonder if I made a
mistake somewhere. Can anyone else replicate these results?

I tried this on a M1 MacBook Air. I cannot reproduce these results.
The unpatched numbers are about in the neighborhood of what you showed,
but the patched numbers are only about a few percent better, not the
1.5x or 2x change that you showed.

#18

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Peter Eisentraut (#17)

Re: Improving spin-lock implementation on ARM.

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

I tried this on a M1 MacBook Air. I cannot reproduce these results.
The unpatched numbers are about in the neighborhood of what you showed,
but the patched numbers are only about a few percent better, not the
1.5x or 2x change that you showed.

After redoing the test, I can't find any outside-the-noise difference
at all between HEAD and the patch. So clearly, I screwed up yesterday.
The most likely theory is that I managed to measure an assert-enabled
build of HEAD.

It might be that this hardware is capable of showing a difference with a
better-tuned pgbench test, but with an untuned pgbench run, we just aren't
sufficiently sensitive to the spinlock properties. (Which I guess is good
news, really.)

One thing that did hold up is that the thermal performance of this box
is pretty ridiculous. After being beat on for a solid hour, the fan
still hasn't turned on to any noticeable level, and the enclosure is
only a little warm to the touch. Try that with Intel hardware ;-)

regards, tom lane

#19

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Tom Lane (#18)

Re: Improving spin-lock implementation on ARM.

I wrote:

It might be that this hardware is capable of showing a difference with a
better-tuned pgbench test, but with an untuned pgbench run, we just aren't
sufficiently sensitive to the spinlock properties. (Which I guess is good
news, really.)

It occurred to me that if we don't insist on a semi-realistic test case,
it's not that hard to just pound on a spinlock and see what happens.
I made up a simple C function (attached) to repeatedly call
XLogGetLastRemovedSegno, which is basically just a spinlock
acquire/release. Using this as a "transaction":

$ cat bench.sql
select drive_spinlocks(50000);

I get this with HEAD:

$ pgbench -f bench.sql -n -T 60 -c 1 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 127597
latency average = 0.470 ms
tps = 2126.479699 (including connections establishing)
tps = 2126.595015 (excluding connections establishing)

$ pgbench -f bench.sql -n -T 60 -c 2 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
duration: 60 s
number of transactions actually processed: 108979
latency average = 1.101 ms
tps = 1816.051930 (including connections establishing)
tps = 1816.150556 (excluding connections establishing)

$ pgbench -f bench.sql -n -T 60 -c 4 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 60 s
number of transactions actually processed: 42862
latency average = 5.601 ms
tps = 714.202152 (including connections establishing)
tps = 714.237301 (excluding connections establishing)

(With only 4 high-performance cores, it's probably not
interesting to go further; involving the slower cores
will just confuse matters.) And this with the patch:

$ pgbench -f bench.sql -n -T 60 -c 1 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 130455
latency average = 0.460 ms
tps = 2174.098284 (including connections establishing)
tps = 2174.217097 (excluding connections establishing)

$ pgbench -f bench.sql -n -T 60 -c 2 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
duration: 60 s
number of transactions actually processed: 51533
latency average = 2.329 ms
tps = 858.765176 (including connections establishing)
tps = 858.811132 (excluding connections establishing)

$ pgbench -f bench.sql -n -T 60 -c 4 bench
transaction type: bench.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 1
duration: 60 s
number of transactions actually processed: 31154
latency average = 7.705 ms
tps = 519.116788 (including connections establishing)
tps = 519.144375 (excluding connections establishing)

So at least on Apple's hardware, it seems like the CAS
implementation might be a shade faster when uncontended,
but it's very clearly worse when there is contention for
the spinlock. That's interesting, because the argument
that CAS should involve strictly less work seems valid ...
but that's what I'm getting.

It might be useful to try this on other ARM platforms,
but I lack the energy right now (plus the only other
thing I've got is a Raspberry Pi, which might not be
something we particularly care about performance-wise).

regards, tom lane

#20

aekorotkov@gmail.com

over 5 years ago

In reply to: Tom Lane (#19)

Re: Improving spin-lock implementation on ARM.

On Sat, Nov 28, 2020 at 5:36 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So at least on Apple's hardware, it seems like the CAS
implementation might be a shade faster when uncontended,
but it's very clearly worse when there is contention for
the spinlock. That's interesting, because the argument
that CAS should involve strictly less work seems valid ...
but that's what I'm getting.

It might be useful to try this on other ARM platforms,
but I lack the energy right now (plus the only other
thing I've got is a Raspberry Pi, which might not be
something we particularly care about performance-wise).

I guess that might depend on the implementation of CAS and TAS. I bet
usage of CAS in spinlock gives advantage when ldxr/stxr are used, but
not when swpal/casa are used. I found out that I can force clang to
use swpal/casa by setting "-march=armv8-a+lse". I'm going to make
some experiments on a multicore AWS graviton2 instance with different
atomic implementation.

------
Regards,
Alexander Korotkov

#21

aekorotkov@gmail.com

over 5 years ago

In reply to: Alexander Korotkov (#20)

#22

aekorotkov@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#1)

#23

krunalbauskar@gmail.com

over 5 years ago

In reply to: Alexander Korotkov (#21)

#24

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Krunal Bauskar (#23)

#25

krunalbauskar@gmail.com

over 5 years ago

In reply to: Tom Lane (#24)

#26

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Krunal Bauskar (#25)

#27

krunalbauskar@gmail.com

over 5 years ago

In reply to: Tom Lane (#26)

#28

aekorotkov@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#27)

#29

aekorotkov@gmail.com

over 5 years ago

In reply to: Tom Lane (#26)

#30

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Alexander Korotkov (#29)

#31

aekorotkov@gmail.com

over 5 years ago

In reply to: Tom Lane (#30)

#32

aekorotkov@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#23)

#33

krunalbauskar@gmail.com

over 5 years ago

In reply to: Alexander Korotkov (#32)

#34

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Alexander Korotkov (#31)

#35

aekorotkov@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#33)

#36

aekorotkov@gmail.com

over 5 years ago

In reply to: Tom Lane (#34)

#37

krunalbauskar@gmail.com

over 5 years ago

In reply to: Alexander Korotkov (#35)

#38

Amit Khandekar

amitdkhan.pg@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#37)

#39

krunalbauskar@gmail.com

over 5 years ago

In reply to: Alexander Korotkov (#31)

#40

aekorotkov@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#39)

#41

aekorotkov@gmail.com

over 5 years ago

In reply to: Amit Khandekar (#38)

#42

krunalbauskar@gmail.com

over 5 years ago

In reply to: Alexander Korotkov (#40)

#43

aekorotkov@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#42)

#44

Zidenberg, Tsahi

tsahee@amazon.com

over 5 years ago

In reply to: Alexander Korotkov (#41)

#45

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Alexander Korotkov (#43)

#46

aekorotkov@gmail.com

over 5 years ago

In reply to: Zidenberg, Tsahi (#44)

#47

Zidenberg, Tsahi

tsahee@amazon.com

over 5 years ago

In reply to: Alexander Korotkov (#46)

#48

krunalbauskar@gmail.com

over 5 years ago

In reply to: Tom Lane (#45)

#49

krunalbauskar@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#48)

#50

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Krunal Bauskar (#49)

#51

aekorotkov@gmail.com

over 5 years ago

In reply to: Tom Lane (#50)

#52

aekorotkov@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#48)

#53

aekorotkov@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#48)

#54

Amit Khandekar

amitdkhan.pg@gmail.com

over 5 years ago

In reply to: Alexander Korotkov (#53)

#55

krunalbauskar@gmail.com

over 5 years ago

In reply to: Tom Lane (#50)

#56

krunalbauskar@gmail.com

over 5 years ago

In reply to: Krunal Bauskar (#55)

#57