use a non-locking initial test in TAS_SPIN on AArch64
My colleague Salvatore Dipietro (CC'd) sent me a couple of profiles that
showed an enormous amount of s_lock() time going to the
__sync_lock_test_and_set() call in the AArch64 implementation of tas().
Upon closer inspection, I noticed that we don't implement a custom
TAS_SPIN() for this architecture, so I quickly hacked together the attached
patch and ran a couple of benchmarks that stressed the spinlock code. I
found no discussion about TAS_SPIN() on ARM in the archives, but I did
notice that the initial AArch64 support was added [0]https://postgr.es/c/5c7603c before x86_64 started
using a non-locking test [1]https://postgr.es/c/b03d196.
These benchmarks are for a c8g.24xlarge running a select-only pgbench with
256 clients and pg_stat_statements.track_planning enabled.
without the patch:
90.04% postgres [.] s_lock
1.07% pg_stat_statements.so [.] pgss_store
0.59% postgres [.] LWLockRelease
0.56% postgres [.] perform_spin_delay
0.31% [kernel] [k] arch_local_irq_enable
| while (TAS_SPIN(lock))
| {
| perform_spin_delay(&delayStatus);
0.12 |2c: -> bl perform_spin_delay
| tas():
| return __sync_lock_test_and_set(lock, 1);
0.01 |30: swpa w20, w1, [x19]
| s_lock():
99.87 | add x0, sp, #0x28
| while (TAS_SPIN(lock))
0.00 | ^ cbnz w1, 2c
tps = 74135.100891 (without initial connection time)
with the patch:
30.46% postgres [.] s_lock
5.88% postgres [.] perform_spin_delay
4.61% [kernel] [k] arch_local_irq_enable
3.31% [kernel] [k] next_uptodate_page
2.50% postgres [.] hash_search_with_hash_value
| while (TAS_SPIN(lock))
| {
| perform_spin_delay(&delayStatus);
0.63 |2c:+-->add x0, sp, #0x28
0.07 | |-> bl perform_spin_delay
| |while (TAS_SPIN(lock))
0.25 |34:| ldr w0, [x19]
65.19 | +--cbnz w0, 2c
| tas():
| return __sync_lock_test_and_set(lock, 1);
0.00 | swpa w20, w0, [x19]
| s_lock():
33.82 | ^ cbnz w0, 2c
tps = 549462.785554 (without initial connection time)
[0]: https://postgr.es/c/5c7603c
[1]: https://postgr.es/c/b03d196
--
nathan
Attachments:
v1-0001-Use-a-non-locking-initial-test-in-TAS_SPIN-on-AAr.patchtext/plain; charset=us-asciiDownload
From a69b7d38c40f0fb49f714c49bb45c292e7c8587d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 22 Oct 2024 14:33:39 -0500
Subject: [PATCH v1 1/1] Use a non-locking initial test in TAS_SPIN on AArch64.
---
src/include/storage/s_lock.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index e94ed5f48b..07f343baf8 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -270,6 +270,12 @@ tas(volatile slock_t *lock)
*/
#if defined(__aarch64__)
+/*
+ * On ARM64, it's a win to use a non-locking test before the TAS proper. It
+ * may be a win on 32-bit ARM, too, but nobody's tested it yet.
+ */
+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
#define SPIN_DELAY() spin_delay()
static __inline__ void
--
2.39.5 (Apple Git-154)
Hi~
Upon closer inspection, I noticed that we don't implement a custom
TAS_SPIN() for this architecture, so I quickly hacked together the attached
patch and ran a couple of benchmarks that stressed the spinlock code. I
found no discussion about TAS_SPIN() on ARM in the archives, but I did
notice that the initial AArch64 support was added [0] before x86_64 started
using a non-locking test [1].
It reminds me of a discussion about improving spinlock performance on ARM
in 2020 [0]/messages/by-id/CAPpHfdsGqVd6EJ4mr_RZVE5xSiCNBy4MuSvdTrKmTpM0eyWGpg@mail.gmail.com, though the discussion is about CAS and TAS, not TAS_SPIN()
itself.
tps = 74135.100891 (without initial connection time)
tps = 549462.785554 (without initial connection time)
The result looks great, but the discussion in [0]/messages/by-id/CAPpHfdsGqVd6EJ4mr_RZVE5xSiCNBy4MuSvdTrKmTpM0eyWGpg@mail.gmail.com shows that the result may
vary among different ARM chips. Could you provide the chip model of this
test? So that we can do a cross validation of this patch. Not sure if
compiler
version is necessary too. I'm willing to test it on Alibaba Cloud Yitian 710
if I have time.
[0]: /messages/by-id/CAPpHfdsGqVd6EJ4mr_RZVE5xSiCNBy4MuSvdTrKmTpM0eyWGpg@mail.gmail.com
/messages/by-id/CAPpHfdsGqVd6EJ4mr_RZVE5xSiCNBy4MuSvdTrKmTpM0eyWGpg@mail.gmail.com
- Regards
Jingtang
On Wed, Oct 23, 2024 at 11:01:05AM +0800, Jingtang Zhang wrote:
The result looks great, but the discussion in [0] shows that the result may
vary among different ARM chips. Could you provide the chip model of this
test? So that we can do a cross validation of this patch.
This is on a c8g.24xlarge, which is using Neoverse-V2 and Armv9.0-a [0]https://github.com/aws/aws-graviton-getting-started/blob/main/README.md#building-for-graviton -- nathan.
I'm willing to test it on Alibaba Cloud Yitian 710 if I have time.
That would be great. I have a couple of Apple M-series machines I can
test, too.
[0]: https://github.com/aws/aws-graviton-getting-started/blob/main/README.md#building-for-graviton -- nathan
--
nathan
On Wed, Oct 23, 2024 at 09:46:56AM -0500, Nathan Bossart wrote:
I have a couple of Apple M-series machines I can test, too.
After some preliminary tests on an M3, I'm not seeing any gains outside the
noise range. That's not too surprising because it's likely more difficult
to create a lot of spinlock contention on these smaller machines. But, at
the very least, I'm not seeing a regression.
--
nathan
On Tue, Oct 22, 2024 at 02:54:57PM -0500, Nathan Bossart wrote:
My colleague Salvatore Dipietro (CC'd) sent me a couple of profiles that
showed an enormous amount of s_lock() time going to the
__sync_lock_test_and_set() call in the AArch64 implementation of tas().
Upon closer inspection, I noticed that we don't implement a custom
TAS_SPIN() for this architecture, so I quickly hacked together the attached
patch and ran a couple of benchmarks that stressed the spinlock code. I
found no discussion about TAS_SPIN() on ARM in the archives, but I did
notice that the initial AArch64 support was added [0] before x86_64 started
using a non-locking test [1].These benchmarks are for a c8g.24xlarge running a select-only pgbench with
256 clients and pg_stat_statements.track_planning enabled.without the patch:
[...]
tps = 74135.100891 (without initial connection time)
with the patch:
[...]
tps = 549462.785554 (without initial connection time)
Are there any objections to proceeding with this change? So far, it's been
tested on a c8g.24xlarge and an Apple M3 (which seems to be too small to
show any effect). If anyone has access to a larger ARM machine, additional
testing would be greatly appreciated. I think it would be unfortunate if
this slipped to v19.
--
nathan
Hi,
On 2025-01-08 12:12:19 -0600, Nathan Bossart wrote:
On Tue, Oct 22, 2024 at 02:54:57PM -0500, Nathan Bossart wrote:
My colleague Salvatore Dipietro (CC'd) sent me a couple of profiles that
showed an enormous amount of s_lock() time going to the
__sync_lock_test_and_set() call in the AArch64 implementation of tas().
Upon closer inspection, I noticed that we don't implement a custom
TAS_SPIN() for this architecture, so I quickly hacked together the attached
patch and ran a couple of benchmarks that stressed the spinlock code. I
found no discussion about TAS_SPIN() on ARM in the archives, but I did
notice that the initial AArch64 support was added [0] before x86_64 started
using a non-locking test [1].These benchmarks are for a c8g.24xlarge running a select-only pgbench with
256 clients and pg_stat_statements.track_planning enabled.without the patch:
[...]
tps = 74135.100891 (without initial connection time)
with the patch:
[...]
tps = 549462.785554 (without initial connection time)
Are there any objections to proceeding with this change? So far, it's been
tested on a c8g.24xlarge and an Apple M3 (which seems to be too small to
show any effect). If anyone has access to a larger ARM machine, additional
testing would be greatly appreciated. I think it would be unfortunate if
this slipped to v19.
Seems reasonable. It's not surprising that spinning with an atomic operation
doesn't scale very well once a you have more than a handful cores.
Of course the whole way locking works in pg_stat_statements is a scalability
disaster, but that's a bigger project to fix.
Out of curiosity, have you measured whether this has a positive effect without
pg_stat_statements? I think it'll e.g. also affect lwlocks, as they also use
perform_spin_delay().
Greetings,
Andres
Nathan Bossart <nathandbossart@gmail.com> writes:
Are there any objections to proceeding with this change? So far, it's been
tested on a c8g.24xlarge and an Apple M3 (which seems to be too small to
show any effect). If anyone has access to a larger ARM machine, additional
testing would be greatly appreciated. I think it would be unfortunate if
this slipped to v19.
I just acquired an M4 Pro, which may also be too small to show any
effect, but perhaps running the test there would at least give us
more confidence that there's not a bad effect. Which test case(s)
would you recommend trying?
regards, tom lane
On Wed, Jan 08, 2025 at 03:23:45PM -0500, Tom Lane wrote:
I just acquired an M4 Pro, which may also be too small to show any
effect, but perhaps running the test there would at least give us
more confidence that there's not a bad effect. Which test case(s)
would you recommend trying?
Thanks! A select-only pgbench with many clients (I used 256 upthread) and
pg_stat_statements.track_planning enabled seems to be a pretty easy way to
stress spinlocks. The same test without track_planning enabled might also
be interesting. I'm looking for a way to stress LWLocks, too, and I will
share here when I either find an existing test or write something of my
own.
--
nathan
On Wed, Jan 08, 2025 at 03:07:24PM -0500, Andres Freund wrote:
Out of curiosity, have you measured whether this has a positive effect without
pg_stat_statements? I think it'll e.g. also affect lwlocks, as they also use
perform_spin_delay().
AFAICT TAS_SPIN() is only used for s_lock(), which doesn't appear to be
used by LWLocks. But I did retry my test from upthread without
pg_stat_statements and was surprised to find a reproducible 4-6%
regression. I'm not seeing any obvious differences in perf, but I do see
that the thread for adding TAS_SPIN() for PPC mentions a regression at
lower contention levels [0]https://postgr.es/me/26496.1325625436%40sss.pgh.pa.us. Perhaps the non-locked test is failing often
enough to hurt performance in this case... Whatever it is, it'll be mighty
frustrating to miss out on a >7x gain because of a 4% regression.
[0]: https://postgr.es/me/26496.1325625436%40sss.pgh.pa.us
--
nathan
Hi,
On 2025-01-08 16:01:19 -0600, Nathan Bossart wrote:
On Wed, Jan 08, 2025 at 03:07:24PM -0500, Andres Freund wrote:
Out of curiosity, have you measured whether this has a positive effect without
pg_stat_statements? I think it'll e.g. also affect lwlocks, as they also use
perform_spin_delay().AFAICT TAS_SPIN() is only used for s_lock(), which doesn't appear to be
used by LWLocks.
Brainfart on my part, sorry. I was thinking of SPIN_DELAY() for a moment...
But I did retry my test from upthread without pg_stat_statements and was
surprised to find a reproducible 4-6% regression.
Uh, huh. I assume this was readonly pgbench with 256 clients just as you had
tested upthread? I don't think there's any hot spinlock meaningfully involved
in that workload? A r/w workload is a different story, but upthread you
mentioned select-only.
Do you see any spinlock in profiles?
I'm not seeing any obvious differences in perf, but I do see that the thread
for adding TAS_SPIN() for PPC mentions a regression at lower contention
levels [0]. Perhaps the non-locked test is failing often enough to hurt
performance in this case... Whatever it is, it'll be mighty frustrating to
miss out on a7x gain because of a 4% regression.
I don't think the explanation can be that simple - even with TAS_SPIN defined,
we do try to acquire the lock once without using TAS_SPIN:
#if !defined(S_LOCK)
#define S_LOCK(lock) \
(TAS(lock) ? s_lock((lock), __FILE__, __LINE__, __func__) : 0)
#endif /* S_LOCK */
Only s_lock() then uses TAS_SPIN(lock).
I wonder if you're hitting an extreme case of binary-layout related effects?
I've never seen them at this magnitude though. I'd suggest using either lld
or mold as linker and comparing the numbers for a few
-Wl,--shuffle-sections=$seed seed values.
Greetings,
Andres Freund
On Wed, Jan 08, 2025 at 05:25:24PM -0500, Andres Freund wrote:
On 2025-01-08 16:01:19 -0600, Nathan Bossart wrote:
But I did retry my test from upthread without pg_stat_statements and was
surprised to find a reproducible 4-6% regression.Uh, huh. I assume this was readonly pgbench with 256 clients just as you had
tested upthread? I don't think there's any hot spinlock meaningfully involved
in that workload? A r/w workload is a different story, but upthread you
mentioned select-only.Do you see any spinlock in profiles?
Yes, this was using 256 clients. Looking closer, I don't see anything
spinlock related anywhere near the top of perf.
I'm not seeing any obvious differences in perf, but I do see that the thread
for adding TAS_SPIN() for PPC mentions a regression at lower contention
levels [0]. Perhaps the non-locked test is failing often enough to hurt
performance in this case... Whatever it is, it'll be mighty frustrating to
miss out on a7x gain because of a 4% regression.
I don't think the explanation can be that simple - even with TAS_SPIN defined,
we do try to acquire the lock once without using TAS_SPIN:#if !defined(S_LOCK)
#define S_LOCK(lock) \
(TAS(lock) ? s_lock((lock), __FILE__, __LINE__, __func__) : 0)
#endif /* S_LOCK */Only s_lock() then uses TAS_SPIN(lock).
Ah, right. FWIW I tried setting a cap on the number of times we do a
non-locked test, and the results still showed the regression, which seems
to match your intuition here.
I wonder if you're hitting an extreme case of binary-layout related effects?
I've never seen them at this magnitude though. I'd suggest using either lld
or mold as linker and comparing the numbers for a few
-Wl,--shuffle-sections=$seed seed values.
Will do.
--
nathan
Nathan Bossart <nathandbossart@gmail.com> writes:
AFAICT TAS_SPIN() is only used for s_lock(), which doesn't appear to be
used by LWLocks. But I did retry my test from upthread without
pg_stat_statements and was surprised to find a reproducible 4-6%
regression.
On what hardware?
I just spent an hour beating on my M4 Pro (the 14-core variant)
and could not detect any outside-the-noise effect of this patch,
with or without pg_stat_statements loaded. There does seem to be
a small fraction-of-a-percent-ish benefit. But the run-to-run
variation with 60-second "pgbench -S" tests is a couple of percent,
so I can't say that that's real.
I do feel pretty sure that the patch doesn't hurt on this
class of hardware.
regards, tom lane
On Wed, Jan 08, 2025 at 06:07:44PM -0500, Tom Lane wrote:
Nathan Bossart <nathandbossart@gmail.com> writes:
AFAICT TAS_SPIN() is only used for s_lock(), which doesn't appear to be
used by LWLocks. But I did retry my test from upthread without
pg_stat_statements and was surprised to find a reproducible 4-6%
regression.On what hardware?
This was on a c8g.24xlarge (Neoverse-V2, Armv9.0-a) [0]https://github.com/aws/aws-graviton-getting-started/blob/main/README.md#building-for-graviton.
I just spent an hour beating on my M4 Pro (the 14-core variant)
and could not detect any outside-the-noise effect of this patch,
with or without pg_stat_statements loaded. There does seem to be
a small fraction-of-a-percent-ish benefit. But the run-to-run
variation with 60-second "pgbench -S" tests is a couple of percent,
so I can't say that that's real.I do feel pretty sure that the patch doesn't hurt on this
class of hardware.
Great. This matches what I saw on an M3.
[0]: https://github.com/aws/aws-graviton-getting-started/blob/main/README.md#building-for-graviton
--
nathan
On Wed, Jan 08, 2025 at 04:38:02PM -0600, Nathan Bossart wrote:
On Wed, Jan 08, 2025 at 05:25:24PM -0500, Andres Freund wrote:
I wonder if you're hitting an extreme case of binary-layout related effects?
I've never seen them at this magnitude though. I'd suggest using either lld
or mold as linker and comparing the numbers for a few
-Wl,--shuffle-sections=$seed seed values.Will do.
Actually, I think I may have just had back luck and/or not warmed things up
enough. I just re-ran the test a few dozen times, carefully ensuring the
data was in the cache and periodically alternating between the binary with
the patch applied and the one without it. The results converged to within
1-2% of each other, with the patched version even winning about half the
time. The averages across all the runs showed a ~0.4% regression, which I
suspect is well within the noise range.
--
nathan
On Fri, Jan 10, 2025 at 10:51:40AM -0600, Nathan Bossart wrote:
Actually, I think I may have just had back luck and/or not warmed things up
enough. I just re-ran the test a few dozen times, carefully ensuring the
data was in the cache and periodically alternating between the binary with
the patch applied and the one without it. The results converged to within
1-2% of each other, with the patched version even winning about half the
time. The averages across all the runs showed a ~0.4% regression, which I
suspect is well within the noise range.
I went ahead and committed this patch. Please let me know if there are any
remaining concerns.
--
nathan
Hi, Nathan.
I just realized that I almost forgot about this thread :)
The result looks great, but the discussion in [0] shows that the result may
vary among different ARM chips. Could you provide the chip model of this
test? So that we can do a cross validation of this patch. Not sure if compiler
version is necessary too. I'm willing to test it on Alibaba Cloud Yitian 710
if I have time.
I did some benchmark on Yitian 710.
On c8y.16xlarge (64 cores):
Without the patch:
80.31% postgres [.] __aarch64_swp4_acq
1.77% postgres [.] __aarch64_ldadd4_acq_rel
1.13% postgres [.] hash_search_with_hash_value
0.87% pg_stat_statements.so [.] __aarch64_swp4_acq
0.72% postgres [.] perform_spin_delay
0.44% postgres [.] _bt_compare
tps = 295272.628421 (including connections establishing)
tps = 295335.660323 (excluding connections establishing)
Patched:
9.94% postgres [.] s_lock
6.07% postgres [.] __aarch64_swp4_acq
5.73% postgres [.] hash_search_with_hash_value
2.81% postgres [.] perform_spin_delay
2.29% postgres [.] _bt_compare
2.15% postgres [.] PinBuffer
tps = 864519.764125 (including connections establishing)
tps = 864638.244443 (excluding connections establishing)
Seems that great performance could be gained if s_lock contention is severe.
This may be more likely to happen on bigger machines.
On c8y.2xlarge (8 cores), I failed to make s_lock contended severely, and
as a result this patch didn’t bring any difference outside the noise.
Regards,
Jingtang
On Wed, Jan 15, 2025 at 07:50:38PM +0800, Jingtang Zhang wrote:
Seems that great performance could be gained if s_lock contention is severe.
This may be more likely to happen on bigger machines.On c8y.2xlarge (8 cores), I failed to make s_lock contended severely, and
as a result this patch didn�t bring any difference outside the noise.
Thanks for sharing.
--
nathan