Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Started by Salvatore Dipietro9 months ago23 messages

dipietro.salvatore@gmail.com

9 months ago

1 attachment(s)

Hi,
we would like to propose the removal of the Instruction
Synchronization Barrier (isb) for aarch64 architectures. Based on our
testing on Graviton instances (m7g.16xlarge), we can see on average
over multiple iterations up to 12% better performance using PGBench
select-only and up to 9% with Sysbench oltp_read_only workloads. On
Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
select-only and up to 6% with Sysbench oltp_read_only workloads.
We have also tested it putting more pressure on the spin_delay
function, enabling pg_stat_statements.track_planning with PGBench
read-only [0]/messages/by-id/ZxgDEb_VpWyNZKB_@nathan and, on average, the patch shows up to 27% better
performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

Testing environment:
- PostgreSQL version: 17.2
- Operating System: Ubuntu 22.04
- Test Platform: AWS Graviton instances (m6g.16xlarge, m7g.16xlarge
and m8g.24xlarge)

Our benchmark results on PGBench select-only without
pg_stat_statements.track_planning:
```
# Load DB on m7g.16xlarge
$ pgbench -i --fillfactor=90 --scale=5644 --host=172.31.32.85
--username=postgres pgtest

# Without patch
$ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared
-P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest
...
"transaction type: <builtin: select only>",
"scaling factor: 5644",
"query mode: prepared",
"number of clients: 256",
"number of threads: 96",
"duration: 600 s",
"number of transactions actually processed: 359864937",
"latency average = 0.420 ms",
"latency stddev = 1.755 ms",
"tps = 599770.727316 (including connections establishing)",
"tps = 599826.788919 (excluding connections establishing)"

# With patch
$ pgbench --host 172.31.32.85 --username=postgres --protocol=prepared
-P 10 -b select-only --time=600 --client=256 --jobs=96 pgtest
...
"transaction type: <builtin: select only>",
"scaling factor: 5644",
"query mode: prepared",
"number of clients: 256",
"number of threads: 96",
"duration: 600 s",
"number of transactions actually processed: 405891881",
"latency average = 0.371 ms",
"latency stddev = 0.569 ms",
"tps = 676480.900049 (including connections establishing)",
"tps = 676523.557293 (excluding connections establishing)"
```

[0]: /messages/by-id/ZxgDEb_VpWyNZKB_@nathan

Attachments:

0001-Remove-Instruction-Synchronization-Barrier-in-spin_d.patchapplication/octet-stream; name=0001-Remove-Instruction-Synchronization-Barrier-in-spin_d.patchDownload

From ac3a0b6be0d272abbb41f446a3de2a7625ddcb43 Mon Sep 17 00:00:00 2001
From: Salvatore Dipietro <dipiets@amazon.com>
Date: Tue, 29 Apr 2025 14:01:27 -0700
Subject: [PATCH] Remove Instruction Synchronization Barrier in spin_delay()
 for ARM64 architecture

---
 src/include/storage/s_lock.h | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 2f73f9fcf57a..e1e01b94fe68 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -274,16 +274,7 @@ tas(volatile slock_t *lock)
 #define SPIN_DELAY() spin_delay()
 
 static __inline__ void
-spin_delay(void)
-{
-	/*
-	 * Using an ISB instruction to delay in spinlock loops appears beneficial
-	 * on high-core-count ARM64 processors.  It seems mostly a wash for smaller
-	 * gear, and ISB doesn't exist at all on pre-v7 ARM chips.
-	 */
-	__asm__ __volatile__(
-		" isb;				\n");
-}
+spin_delay(void){}
 
 #endif	 /* __aarch64__ */
 #endif	 /* HAVE_GCC__SYNC_INT32_TAS */
-- 
2.49.0

Robert Haas

robertmhaas@gmail.com

9 months ago

In reply to: Salvatore Dipietro (#1)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro
<dipietro.salvatore@gmail.com> wrote:

we would like to propose the removal of the Instruction
Synchronization Barrier (isb) for aarch64 architectures. Based on our
testing on Graviton instances (m7g.16xlarge), we can see on average
over multiple iterations up to 12% better performance using PGBench
select-only and up to 9% with Sysbench oltp_read_only workloads. On
Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
select-only and up to 6% with Sysbench oltp_read_only workloads.
We have also tested it putting more pressure on the spin_delay
function, enabling pg_stat_statements.track_planning with PGBench
read-only [0] and, on average, the patch shows up to 27% better
performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

Hmm. This was added only 3 years ago, supposedly because it made
performance better:

commit a82a5eee314df52f3183cedc0ecbcac7369243b1
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed Apr 6 18:57:57 2022 -0400

Use ISB as a spin-delay instruction on ARM64.

This seems beneficial on high-core-count machines, and not harmful
on lesser hardware. However, older ARM32 gear doesn't have this
instruction, so restrict the patch to ARM64.

Geoffrey Blake

Discussion:
/messages/by-id/78338F29-9D7F-4DC8-BD71-E9674CE71425@amazon.com

I think you should make some kind of argument about why the previous
conclusion was wrong, or why something's changed between then and now.

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Robert Haas (#2)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro
<dipietro.salvatore@gmail.com> wrote:

we would like to propose the removal of the Instruction
Synchronization Barrier (isb) for aarch64 architectures. Based on our
testing on Graviton instances (m7g.16xlarge), we can see on average
over multiple iterations up to 12% better performance using PGBench
select-only and up to 9% with Sysbench oltp_read_only workloads. On
Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
select-only and up to 6% with Sysbench oltp_read_only workloads.
We have also tested it putting more pressure on the spin_delay
function, enabling pg_stat_statements.track_planning with PGBench
read-only [0] and, on average, the patch shows up to 27% better
performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

I think you should make some kind of argument about why the previous
conclusion was wrong, or why something's changed between then and now.

TBH, those numbers are large enough that I flat out don't believe
them. As noted in the previous thread, we've managed to squeeze out
a lot of our dependencies on spinlock performance, via algorithmic
changes, migration to atomic ops, etc. So I think ten-percent-ish
improvement on a plain pgbench test just isn't very plausible.
We certainly didn't see that kind of effect when we were doing that
earlier round of testing --- we had to use a custom testing lashup
to get numbers that were outside the noise at all.

Of course, microbenchmarking is a tricky business, so it's possible
that a different custom testing lashup would show the opposite
results. But what's quoted above is sufficiently unlike our prior
results that I can't help thinking something is wrong.

One other thing that comes to mind is that pg_stat_statements
has stretched the intention of "short straight-line code segment"
to the point of unrecognizability. Maybe it needs to stop using
spinlocks to protect pgssEntry updates. But I'm not sure if that
would move the needle on whether ISB is a good idea or not.

regards, tom lane

Nathan Bossart

nathandbossart@gmail.com

9 months ago

In reply to: Tom Lane (#3)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Thu, May 01, 2025 at 02:48:59PM -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Apr 30, 2025 at 4:53 AM Salvatore Dipietro
<dipietro.salvatore@gmail.com> wrote:

we would like to propose the removal of the Instruction
Synchronization Barrier (isb) for aarch64 architectures. Based on our
testing on Graviton instances (m7g.16xlarge), we can see on average
over multiple iterations up to 12% better performance using PGBench
select-only and up to 9% with Sysbench oltp_read_only workloads. On
Graviton4 (m8g.24xlarge) results are up to 8% better using PGBench
select-only and up to 6% with Sysbench oltp_read_only workloads.
We have also tested it putting more pressure on the spin_delay
function, enabling pg_stat_statements.track_planning with PGBench
read-only [0] and, on average, the patch shows up to 27% better
performance on m6g.16xlarge and up to 37% on m7g.16xlarge.

I think you should make some kind of argument about why the previous
conclusion was wrong, or why something's changed between then and now.

TBH, those numbers are large enough that I flat out don't believe
them. As noted in the previous thread, we've managed to squeeze out
a lot of our dependencies on spinlock performance, via algorithmic
changes, migration to atomic ops, etc. So I think ten-percent-ish
improvement on a plain pgbench test just isn't very plausible.
We certainly didn't see that kind of effect when we were doing that
earlier round of testing --- we had to use a custom testing lashup
to get numbers that were outside the noise at all.

I'm not sure there's actually any hot spinlock involved in a regular
select-only pgbench (unless perhaps pg_stat_statements was enabled or
something). But I do know pg_stat_statements.track_planning stresses the
spinlock code quite a bit, and commit 3d0b4b1 recently added a non-locking
initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still
appropriate. It'd be interesting to see the performance difference of
removing the ISB with and without commit 3d0b4b1 applied.

--
nathan

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Nathan Bossart (#4)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Nathan Bossart <nathandbossart@gmail.com> writes:

... commit 3d0b4b1 recently added a non-locking
initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still
appropriate. It'd be interesting to see the performance difference of
removing the ISB with and without commit 3d0b4b1 applied.

Oh! That's an excellent point. The OP didn't mention if their tests
were done before or after 3d0b4b1, but that might well matter.

I still think pgbench is a very blunt tool for this type of testing,
though. I recommend resurrecting the test_shm_mq-based hack discussed
in the prior thread and seeing what that shows.

regards, tom lane

Nathan Bossart

nathandbossart@gmail.com

9 months ago

In reply to: Tom Lane (#5)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Thu, May 01, 2025 at 04:08:06PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

... commit 3d0b4b1 recently added a non-locking
initial test in AArch64's TAS_SPIN, so I wonder if the ISB is still
appropriate. It'd be interesting to see the performance difference of
removing the ISB with and without commit 3d0b4b1 applied.

Oh! That's an excellent point. The OP didn't mention if their tests
were done before or after 3d0b4b1, but that might well matter.

I still think pgbench is a very blunt tool for this type of testing,
though. I recommend resurrecting the test_shm_mq-based hack discussed
in the prior thread and seeing what that shows.

Well, I have interesting results. This is all on a c8g.24xlarge (96 cores,
Neoverse-V2, Armv9.0-a).

For the first test_shm_mq test, I ran the following:

SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 2);
SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
...

This gave me the following results (values are in seconds):

w/o 3d0b4b1 w/ 3d0b4b1
ISB no ISB ISB no ISB
1 1.4 1.6 1.5 1.6
2 2.1 2.0 2.1 2.1
4 3.2 3.5 3.3 3.5
8 7.4 8.1 7.2 8.4
16 18.0 35.9 22.7 23.4
32 35.7 85.6 53.7 49.5
64 85.1 ? 147.6 100.1

For the second test_shm_mq test, I ran at higher concurrency, so I had to
reduce the loop counts:

SELECT test_shm_mq_pipelined(16384, 'xyzzy', 100000, 32);
...

That gave me the following:

w/o 3d0b4b1 w/ 3d0b4b1
ISB no ISB ISB no ISB
32 0.4 0.8 0.5 0.6
64 2.0 4.8 1.3 1.1
128 6.1 29.3 7.5 2.1
256 43.0 66.4 24.4 4.5

Finally, I ran the pgbench select-only test with
pg_stat_statements.track_planning enabled (values are in thousands of
transactions per second):

w/o 3d0b4b1 w/ 3d0b4b1
ISB no ISB ISB no ISB
71.4 67.4 538.2 891.2

So...

* The ISB does seem to have a positive effect without commit 3d0b4b1
applied.

* With commit 3d0b4b1 applied, removing the ISB seems to have a positive
effect at high concurrencies. This is especially pronounced in the
pgbench test.

* With commit 3d0b4b1 applied, removing the ISB doesn't change much at
lower concurrencies, and there might even be a small regression.

* At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress
some test_shm_mq tests. Removing the ISB instruction appears to help in
some cases, but not all.

--
nathan

Salvatore Dipietro

dipietro.salvatore@gmail.com

9 months ago

In reply to: Tom Lane (#5)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Thu, 1 May 2025 at 13:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh! That's an excellent point. The OP didn't mention if their tests
were done before or after 3d0b4b1, but that might well matter.

The benchmarks we conducted are based on REL_17_2 branch which do not
include the TAS_SPIN(lock) change for ARM yet.

Salvatore Dipietro

dipietro.salvatore@gmail.com

8 months ago

In reply to: Nathan Bossart (#6)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Thu, 1 May 2025 at 14:50, Nathan Bossart <nathandbossart@gmail.com> wrote:

So...

* The ISB does seem to have a positive effect without commit 3d0b4b1
applied.

* With commit 3d0b4b1 applied, removing the ISB seems to have a positive
effect at high concurrencies. This is especially pronounced in the
pgbench test.

* With commit 3d0b4b1 applied, removing the ISB doesn't change much at
lower concurrencies, and there might even be a small regression.

* At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress
some test_shm_mq tests. Removing the ISB instruction appears to help in
some cases, but not all.

Based on your findings Nathan, what is the best way to proceed for this change?
Do we need more validation for it? If yes, which kind?

Thanks,
Salvatore

Nathan Bossart

nathandbossart@gmail.com

8 months ago

In reply to: Salvatore Dipietro (#8)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Tue, May 13, 2025 at 11:37:45AM -0700, Salvatore Dipietro wrote:

On Thu, 1 May 2025 at 14:50, Nathan Bossart <nathandbossart@gmail.com> wrote:

So...

* The ISB does seem to have a positive effect without commit 3d0b4b1
applied.

* With commit 3d0b4b1 applied, removing the ISB seems to have a positive
effect at high concurrencies. This is especially pronounced in the
pgbench test.

* With commit 3d0b4b1 applied, removing the ISB doesn't change much at
lower concurrencies, and there might even be a small regression.

* At mostly lower concurrencies, commit 3d0b4b1 actually seems to regress
some test_shm_mq tests. Removing the ISB instruction appears to help in
some cases, but not all.

Based on your findings Nathan, what is the best way to proceed for this change?
Do we need more validation for it? If yes, which kind?

Well, I am confused because your recent message [0]/messages/by-id/CAGnuAhXNQXCcS1nCeD6E0Dyfi4Ms-b0sjcm79Y9iMi5WxUqM4g@mail.gmail.com indicated that you saw
improvement without the non-locking initial test in TAS_SPIN, which seems
to contradict my findings [1]/messages/by-id/aBPsrFbjnrqp3_8S@nathan. Could you retry your tests on v18devel? It
might also be useful to repeat the tests on a variety of hardware to ensure
it's a win across the board.

[0]: /messages/by-id/CAGnuAhXNQXCcS1nCeD6E0Dyfi4Ms-b0sjcm79Y9iMi5WxUqM4g@mail.gmail.com
[1]: /messages/by-id/aBPsrFbjnrqp3_8S@nathan

--
nathan

#10

Nathan Bossart

nathandbossart@gmail.com

8 months ago

In reply to: Nathan Bossart (#9)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Mon, May 19, 2025 at 11:38:39AM -0500, Nathan Bossart wrote:

On Tue, May 13, 2025 at 11:37:45AM -0700, Salvatore Dipietro wrote:

Based on your findings Nathan, what is the best way to proceed for this change?
Do we need more validation for it? If yes, which kind?

Well, I am confused because your recent message [0] indicated that you saw
improvement without the non-locking initial test in TAS_SPIN, which seems
to contradict my findings [1]. Could you retry your tests on v18devel? It
might also be useful to repeat the tests on a variety of hardware to ensure
it's a win across the board.

I should probably also mention that we are in feature-freeze until v19
development opens in July.

--
nathan

#11

Salvatore Dipietro

dipietro.salvatore@gmail.com

7 months ago

In reply to: Nathan Bossart (#9)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Mon, 19 May 2025 at 09:38, Nathan Bossart <nathandbossart@gmail.com> wrote:

Could you retry your tests on v18devel? It might also be useful to

repeat the tests on a variety of hardware to ensure

it's a win across the board.

Hi Nathan,
Thanks for your clarification. As you requested, I have performed more
tests on different instance types and sizes.
In particular, I have run the `test_shm_mq_pipelined` benchmark using
Ubuntu 22.04 on m7g.[2,8,16]xlarge and
c8g.[2,8.24]xlarge instances with PG master branch (commit:
84914e964b4). Each test has been repeated 30 times
and here is the average (in seconds) and the difference from baseline (Master).

Graviton3 instances (m7g) results:
| Concurrency | Loops | 2xl Master | 2xl No-ISB | 8xl Master |
8xl No-ISB | 16xl Master | 16xl No-ISB |
|-------------|----------|------------|--------------|------------|--------------|-------------|----------------|
| 1 | 10000000 | 1.9s | 1.9s (1.00x) | 1.9s |
1.8s (1.06x) | 1.7s | 1.8s (0.94x) |
| 2 | 10000000 | 2.4s | 2.4s (1.00x) | 2.5s |
2.3s (1.09x) | 2.3s | 2.3s (1.00x) |
| 4 | 10000000 | 3.8s | 3.8s (1.00x) | 3.8s |
3.6s (1.06x) | 3.5s | 3.8s (0.92x) |
| 8 | 10000000 | 8.9s | 10.3s (0.86x)| 7.5s |
8.6s (0.87x) | 7.8s | 8.9s (0.88x) |
| 16 | 10000000 | 21.6s | 22.5s (0.96x)| 22.5s |
23.6s (0.95x)| 21.4s | 24.9s (0.86x) |
| 32 | 10000000 | 42.8s | 41.3s (1.04x)| 114.7s |
52.0s (2.21x)| 88.6s | 49.9s (1.78x) |
| 64 | 10000000 | 81.8s | 73.3s (1.12x)| 395.9s |
85.2s (4.65x)| 381.3s | 97.0s (3.93x) |
| 32 | 100000 | 0.4s | 0.4s (1.00x) | 1.1s |
0.5s (2.20x) | 1.1s | 0.6s (1.83x) |
| 64 | 100000 | 0.8s | 0.8s (1.00x) | 3.9s |
0.9s (4.33x) | 3.9s | 1.1s (3.55x) |
| 128 | 100000 | 1.6s | 1.5s (1.07x) | 8.5s |
1.9s (4.47x) | 13.3s | 2.0s (6.65x) |
| 256 | 100000 | 3.2s | 3.1s (1.03x) | 19.8s |
4.0s (4.95x) | 35.9s | 4.1s (8.76x) |

Graviton4 instances (c8g) results:
| Concurrency | Loops | 2xl Master | 2xl No-ISB | 8xl Master |
8xl No-ISB | 24xl Master | 24xl No-ISB |
|-------------|----------|------------|---------------|------------|---------------|-------------|----------------|
| 1 | 10000000 | 1.7s | 1.6s (1.06x) | 1.6s |
1.6s (1.00x) | 1.6s | 1.5s (1.07x) |
| 2 | 10000000 | 2.2s | 2.2s (1.00x) | 2.2s |
2.2s (1.00x) | 2.2s | 2.1s (1.05x) |
| 4 | 10000000 | 3.4s | 3.5s (0.97x) | 3.5s |
3.4s (1.03x) | 3.5s | 3.4s (1.03x) |
| 8 | 10000000 | 10.9s | 13.9s (0.78x) | 8.2s |
9.4s (0.87x) | 7.8s | 8.2s (0.95x) |
| 16 | 10000000 | 23.6s | 27.0s (0.87x) | 26.3s |
26.1s (1.01x) | 27.1s | 28.1s (0.96x) |
| 32 | 10000000 | 44.6s | 46.9s (0.95x) | 60.6s |
47.7s (1.27x) | 62.1s | 50.4s (1.23x) |
| 64 | 10000000 | 81.4s | 81.5s (1.00x) | 189.4s |
91.5s (2.07x) | 176.9s | 101.3s (1.75x) |
| 32 | 100000 | 0.5s | 0.5s (1.00x) | 0.6s |
0.5s (1.20x) | 0.6s | 0.5s (1.20x) |
| 64 | 100000 | 0.8s | 0.8s (1.00x) | 1.7s |
0.9s (1.89x) | 2.1s | 1.2s (1.75x) |
| 128 | 100000 | 1.5s | 1.6s (0.94x) | 4.5s |
1.9s (2.37x) | 7.8s | 2.1s (3.71x) |
| 256 | 100000 | 3.3s | 3.1s (1.06x) | 9.7s |
4.1s (2.37x) | 22.0s | 4.5s (4.89x) |

We can notice that with low concurrency (1,2,4) results are similar
while with medium concurrency (8,16)
the No-ISB approach can introduce some regression especially on
smaller instances. However, we can see some significant
positive performance impact with high concurrency (>=32) settings on
large instances (up to 8.76x on m7g.16xl with 256 concurrency).

#12

Nathan Bossart

nathandbossart@gmail.com

5 months ago

In reply to: Salvatore Dipietro (#11)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Thu, Jun 19, 2025 at 12:10:52PM -0700, Salvatore Dipietro wrote:

We can notice that with low concurrency (1,2,4) results are similar
while with medium concurrency (8,16)
the No-ISB approach can introduce some regression especially on
smaller instances. However, we can see some significant
positive performance impact with high concurrency (>=32) settings on
large instances (up to 8.76x on m7g.16xl with 256 concurrency).

Given these mixed results, it's unclear to me how exactly we should
proceed. Perhaps there is another approach that reduces the regressions to
a negligible level while still producing gains at higher levels of
concurrency. Or maybe we can convince ourselves that these regressions
aren't worth worrying about, but that seems like a bit of a stretch to me.

--
nathan

#13

Tom Lane

tgl@sss.pgh.pa.us

5 months ago

In reply to: Nathan Bossart (#12)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Nathan Bossart <nathandbossart@gmail.com> writes:

On Thu, Jun 19, 2025 at 12:10:52PM -0700, Salvatore Dipietro wrote:

We can notice that with low concurrency (1,2,4) results are similar
while with medium concurrency (8,16)
the No-ISB approach can introduce some regression especially on
smaller instances. However, we can see some significant
positive performance impact with high concurrency (>=32) settings on
large instances (up to 8.76x on m7g.16xl with 256 concurrency).

Given these mixed results, it's unclear to me how exactly we should
proceed. Perhaps there is another approach that reduces the regressions to
a negligible level while still producing gains at higher levels of
concurrency. Or maybe we can convince ourselves that these regressions
aren't worth worrying about, but that seems like a bit of a stretch to me.

I agree; I'm also worried that this may be optimizing for one
particular ARM implementation and have negative effects on others.

Perhaps a better way forward is to identify which spinlock is
suffering such huge contention and try to alleviate that at a
higher design level. (That could benefit more than just ARM, too.)

regards, tom lane

#14

Álvaro Herrera

alvherre@kurilemu.de

5 months ago

In reply to: Tom Lane (#3)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On 2025-May-01, Tom Lane wrote:

One other thing that comes to mind is that pg_stat_statements
has stretched the intention of "short straight-line code segment"
to the point of unrecognizability. Maybe it needs to stop using
spinlocks to protect pgssEntry updates. But I'm not sure if that
would move the needle on whether ISB is a good idea or not.

Yeah, it looks like pgss_store() is being too generous on the amount of
code run with that spinlock held. However, changing that spinlock to an
lwlock doesn't look easy, because of the way each pgss entry is created
as a dynahash entry, and then deallocated from there. With spinlocks we
can just reinit the spinlock each time, but that doesn't work with
lwlocks. We have no easy way to associate then disassociate each entry
from a specific lwlock.

Maybe we'd need to do
RequestNamedLWLockTranche("pg_stat_statements", 1 + pgss_max)
at startup, then put them all in an lwlock-freelist: a list of unused
lwlocks in this tranche, to take one on entry_alloc and put it back on
entry_dealloc.

I think Salvatore's comment that enabling track_planning makes the
performance gain more pronounced is evidence that changing this spinlock
to lwlock would be beneficial.

Another benefit of using an lwlock: pg_stat_statements_internal() could
use shared mode to read each entry. Probably not as critical (because
it requires the view to be read), but it'd reduce the performance hit
when the user is reading the view.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu. Five minutes later I realize that it's also talking
about food" (Donald Knuth)

#15

Nathan Bossart

nathandbossart@gmail.com

5 months ago

In reply to: Álvaro Herrera (#14)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Thu, Aug 14, 2025 at 11:29:08AM +0200, Álvaro Herrera wrote:

On 2025-May-01, Tom Lane wrote:

One other thing that comes to mind is that pg_stat_statements
has stretched the intention of "short straight-line code segment"
to the point of unrecognizability. Maybe it needs to stop using
spinlocks to protect pgssEntry updates. But I'm not sure if that
would move the needle on whether ISB is a good idea or not.

Yeah, it looks like pgss_store() is being too generous on the amount of
code run with that spinlock held. However, changing that spinlock to an
lwlock doesn't look easy, because of the way each pgss entry is created
as a dynahash entry, and then deallocated from there. With spinlocks we
can just reinit the spinlock each time, but that doesn't work with
lwlocks. We have no easy way to associate then disassociate each entry
from a specific lwlock.

I've attempted multiple times now to replace this spinlock with LWLocks or
atomics, but thus far haven't produced anything that improved matters [0]/messages/by-id/Zs4hJ6-Fg8DMgU_P@nathan.
IIRC I just dynamically created LWLocks with LWLockInitialize() as needed,
which is tedious but easy enough. My recollection is that the use of
floats for many of the values greatly complicated the atomics attempts.
So, fixing it might require some larger changes.

[0]: /messages/by-id/Zs4hJ6-Fg8DMgU_P@nathan

--
nathan

#16

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Álvaro Herrera (#14)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Hi,

On 2025-08-14 11:29:08 +0200, ï¿½lvaro Herrera wrote:

On 2025-May-01, Tom Lane wrote:

One other thing that comes to mind is that pg_stat_statements
has stretched the intention of "short straight-line code segment"
to the point of unrecognizability. Maybe it needs to stop using
spinlocks to protect pgssEntry updates. But I'm not sure if that
would move the needle on whether ISB is a good idea or not.

Yeah, it looks like pgss_store() is being too generous on the amount of
code run with that spinlock held.

Indeed. To the point that these days pgss is basically unusable for workloads
with many short queries.

However, changing that spinlock to an lwlock doesn't look easy, because of
the way each pgss entry is created as a dynahash entry, and then deallocated
from there. With spinlocks we can just reinit the spinlock each time, but
that doesn't work with lwlocks. We have no easy way to associate then
disassociate each entry from a specific lwlock.

I'm not following? The lwlock can just be inside the struct, just like the
spinlock is? "Association" is just LWLockInitialize() and deassociation is not
needed.

Greetings,

Andres Freund

#17

Nathan Bossart

nathandbossart@gmail.com

5 months ago

In reply to: Andres Freund (#16)

1 attachment(s)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Fri, Aug 15, 2025 at 01:39:52PM -0400, Andres Freund wrote:

On 2025-08-14 11:29:08 +0200, Álvaro Herrera wrote:

However, changing that spinlock to an lwlock doesn't look easy, because of
the way each pgss entry is created as a dynahash entry, and then deallocated
from there. With spinlocks we can just reinit the spinlock each time, but
that doesn't work with lwlocks. We have no easy way to associate then
disassociate each entry from a specific lwlock.

I'm not following? The lwlock can just be inside the struct, just like the
spinlock is? "Association" is just LWLockInitialize() and deassociation is not
needed.

Indeed. I rebased an old patch that I had lying around to demonstrate. If
my past testing [0]/messages/by-id/Zs4hJ6-Fg8DMgU_P@nathan is to be trusted, this actually hurts performance,
unfortunately.

[0]: /messages/by-id/Zs4hJ6-Fg8DMgU_P@nathan

--
nathan

Attachments:

v1-0001-replace-p_s_s-entry-spinlocks-with-lwlocks.patchtext/plain; charset=us-asciiDownload

From ecb70a2bb492c5ca47e86cadbce91d395f011316 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 1 Aug 2024 09:30:32 -0500
Subject: [PATCH v1 1/1] replace p_s_s entry spinlocks with lwlocks

---
 contrib/pg_stat_statements/pg_stat_statements.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9fc9635d330..d1fdd77acc3 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -240,7 +240,7 @@ typedef struct pgssEntry
 	int			encoding;		/* query text encoding */
 	TimestampTz stats_since;	/* timestamp of entry allocation */
 	TimestampTz minmax_stats_since; /* timestamp of last min/max values reset */
-	slock_t		mutex;			/* protects the counters only */
+	LWLock		mutex;			/* protects the counters only */
 } pgssEntry;
 
 /*
@@ -256,6 +256,7 @@ typedef struct pgssSharedState
 	int			n_writers;		/* number of active writers to query file */
 	int			gc_count;		/* query file garbage collection cycle count */
 	pgssGlobalStats stats;		/* global statistics for pgss */
+	int			tranche;
 } pgssSharedState;
 
 /*---- Local variables ----*/
@@ -554,6 +555,7 @@ pgss_shmem_startup(void)
 		pgss->gc_count = 0;
 		pgss->stats.dealloc = 0;
 		pgss->stats.stats_reset = GetCurrentTimestamp();
+		pgss->tranche = LWLockNewTrancheId();
 	}
 
 	info.keysize = sizeof(pgssHashKey);
@@ -565,6 +567,8 @@ pgss_shmem_startup(void)
 
 	LWLockRelease(AddinShmemInitLock);
 
+	LWLockRegisterTranche(pgss->tranche, "pg_stat_statements counters");
+
 	/*
 	 * If we're in the postmaster (or a standalone backend...), set up a shmem
 	 * exit hook to dump the statistics to disk.
@@ -1411,7 +1415,7 @@ pgss_store(const char *query, int64 queryId,
 		 * Grab the spinlock while updating the counters (see comment about
 		 * locking rules at the head of the file)
 		 */
-		SpinLockAcquire(&entry->mutex);
+		LWLockAcquire(&entry->mutex, LW_EXCLUSIVE);
 
 		/* "Unstick" entry if it was previously sticky */
 		if (IS_STICKY(entry->counters))
@@ -1511,7 +1515,7 @@ pgss_store(const char *query, int64 queryId,
 		else if (planOrigin == PLAN_STMT_CACHE_CUSTOM)
 			entry->counters.custom_plan_calls++;
 
-		SpinLockRelease(&entry->mutex);
+		LWLockRelease(&entry->mutex);
 	}
 
 done:
@@ -1901,9 +1905,9 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
 		}
 
 		/* copy counters to a local variable to keep locking time short */
-		SpinLockAcquire(&entry->mutex);
+		LWLockAcquire(&entry->mutex, LW_SHARED);
 		tmp = entry->counters;
-		SpinLockRelease(&entry->mutex);
+		LWLockRelease(&entry->mutex);
 
 		/*
 		 * The spinlock is not required when reading these two as they are
@@ -2134,7 +2138,7 @@ entry_alloc(pgssHashKey *key, Size query_offset, int query_len, int encoding,
 		/* set the appropriate initial usage count */
 		entry->counters.usage = sticky ? pgss->cur_median_usage : USAGE_INIT;
 		/* re-initialize the mutex each time ... we assume no one using it */
-		SpinLockInit(&entry->mutex);
+		LWLockInitialize(&entry->mutex, pgss->tranche);
 		/* ... and don't forget the query text metadata */
 		Assert(query_len >= 0);
 		entry->query_offset = query_offset;
-- 
2.39.5 (Apple Git-154)

#18

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Nathan Bossart (#17)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Hi,

On 2025-08-15 12:57:52 -0500, Nathan Bossart wrote:

On Fri, Aug 15, 2025 at 01:39:52PM -0400, Andres Freund wrote:

On 2025-08-14 11:29:08 +0200, ï¿½lvaro Herrera wrote:

However, changing that spinlock to an lwlock doesn't look easy, because of
the way each pgss entry is created as a dynahash entry, and then deallocated
from there. With spinlocks we can just reinit the spinlock each time, but
that doesn't work with lwlocks. We have no easy way to associate then
disassociate each entry from a specific lwlock.

I'm not following? The lwlock can just be inside the struct, just like the
spinlock is? "Association" is just LWLockInitialize() and deassociation is not
needed.

Indeed. I rebased an old patch that I had lying around to demonstrate. If
my past testing [0] is to be trusted, this actually hurts performance,
unfortunately.

FWIW, rather interesting result of testing the patch briefly:

On my older workstation, the patch is a substantial *gain* when there's a lot
of contention. But on my newer workstation it's a *loss*.

The penalty from enabling pg_stat_statements for readonly pgbench on the newer
workstation is rather bad - about 1/3 the throughput.

I think the main reason that lwlocks loose on the newer machine is that we
loose spinning. The newer machine has more cores and more numa domains and the
fairer locks lead to more cacheline pingpong...

IMO, the only way to actually make pg_stat_statements scale is to move to a
model much more like our regular stats. I.e. accumulate counters in backend
local memory and only occasionally update the shared stats. Even if you were
to move pgss successfully to atomics, the cacheline contention still would be
terrible for performance.

FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns*
compared to using the spinlocks. You'd replace one atomic operation with
dozens, to update all those fields individually. With loads of cacheline
pingpong inbetween.

Greetings,

Andres Freund

#19

Nathan Bossart

nathandbossart@gmail.com

5 months ago

In reply to: Andres Freund (#18)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Fri, Aug 15, 2025 at 04:13:30PM -0400, Andres Freund wrote:

IMO, the only way to actually make pg_stat_statements scale is to move to a
model much more like our regular stats. I.e. accumulate counters in backend
local memory and only occasionally update the shared stats.

Agreed. I remember discussing something similar at pgconf.dev this year.

FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns*
compared to using the spinlocks. You'd replace one atomic operation with
dozens, to update all those fields individually. With loads of cacheline
pingpong inbetween.

Right. At some point I tried moving most things to atomics and leaving the
rest behind the spinlock, and IIRC there wasn't really any improvement. I
didn't dig into whether that was because of atomic-related cache line
ping-pong or the existing spinlock, but regardless, I quickly discarded
that idea.

--
nathan

#20

Nathan Bossart

nathandbossart@gmail.com

5 months ago

In reply to: Nathan Bossart (#19)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Fri, Aug 15, 2025 at 03:25:20PM -0500, Nathan Bossart wrote:

On Fri, Aug 15, 2025 at 04:13:30PM -0400, Andres Freund wrote:

IMO, the only way to actually make pg_stat_statements scale is to move to a
model much more like our regular stats. I.e. accumulate counters in backend
local memory and only occasionally update the shared stats.

Agreed. I remember discussing something similar at pgconf.dev this year.

BTW I'm planning to give this a try...

--
nathan

#21

Tom Lane

tgl@sss.pgh.pa.us

5 months ago

In reply to: Andres Freund (#18)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Andres Freund <andres@anarazel.de> writes:

FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns*
compared to using the spinlocks. You'd replace one atomic operation with
dozens, to update all those fields individually. With loads of cacheline
pingpong inbetween.

Not only that, but you'd no longer have any semblance of read-consistency
between the fields of a stats entry.

regards, tom lane

#22

Michael Paquier

michael@paquier.xyz

5 months ago

In reply to: Nathan Bossart (#20)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Fri, Aug 15, 2025 at 03:54:29PM -0500, Nathan Bossart wrote:

On Fri, Aug 15, 2025 at 03:25:20PM -0500, Nathan Bossart wrote:

On Fri, Aug 15, 2025 at 04:13:30PM -0400, Andres Freund wrote:

IMO, the only way to actually make pg_stat_statements scale is to move to a
model much more like our regular stats. I.e. accumulate counters in backend
local memory and only occasionally update the shared stats.

Agreed. I remember discussing something similar at pgconf.dev this year.

BTW I'm planning to give this a try...

This means to use the pluggable pgstats API to achieve that. FWIW,
one difficulty I am foreseeing is how we want to cap and control the
number of PGSS entries that are saved in the pgstats hash table, based
on the current max PGSS can be set to. Just saying that we'll most
likely rely on a separate zone of shared memory to track this
information. Most of that should happen in the flush callback, based
on the current infra in place, I guess.

More stats are good to have, still capping them is key to keep the
backend under control, and it does not make sense to me to use a fixed
chunk of shared memory for this side (meaning a fixed-sized pgstats
kind for PGSS). Another benefit is to make the max reloadable.
--
Michael

#23

Nathan Bossart

nathandbossart@gmail.com

5 months ago

In reply to: Tom Lane (#13)

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

On Mon, Aug 11, 2025 at 04:12:48PM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

On Thu, Jun 19, 2025 at 12:10:52PM -0700, Salvatore Dipietro wrote:

We can notice that with low concurrency (1,2,4) results are similar
while with medium concurrency (8,16)
the No-ISB approach can introduce some regression especially on
smaller instances. However, we can see some significant
positive performance impact with high concurrency (>=32) settings on
large instances (up to 8.76x on m7g.16xl with 256 concurrency).

Given these mixed results, it's unclear to me how exactly we should
proceed. Perhaps there is another approach that reduces the regressions to
a negligible level while still producing gains at higher levels of
concurrency. Or maybe we can convince ourselves that these regressions
aren't worth worrying about, but that seems like a bit of a stretch to me.

I agree; I'm also worried that this may be optimizing for one
particular ARM implementation and have negative effects on others.

Based on this discussion, I am marking the patch as Returned with Feedback.

--
nathan