Add spin_delay() implementation for Arm in s_lock.h

Started by Blake, Geoffabout 4 years ago15 messages
#1Blake, Geoff
Blake, Geoff
blakgeof@amazon.com
1 attachment(s)

Hi,

Have a tiny patch to add an implementation of spin_delay() for Arm64 processors to match behavior with x86's PAUSE instruction. See negligible benefit on the pgbench tpcb-like workload so at worst it appears to do no harm but should help some workloads that experience some lock contention that need to spin.

Thanks,
Geoffrey Blake

Attachments:

0001-Add-spin_delay-implementation-for-Arm-processors-by-.patchapplication/octet-stream; name=0001-Add-spin_delay-implementation-for-Arm-processors-by-.patch
#2Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Blake, Geoff (#1)
Re: Add spin_delay() implementation for Arm in s_lock.h

"Blake, Geoff" <blakgeof@amazon.com> writes:

Have a tiny patch to add an implementation of spin_delay() for Arm64 processors to match behavior with x86's PAUSE instruction. See negligible benefit on the pgbench tpcb-like workload so at worst it appears to do no harm but should help some workloads that experience some lock contention that need to spin.

Given the very wide variety of ARM implementations out there,
I'm not sure that we want to take a patch like this on the basis of
exactly zero evidence. It could as easily be a net loss as a win.
What did you test exactly?

regards, tom lane

#3Blake, Geoff
Blake, Geoff
blakgeof@amazon.com
In reply to: Tom Lane (#2)
Re: Add spin_delay() implementation for Arm in s_lock.h

Hi Tom,

What did you test exactly?

Tested 3 benchmark configurations on an m6g.16xlarge (Graviton2, 64 cpus, 256GB RAM)
I set the scale factor to consume about 1/3 of 256GB and the other parameters in the next line.
pgbench setup: -F 90 -s 5622 -c 256
Pgbench select-only w/ patch 662804 tps (-0.5%)
w/o patch 666354 tps.
tcpb-like w/ patch 35844 tps (0%)
w/o patch 35835 tps

We also test with Hammerdb when evaluating patches, it shows the patch gets +3%:
Hammerdb (192 Warehouse 256 clients)
w/ patch 1147463 NOPM (+3%)
w/o patch 1112908 NOPM

I've run pgbench more than once and the measured TPS values overlap, even though the means on select-only show a small degradation at the moment I am concluding it is noise. On Hammerdb, the results show a measurable difference.

Thanks,
Geoff

#4Blake, Geoff
Blake, Geoff
blakgeof@amazon.com
In reply to: Tom Lane (#2)
Re: Add spin_delay() implementation for Arm in s_lock.h

Tom,

Hope everything is well going into the new year. I'd like to pick this discussion back up and your thoughts on the patch with the data I posted 2 weeks prior. Is there more data that would be helpful? Different setup? Data on older versions of Postgresql to ascertain if it makes more sense on versions before the large re-work of the snapshot algorithm that exhibited quite a bit of synchronization contention?

Thanks,
Geoff

#5Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Blake, Geoff (#4)
1 attachment(s)
Re: Add spin_delay() implementation for Arm in s_lock.h

"Blake, Geoff" <blakgeof@amazon.com> writes:

Hope everything is well going into the new year. I'd like to pick this discussion back up and your thoughts on the patch with the data I posted 2 weeks prior. Is there more data that would be helpful? Different setup? Data on older versions of Postgresql to ascertain if it makes more sense on versions before the large re-work of the snapshot algorithm that exhibited quite a bit of synchronization contention?

I spent some time working on this. I don't have a lot of faith in
pgbench as a measurement testbed for spinlock contention, because over
the years we've done a good job of getting rid of that in our main
code paths (both the specific change you mention, and many others).
After casting around a bit and thinking about writing a bespoke test
framework, I landed on the idea of adding some intentional spinlock
contention to src/test/modules/test_shm_mq, which is a prefab test
framework for passing data among multiple worker processes. The
attached quick-hack patch makes it grab and release a spinlock once
per passed message. I'd initially expected that this would show only
marginal changes, because you'd hope that a spinlock acquisition would
be reasonably cheap compared to shm_mq_receive plus shm_mq_send.
Turns out not though.

The proposed test case is

(1) patch test_shm_mq as below

(2) time this query:

SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, n);

for various values of "n" up to about how many cores you have.
(You'll probably need to bump up max_worker_processes.)

For context, on my Intel-based main workstation (8-core Xeon W-2245),
the time to do this with stock test_shm_mq is fairly level.
Reporting best-of-3 runs:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1386.413 ms (00:01.386)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 1302.503 ms (00:01.303)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 1373.121 ms (00:01.373)

However, after applying the contention patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1346.362 ms (00:01.346)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3313.490 ms (00:03.313)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 7660.329 ms (00:07.660)

So this seems like (a) it's a plausible model for code that has
unoptimized spinlock contention, and (b) the effects are large
enough that you needn't fret too much about measurement noise.

I tried this out on a handy Apple M1 mini, which I concede
is not big iron but it's pretty decent aarch64 hardware.
With current HEAD's spinlock code, I get (again best-of-3):

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1630.255 ms (00:01.630)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3495.066 ms (00:03.495)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19541.929 ms (00:19.542)

With your spin-delay patch:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1643.524 ms (00:01.644)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3404.625 ms (00:03.405)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 19260.721 ms (00:19.261)

So I don't see a lot of reason to think your patch changes anything.
Maybe on something with more cores?

For grins I also tried this same test with the use-CAS-for-TAS patch
that was being discussed November before last, and it didn't
really show up as any improvement either:

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 1);
Time: 1608.642 ms (00:01.609)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 4);
Time: 3396.564 ms (00:03.397)

regression=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, 8);
Time: 20092.683 ms (00:20.093)

Maybe that's a little better in the uncontended (single-worker)
case, but it's worse at the high end.

I'm really curious to hear if this measurement method shows
any interesting improvements on your hardware.

regards, tom lane

Attachments:

test-spinlock-contention.patchtext/x-diff; charset=us-ascii; name=test-spinlock-contention.patch
#6Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: Add spin_delay() implementation for Arm in s_lock.h

Hi,

I landed on the idea of adding some intentional spinlock
contention to src/test/modules/test_shm_mq, which is a prefab test
framework for passing data among multiple worker processes. The
attached quick-hack patch makes it grab and release a spinlock once
per passed message.

I wonder if this will show the full set of spinlock contention issues - isn't
this only causing contention for one spinlock between two processes? It's not
too hard to imagine delays being more important the more processes contend for
one cacheline. I only skimmed your changes, so I might also just have
misunderstood what you were doing...

Greetings,

Andres Freund

#7Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#6)
Re: Add spin_delay() implementation for Arm in s_lock.h

Andres Freund <andres@anarazel.de> writes:

I landed on the idea of adding some intentional spinlock
contention to src/test/modules/test_shm_mq, which is a prefab test
framework for passing data among multiple worker processes. The
attached quick-hack patch makes it grab and release a spinlock once
per passed message.

I wonder if this will show the full set of spinlock contention issues - isn't
this only causing contention for one spinlock between two processes?

I don't think so -- the point of using the "pipelined" variant is
that messages are passing between all N worker processes concurrently.
(With the proposed test, I see N processes all pinning their CPUs;
if I use the non-pipelined API, they are busy but nowhere near 100%.)

It is just one spinlock, true, but I think the point is to gauge
what happens with N processes all contending for the same lock.
We could add some more complexity to use multiple locks, but
does that really add anything but complexity?

regards, tom lane

#8Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#7)
Re: Add spin_delay() implementation for Arm in s_lock.h

Hi,

On 2022-01-06 21:39:57 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I wonder if this will show the full set of spinlock contention issues - isn't
this only causing contention for one spinlock between two processes?

I don't think so -- the point of using the "pipelined" variant is
that messages are passing between all N worker processes concurrently.
(With the proposed test, I see N processes all pinning their CPUs;
if I use the non-pipelined API, they are busy but nowhere near 100%.)

My understanding of the shm_mq code is that that ends up with N shm_mq
instances, one for each worker. After all:

* shm_mq.c
* single-reader, single-writer shared memory message queue

These separate shm_mq instances forward messages in a circle,
"leader"->worker_1->worker_2->...->"leader". So there isn't a single contended
spinlock, but a bunch of different spinlocks, each with at most two backends
accessing it?

It is just one spinlock, true, but I think the point is to gauge
what happens with N processes all contending for the same lock.
We could add some more complexity to use multiple locks, but
does that really add anything but complexity?

Right, I agree that that's what we shoudl test - it's just no immediately
obvious to me that we are.

Greetings,

Andres Freund

#9Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#8)
Re: Add spin_delay() implementation for Arm in s_lock.h

Andres Freund <andres@anarazel.de> writes:

These separate shm_mq instances forward messages in a circle,
"leader"->worker_1->worker_2->...->"leader". So there isn't a single contended
spinlock, but a bunch of different spinlocks, each with at most two backends
accessing it?

No; there's just one spinlock. I'm re-purposing the spinlock that
test_shm_mq uses to protect its setup operations (and thereafter
ignores). AFAICS the N+1 shm_mq instances don't internally contain
spinlocks; they all use atomic ops.

(Well, on crappy architectures maybe there's spinlocks underneath
the atomic ops, but I don't think we care about such cases here.)

regards, tom lane

#10Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: Add spin_delay() implementation for Arm in s_lock.h

On 2022-01-06 22:23:38 -0500, Tom Lane wrote:

No; there's just one spinlock. I'm re-purposing the spinlock that
test_shm_mq uses to protect its setup operations (and thereafter
ignores).

Oh, sorry, misread :(

AFAICS the N+1 shm_mq instances don't internally contain
spinlocks; they all use atomic ops.

They contain spinlocks too, and the naming is similar enough that I got
confused:
struct shm_mq
{
slock_t mq_mutex;

We don't use them for all that much anymore though...

Greetings,

Andres Freund

#11Blake, Geoff
Blake, Geoff
blakgeof@amazon.com
In reply to: Andres Freund (#10)
Re: Add spin_delay() implementation for Arm in s_lock.h

Tom, Andres,

I spun up a 64-core Graviton2 instance (where I reported seeing improvement with this patch) and ran the provided regression test with and without my proposed on top of mainline PG. I ran 4 runs each of 63 workers where we should see the most contention and most impact from the patch. I am reporting the average and standard deviation, the average with the patch is 10% lower latency, but there is overlap in the standard deviation. I'll gather additional data at lower worker counts and post later to see what the trend is.

Cmd: postgres=# SELECT test_shm_mq_pipelined(16384, 'xyzzy', 10000000, workers);

Avg +/- standard dev
63 workers w/o patch: 552443ms +/- 22841ms
63 workers w/ patch: 502727 +/- 45253ms

Best results
w/o patch: 521216ms
w/ patch: 436442ms

Thanks,
Geoff

#12Blake, Geoff
Blake, Geoff
blakgeof@amazon.com
In reply to: Andres Freund (#10)
Re: Add spin_delay() implementation for Arm in s_lock.h

As promised, here is the remaining data:

1 worker, w/o patch: 5236 ms +/- 252ms
1 worker, w/ patch: 5529 ms +/- 168ms

2 worker, w/o patch: 4917 ms +/- 180ms
2 worker, w/ patch: 4745 ms +/- 169ms

4 worker, w/o patch: 6564 ms +/- 336ms
4 worker, w/ patch: 6105 ms +/- 177ms

8 worker, w/o patch: 9575 ms +/- 2375ms
8 worker, w/ patch: 8115 ms +/- 391ms

16 worker, w/o patch: 19367 ms +/- 3543ms
16 worker, w/ patch: 18004 ms +/- 3701ms

32 worker, w/o patch: 101509 ms +/- 22651ms
32 worker, w/ patch: 104234 ms +/- 26821ms

48 worker, w/o patch: 243329 ms +/- 70037ms
48 worker, w/ patch: 189965 ms +/- 79459ms

64 worker, w/o patch: 552443 ms +/- 22841ms
64 worker, w/ patch: 502727 ms +/- 45253ms

From this data, on average the patch is beneficial at high worker (CPU) counts tested: 48, 63. At 32 and below the performance is relatively close to each other.

Thanks,
Geoff

#13Blake, Geoff
Blake, Geoff
blakgeof@amazon.com
In reply to: Blake, Geoff (#12)
Re: Add spin_delay() implementation for Arm in s_lock.h

Hi Tom, Andres,

Any additional feedback for this patch?

Thanks,
Geoff Blake

#14Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Blake, Geoff (#13)
Re: Add spin_delay() implementation for Arm in s_lock.h

"Blake, Geoff" <blakgeof@amazon.com> writes:

Hi Tom, Andres,
Any additional feedback for this patch?

I did some more research and testing:

* Using a Mac with the M1 Pro chip (marginally beefier than the M1
I was testing on before), I think I can see some benefit in the
test case I proposed upthread. It's marginal though.

* On a Raspberry Pi 3B+, there's no outside-the-noise difference.

* ISB doesn't exist in pre-V7 ARM, so it seems prudent to restrict
the patch to ARM64. I doubt any flavor of ARM32 would be able to
benefit anyway. (Googling finds that MariaDB made this same
choice not long ago [1]https://jira.mariadb.org/browse/MDEV-25807.)

So what we've got is that there seems to be benefit at high
core counts, and it at least doesn't hurt at lower ones.
That's good enough for me, so pushed.

regards, tom lane

[1]: https://jira.mariadb.org/browse/MDEV-25807

#15Blake, Geoff
Blake, Geoff
blakgeof@amazon.com
In reply to: Tom Lane (#14)
Re: Add spin_delay() implementation for Arm in s_lock.h

Thanks for all the help Tom!

On 4/6/22, 6:07 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

"Blake, Geoff" <blakgeof@amazon.com> writes:

Hi Tom, Andres,
Any additional feedback for this patch?

I did some more research and testing:

* Using a Mac with the M1 Pro chip (marginally beefier than the M1
I was testing on before), I think I can see some benefit in the
test case I proposed upthread. It's marginal though.

* On a Raspberry Pi 3B+, there's no outside-the-noise difference.

* ISB doesn't exist in pre-V7 ARM, so it seems prudent to restrict
the patch to ARM64. I doubt any flavor of ARM32 would be able to
benefit anyway. (Googling finds that MariaDB made this same
choice not long ago [1]https://jira.mariadb.org/browse/MDEV-25807.)

So what we've got is that there seems to be benefit at high
core counts, and it at least doesn't hurt at lower ones.
That's good enough for me, so pushed.

regards, tom lane

[1]: https://jira.mariadb.org/browse/MDEV-25807