remove adaptive spins_per_delay code
(creating new thread from [0]/messages/by-id/65063.1712800379@sss.pgh.pa.us)
On Wed, Apr 10, 2024 at 09:52:59PM -0400, Tom Lane wrote:
On fourth thought ... the number of tries to acquire the lock, or
in this case number of tries to observe the lock free, is not
NUM_DELAYS but NUM_DELAYS * spins_per_delay. Decreasing
spins_per_delay should therefore increase the risk of unexpected
"stuck spinlock" failures. And finish_spin_delay will decrement
spins_per_delay in any cycle where we slept at least once.
It's plausible therefore that this coding with finish_spin_delay
inside the main wait loop puts more downward pressure on
spins_per_delay than the algorithm is intended to cause.I kind of wonder whether the premises finish_spin_delay is written
on even apply anymore, given that nobody except some buildfarm
dinosaurs runs Postgres on single-processor hardware anymore.
Maybe we should rip out the whole mechanism and hard-wire
spins_per_delay at 1000 or so.
I've been looking at spinlock contention on newer hardware, and while I do
not yet have any proposal to share for that, I saw this adaptive
spins_per_delay code and wondered about this possibility of "downward
pressure on spins_per_delay" for contended locks. ISTM it could make
matters worse in some cases.
Anyway, I'm inclined to agree that the premise of the adaptive
spins_per_delay code probably doesn't apply anymore, so here's a patch to
remove it.
[0]: /messages/by-id/65063.1712800379@sss.pgh.pa.us
--
nathan
Attachments:
v1-0001-remove-adaptive-spins_per_delay-code.patchtext/plain; charset=us-asciiDownload+10-116
Hi,
On 2024-08-27 11:16:15 -0500, Nathan Bossart wrote:
(creating new thread from [0])
On Wed, Apr 10, 2024 at 09:52:59PM -0400, Tom Lane wrote:
On fourth thought ... the number of tries to acquire the lock, or
in this case number of tries to observe the lock free, is not
NUM_DELAYS but NUM_DELAYS * spins_per_delay. Decreasing
spins_per_delay should therefore increase the risk of unexpected
"stuck spinlock" failures. And finish_spin_delay will decrement
spins_per_delay in any cycle where we slept at least once.
It's plausible therefore that this coding with finish_spin_delay
inside the main wait loop puts more downward pressure on
spins_per_delay than the algorithm is intended to cause.I kind of wonder whether the premises finish_spin_delay is written
on even apply anymore, given that nobody except some buildfarm
dinosaurs runs Postgres on single-processor hardware anymore.
Maybe we should rip out the whole mechanism and hard-wire
spins_per_delay at 1000 or so.I've been looking at spinlock contention on newer hardware, and while I do
not yet have any proposal to share for that, I saw this adaptive
spins_per_delay code and wondered about this possibility of "downward
pressure on spins_per_delay" for contended locks. ISTM it could make
matters worse in some cases.Anyway, I'm inclined to agree that the premise of the adaptive
spins_per_delay code probably doesn't apply anymore, so here's a patch to
remove it.
FWIW, I've seen cases on multi-socket machines where performance was vastly
worse under contention with some values of spins_per_delay. With good numbers
being quite different on smaller machines. Most new-ish server CPUs these days
basically behave like a multi-socket machine internally, due to being
internally partitioned into multiple chiplets. And it's pretty clear that that
trend isn't going to go away. So finding a good value probably isn't easy.
We don't have a whole lot of contended spin.h spinlocks left, except that we
have one very critical one, XLogCtl->Insert.insertpos_lck. And of course we
use the same spinning logic for buffer header locks - which can be heavily
contended.
I suspect that eventually we ought to replace all our userspace
spinlock-like-things with a framework for writing properly "waiting" locks
with some spinning. We can't just use lwlocks because support for
reader-writer locks makes them much more heavyweight (mainly because it
implies having to use an atomic operation for lock release, which shows up
substantially).
Greetings,
Andres Freund
On Tue, Aug 27, 2024 at 02:27:00PM -0400, Andres Freund wrote:
FWIW, I've seen cases on multi-socket machines where performance was vastly
worse under contention with some values of spins_per_delay. With good numbers
being quite different on smaller machines. Most new-ish server CPUs these days
basically behave like a multi-socket machine internally, due to being
internally partitioned into multiple chiplets. And it's pretty clear that that
trend isn't going to go away. So finding a good value probably isn't easy.
Yeah.
We don't have a whole lot of contended spin.h spinlocks left, except that we
have one very critical one, XLogCtl->Insert.insertpos_lck. And of course we
use the same spinning logic for buffer header locks - which can be heavily
contended.
Another one I've been looking into is pgssEntry->mutex, which shows up
prominently when pg_stat_statements.track_planning is on. There was some
previous discussion about this [0]/messages/by-id/2895b53b033c47ccb22972b589050dd9@EX13D05UWC001.ant.amazon.com, which resulted in that parameter
getting turned off by default (commit d1763ea). I tried converting those
locks to LWLocks, but that actually hurt performance. I also tried
changing the counters to atomics, which AFAICT is mostly doable except for
"usage". That one would require some more thought to be able to convert it
away from a double.
I suspect that eventually we ought to replace all our userspace
spinlock-like-things with a framework for writing properly "waiting" locks
with some spinning. We can't just use lwlocks because support for
reader-writer locks makes them much more heavyweight (mainly because it
implies having to use an atomic operation for lock release, which shows up
substantially).
Another approach I'm investigating is adding exponential backoff via extra
spins in perform_spin_delay(). I'm doubtful this will be a popular
suggestion, as appropriate settings seem to be hardware/workload dependent
and therefore will require a GUC or two, but it does seem to help
substantially on machines with many cores. In any case, I think we ought
to do _something_ in this area for v18.
[0]: /messages/by-id/2895b53b033c47ccb22972b589050dd9@EX13D05UWC001.ant.amazon.com
--
nathan