spinlocks: generalizing "non-locking test"

Started by Neil Conwayabout 21 years ago7 messages
#1Neil Conway
neilc@samurai.com

Currently, the assembly for TAS() on x86 does a non-locking test before
using an atomic operation to attempt to acquire the spinlock:

__asm__ __volatile__(
" cmpb $0,%1 \n"
" jne 1f \n"
" lock \n"
" xchgb %0,%1 \n"
"1: \n"
: "+q"(_res), "+m"(*lock)
:
: "memory", "cc");

The reason this is a good idea is that if we fail to immediately acquire
the spinlock, s_lock() will spin SPINS_PER_DELAY times in userspace
calling TAS() each time before going to sleep. If we do an atomic
operation for each spin, this generates a lot more bus traffic than is
necessary. Doing a non-locking test (followed by an atomic operation to
acquire the spinlock if appropriate) is therefore better on SMP systems.

Currently x86 is the only platform on which we do this -- ISTM that all
the other platforms that implement spinlocks via atomic operations could
benefit from this technique.

We could fix this by tweaking each platform's assembler to add a
non-blocking test, but there might be an easier way. Rather than
modifying platform-specific assembler, I believe this C sequence is
equivalent to the non-locking test:

volatile slock_t *lock = ...;

if (*lock == 0)
TAS(lock);

Because the lock variable is volatile, the compiler should reload it
from memory for each loop iteration. (If this is actually not a
sufficient non-locking test, please let me know...)

We could add a new s_lock.h macro, TAS_INNER_LOOP(), whose default
implementation would be:

#define TAS_INNER_LOOP(lock) \
if ((*lock) == 0) \
TAS(lock);

And then remove the x86-specific non-locking test from TAS.

Comments?

-Neil

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#1)
Re: spinlocks: generalizing "non-locking test"

Neil Conway <neilc@samurai.com> writes:

Currently x86 is the only platform on which we do this -- ISTM that all
the other platforms that implement spinlocks via atomic operations could
benefit from this technique.

Do you have any actual evidence for that opinion? ISTM this is
dependent on a large set of assumptions about the CPU's bus behavior,
boiling down to the conclusion that an extra conditional branch is
cheaper than a locked bus cycle. It would be a serious error to
transfer that conclusion to non-Intel chips without investigation.
(For that matter, I'm not that thrilled about it even for the Intel
chips, since the extra test is certainly 100% wasted cycles on any
single-CPU machine, or indeed anywhere that the lock is not contended
for.)

I believe that an extra test would be a dead loss on PPC, for instance,
because the initial load is already nonblocking there.

if (*lock == 0)
TAS(lock);

This will certainly not work on HPPA, and it really shouldn't assume
that zero is the non-locked state, and what happened to testing the
TAS result?

On the whole it seems far easier to stick an extra load into the asm
code on those platforms where we think it's a win. Especially since
we have already done that ;-)

regards, tom lane

#3Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#2)
Re: spinlocks: generalizing "non-locking test"

On Mon, 2004-10-18 at 04:13, Tom Lane wrote:

Do you have any actual evidence for that opinion? ISTM this is
dependent on a large set of assumptions about the CPU's bus behavior,
boiling down to the conclusion that an extra conditional branch is
cheaper than a locked bus cycle.

I think the conditional branch is effectively free: we're spinning in a
busy-wait loop, so we're effectively throwing away cycles until the lock
is free anyway. The important things are: (a) we interfere as little as
possible with concurrent system activity while spinning (b) we notice
promptly that the lock is free. ISTM that doing a non-locking test
before the locking test achieves those two conditions.

(For that matter, I'm not that thrilled about it even for the Intel
chips, since the extra test is certainly 100% wasted cycles on any
single-CPU machine, or indeed anywhere that the lock is not contended
for.)

Yes, but the proper fix for this is not to spin at all on UP machines; I
have a grungy patch to do this that I will clean up and send in for 8.1.

if (*lock == 0)
TAS(lock);

This will certainly not work on HPPA, and it really shouldn't assume
that zero is the non-locked state, and what happened to testing the
TAS result?

Erm, perhaps I should have emphasized that that was pseudo-code :) A
more realistic implementation would be:

#define TAS_INNER_LOOP(lock) (S_LOCK_FREE(lock) ? TAS(lock) : 1)

-Neil

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#3)
Re: spinlocks: generalizing "non-locking test"

Neil Conway <neilc@samurai.com> writes:

On Mon, 2004-10-18 at 04:13, Tom Lane wrote:

Do you have any actual evidence for that opinion? ISTM this is
dependent on a large set of assumptions about the CPU's bus behavior,
boiling down to the conclusion that an extra conditional branch is
cheaper than a locked bus cycle.

I think the conditional branch is effectively free: we're spinning in a
busy-wait loop, so we're effectively throwing away cycles until the lock
is free anyway.

Only once we've begun to spin. The first time through, it's not at all
clear whether the extra test is worthwhile --- it's certainly a win if
the lock is always already held, and certainly a loss if the lock is
always free, and otherwise you have to do some benchmarking to decide
if you want it or not. We have the ASM-level test on those platforms
where people seem to think that it is worthwhile, but not everywhere.
As near as I can tell, your proposal is to never have the extra test on
the initial TAS (because you'll remove it from the ASM) and to have it
always on repeated tests (inside the spin loop). This is not what is
done elsewhere AFAICS, and I'm not prepared to take on faith that it's
a win.

regards, tom lane

#5Mark Wong
markw@osdl.org
In reply to: Neil Conway (#1)
Re: spinlocks: generalizing "non-locking test"

On Sun, Oct 17, 2004 at 11:16:50PM +1000, Neil Conway wrote:

Currently, the assembly for TAS() on x86 does a non-locking test before
using an atomic operation to attempt to acquire the spinlock:

__asm__ __volatile__(
" cmpb $0,%1 \n"
" jne 1f \n"
" lock \n"
" xchgb %0,%1 \n"
"1: \n"
: "+q"(_res), "+m"(*lock)
:
: "memory", "cc");

The reason this is a good idea is that if we fail to immediately acquire
the spinlock, s_lock() will spin SPINS_PER_DELAY times in userspace
calling TAS() each time before going to sleep. If we do an atomic
operation for each spin, this generates a lot more bus traffic than is
necessary. Doing a non-locking test (followed by an atomic operation to
acquire the spinlock if appropriate) is therefore better on SMP systems.

Currently x86 is the only platform on which we do this -- ISTM that all
the other platforms that implement spinlocks via atomic operations could
benefit from this technique.

We could fix this by tweaking each platform's assembler to add a
non-blocking test, but there might be an easier way. Rather than
modifying platform-specific assembler, I believe this C sequence is
equivalent to the non-locking test:

volatile slock_t *lock = ...;

if (*lock == 0)
TAS(lock);

Because the lock variable is volatile, the compiler should reload it
from memory for each loop iteration. (If this is actually not a
sufficient non-locking test, please let me know...)

We could add a new s_lock.h macro, TAS_INNER_LOOP(), whose default
implementation would be:

#define TAS_INNER_LOOP(lock) \
if ((*lock) == 0) \
TAS(lock);

And then remove the x86-specific non-locking test from TAS.

Comments?

-Neil

Steve Hemminger had this to say:

The linux kernel code is:
#define UNLOCKED 1

...
while (atomic_dec(lock) != 0) {
do {
rep_nop();
} while(*lock != UNLOCKED);
}

To do the equivalent thing in postgres would mean

while(TAS(lock)) {
do {
rep_nop();
} while (*lock);

The point is do the locking test first (assume success is possible)
if that doesn't work spin without doing locked operations and make
sure and do the rep; nop; for the hyperthreaded CPU's

#6Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#4)
Re: spinlocks: generalizing "non-locking test"

On Mon, 2004-10-18 at 11:53, Tom Lane wrote:

Only once we've begun to spin. The first time through, it's not at all
clear whether the extra test is worthwhile --- it's certainly a win if
the lock is always already held, and certainly a loss if the lock is
always free

Granted, but I think you've mostly conceded my point: every _subsequent_
time TAS() is invoked, the non-locking test is a clear win (with the
possible exception of PPC). Therefore we have two cases: "initial TAS"
and "TAS inside loop", so so the logical implementation is two distinct
macros. Of course, there may well be platforms on which TAS() is defined
to TAS_INNER_LOOP() or vice versa, but this decision will vary by
platform.

you have to do some benchmarking to decide if you want it or not.

I agree that benchmarking is worth doing before making changes.

We have the ASM-level test on those platforms
where people seem to think that it is worthwhile, but not everywhere.

That is certainly an optimistic interpretation :-) I would say an
equally likely theory is that there is only one platform on which people
have bothered to try a non-blocking test and see if it improves
performance, and accordingly we only have one platform on which a
non-locking test is used.

(If anyone out there _has_ modified the spinlock implementation for
PostgreSQL on a particular platform to use a non-locking initial test
and found it hasn't improved performance, please speak up.)

-Neil

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#6)
Re: spinlocks: generalizing "non-locking test"

Neil Conway <neilc@samurai.com> writes:

Granted, but I think you've mostly conceded my point: every _subsequent_
time TAS() is invoked, the non-locking test is a clear win (with the
possible exception of PPC).

I'm not real sure. One point here is that the standard advice about
this stuff is generally thinking in terms of an *extremely* tight spin
loop, ie

while (TAS(lock))
;

The loop in s_lock.c has a bit more overhead than that. Also, because
we only use spinlocks to protect LWLocks, the expected hold time for a
spinlock is just a couple dozen instructions, which is probably less
than the expected time in most other uses of spinlocks. So I think it's
less than clear that we should expect TAS to fail, even within the loop.

Basically I'd like to see some tests proving that there's actually any
value in it before we go complicating the assembly-code API ...

regards, tom lane