TAS patch for building on armel/armhf thumb
Hello,
if you build postgresql (tested all releases from 8.4 up to trunk) for
ARM with the -mthumb instruction set (much better performance), it
fails with
gcc -g -O2 -g -Wall -O2 -fPIC -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/tcl8.5 -c -o xlog.o xlog.c
/tmp/cc8Wkglp.s: Assembler messages:
/tmp/cc8Wkglp.s:1456: Error: selected processor does not support `swpb r3,r3,[r0]'
/tmp/cc8Wkglp.s:1587: Error: selected processor does not support `swpb r2,r2,[r0]'
A fair while ago, Alexander Sack from Linaro applied a patch to our
packages to drop the Assembler bits and instead use gcc's atomic
builtins [1]http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html -- Martin Pitt | http://www.piware.de Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org), which provide a proper implementation for thumb, too.
The original patch spectacularly failed on our slightly newer Panda
boards (our old builders were Freescale Babbage boards), but I got
that to work yesterday. Now it's working on Babbage, Panda, both with
and without hard float (armhf) enabled.
I'm not sure how appropriate it is for upstream to have GCC-isms in
the code, but even if it can't land upstream, perhaps it is useful for
other porters/packagers.
Thanks,
Martin
[1]: http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html -- Martin Pitt | http://www.piware.de Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Attachments:
01-armel-tas.patchtext/x-diff; charset=us-asciiDownload+3-9
On 16.12.2011 11:36, Martin Pitt wrote:
if you build postgresql (tested all releases from 8.4 up to trunk) for
ARM with the -mthumb instruction set (much better performance), it
fails withgcc -g -O2 -g -Wall -O2 -fPIC -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/tcl8.5 -c -o xlog.o xlog.c
/tmp/cc8Wkglp.s: Assembler messages:
/tmp/cc8Wkglp.s:1456: Error: selected processor does not support `swpb r3,r3,[r0]'
/tmp/cc8Wkglp.s:1587: Error: selected processor does not support `swpb r2,r2,[r0]'A fair while ago, Alexander Sack from Linaro applied a patch to our
packages to drop the Assembler bits and instead use gcc's atomic
builtins [1], which provide a proper implementation for thumb, too.The original patch spectacularly failed on our slightly newer Panda
boards (our old builders were Freescale Babbage boards), but I got
that to work yesterday. Now it's working on Babbage, Panda, both with
and without hard float (armhf) enabled.I'm not sure how appropriate it is for upstream to have GCC-isms in
the code, but even if it can't land upstream, perhaps it is useful for
other porters/packagers.
Should be ok, as long as it's within #ifdef __GNUC__.
An even better approach would be to have a configure test for
__sync_lock_test_and_set. A quick google search suggests that Intel C
Compiler version >= 11.0 also supports __sync_lock_test_and_set, for
example. It probably makes sense to use it on any platform where it's
defined. Presumably an implementation provided by the compiler is always
going to be at least as good as any magic assembler incantations we can
come up with.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Hello all,
Heikki Linnakangas wrote:
An even better approach would be to have a configure test for
__sync_lock_test_and_set. A quick google search suggests that Intel
C Compiler version >= 11.0 also supports __sync_lock_test_and_set,
for example.
Right,
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
explicitly refers to intel CC as well.
It probably makes sense to use it on any platform where it's
defined. Presumably an implementation provided by the compiler is
always going to be at least as good as any magic assembler
incantations we can come up with.
I agree. How about a patch like this? It uses builtin atomics if
available, and falls back to the custom implementations if not.
Note that a simple AC_CHECK_FUNC(__sync_lock_test_and_set) does not
work, so I used a slightly more complicated AC_TRY_LINK() which also
verifies that __sync_lock_release() is available.
The patch is against 9.1.2, but I suppose it also applies to trunk
(except perhaps the autogenerated configure part, this needs an
autoheader/autoconf run).
Thanks for considering,
Martin
--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Attachments:
01-atomic-builtins.patchtext/x-diff; charset=us-asciiDownload+97-1
On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt <mpitt@debian.org> wrote:
It probably makes sense to use it on any platform where it's
defined. Presumably an implementation provided by the compiler is
always going to be at least as good as any magic assembler
incantations we can come up with.I agree. How about a patch like this? It uses builtin atomics if
available, and falls back to the custom implementations if not.
-1. Absent some evidence that gcc's implementations are superior to
ours, I think we should not change stuff that works now. That's
likely to lead to subtle bugs that are hard to find and perhaps
dependent on the exact compiler version used.
But I'm completely cool with doing this for platforms where we haven't
otherwise got an implementation. Any port in a storm.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 19.12.2011 16:31, Robert Haas wrote:
On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt<mpitt@debian.org> wrote:
It probably makes sense to use it on any platform where it's
defined. Presumably an implementation provided by the compiler is
always going to be at least as good as any magic assembler
incantations we can come up with.I agree. How about a patch like this? It uses builtin atomics if
available, and falls back to the custom implementations if not.-1. Absent some evidence that gcc's implementations are superior to
ours, I think we should not change stuff that works now. That's
likely to lead to subtle bugs that are hard to find and perhaps
dependent on the exact compiler version used.
Ok, we're in disagreement on that then. I don't feel very strongly about
it, let's see what others think.
One thing that caught my eye: if you use __sync_lock_and_test() to
implement S_LOCK(), you really should be using __sync_lock_release() for
S_UNLOCK().
Actually, I believe our Itanium (and possibly ARM, too) implementation
of S_UNLOCK() is wrong as it is. There is no platform-specific
S_UNLOCK() defined for Itanium, so we're using the generic implementation:
#if !defined(S_UNLOCK)
#define S_UNLOCK(lock) (*((volatile slock_t *) (lock)) = 0)
#endif /* S_UNLOCK */
That is not sufficient on platforms with a weak memory model, like Itanium.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 19.12.2011 16:31, Robert Haas wrote:
On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt<mpitt@debian.org> wrote:
I agree. How about a patch like this? It uses builtin atomics if
available, and falls back to the custom implementations if not.
-1. Absent some evidence that gcc's implementations are superior to
ours, I think we should not change stuff that works now. That's
likely to lead to subtle bugs that are hard to find and perhaps
dependent on the exact compiler version used.
Ok, we're in disagreement on that then. I don't feel very strongly about
it, let's see what others think.
I agree with Robert. There is no evidence whatsoever that this would
be an improvement, and unless somebody cares to provide such evidence,
we shouldn't risk changing code that's so full of portability hazards.
Actually, I believe our Itanium (and possibly ARM, too) implementation
of S_UNLOCK() is wrong as it is.
Hmm. Anybody got a large itanium box we could play with? If it is
wrong, I'd expect it would show up pretty quickly under pgbench or
similar.
regards, tom lane
Robert Haas [2011-12-19 9:31 -0500]:
On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt <mpitt@debian.org> wrote:
It probably makes sense to use it on any platform where it's
defined. Presumably an implementation provided by the compiler is
always going to be at least as good as any magic assembler
incantations we can come up with.I agree. How about a patch like this? It uses builtin atomics if
available, and falls back to the custom implementations if not.-1. Absent some evidence that gcc's implementations are superior to
ours, I think we should not change stuff that works now. That's
likely to lead to subtle bugs that are hard to find and perhaps
dependent on the exact compiler version used.But I'm completely cool with doing this for platforms where we haven't
otherwise got an implementation. Any port in a storm.
Sure, then the other option is to stuff this at the end of s_lock.h if
we don't already have HAS_TEST_AND_SET. This would then mean that we
need to remove the armel implementation, as it doesn't really work on
anything non-ancient, and the gcc one got some fairly good test
coverage by now.
I'm happy to work out the patch for this. I'll just wait a bit if
there are more comments on this.
Thanks,
Martin
--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Heikki Linnakangas [2011-12-19 17:09 +0200]:
One thing that caught my eye: if you use __sync_lock_and_test() to
implement S_LOCK(), you really should be using __sync_lock_release()
for S_UNLOCK().
Right, the patch I send does that:
#define S_UNLOCK(lock) __sync_lock_release(lock)
Martin
--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Tom Lane [2011-12-19 10:25 -0500]:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 19.12.2011 16:31, Robert Haas wrote:
On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt<mpitt@debian.org> wrote:
I agree. How about a patch like this? It uses builtin atomics if
available, and falls back to the custom implementations if not.-1. Absent some evidence that gcc's implementations are superior to
ours, I think we should not change stuff that works now. That's
likely to lead to subtle bugs that are hard to find and perhaps
dependent on the exact compiler version used.Ok, we're in disagreement on that then. I don't feel very strongly about
it, let's see what others think.I agree with Robert. There is no evidence whatsoever that this would
be an improvement, and unless somebody cares to provide such evidence,
we shouldn't risk changing code that's so full of portability hazards.
OK, with you and Robert preferring this as a fallback instead of a
preferred way, and with Heikki's "I don't care much", I'll rework the
patch.
Thanks,
Martin
--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
On 19.12.2011 17:39, Martin Pitt wrote:
Heikki Linnakangas [2011-12-19 17:09 +0200]:
One thing that caught my eye: if you use __sync_lock_and_test() to
implement S_LOCK(), you really should be using __sync_lock_release()
for S_UNLOCK().Right, the patch I send does that:
#define S_UNLOCK(lock) __sync_lock_release(lock)
Oh, I'm blind...
Can you comment on or test whether the existing TAS implementation is
broken on ARM, because the existing S_UNLOCK() on ARM doesn't act as a
memory barrier? If we're going to keep the existing snippet of inline
assembly for those variants of ARM where it works, we need to either fix
S_UNLOCK or convince ourselves that it's not broken.
I'll take a closer look at the same on Itanium.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 19.12.2011 17:25, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:
On 19.12.2011 16:31, Robert Haas wrote:
Actually, I believe our Itanium (and possibly ARM, too) implementation
of S_UNLOCK() is wrong as it is.Hmm. Anybody got a large itanium box we could play with? If it is
wrong, I'd expect it would show up pretty quickly under pgbench or
similar.
We've been running pgbench heavily recently in the company on large HP
Itanium boxes with 32 and 64 cores, and haven't seen any issues. I'm not
sure how it would manifest itself, though. It's also possible that while
it's a genuine bug at the source level, it happens to be masked by some
compiler decisions. It sure looks wrong.
I'll try to construct a self-contained test program to exercise that.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Robert Haas [2011-12-19 9:31 -0500]:
-1. Absent some evidence that gcc's implementations are superior to
ours, I think we should not change stuff that works now. That's
likely to lead to subtle bugs that are hard to find and perhaps
dependent on the exact compiler version used.But I'm completely cool with doing this for platforms where we haven't
otherwise got an implementation. Any port in a storm.
The updated patch only uses the gcc builtins if there is no explicit
implementation, but drops the arm one as this doesn't work on ARMv7
and newer, as stated in the original mail.
Thanks,
Martin
--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Attachments:
01-atomic-builtins.patchtext/x-diff; charset=us-asciiDownload+96-23
On Mon, Dec 19, 2011 at 05:09:11PM +0200, Heikki Linnakangas wrote:
Actually, I believe our Itanium (and possibly ARM, too) implementation
of S_UNLOCK() is wrong as it is. There is no platform-specific
S_UNLOCK() defined for Itanium, so we're using the generic
implementation:#if !defined(S_UNLOCK)
#define S_UNLOCK(lock) (*((volatile slock_t *) (lock)) = 0)
#endif /* S_UNLOCK */That is not sufficient on platforms with a weak memory model, like Itanium.
Other processors may observe the lock as held after its release, but there's no
correctness problem.
Noah Misch <noah@leadboat.com> writes:
On Mon, Dec 19, 2011 at 05:09:11PM +0200, Heikki Linnakangas wrote:
That is not sufficient on platforms with a weak memory model, like Itanium.
Other processors may observe the lock as held after its release, but there's no
correctness problem.
How weak is the memory model, exactly?
A correctness problem would ensue if out-of-order stores are possible,
ie other processors could observe the lock being freed (and then acquire
it) before seeing changes to shared variables that had been made while
holding the lock.
I'm a little dubious that this applies to Itanium, because I don't see
any memory fence instruction in the TAS macro. If we needed one in
S_UNLOCK I'd rather expect there to be one in TAS as well.
regards, tom lane
On 19.12.2011 22:03, Noah Misch wrote:
On Mon, Dec 19, 2011 at 05:09:11PM +0200, Heikki Linnakangas wrote:
Actually, I believe our Itanium (and possibly ARM, too) implementation
of S_UNLOCK() is wrong as it is. There is no platform-specific
S_UNLOCK() defined for Itanium, so we're using the generic
implementation:#if !defined(S_UNLOCK)
#define S_UNLOCK(lock) (*((volatile slock_t *) (lock)) = 0)
#endif /* S_UNLOCK */That is not sufficient on platforms with a weak memory model, like Itanium.
Other processors may observe the lock as held after its release, but there's no
correctness problem.
I thought it would also be legal for a store to become visible to other
processors, *after* the releasing of the lock. Which would be bad. For
example, if you have:
volatile bool *shared = ...
SpinLockAcquire(lock);
shared->variable = true;
SpinLockRelease(lock);
more code
The macro-expanded code would look like:
<test and set> lock
shared->variable = true;
(*((volatile slock_t *) (lock)) = 0;
more code
I believe on an architecture with a weak memory model, like Itanium,
there's no guarantee that the assignments will happen in that order. The
lock might appear as released *before* the variable is set.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 19.12.2011 22:12, Tom Lane wrote:
Noah Misch<noah@leadboat.com> writes:
On Mon, Dec 19, 2011 at 05:09:11PM +0200, Heikki Linnakangas wrote:
That is not sufficient on platforms with a weak memory model, like Itanium.
Other processors may observe the lock as held after its release, but there's no
correctness problem.How weak is the memory model, exactly?
A correctness problem would ensue if out-of-order stores are possible,
ie other processors could observe the lock being freed (and then acquire
it) before seeing changes to shared variables that had been made while
holding the lock.I'm a little dubious that this applies to Itanium, because I don't see
any memory fence instruction in the TAS macro. If we needed one in
S_UNLOCK I'd rather expect there to be one in TAS as well.
There's a pretty good manual called "Implementing Spinlocks on Itanium
and PA-RISC" from HP at:
http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf.
It says, in section "3.2.1 Try to get a spinlock":
Note also, that the �xchg� instruction always
has the �acquire� semantics, so it enforces the correct memory ordering.
But the current implementation seems to be safe, anyway. In section
"3.2.3 Freeing a spinlock", that manual says:
Note, that a simple assignment statement into a volatile variable
will not work, as we are
assuming that the +Ovolatile=__unordered compile option is being used.
So +Ovolatile=__unordered would allow the kind of optimization that I
thought was possible, and we would have a problem if we used it. But the
default is more conservative, and a simple assignment does work.
I compiled the attached test program on an HP Itanium box, using the
same flags you get from PostgreSQL's configure on that box. The relevant
assembler output is:
xchg4 r14 = [r15], r14 // M [slocktest.c: 66/3]
//file/line/col slocktest.c/67/3
ld1.acq r16 = [r11] // M [slocktest.c: 67/3]
nop.i 0 // I
//file/line/col slocktest.c/68/3
st1.rel [r11] = r10 ;; // M [slocktest.c: 68/3]
//file/line/col slocktest.c/69/3
st4.rel [r15] = r0 // M [slocktest.c: 69/3]
//file/line/col slocktest.c/70/1
The trick I missed is that the compiler attaches .rel to all the stores
and .acq to all the loads through a volatile pointer. gcc seems to do
the same. So we're safe.
It would be interesting to test whether using +Ovolatile=__unordered
would make a difference to performance. Tacking those memory fence
modifiers to all the volatile loads and stores might be expensive.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
slocktest.ctext/x-csrc; name=slocktest.cDownload
On Mon, Dec 19, 2011 at 11:25:06PM +0200, Heikki Linnakangas wrote:
I compiled the attached test program on an HP Itanium box, using the
same flags you get from PostgreSQL's configure on that box. The relevant
assembler output is:xchg4 r14 = [r15], r14 // M [slocktest.c: 66/3]
//file/line/col slocktest.c/67/3
ld1.acq r16 = [r11] // M [slocktest.c: 67/3]
nop.i 0 // I
//file/line/col slocktest.c/68/3
st1.rel [r11] = r10 ;; // M [slocktest.c: 68/3]
//file/line/col slocktest.c/69/3
st4.rel [r15] = r0 // M [slocktest.c: 69/3]
//file/line/col slocktest.c/70/1The trick I missed is that the compiler attaches .rel to all the stores
and .acq to all the loads through a volatile pointer. gcc seems to do
the same. So we're safe.
The Intel compiler appears not to follow this convention:
http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/copts/ccpp_options/option_qserialize-volatile.htm
If you have that compiler installed, could you see which opcode it generates?
Thanks,
nm
Martin Pitt <mpitt@debian.org> writes:
The updated patch only uses the gcc builtins if there is no explicit
implementation, but drops the arm one as this doesn't work on ARMv7
and newer, as stated in the original mail.
Getting this thread back to the original patch ... I'm afraid that if we
apply this as-is, what will happen is that we fix ARMv7 and break older
versions. Some googling found this, for instance:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33413
which suggests that (1) ARM gcc hasn't had __sync_lock_test_and_set for
very long, and (2) what it generates doesn't work pre-ARMv6.
So I'm thinking that removing the swpb ASM option is not such a good
idea. We could possibly test for __sync_lock_test_and_set first, and
only use swpb if we're on ARM and don't have the builtin.
Another thing that is bothering me is that according to the gcc manual,
eg here,
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
__sync_lock_test_and_set is nominally provided for datatypes 1, 2, 4,
or 8 bytes in length, but the underlying hardware doesn't necessarily
support all those widths natively. If you pick the wrong width then
you don't get an inline operation at all, but a call to some possibly
inefficient library subroutine. I see that your patch just assumes that
"int" will be a good width for the lock type, but it's unclear to me
what that choice is based on and whether or not it might be a really bad
choice on some platforms. A look through s_lock.h suggests that only a
minority of platforms prefer int-width locks ... but I have no idea
how many of those assembly snippets could have been coded to use a
different lock datatype without penalty. Some other evidence that
4-byte __sync_lock_test_and_set isn't universal is here:
https://svn.boost.org/trac/boost/ticket/2525
Google is also finding some rather worrisome suggestions that
__sync_lock_test_and_set might involve a kernel call on some flavors of
ARM. That would be pretty disastrous from a performance standpoint.
regards, tom lane
Hello Tom, all,
Tom Lane [2011-12-20 21:14 -0500]:
Getting this thread back to the original patch ... I'm afraid that if we
apply this as-is, what will happen is that we fix ARMv7 and break older
versions.
Right, I guess it's dependent on the compiler version, too. That's why
my original patch tested for this first and used it if it was
available, then we can have any code in the #else part further down.
So I'm thinking that removing the swpb ASM option is not such a good
idea. We could possibly test for __sync_lock_test_and_set first, and
only use swpb if we're on ARM and don't have the builtin.
OK, that would be kind of a hybrid solution between the first and
second version of the patch. Can work on this (but only next year,
need to do some christmas prep, and then holidays/AFK).
Another thing that is bothering me is that according to the gcc
manual, eg here,
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
__sync_lock_test_and_set is nominally provided for datatypes 1, 2,
4, or 8 bytes in length, but the underlying hardware doesn't
necessarily support all those widths natively. If you pick the
wrong width then you don't get an inline operation at all, but a
call to some possibly inefficient library subroutine.
It's even worse. Our original version of the patch used a char, and
while that worked fine on the older Babbage board, it completely broke
at runtime on e. g. a Panda board. It wasn't just slow, it caused
hangs and test failures all over the place. With int it works
flawlessly.
I see that your patch just assumes that "int" will be a good width
for the lock type, but it's unclear to me what that choice is based
on and whether or not it might be a really bad choice on some
platforms.
Unfortunately the gcc documentation doesn't give any further
conditions. As an "int" is usually meant to fit the standard word size
of a processor, if that function call/machine op code supports just
one datatype, it will most likely be "int", not something which you
have to mask on read/write (char) or potentially involves multiple
words (long/long long).
The intel definition [1]http://www.cs.fsu.edu/~engelen/courses/HPC-adv/intref_cls.pdf p. 198 (and also in above gcc doc) says "32 or 64
bit integer", which also matches "int" (unless we are looking at some
really small architectures like the old Atmels with their 16 bit
words, but these are three magnitudes too small for PostgreSQL anyway
:) ) So with the intel compiler "char" is clearly forbidden.
Beyond that I don't have any further data.
So, I'm happy with keeping this patch in Debian/Ubuntu only for the
time being, as there we have a pretty clear idea of the supported ARM
platforms, and can actually test the patches on all the platforms. I
just found it prudent to at least bring it up here.
Thanks, and happy end-of-year holidays!
Martin
[1]: http://www.cs.fsu.edu/~engelen/courses/HPC-adv/intref_cls.pdf p. 198
--
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Martin Pitt <mpitt@debian.org> writes:
Tom Lane [2011-12-20 21:14 -0500]:
Another thing that is bothering me is that according to the gcc
manual, eg here,
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
__sync_lock_test_and_set is nominally provided for datatypes 1, 2,
4, or 8 bytes in length, but the underlying hardware doesn't
necessarily support all those widths natively. If you pick the
wrong width then you don't get an inline operation at all, but a
call to some possibly inefficient library subroutine.
It's even worse. Our original version of the patch used a char, and
while that worked fine on the older Babbage board, it completely broke
at runtime on e. g. a Panda board. It wasn't just slow, it caused
hangs and test failures all over the place. With int it works
flawlessly.
Yeah, that was another thing I found worrisome while googling: there
were a disturbingly large number of claims that __sync_lock_test_and_set
and/or __sync_lock_release were flat-out broken on various combinations
of gcc version and platform. After reading that, there is no way at all
that I'd accept your original patch to use these functions everywhere.
For the moment I'm inclined to consider using these functions *only* on
ARM, so as to limit our exposure to such bugs. That would also limit
the risk of using an inappropriate choice of lock width.
regards, tom lane